A distributed SQL engine designed for data-intensive distributed applications for interactive analysis of large-scale datasets.It follows Schema-free JSON document model similar to MongoDB and has pluggable architecture that enables connectivity to multiple datastores. Apache drill project inspired form Google's Dremel system which is available as an infrastructure service called Google BigQuery. key features of Apache drill(Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage) are-
- Drill supports a variety of NoSQL databases and file systems like HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.
- Data from multiple datastores(NoSQL,HDFS or HBase) can be joined using single query.
- Drill processes the data in-situ(in place) without requiring users to define schemas or transform data.
- Drill support industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
Note:- Oracle Java SE Development (JDK) Kit 7 + must be available in our system, if not available refer Oracle JDK installation in Ubuntu.
Apache drill installation in embedded mode
1. Download a stable distribution of Apache drill from here OR execute one of the following commands and get specified version of Apache drill.zytham@ubuntu:~$ wget http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz
zytham@ubuntu:~$ curl -o apache-drill-1.1.0.tar.gz http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz
zytham@ubuntu:~$ tar -xvzf ./Downloads/apache-drill-1.1.0.tar.gz
zytham@ubuntu:~$ mv apache-drill-1.1.0 /opt/drill
zytham@ubuntu:/opt/drill$ ./bin/drill-embedded
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Nov 20, 2015 9:54:10 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.0.0
"say hello to my little drill"
0: jdbc:drill:zk=local>
zytham@ubuntu:/opt/drill$ ./bin/drill-embedded
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006c0000000, 1431633920, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1431633920 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/drill/hs_err_pid3122.log
- Open drill-env.sh and update these two fields :- DRILL_MAX_DIRECT_MEMORY and DRILL_HEAP so that memory allocated to starting script is reduced.
DRILL_MAX_DIRECT_MEMORY="1G"
DRILL_HEAP="512M" - Save it and again start drill shell in embedded mode.
SQL query execution in drill prompt
From drill prompt, we can execute SQL query on flat files (JSON, CSV, Praquet,etc.). Here we are using sample JSON file employee.json shipped with Apache drill, packaged in the Foodmart data JAR in Drill's classpath availale at <DRILL_HOME>/jars/3rdparty/foodmart-data-json-0.4.jar.Note:- Below cp.'employee.json' refers to employee json file available in class path.
0: jdbc:drill:zk=local> show databases;
+---------------------+
| SCHEMA_NAME |
+---------------------+
| INFORMATION_SCHEMA |
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| sys |
+---------------------+
6 rows selected (3.015 seconds)
0: jdbc:drill:zk=local> select employee_id, first_name,last_name,position_id,salary FROM cp.`employee.json` where salary > 30000;
+--------------+-------------+------------+--------------+----------+
| employee_id | first_name | last_name | position_id | salary |
+--------------+-------------+------------+--------------+----------+
| 1 | Sheri | Nowmer | 1 | 80000.0 |
| 2 | Derrick | Whelply | 2 | 40000.0 |
| 4 | Michael | Spence | 2 | 40000.0 |
| 5 | Maya | Gutierrez | 2 | 35000.0 |
| 10 | Darren | Stanz | 5 | 50000.0 |
| 21 | Pedro | Castillo | 2 | 35000.0 |
| 22 | Laurie | Borges | 2 | 35000.0 |
| 36 | Donna | Arnold | 7 | 45000.0 |
+--------------+-------------+------------+--------------+----------+
8 rows selected (0.231 seconds)
0: jdbc:drill:zk=local> select employee_id, first_name,last_name,position_id,salary FROM cp.`employee.json` where salary > 30000 and position_id=2;
+--------------+-------------+------------+--------------+----------+
| employee_id | first_name | last_name | position_id | salary |
+--------------+-------------+------------+--------------+----------+
| 2 | Derrick | Whelply | 2 | 40000.0 |
| 4 | Michael | Spence | 2 | 40000.0 |
| 5 | Maya | Gutierrez | 2 | 35000.0 |
| 21 | Pedro | Castillo | 2 | 35000.0 |
| 22 | Laurie | Borges | 2 | 35000.0 |
+--------------+-------------+------------+--------------+----------+
5 rows selected (0.276 seconds)
Database Join operation:- In continuation of that, lets perform join operation on two json files(employee.json and department.json).First execute a query with aliasing followed by join two two files and get department_description from department_id.
0: jdbc:drill:zk=local> select emp.employee_id, emp.first_name,emp.salary,emp.department_id FROM cp.`employee.json` emp where emp.salary <40000 and emp.salary>21000;
+--------------+-------------+----------+----------------+
| employee_id | first_name | salary | department_id |
+--------------+-------------+----------+----------------+
| 5 | Maya | 35000.0 | 1 |
| 6 | Roberta | 25000.0 | 2 |
| 20 | Beverly | 30000.0 | 1 |
| 21 | Pedro | 35000.0 | 1 |
| 22 | Laurie | 35000.0 | 1 |
+--------------+-------------+----------+----------------+
0: jdbc:drill:zk=local> select emp.employee_id, emp.first_name,emp.salary,emp.department_id,dept.department_description FROM cp.`employee.json` emp , cp.`department.json` dept where emp.salary <40000 and emp.salary>21000 and emp.department_id = dept.department_id;
+--------------+-------------+----------+----------------+-------------------------+
| employee_id | first_name | salary | department_id | department_description |
+--------------+-------------+----------+----------------+-------------------------+
| 5 | Maya | 35000.0 | 1 | HQ General Management |
| 6 | Roberta | 25000.0 | 2 | HQ Information Systems |
| 20 | Beverly | 30000.0 | 1 | HQ General Management |
| 21 | Pedro | 35000.0 | 1 | HQ General Management |
| 22 | Laurie | 35000.0 | 1 | HQ General Management |
+--------------+-------------+----------+----------------+-------------------------+
SQL query on CSV file:-
Create a csv(comma separated values) file at location /tmp/sample.csv and copy following data in it.(Physical attributes of four people.)
1, NIKS, 79 , 5.8
2, CKM , 83 , 5.7
3, RITZ, 85 ,5.10
4, YUNNUS, 71, 5.11
Individual columns from CSV file is extracted - considering each row of csv file as zero indexed based array like columns[0], columns[1] and so on.2, CKM , 83 , 5.7
3, RITZ, 85 ,5.10
4, YUNNUS, 71, 5.11
0: jdbc:drill:zk=local> select * FROM dfs.`/tmp/sample.csv`;
+---------------------------------+
| columns |
+---------------------------------+
| ["1"," NIKS"," 79 "," 5.8 "] |
| ["2"," CKM "," 83 "," 5.7"] |
| ["3"," RITZ"," 85 ","5.10"] |
| ["4"," YUNNUS"," 71"," 5.11 "] |
+---------------------------------+
4 rows selected (0.189 seconds)
0: jdbc:drill:zk=local> select columns[0] as id, columns[1] name, columns[2] as weight, columns[3] as height FROM dfs.`/tmp/sample.csv`;
+-----+----------+---------+---------+
| id | name | weight | height |
+-----+----------+---------+---------+
| 1 | NIKS | 79 | 5.8 |
| 2 | CKM | 83 | 5.7 |
| 3 | RITZ | 85 | 5.10 |
| 4 | YUNNUS | 71 | 5.11 |
+-----+----------+---------+---------+
4 rows selected (0.288 seconds)
Querying multiple data source in single query :-
In Apache drill multiple data source(CSV,JSON, etc) can be accessed and joined in single query. Below query access employee.json, department.json and sample.csv file and display employee details with physical attributes.
0: jdbc:drill:zk=local> select emp.employee_id, phy.columns[1] as Name ,dept.department_description,phy.columns[2] as Weight , phy.columns[3] as Height FROM cp.`employee.json` emp , cp.`department.json` dept, dfs.`/tmp/sample.csv` phy where CAST(emp.employee_id AS INT) = CAST(phy.columns[0] AS INT) and emp.department_id = dept.department_id;
+--------------+----------+-------------------------+---------+---------+
| employee_id | Name | department_description | Weight | Height |
+--------------+----------+-------------------------+---------+---------+
| 1 | NIKS | HQ General Management | 79 | 5.8 |
| 2 | CKM | HQ General Management | 83 | 5.7 |
| 4 | YUNNUS | HQ General Management | 71 | 5.11 |
+--------------+----------+-------------------------+---------+---------+
3 rows selected (0.679 seconds)
Explicit type casting is mandatory when we are dealing with different sources,otherwise we will get error some thing like this.
Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts between
1. Numeric data 2. Varchar, Varbinary data 3. Date, Timestamp data Left type: BIGINT, Right type: VARCHAR. Add explicit casts to avoid this error
http://<IP address or host name>:8047
In terminal type ifconfig and get ip(inet) address, update above url accordingly and open it in browser.Apache drill web console and configuration window |
While querying CSV file in above example,we created sample.csv at location /tmp/ because there is a entry for workspace location /temp/ and write permission is enabled(Refer diagram-2.). We can modify these settrings as per our convenience.
Note:-
1. By convention this /tmp/ directory is used to create "view" and it is created locally.
CREATE or REPLACE view dfs.tmp.<VIEW_NAME> as <SLLECT_QUERY>
2. It is not mandatory to use configured location in SQL query, we can place data source file at some location and in SQL query we can specify the same as follows:
SELECT * FROM dfs.`/users/zytham/Downloads/data/<data_source_json>`;
Stopping Apache drill:- Execute following command to exit prompt and reach at terminal.
0: jdbc:drill:zk=local> !quit
In next post, we will see how to connect Apache drill with JDBC client and query in the same way we did form Apache drill prompt.
References :- Apache drill official documentation
Mua vé tại Aivivu, tham khảo
ReplyDeletevé máy bay đi Mỹ Vietnam Airline
bay về việt nam từ mỹ
bao giờ có chuyến bay từ đức về việt nam
lịch bay từ moscow đến hà nội
chuyến bay từ anh về việt nam
các chuyến bay từ châu âu về việt nam
danh sách khách sạn cách ly ở tphcm