Apache Drill入门

Apache Drill: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

单机模式

1
2
3
4
5
6
7
8
9
10
11
12
[qihuang.zheng@dp0653 ~]$ cd apache-drill-1.0.0
[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/drill-embedded
apache drill 1.0.0
"json ain't no thang"
0: jdbc:drill:zk> select * from cp.`employee.json` limit 2;
+--------------+------------------+-------------+------------+--------------+---------------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
| employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role |
+--------------+------------------+-------------+------------+--------------+---------------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
| 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management |
| 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | 1915-07-03 | 1994-12-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | M | M | Senior Management |
+--------------+------------------+-------------+------------+--------------+---------------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
2 rows selected (1.247 seconds)

drill使用zookeeper进行集群. 其中local表示使用本机的zk.

也可以使用sqlline启动:

1
2
3
4
5
6
7
8
9
[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/sqlline -u jdbc:drill:zk=local
log4j:WARN No appenders could be found for logger (DataNucleus.General).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
六月 15, 2015 11:11:18 上午 org.glassfish.jersey.server.ApplicationHandler initialize
信息: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.0.0
"a drill in the hand is better than two in the bush"
0: jdbc:drill:zk=local>

退出drill的方式:

1
2
0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection

使用后台进程的方式启动:

1
2
[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/drillbit.sh start
starting drillbit, logging to /home/qihuang.zheng/apache-drill-1.0.0/log/drillbit.out

查看drill进程

1
2
3
[qihuang.zheng@dp0653 apache-drill-1.0.0]$ jps -lm
2788 org.apache.drill.exec.server.Drillbit
3045 sqlline.SqlLine -d org.apache.drill.jdbc.Driver --maxWidth=10000 --color=true -u jdbc:drill:zk=local

第一个是drillbit的后台进程, 第二个是使用sqlline或者dril-embbed启动的客户端进程

Storage Plugin

cp是classpath storage plugin, drill的web ui: http://192.168.6.53:8047/storage
Drill支持不同的存储介质, 并且可以从不同的存储介质中使用SQL查询数据.

Storage Plugin

默认只有cp和dfs是enable的. 在Diabled Storage Plugins中点击某个插件的Enable, 就可以使用这个存储插件了

添加HDFS插件

默认没有hdfs, 可以在New Storage Plugin中输入hdfs, 点击Create, 在Configuration中输入hdfs的存储插件配置信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"type": "file",
"enabled": true,
"connection": "hdfs://192.168.6.53:9000/",
"workspaces": {
"root": {
"location": "/",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
}
}
}

如果是HDFS HA的模式, 也可以支持: hdfs://tdhdfs, 还可以添加workspaces

1
2
3
4
5
"tmp": {
"location": "/user/qihuang.zheng",
"writable": true,
"defaultInputFormat": null
}

路径(dfs和hdfs)

设置dfs插件的工作目录: 点击dfs插件的Update, 添加work目录, 然后点击Update

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
"connection": "file:///",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
},
"work": {
"location": "/home/qihuang.zheng/apache-drill-1.0.0/",
"writable": true,
"defaultInputFormat": null
}
},

对于hdfs也可以自定义一个自己的工作空间比如work=/user/zhengqh. 则定位到/user/zhengqh下,直接使用hfs.work进行查询

DFS文件

下面测试了使用不同的路径查询drill安装目录下sample-data下的parquet文件

  • 没有使用定义好的work工作目录,导致无法找到文件
  • 使用了自定义的work目录(注意work的使用方式: dfs.work.), 使用相对路径也能找到文件
  • 绝对路径
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0: jdbc:drill:zk> select * from dfs.`sample-data/region.parquet` limit 2;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 17: Table 'dfs.sample-data/region.parquet' not found
[Error Id: a1e53ed6-cc07-4799-9e9f-a7b112bb4e36 on dp0657:31010] (state=,code=0)

0: jdbc:drill:zk> select * from dfs.work.`sample-data/region.parquet` limit 2;
+--------------+----------+-----------------------+
| R_REGIONKEY | R_NAME | R_COMMENT |
+--------------+----------+-----------------------+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
+--------------+----------+-----------------------+
2 rows selected (0.338 seconds)

0: jdbc:drill:zk> select * from dfs.`/home/qihuang.zheng/apache-drill-1.0.0/sample-data/region.parquet` limit 2;
+--------------+----------+-----------------------+
| R_REGIONKEY | R_NAME | R_COMMENT |
+--------------+----------+-----------------------+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
+--------------+----------+-----------------------+
2 rows selected (0.235 seconds)

HDFS文件

  • 第一个查询直接使用了相对路径, 因为默认的hdfs插件的root指向的是/, 而它的connection配置路径是: hdfs://192.168.6.53:9000/.
  • 第二个查询使用了绝对路径
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0: jdbc:drill:zk> select count(*) from hdfs.`user/admin/evidence`;
+----------+
| EXPR$0 |
+----------+
| 4003278 |
+----------+
1 row selected (0.586 seconds)
0: jdbc:drill:zk> select count(*) from hdfs.`hdfs://tdhdfs/user/admin/evidence`;
+----------+
| EXPR$0 |
+----------+
| 4003278 |
+----------+
1 row selected (0.278 seconds)

这里还有一个知识点: 可以直接查询文件夹下的所有文件. 也可以是文件夹下的子文件夹都可以

1
2
3
4
5
6
7
[qihuang.zheng@dp0653 ~]$ /usr/install/hadoop/bin/hadoop fs -ls /user/admin/evidence
15/06/17 08:25:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 3 shuoyi.zhao supergroup 67561800 2015-05-21 10:49 /user/admin/evidence/4b75f114-7f64-40df-9ff6-11a1e75637a7.parquet
-rw-r--r-- 3 shuoyi.zhao supergroup 96528887 2015-05-21 10:49 /user/admin/evidence/bdb0bdb4-fa04-402f-af05-b2aea02728ed.parquet
-rw-r--r-- 3 shuoyi.zhao supergroup 80968799 2015-05-21 10:49 /user/admin/evidence/da1439fc-0c32-4cd8-90f2-67d24dbaa6cc.parquet
-rw-r--r-- 3 shuoyi.zhao supergroup 136852232 2015-05-21 10:50 /user/admin/evidence/f0954a8f-583b-4173-9b89-55ed3107daf1.parquet

复杂SQL查询

  1. 两表join查询
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT nations.name, regions.name FROM (
SELECT N_REGIONKEY as regionKey, N_NAME as name
FROM dfs.work.`sample-data/nation.parquet`
) nations join (
SELECT R_REGIONKEY as regionKey, R_NAME as name
FROM dfs.work.`sample-data/region.parquet`
) regions
on nations.regionKey = regions.regionKey
order by nations.name;

+-----------------+--------------+
| name | name0 |
+-----------------+--------------+
| ALGERIA | AFRICA |
| ARGENTINA | AMERICA |
| BRAZIL | AMERICA |
| CANADA | AMERICA |
| CHINA | ASIA |
...
+-----------------+--------------+
25 rows selected (1.038 seconds)
  1. 子查询in
1
2
3
4
5
6
7
8
9
10
SELECT N_REGIONKEY as regionKey, N_NAME as name  
FROM dfs.work.`sample-data/nation.parquet`
WHERE cast(N_NAME as varchar(10)) IN ('INDIA', 'CHINA');

+------------+--------+
| regionKey | name |
+------------+--------+
| 2 | INDIA |
| 2 | CHINA |
+------------+--------+

Drill连接Hive

HIVE使用(本机环境: cdh542)

drill中有一个默认的hive配置项:

1
2
3
4
5
6
7
8
9
10
11
{
"type": "hive",
"enabled": false,
"configProps": {
"hive.metastore.uris": "",
"javax.jdo.option.ConnectionURL": "jdbc:derby:;databaseName=../sample-data/drill_hive_db;create=true",
"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
"fs.default.name": "file:///",
"hive.metastore.sasl.enabled": "false"
}
}

我们修改成使用hive-site.xml中的配置项:

1
2
3
4
5
6
7
8
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://localhost:9083",
"hive.metastore.sasl.enabled": "false"
}
}
  • 启动hadoop: start-all.sh
  • 启动hive: hive –service metastore和hiveserver2
  • 启动drill: bin/drill-embedded
  • 进入drill的命令行中, 和hive的一些语法类似, 比如下面列出已经存在的数据库show databases, 定位到某个数据库use xxx…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
0: jdbc:drill:zk=local> show databases;
+---------------------+
| SCHEMA_NAME |
+---------------------+
| INFORMATION_SCHEMA |
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| hive.default |
| hive.wiki |
| sys |
+---------------------+
8 rows selected (0.627 seconds)
0: jdbc:drill:zk=local> use hive.wiki;
+-------+----------------------------------------+
| ok | summary |
+-------+----------------------------------------+
| true | Default schema changed to [hive.wiki] |
+-------+----------------------------------------+
1 row selected (0.156 seconds)
0: jdbc:drill:zk=local> show tables;
+---------------+-------------+
| TABLE_SCHEMA | TABLE_NAME |
+---------------+-------------+
| hive.wiki | invites |
| hive.wiki | pokes |
| hive.wiki | u_data |
| hive.wiki | u_data_new |
+---------------+-------------+
4 rows selected (1.194 seconds)
0: jdbc:drill:zk=local> select count(*) from invites;
+---------+
| EXPR$0 |
+---------+
| 525 |
+---------+
1 row selected (4.204 seconds)

HIVE测试环境(hive-1.2.0)

修改hive的配置信息:

1
2
3
4
5
6
7
8
9
10
11
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://192.168.6.53:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://192.168.6.53:3306/hive?characterEncoding=UTF-8",
"hive.metastore.warehouse.dir": "/user/hive/warehouse",
"fs.default.name": "hdfs://tdhdfs",
"hive.metastore.sasl.enabled": "false"
}
}

注: 上面绿色部分除了fs.default.name都不是必须的.

问题: 无法查询hive表数据
在测试环境遇到一个问题: 死活查不出来hive中的表(但是show databases, show tables, describe xx都是正常)
比如select count(*) from employee; 后就一直不动了. 观察web ui显示pending状态

pending job

用Control+C取消后, 执行其他之前正常的命令都无法执行了, 使用!quit也无法正常退出. 只能通过kill -9 pid杀死sqlline进程!

问题追踪
将conf下的logback.xml的日志级别改成debug. 这样执行每一条命令都会打印出日志信息
前面的语句都没有问题, 当执行查询hive表数据的时候, 报错连的是另外一个地址: tdhdfs/220.250.64.20:8020

strange address

问题思考
搜了一番hadoop /etc/hosts以及dns; hdfs ha dns之后无果.
然后想到220.250.64.20:8020其中8020端口根本就是默认的.
而我们的测试集群使用的是hdfs ha, 并且用的是9000端口.

说明drill根本没有找到hadoop的配置! 即使在hive的配置页面指定了hdfs.default.name为hdfs://tdhdfs
正因为没有drill没有找到hadoop的配置文件, 那么我们就要手动让drill知道hadoop的配置文件位置!

其他问题
启动drill-embedded的时候有一个报错:

1
2
10:43:54.789 [main] DEBUG org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

虽然没有影响drill的启动. 但还是修改下: vi ~/.bashrc

1
2
3
4
5
6
7
8
9
10
export HADOOP_HOME="/usr/install/hadoop"
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export PATH="$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin"

并在drill-env.sh中添加

1
export HADOOP_HOME="/usr/install/hadoop"

上面增加的配置虽然启动时不再报错, 但是并不能解决我们之前遇到的问题.

问题解决
拷贝hadoop安装目录下的core-site.xml, mapred-site.xml, hdfs-site.xml, yarn-site.xml到DRILL/conf下!
重启bin/drill-embedded. (发现重启后, 原先的hive和hdfs配置都不见了, 所以要重新update)

1
2
3
4
5
6
7
8
9
[qihuang.zheng@dp0653 conf]$ ll -rt
-rw-r--r--. 1 qihuang.zheng users 3835 5月 16 10:35 drill-override-example.conf
-rwxr-xr-x. 1 qihuang.zheng users 1276 6月 16 10:51 drill-env.sh
-rw-r--r--. 1 qihuang.zheng users 2354 6月 16 15:12 core-site.xml
-rw-r--r--. 1 qihuang.zheng users 3257 6月 16 15:12 hdfs-site.xml
-rw-r--r--. 1 qihuang.zheng users 2111 6月 16 15:12 mapred-site.xml
-rw-r--r--. 1 qihuang.zheng users 8382 6月 16 15:12 yarn-site.xml
-rw-r--r--. 1 qihuang.zheng users 3119 6月 16 15:25 logback.xml
-rw-r--r--. 1 qihuang.zheng users 1237 6月 16 15:35 drill-override.conf

Hive的数据类型

目前Drill并不支持Hive一些复杂的结构类型, 比如LIST, MAP, STRUCT, UNION

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
0: jdbc:drill:zk> describe koudai;
+-------------------+---------------------------------------+--------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+-------------------+---------------------------------------+--------------+
| sequence_id | VARCHAR | YES |
| occur_time | BIGINT | YES |
| activity_map | (VARCHAR(65535), VARCHAR(65535)) MAP | NO |
| device_map | (VARCHAR(65535), VARCHAR(65535)) MAP | NO |
| event_result_map | (VARCHAR(65535), VARCHAR(65535)) MAP | NO |
| geo_map | (VARCHAR(65535), VARCHAR(65535)) MAP | NO |
| policy_map | (VARCHAR(65535), VARCHAR(65535)) MAP | NO |
| indice | VARCHAR | YES |
+-------------------+---------------------------------------+--------------+
8 rows selected (0.504 seconds)
0: jdbc:drill:zk> select count(*) from koudai;
Error: SYSTEM ERROR: java.lang.RuntimeException: Unsupported Hive data type MAP.
Following Hive data types are supported in Drill for querying: BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, DATE, TIMESTAMP, BINARY, DECIMAL, STRING, and VARCHAR

Fragment 1:0

[Error Id: 7c5dd0d4-1e18-4dbc-b470-1eb6ca6a3b36 on dp0653:31010] (state=,code=0)
0: jdbc:drill:zk> describe int_table;
+--------------+------------+--------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+--------------+------------+--------------+
| id | INTEGER | YES |
+--------------+------------+--------------+
1 row selected (0.334 seconds)
0: jdbc:drill:zk> select count(*) from int_table;
+---------+
| EXPR$0 |
+---------+
| 90 |
+---------+
1 row selected (0.654 seconds)

So What Can We do when we want to query Hive Table which has map type?

在drill的mail-list上看到这样的一个回复:
http://mail-archives.apache.org/mod_mbox/drill-dev/201504.mbox/browser

We haven’t yet added support for Hive’s Map type. Can we work together on
adding this? Drill doesn’t distinguish between maps and structs given its
support for schemaless data. If you could post a small example piece of
data, maybe we could figure out the best way to work together to add this
functionality. As I said, it is mostly just a metadata mapping exercise
since Drill already has complex type support in the execution engine. You
can see how it works by looking at the JSONReader complex Parquet reader.

大致的意思是我们现在不支持hive的map类型. 为什么呢? 因为drill支持无模式的数据, 所以map类型还是结构类型对于drill而言都是一样的.
Drill的执行引擎中已经支持了复杂的类型. 你可以看看怎么读JSON或者Parquet格式的文件是怎么做的.

然后想到hive包含有map类型的表结构虽然drill不支持. 但是drill可以使用hive的数据啊.
既然hive的表结构是有一定schema的. 那么它的数据格式也一定是有格式的.
所以这里虽然drill可以和hive公用表结构, 如果我们直接用hive的表数据, 相当于还是使用hdfs插件了.

Drill分布式模式

  • 上面在单机上的配置项, 将drill文件夹复制到集群中. 注意修改drill下conf的drill-override.conf
1
2
3
4
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "192.168.6.55:2181,192.168.6.56:2181,192.168.6.57:2181"
}

每台节点的cluster-id都是一样的. 保证了所有的节点组成一个集群.
这和ElasticSearch集群的安装一样. 它的好处是随时可以扩展节点, 而不需要更改原先的任何配置.

  • 然后在每台机器上都启动bin/drillbit.sh start
  • 随便访问任意一台机器的8047端口, 都可以列出集群中的所有drill服务

http://192.168.6.52:8047/
http://192.168.6.53:8047/
http://192.168.6.54:8047/
http://192.168.6.56:8047/
http://192.168.6.57:8047/

distribute mode

客户端连接

Drill提供了一些工具, 包括第三方工具也提供了访问Drill数据的方法.
主要是Drill和其他SQL DB一样提供了一个ODBC Driver. 参考: https://drill.apache.org/docs/interfaces-introduction/

sqlline

连接本地ZK

1
bin/sqlline -u jdbc:drill:zk=local

连接ZK集群

  • 手动指定ZK
1
[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/sqlline -u jdbc:drill:zk=192.168.6.55,192.168.6.56,192.168.6.57:2181
  • 如果是集群模式, 也可以不跟上zk地址: bin/sqlline -u jdbc:drill:zk 会自动读取drill-override.conf的配置

注意: 当指定的zk是一个全新的ZK, 之前如果使用zk=local在本次新的zk会话中Storage-Plugin的信息都丢失.
因为我们指定的zookeeper集群是全新的. 所以drill还没有往里面写入任何数据.
这是因为在web ui上对Storage Plugin进行update或者create的数据都会写入到对应的zookeeper节点上!
当我们在界面上update hive, 并且enable后, 通过show databases就可以看到hive里的表了

iodbc

iodbc data source manager

选择一个已有的Driver, 修改连接类型, 如果是ZooKeeper,要指定ZK集群和clusterId
如果是Direct, 则直接指定要连接的Drill的host和port

iodbc

测试成功后, 新建一个SQL查询

iodbc query

点击OK后, 会返回查询结果

iodbc result

iodbc terminal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ iodbctest
iODBC Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008

Enter ODBC connect string (? shows list): ?
DSN | Driver
------------------------------------------------------------------------------
Sample MapR Drill DSN | MapR Drill ODBC Driver

Enter ODBC connect string (? shows list): DRIVER=MapR Drill ODBC Driver;AdvancedProperties= {HandshakeTimeout=0;QueryTimeout=0; TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys, INFORMATION_SCHEMA;};Catalog=DRILL;Schema=; ConnectionType=Direct;Host=192.168.6.53;Port=31010
1: SQLDriverConnect = [iODBC][Driver Manager]dlopen(MapR Drill ODBC Driver, 6): image not found (0) SQLSTATE=00000
2: SQLDriverConnect = [iODBC][Driver Manager]Specified driver could not be loaded (0) SQLSTATE=IM003

Enter ODBC connect string (? shows list): DSN=Sample MapR Drill DSN;ConnectionType=Direct;Host=192.168.6.53;Port=31010
Driver: 1.0.0.1001 (MapR Drill ODBC Driver)

SQL>select count(*) from cp.`employee.json`
EXPR$0
--------------------
1155
result set 1 returned 1 rows.
SQL>

注意上面输入ODBC的连接字符串, 按照官方文档有些地方写的是:

1
DRIVER=MapR Drill ODBC Driver;AdvancedProperties= {HandshakeTimeout=0;QueryTimeout=0; TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys, INFORMATION_SCHEMA;};Catalog=DRILL;Schema=; ConnectionType=Direct;Host=192.168.6.53;Port=31010

会报错说image not found. 正确的格式应该是:

1
DSN=Sample MapR Drill DSN;ConnectionType=Direct;Host=192.168.6.53;Port=31010

Drill Explorer

Drill Expoloer连接Drill的字符串格式和上面一样, 在Advance中输入

explorer connect

在Drill Explorer中可以浏览数据, 并且可以建立一些视图

expoloer

性能测试

HDFS的Parquet文件查询(单机和分布式模式对比)

单机模式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0: jdbc:drill:zk=local> select count(*) from hdfs.`/user/admin/evidence`;
+----------+
| EXPR$0 |
+----------+
| 4003278 |
+----------+
1 row selected (0.854 seconds)

0: jdbc:drill:zk=local> select fraud_type,count(*) from hdfs.`/user/admin/evidence` group by fraud_type order by count(*) desc;
+--------------------+----------+
| fraud_type | EXPR$1 |
+--------------------+----------+
| fakeMobile | 1941589 |
...
| clickFraud, | 18 |
+--------------------+----------+
14 rows selected (4.451 seconds)

五台机器的分布式模式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0: jdbc:drill:zk=192.168.6.55,192.168.6.56,19> select count(*) from hdfs.`/user/admin/evidence`;
+----------+
| EXPR$0 |
+----------+
| 4003278 |
+----------+
1 row selected (0.394 seconds)

0: jdbc:drill:zk=192.168.6.55,192.168.6.56,19> select fraud_type,count(*) from hdfs.`/user/admin/evidence` group by fraud_type order by count(*) desc;
+--------------------+----------+
| fraud_type | EXPR$1 |
+--------------------+----------+
| fakeMobile | 1941589 |
...
| clickFraud, | 18 |
+--------------------+----------+
14 rows selected (1.744 seconds)

实验现象1: Foreman不固定
在每台机器的8047端口的Profile中看到并不一定每台机器都回显示Queries.
比如在dp0655上运行时, 其他几台机器都没有 只有dp0656上才有.
而且即使是在相同的客户端, 不同的会话也会在不同的Foreman上运行.

foreman

A: Foreman只是类似Facade, 是最终返回查询结果给客户端的节点. 只需要一个即可.
Drill分布式计算会由Forman决定如何派发数据给不同的Drillbit节点.

如何验证: 查看Profiles下某个Query, 通常第一个Major Fragment就是Forman节点.
其余的Major Fragment会分发到不同的节点.

profile

其实从Drill的架构图也可以看出Forman只有一个

query-flow-client

leaf-frag

实验现象2: 第一次查询慢
有些查询在第一次执行时较慢. 后面同样的语句会快一倍多.
但最后会稳定下来比如上面的group by order by测试结果(400万条,分组后排序)
横坐标表示依次在这些机器上执行, 纵坐标表示在这台机器上执行了多次同样的SQL语句.

Round dp0653 dp0652 dp0655 dp0657 dp0656 dp0653
Round1 7.871 5.079 4.8299 1.764 1.557 4.305
Round2 2.549 2.103 2.106 1.66 1.418 1.854
Round3 1.888 1.893 1.779 1.534 1.512 1.955
Round4 1.841 1.641 1.703
Round5 1.744 1.714 1.9
Round6 1.763 1.572 1.53

Drill其他知识点

1.QueryUI

http://192.168.6.53:8047/query页面输入查询条件. 注意下面的hdfs.tmp.by_yr
其中hdfs.tmp类似于在sqline中先执行了use hdfs.tmp, by_yr是表名

点击submit, 可以得出查询结果

2.REST服务

1
2
3
4
5
6
7
8
curl  \
--header "Content-type: application/json" \
--request POST \
--data '{
"queryType" : "SQL",
"query" : "select yr,count(*) from hdfs.tmp.by_yr group by yr having count(*) > 30000 order by count(*) desc"
}' \
http://192.168.6.52:8047/query.json

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"columns" : [ "yr", "EXPR$1" ],
"rows" : [ {
"yr" : "2008",
"EXPR$1" : "38425"
}, {
"yr" : "2004",
"EXPR$1" : "38252"
}, {
"yr" : "2007",
"EXPR$1" : "38069"
}, {
"yr" : "2003",
"EXPR$1" : "37050"
}, {
...
}, {
"yr" : "1990",
"EXPR$1" : "30368"
} ]

Q&A

  • [ ] Q: 为什么第一次执行会比较慢?

  • [x] Q: 既然是分布式的, 为什么每次执行时, 只派发给一个Foreman?

    A: Forman只是最终返回给客户端的节点, 只需要一个即可.  
    但是具体的查询Foreman会分发给多个几点!
    

TODO

  • 测试环境hive中没什么表, 而且koudai表的类型是map, drill不支持map类型无法查询
    准备导入一些测试数据集进来测下
  • Drill + Tableau

参考文档

Drill官网
Google Dremel 原理 - 如何能3秒分析1PB

apache drill 0.8.0 单机/分布式安装测试
部署分布式Drill集群
Apache Drill环境搭建及连接hdfs


文章目录
  1. 1. 单机模式
  2. 2. Storage Plugin
    1. 2.1. 添加HDFS插件
    2. 2.2. 路径(dfs和hdfs)
    3. 2.3. DFS文件
    4. 2.4. HDFS文件
    5. 2.5. 复杂SQL查询
  3. 3. Drill连接Hive
    1. 3.1. HIVE使用(本机环境: cdh542)
    2. 3.2. HIVE测试环境(hive-1.2.0)
    3. 3.3. Hive的数据类型
  4. 4. Drill分布式模式
  5. 5. 客户端连接
    1. 5.1. sqlline
    2. 5.2. iodbc
    3. 5.3. Drill Explorer
  6. 6. 性能测试
    1. 6.1. HDFS的Parquet文件查询(单机和分布式模式对比)
  7. 7. Drill其他知识点
  8. 8. Q&A
  9. 9. TODO
  10. 10. 参考文档