CarbonData

Apache CarbonData

Apache CarbonData

版本:carbondata-1.1.0,spark-2.1.1,hadoop-2.6.0

1
2
3
4
$ mvn -DskipTests -Pspark-2.1 -Dspark.version=2.1.1 -Dhadoop.version=2.6.0 clean package

$ ll assembly/target/scala-2.11
8.9M 7 12 16:14 carbondata_2.11-1.1.1-shade-hadoop2.6.0.jar

本地模式测试,创建CarbonSession的第一个参数为本地文件系统

1
2
3
4
5
6
7
8
bin/spark-shell --jars ~/Github/carbondata-parent-1.1.0/assembly/target/scala-2.11/carbondata_2.11-1.1.1-shade-hadoop2.6.0.jar

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/tmp/carbon")
carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string,name string,city string,age Int)STORED BY 'carbondata'")
carbon.sql("LOAD DATA INPATH '/Users/zhengqh/Downloads/spark-2.1.1-bin-hadoop2.7/sample.csv' INTO TABLE test_table")
carbon.sql("SELECT city, avg(age), sum(age) FROM test_table GROUP BY city").show()

本地文件系统的文件夹包括Fact(表数据)、Metadata(表结构)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
➜  carbondata-parent-1.1.0 tree /tmp/carbon
/tmp/carbon
├── default
│   └── test_table
│   ├── Fact
│   │   └── Part0
│   │   └── Segment_0
│   │   ├── 0_batchno0-0-1499845043969.carbonindex
│   │   └── part-0-0_batchno0-0-1499845043969.carbondata
│   └── Metadata
│   ├── 3d8bd318-a620-419b-b0fd-c276936375e2.dict
│   ├── 3d8bd318-a620-419b-b0fd-c276936375e2.dictmeta
│   ├── 3d8bd318-a620-419b-b0fd-c276936375e2_27.sortindex
│   ├── f2f45986-6fb6-42af-b991-513ee43aad01.dict
│   ├── f2f45986-6fb6-42af-b991-513ee43aad01.dictmeta
│   ├── f2f45986-6fb6-42af-b991-513ee43aad01_18.sortindex
│   ├── f93ce55d-b82a-4eca-9076-e21dcd819218.dict
│   ├── f93ce55d-b82a-4eca-9076-e21dcd819218.dictmeta
│   ├── f93ce55d-b82a-4eca-9076-e21dcd819218_30.sortindex
│   ├── schema
│   └── tablestatus
└── modifiedTime.mdt

yarn模式按照官网部署http://carbondata.apache.org/installation-guide.html

注意:使用yarn模式,不需要把carbondata通过scp分发到各个节点,只需要在Driver端有就可以。另外,当前版本不依赖kettle

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cd spark-2.1.1*
mkdir carbonlib
cp ~/carbondata_2.11-1.1.1-shade-hadoop2.6.0.jar carbonlib
cp ~/carbon.properties conf

tar -zcvf carbondata.tar.gz carbonlib/
mv carbondata.tar.gz carbonlib/

$ vi conf/spark-defaults.conf
spark.executor.extraJavaOptions -Dcarbon.properties.filepath=/usr/install/spark-2.1.1-bin-2.6.0-cdh5.9.0/conf/carbon.properties
spark.driver.extraJavaOptions -Dcarbon.properties.filepath=/usr/install/spark-2.1.1-bin-2.6.0-cdh5.9.0/conf/carbon.properties
spark.driver.extraClassPath /usr/install/spark-2.1.1-bin-2.6.0-cdh5.9.0/carbonlib/*
spark.executor.extraClassPath /usr/install/spark-2.1.1-bin-2.6.0-cdh5.9.0/carbonlib/*
spark.yarn.dist.files /usr/install/spark-2.1.1-bin-2.6.0-cdh5.9.0/conf/carbon.properties
spark.yarn.dist.archives /usr/install/spark-2.1.1-bin-2.6.0-cdh5.9.0/carbonlib/carbondata.tar.gz

启动spark-shell还需要加上--jars。注意创建CarbonSession时第一个参数必须加上hdfs前缀,否则会报错找不到文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ bin/spark-shell --jars /home/admin/carbondata_2.11-1.1.1-shade-hadoop2.6.0.jar

sql("CREATE TABLE IF NOT EXISTS test_table1(id string,name string,city string,age Int)")
sql("insert into table test_table1 values('1','david','shenzhen',31)")
sql("insert into table test_table1 values('2','eason','shenzhen',20)")
sql("insert into table test_table1 values('3','jarry','wuhan',35)")

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://tdhdfs/user/tongdun/carbon","/home/admin/carbon")

carbon.sql("CREATE TABLE IF NOT EXISTS test_table2(id string,name string,city string,age Int)STORED BY 'carbondata'")
carbon.sql("INSERT INTO test_table2 SELECT * FROM test_table1") // insert #1
carbon.sql("select * from test_table2").show
carbon.sql("INSERT INTO test_table2 SELECT * FROM test_table1") // insert again
carbon.sql("select * from test_table2").show

carbon.sql("INSERT overwrite table test_table2 SELECT * FROM test_table1") // overwrite

carbondata运行在HDFS时,它的事实数据与元数据保存在HDFS上。

carbon

将hdfs表数据导入到carbondata建立的表后,执行一些查询语句,观察ui。

注意:导入数据时,carbondata分为两个步骤:全局字典(GlobalDictionary)和CarbonDataRDD。
其中全局字典会在Metadata下生产索引文件,CarbonDataRDD会在Fact下生成数据文件。

carbon1

CarbonData数据导入与查询

建立crosspartner carbondata表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://tdhdfs/user/tongdun/carbon","/home/admin/carbon")

carbon.sql("drop table cross_partner_carbon")
carbon.sql("""CREATE TABLE IF NOT EXISTS cross_partner_carbon(
partnerCode string,
eventType string,
idNumber string,
accountMobile string,
accountEmail string,
accountPhone string,
deviceId string,
cardNumber string,
contact1Mobile string,
contact2Mobile string,
contact3Mobile string,
contact4Mobile string,
contact5Mobile string,
contact1IdNumber string,
contact2IdNumber string,
contact3IdNumber string,
contact4IdNumber string,
contact5IdNumber string,
sequenceId string
)
STORED BY 'carbondata'
TBLPROPERTIES ('DICTIONARY_EXCLUDE'='sequenceId')
""")

再生成carbondata表:

1
2
3
4
5
6
7
8
9
10
11
carbon.sql("insert into cross_partner_carbon select * from crosspartner")

spark.sql("select count(*) from cross2partner_dt").show
carbon.sql("select count(*) from cross_partner_carbon_dm").show

spark.sql("select * from cross2partner_dt").show
carbon.sql("select * from cross_partner_carbon_dm").show

val idnumber=""
spark.sql(s"select sequenceId from cross2partner_dt where partnerCode='007fenqi' and eventType='Loan' and idNumber='$idnumber'").show
carbon.sql(s"select sequenceId from cross_partner_carbon_dm where partnerCode='007fenqi' and eventType='Loan' and idNumber='$idnumber'").show

比较crosspartner_hdfs的过滤与carbondata的查询

1
carbon.sql("select sequenceId from cross_partner_carbon where partnerCode='qufenqi' and eventType='Loan' and idNumber=''").show

实验结果

创建carbondata表时,如果默认所有字段都加上索引,导入数据时Executor会报错OOM。
如果去掉所有字段的索引,导入数据很快,但是查询速度就满了。

比较磁盘空间的大小,没有索引下,Parquet和Carbondata差不多

1

问题

1. Hive表与CarbonData表

activity事件数据,只取借贷和放贷的数据,并保存成临时表crosspartner_hdfs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
spark.sql("""CREATE TABLE crosspartner_hdfs(
partnerCode string,
eventType string,
idNumber string,
accountMobile string,
accountEmail string,
accountPhone string,
deviceId string,
cardNumber string,
contact1Mobile string,
contact2Mobile string,
contact3Mobile string,
contact4Mobile string,
contact5Mobile string,
contact1IdNumber string,
contact2IdNumber string,
contact3IdNumber string,
contact4IdNumber string,
contact5IdNumber string,
sequenceId string
) partitioned by(ds string)
""")

spark.sql("""insert into table crosspartner_hdfs partition(ds='201706')
select
activity_map.partnerCode as partnerCode,
activity_map.eventType as eventType,
activity_map.idNumber as idNumber,
activity_map.accountMobile as accountMobile,
activity_map.accountEmail as accountEmail,
activity_map.accountPhone as accountPhone,
activity_map.deviceId as deviceId,
activity_map.cardNumber as cardNumber,
activity_map.contact1Mobile as contact1Mobile,
activity_map.contact2Mobile as contact2Mobile,
activity_map.contact3Mobile as contact3Mobile,
activity_map.contact4Mobile as contact4Mobile,
activity_map.contact5Mobile as contact5Mobile,
activity_map.contact1IdNumber as contact1IdNumber,
activity_map.contact2IdNumber as contact2IdNumber,
activity_map.contact3IdNumber as contact3IdNumber,
activity_map.contact4IdNumber as contact4IdNumber,
activity_map.contact5IdNumber as contact5IdNumber,
activity_map.sequenceId as sequenceId
from activity
where year=2017 and month=6
and activity_map.eventType in('Loan','Lending')
""")

上面如果建表时没有指定存储为parquet,最后是part-xxx。
而且即使指定了parquet,insert sql也不能指定分区数量。

下面改用parquet文件夹加上手动分区的形式:cross_partner_hdfs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import java.text.SimpleDateFormat
import java.util.{Calendar,Date}
def year(ymd: String) = ymd.substring(0,4)
def month(ymd: String) = {
var month=ymd.substring(4,6)
if(month.startsWith("0")) month=ymd.substring(5,6)
month
}
def day(ymd: String) = {
var d=ymd.substring(6,8)
if(d.startsWith("0")) d=ymd.substring(7,8)
d
}
//写成parquet文件夹
def genCrossData(beg: String, end: String) = {
var cal = Calendar.getInstance()
var datef=new SimpleDateFormat("yyyyMMdd")
var beginTime=datef.parse(beg)
var endTime=datef.parse(end)
while(beginTime.compareTo(endTime)<=0){
cal.setTime(beginTime);
var ymd=datef.format(beginTime)
println(ymd)
var y=year(ymd)
var m=month(ymd)
var d=day(ymd)
spark.sql(s"""
select
activity_map.partnerCode as partnerCode,
activity_map.eventType as eventType,
activity_map.idNumber as idNumber,
activity_map.accountMobile as accountMobile,
activity_map.accountEmail as accountEmail,
activity_map.accountPhone as accountPhone,
activity_map.deviceId as deviceId,
activity_map.cardNumber as cardNumber,
activity_map.contact1Mobile as contact1Mobile,
activity_map.contact2Mobile as contact2Mobile,
activity_map.contact3Mobile as contact3Mobile,
activity_map.contact4Mobile as contact4Mobile,
activity_map.contact5Mobile as contact5Mobile,
activity_map.contact1IdNumber as contact1IdNumber,
activity_map.contact2IdNumber as contact2IdNumber,
activity_map.contact3IdNumber as contact3IdNumber,
activity_map.contact4IdNumber as contact4IdNumber,
activity_map.contact5IdNumber as contact5IdNumber,
activity_map.sequenceId as sequenceId
from activity
where year=$y and month=$m and day=$d
and activity_map.eventType in('Loan','Lending')
""").repartition(1).write.mode("overwrite").parquet(s"/user/hive/warehouse/cross_partner_hdfs/ds=$ymd")
cal.add(Calendar.DATE,1);
beginTime=cal.getTime();
}
}
genCrossData("20170101","20170630")

genCrossData("20170621","20170630")

查询parquet,建立临时表,使用SparkSQL查询

1
2
3
4
5
6
val df=spark.read.parquet("/user/hive/warehouse/cross_partner_hdfs/*")
df.createOrReplaceTempView("cross_partner_hdfs")

spark.sql("select * from cross_partner_hdfs").show

spark.sql("select sequenceId from cross_partner_hdfs where partnerCode='qufenqi' and eventType='Loan' and idNumber=''").show

使用临时表的数据插入到carbondata table

1
2
3
4
val df=spark.read.parquet("/user/hive/warehouse/cross_partner_hdfs/*")
df.createOrReplaceTempView("cross_partner_hdfs")

carbon.sql("insert into cross_partner_carbon select * from cross_partner_hdfs")

carbondata不认识用df注册的临时表:

10

创建hive表时指定parquet格式,并从parquet文件夹的数据直接生成表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
spark.sql("""CREATE TABLE crosspartner(
partnerCode string,
eventType string,
idNumber string,
accountMobile string,
accountEmail string,
accountPhone string,
deviceId string,
cardNumber string,
contact1Mobile string,
contact2Mobile string,
contact3Mobile string,
contact4Mobile string,
contact5Mobile string,
contact1IdNumber string,
contact2IdNumber string,
contact3IdNumber string,
contact4IdNumber string,
contact5IdNumber string,
sequenceId string
) partitioned by(ds string) stored as parquet
""")

import java.text.SimpleDateFormat
import java.util.{Calendar,Date}
def genCrossData(beg: String, end: String) = {
var cal = Calendar.getInstance()
var datef=new SimpleDateFormat("yyyyMMdd")
var beginTime=datef.parse(beg)
var endTime=datef.parse(end)
while(beginTime.compareTo(endTime)<=0){
cal.setTime(beginTime);
var ymd=datef.format(beginTime)
var df = spark.read.parquet(s"/user/hive/warehouse/cross_partner_hdfs/ds=$ymd")
df.repartition(1).write.mode("overwrite").parquet(s"/user/hive/warehouse/crosspartner/ds=$ymd")
spark.sql(s"alter table crosspartner add partition(ds='$ymd')")
cal.add(Calendar.DATE,1);
beginTime=cal.getTime();
}
}
genCrossData("20170101","20170630")

或者直接用parquet文件创建外部表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
spark.sql("""
create external table cross2partner_dt(
partnerCode string,
eventType string,
idNumber string,
accountMobile string,
accountEmail string,
accountPhone string,
deviceId string,
cardNumber string,
contact1Mobile string,
contact2Mobile string,
contact3Mobile string,
contact4Mobile string,
contact5Mobile string,
contact1IdNumber string,
contact2IdNumber string,
contact3IdNumber string,
contact4IdNumber string,
contact5IdNumber string,
sequenceId string
)
partitioned by (ds string)
stored as parquet
location '/user/hive/warehouse/cross_partner_hdfs'
""")
spark.sql(s"alter table cross2partner_dt add partition(ds='20170101')")

import java.text.SimpleDateFormat
import java.util.{Calendar,Date}
def genCrossData(beg: String, end: String) = {
var cal = Calendar.getInstance()
var datef=new SimpleDateFormat("yyyyMMdd")
var beginTime=datef.parse(beg)
var endTime=datef.parse(end)
while(beginTime.compareTo(endTime)<=0){
cal.setTime(beginTime);
var ymd=datef.format(beginTime)
spark.sql(s"alter table cross2partner_dt add partition(ds='$ymd')")
cal.add(Calendar.DATE,1);
beginTime=cal.getTime();
}
}
genCrossData("20170102","20170630")

一次性将所有数据插入carbondata太慢了

1
carbon.sql(s"insert into cross_partner_carbon select * from crosspartner where ds like '$ymd%'")

改用按月/天插入carbondata表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import java.text.SimpleDateFormat
import java.util.{Calendar,Date}
def genCrossCarbonData(beg: String, end: String) = {
var cal = Calendar.getInstance()
var datef=new SimpleDateFormat("yyyyMM")
var beginTime=datef.parse(beg)
var endTime=datef.parse(end)
while(beginTime.compareTo(endTime)<=0){
cal.setTime(beginTime);
var ymd=datef.format(beginTime)
println(ymd)
carbon.sql(s"insert into cross_partner_carbon select * from cross2partner_dt where ds like '$ymd%'")
cal.add(Calendar.DATE,1);
beginTime=cal.getTime();
}
}
genCrossCarbonData("201701","201706")

导入数据时还是会报错:

增加内存:

1
2
3
4
5
bin/spark/shell \
--conf spark.executor.instances=15 \
--conf spark.executor.cores=2 \
--conf spark.executor.memory=8g \
--conf spark.driver.memory=8g \

2. carbondata其他设置

1
2
3
4
5
6
7
carbon.sql("""CREATE TABLE IF NOT EXISTS crosspartner1(
...
STORED BY 'carbondata'
TBLPROPERTIES ('DICTIONARY_EXCLUDE'='sequenceId,idNumber,accountMobile,accountEmail,accountPhone,deviceId,cardNumber,contact1Mobile,contact2Mobile,contact3Mobile,contact4Mobile,contact5Mobile,contact1IdNumber,contact2IdNumber,contact3IdNumber,contact4IdNumber,contact5IdNumber')
""")

carbon.sql("insert into crosspartner1 select * from cross_partner_hdfs")

3. carbon thrift server

1
2
3
4
5
6
7
8
9
10
11


bin/spark-submit \
--conf spark.sql.hive.thriftServer.singleSession=true \
--hiveconf hive.server2.thrift.port=10002 \
--hiveconf hive.server2.thrift.bind.host="192.168.39.25" \
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
carbonlib/carbondata_2.11-1.1.1-shade-hadoop2.6.0.jar \
hdfs://tdhdfs/user/tongdun/carbon
hdfs://tdhdfs/user/hive/warehouse/carbon.store
hdfs://tdhdfs/user/tongdun/carbondata/CarbonStore

4. spark-2.2.0

carbondata-1.1.1目前不支持spark2.2。如果加上profile,更改spark版本为2.2.0,编译不通过

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
$ mvn -DskipTests -Pspark-2.2 -Dspark.version=2.2.0 -Dhadoop.version=2.6.0 clean package

[WARNING] /Users/zhengqh/Github/carbondata-parent-1.1.1/integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/UpdateCoalescedRDD.scala:23: warning: imported `RDD' is permanently hidden by definition of class RDD in package rdd
[INFO] import org.apache.spark.rdd.{CoalescedRDDPartition, DataLoadPartitionCoalescer, RDD}
[INFO] ^
[WARNING] /Users/zhengqh/Github/carbondata-parent-1.1.1/integration/spark-common/src/main/scala/org/apache/carbondata/spark/util/CarbonScalaUtil.scala:125: warning: non-variable type argument Any in type pattern scala.collection.Map[Any,Any] is unchecked since it is eliminated by erasure
[INFO] case m: scala.collection.Map[Any, Any] =>
[INFO] ^
[ERROR] /Users/zhengqh/Github/carbondata-parent-1.1.1/integration/spark-common/src/main/scala/org/apache/spark/sql/optimizer/CarbonDecoderOptimizerHelper.scala:87: error: value child is not a member of org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable
[INFO] case i: InsertIntoTable => process(i.child, nodeList)
[INFO] ^
[WARNING] 11 warnings found
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [ 5.140 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [ 10.114 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [ 29.232 s]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 9.828 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 5.719 s]
[INFO] Apache CarbonData :: Spark Common .................. FAILURE [01:10 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SKIPPED
[INFO] Apache CarbonData :: Assembly ...................... SKIPPED
[INFO] Apache CarbonData :: Spark2 ........................ SKIPPED
[INFO] Apache CarbonData :: Spark2 Examples ............... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:10 min
[INFO] Finished at: 2017-08-03T14:39:55+08:00
[INFO] Final Memory: 72M/786M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:compile (default) on project carbondata-spark-common: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :carbondata-spark-common

如果使用spark2.1.1编译的二进制包,放到spark2.2.0下,也会报错:

car

spark-1.6.2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
case class InsertIntoTable(
table: LogicalPlan,
partition: Map[String, Option[String]],
child: LogicalPlan,
overwrite: Boolean,
ifNotExists: Boolean)
extends LogicalPlan {

override def children: Seq[LogicalPlan] = child :: Nil
override def output: Seq[Attribute] = Seq.empty

assert(overwrite || !ifNotExists)
override lazy val resolved: Boolean = childrenResolved && child.output.zip(table.output).forall {
case (childAttr, tableAttr) =>
DataType.equalsIgnoreCompatibleNullability(childAttr.dataType, tableAttr.dataType)
}
}

spark-2.2.0

1
2
3
4
5
6
7
8
9
10
11
12
case class InsertIntoTable(
table: LogicalPlan,
partition: Map[String, Option[String]],
query: LogicalPlan,
overwrite: Boolean,
ifPartitionNotExists: Boolean)
extends LogicalPlan {
// We don't want `table` in children as sometimes we don't want to transform it.
override def children: Seq[LogicalPlan] = query :: Nil
override def output: Seq[Attribute] = Seq.empty
override lazy val resolved: Boolean = false
}

更改为i.query后,重新编译:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[INFO] Apache CarbonData :: Assembly ...................... FAILURE [  2.180 s]
[INFO] Apache CarbonData :: Spark2 ........................ SKIPPED
[INFO] Apache CarbonData :: Spark2 Examples ............... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:57 min
[INFO] Finished at: 2017-08-03T15:33:59+08:00
[INFO] Final Memory: 83M/728M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project carbondata-assembly: Could not resolve dependencies for project org.apache.carbondata:carbondata-assembly:pom:1.1.1: Could not find artifact org.apache.carbondata:carbondata-spark:jar:1.1.1 in central (http://repo1.maven.org/maven2) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :carbondata-assembly

默认1.6版本的assembly无法下载1.1.1的pom,将默认版本改为(添加)2.2.0

1
2
3
4
5
6
7
8
9
10
11
12
13
<profile>
<id>spark-2.2</id>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-spark2</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
</profile>

文章目录
  1. 1. Apache CarbonData
    1. 1.1. CarbonData数据导入与查询
    2. 1.2. 实验结果
    3. 1.3. 问题
      1. 1.3.1. 1. Hive表与CarbonData表
      2. 1.3.2. 2. carbondata其他设置
      3. 1.3.3. 3. carbon thrift server
      4. 1.3.4. 4. spark-2.2.0