Cassandra查询

  • Query
    • Range Query
    • 问题1: 查询应用内最近的记录存在BUG: 查的不是最近而是最早
    • 解决方式: 应用内最近记录使用Order By DESC
    • 问题2: 查询全局范围最近的记录目前的表结构无法实现
    • 拆表
  • CLUSTERING ORDER BY
  • 现实世界的查询: Tombstone
    • tombstone太多查询中断
    • orderby+limit查询
    • timestamp限制不读取tombstone
    • orderby + timestamp
    • orderby+limit如果存活的记录数没有超过limit,是否会查询TTL的数据?
    • tombstone使用经验
    • Slice Query
  • Paging Query
  • Read Time Out
    • Future Interrupt
  • More Abould Query: Token?
  • Node down effect
  • TTL
    • TTL是针对行, 还是针对列
    • 插入数据时using TTL
    • 测试Table的default ttl
    • gc_grace_seconds
    • Delete操作
  • 读一致性
  • 协调节点超时

Query

http://www.planetcassandra.org/blog/we-shall-have-order/
http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

1
2
3
4
5
6
7
8
9
10
CREATE TABLE velocity (
attribute text, //属性值: zhangshan
partner_code text, //合作方: koudai
app_name text, //应用: koudai_ios
type text, //属性名称: account
"timestamp" bigint, //时间撮
event text, //事件: JSON串
sequence_id text, //序列号
PRIMARY KEY ((attribute), partner_code, app_name, type, "timestamp")
)

主键primary key: (attribute), partner_code, app_name, type, “timestamp”
分区键partition key: attribute
聚合键clustering key: partner_code, app_name, type, timestamp

添加上timestampe的目的是为了排序: CassandraVelocityReadDaoImpl Line_133

1
2
3
4
5
6
7
8
9
Collections.sort(velocities, new Comparator<VelocityDO>() {
@Override
public int compare(VelocityDO velo1, VelocityDO velo2) {
if (velo1.getTimestamp() - velo2.getTimestamp() == 0) {
return 0;
}
return velo1.getTimestamp() < velo2.getTimestamp() ? -1 : 1;
}
});

TODO: cluster key本身的目的就是为了order by使用的. 改进的方式是直接使用Cassandra自身的order by来排序! 就不需要使用上面的客户端排序了.

测试数据(下面除了timestamp和sequence_id不一样, 其他数据都是一样的, 注意插入顺序和timestamp的值没有关系):

1
2
3
4
5
6
7
8
9
insert into velocity(attribute,partner_code,app_name,type,timestamp,event,sequence_id)values('test','koudai','koudai_ios','account',1234567,'event blob','1234567');
insert into velocity(attribute,partner_code,app_name,type,timestamp,event,sequence_id)values('test','koudai','koudai_ios','account',12345,'event blob','12345');
insert into velocity(attribute,partner_code,app_name,type,timestamp,event,sequence_id)values('test','koudai','koudai_ios','account',123456,'event blob','123456');
cqlsh:forseti> select * from velocity where attribute='test';
attribute | partner_code | app_name | type | timestamp | event | sequence_id
-----------+--------------+------------+---------+-----------+------------+-------------
test | koudai | koudai_ios | account | 12345 | event blob | 12345
test | koudai | koudai_ios | account | 123456 | event blob | 123456
test | koudai | koudai_ios | account | 1234567 | event blob | 1234567

数据在Cassandra中是如何存储的?

http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure
cassandra_mapping

把所有的clustering key+非primary key的每一列column, value的值是这一列的值.

1
2
     |koudai:koudai_ios:account:12345:event|koudai:koudai_ios:account:12345:sequence_id|koudai:koudai_ios:account:123456:event|koudai:koudai_ios:account:123456:sequence_id
test |event blob |12345 |event blob |123456

其中test是row-key, partition key=attribute的值,
列名: koudai:koudai_ios:account:12345:event之前几个koudai:koudai_ios:account:12345对应的分别是clustering key=partner_code:app_name:type:timestamp的值
最后一个event是event的列名, 同理koudai:koudai_ios:account:12345:sequence_id的sequence_id是sequence_id的列名. value是列对应的值.

由于相同的attribute(以及相同的合作方,应用,属性类型名称)可能有多个, 所以这些记录都会放在同一行里!(同一行指的是同一个partition key)

那么在查询应用时: select * from velocity where attribute=? and partner_code=? and app_name=? and type=?对应的查询是:
select * from velocity where attribute='test' and partner_code='koudai' and app_name='koudai_ios' and type='account'
这个查询会查出上面attribute=’test’的两条记录(12345,123456). 而且速度是很快的. 因为数据在底层都是通过clustering key进行排好序的.

虽然上面where条件中没有指定timestamp, 但是按照clustering key的含义,只要是以clustering key作为查询条件, 都是会排序的.

Range Query

The partition key must be defined in the WHERE clause and
the ORDER BY clause defines the clustering column to use for ordering.

SELECT * FROM users WHERE userID IN (102,104) ORDER BY age ASC;

userID 是partition key. age是cluster column.
所以我们可以知道users表的primary key是 PRIMARY KEY(userId, age)
出现在where中的必须是primary key或者index column.

Cassandra does not support non-equal conditional operations on the partition key.
Use the token function for range queries on the partition key.
不支持在PartitionKey上执行不等于查询. 可以使用token函数执行基于partition key的范围查询

问题1: 查询应用内最近的记录存在BUG: 查的不是最近而是最早

在没有排序的情况下, limit查询, 不会查询最新的, 而是最早的. 实际上没有指定排序就是升序,即按照cluster key升序.

1
2
3
4
5
cqlsh:forseti> select * from velocity where attribute='test' limit 2;
attribute | partner_code | app_name | type | timestamp | event | sequence_id
-----------+--------------+------------+---------+-----------+------------+-------------
test | koudai | koudai_ios | account | 12345 | event blob | 12345
test | koudai | koudai_ios | account | 123456 | event blob | 123456

解决方式: 应用内最近记录使用Order By DESC

使用order by查询正确的方式是: 要指定所有的cluster key. 可以看到按照desc查询时, 查询的是最新的记录了(1234567).

1
2
3
4
5
6
7
8
cqlsh:forseti> select * from velocity 
where attribute='test' and partner_code='koudai' and app_name='koudai_ios' and type='account'
order by partner_code desc,app_name desc,type desc,timestamp desc limit 1;

attribute | partner_code | app_name | type | timestamp | event | sequence_id
-----------+--------------+------------+---------+-----------+------------+-------------
test | koudai | koudai_ios | account | 1234567 | event blob | 1234567
(1 rows)

这里order by多个字段, 每个字段都必须指定排序方式.

问题2: 查询全局范围最近的记录目前的表结构无法实现

全局查询时按照时间倒序排列, 也必须要指定前面的clustering key. 不能只有一个timestamp.
这就影响了数据的准确性. 加入partner_code=’abc’有10条记录, 但是它的ts=1, 而partner_code=’cde’,它们的ts=10. 查询的应该是cde,而不是abc.
如果不加上order by直接limit,会导致数据也不是按照时间逆序排列的!

1
2
select * from velocity where attribute='test' order by timestamp desc limit 10;  ×: 语法错误!
select * from velocity where attribute='test' order by partner_code desc,app_name desc,type desc,timestamp desc limit 10; ×: 不符合业务逻辑,不是严格倒排,数据不准确

可以设置WITH CLUSTERING ORDER BY (timestamp DESC)进行强制字段排序. 但是只能在建表的时候指定,不能使用alter修改表结构!

1
2
3
4
5
6
7
8
9
10
11
12
CREATE TABLE velocity_order (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute), partner_code, app_name, type, "timestamp")
) WITH CLUSTERING ORDER BY ("timestamp" DESC);

Bad Request: Missing CLUSTERING ORDER for column partner_code

使用CLUSTERING ORDER BY,必须包含所有的clustering key! 上面只指定一个timestamp字段是不行的! 要这么做:

1
2
3
4
5
6
7
8
9
10
CREATE TABLE velocity_order (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute), partner_code, app_name, type, "timestamp")
) WITH CLUSTERING ORDER BY (partner_code DESC, app_name DESC, type DESC, "timestamp" DESC);

这种方式在业务系统中, 对于全局的查询显然是不对的! 因为全局查询时要求按照timestamp倒序,其他字段比如partner_code等都不需要排序!

拆表

由于上面的全局查询无法查询最近的1000条记录, 需要将velocity拆分成三张表: 应用, 合作方, 全局.

虽然写的时候要写张表, 但是写对于Cassandra来说是很快的. 只要能满足快速读的性能, 对于数据的冗余,是可以接受的.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
应用:
CREATE TABLE velocity_app (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute, partner_code, app_name, type), "timestamp")
)
select * from velocity_app where attribute=? and partner_code=? and app_name=? and type=? order by timestamp desc limit 1000;

合作方:
CREATE TABLE velocity_partner (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute, partner_code), "timestamp")
)
select * from velocity_partner where attribute=? and partner_code=? order by timestamp desc limit 1000;

全局:
CREATE TABLE velocity_global (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute), "timestamp")
)
select * from velocity_global where attribute=? order by timestamp desc limit 1000;

这样就完了吗? 上面的建表语句如果加上CLUSTERING ORDER BY, 会不会提升性能?

CLUSTERING ORDER BY

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/create_table_r.html?scroll=reference_ds_v3f_vfk_xj__ordering-results

You can order query results to make use of the on-disk sorting of columns. 使用存储在磁盘上的列是排序的这个特点来对查询的结果进行排序
You can order results in ascending or descending order. 因为列是排序的,你可以对结果进行升序或降序排列.
The ascending order will be more efficient than descending. If you need results in descending order, 如果你需要对结果降序排列
you can specify a clustering order to store columns on disk in the reverse order of the default. 你可以声明存储列时默认按照降序
Descending queries will then be faster than ascending ones. 这样降序查询会比升序要快

注意: 这里的列指的是Cassandra底层存储的结构对应的列, 而不是CQL中的列(或者通常意义的列).
Cassandra的列按照存储结构是指cluster key的值+非CK的名称. 相同的row-key/partition key不同的clust key放在同一行.
这些不同的cluster key组成partition key的一列. 列的个数是有限制的.

http://www.planetcassandra.org/blog/we-shall-have-order/

In summary, when it comes to ordering your result sets in Cassandra, remember these points:
-The clustering column(s) determine the data’s on-disk sort order only within a partition key. 只有特定的partiton key下clust key是排序的
-Do not model your table with a “dummy” partition key (key that always has the same value). 如果你的partition key有相同的值, 不要这么设计
-It is not necessary to specify the ORDER BY clause on a query if your desired sort direction 如果CLUSTERING ORDER的顺序和你的查询顺序一样,
(“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition. 查询时不需要指定order by.
-On a CREATE TABLE command, Cassandra will specify CLUSTERING ORDER for you in the table definition if you did not specify it yourself.
The default CLUSTERING ORDER on a table is comprised of your clustering keys in the “ASCending” sort direction. 默认是按照cluster key升序的
-Be mindful of the limit of two billion columns per wide row (partition), as well as the problems that accompany unbounded row growth. 列是有个数限制的
-For tables supporting queries for recent, time-based data, you may want to specify a “DESCending” sort direction in your table definition. 最近的记录在定义时使用降序
-Do not create a new cluster using the Byte Ordered Partitioner. 不要使用ByteOrderedPartitioner分区方式.

1
2
3
4
5
6
7
8
9
10
CREATE TABLE velocity_app (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute, partner_code, app_name, type), "timestamp")
) WITH CLUSTERING ORDER BY ("timestamp" DESC)

观点1: 在存储的时候就把最近的记录存在前面, 这样读取的时候可以不用指定order by timestamp desc.
观点2: 查询的时候还是要指定order by timestamp desc的! 只不过指定了CLUSTERING ORDER BY后, 倒序查询时会快点.
结论: 建表时指定了CLUSTERING ORDER BY, 查询时如果倒排, 就不需要再指定了order by desc, 当然指定了也没关系.
下面的查询可以看到没有指定order by timestamp desc, 查询出来的timestamp也是倒序排列的.

1
2
3
4
5
cqlsh:forseti> select * from velocity_app where attribute='test_55' and partner_code='koudai' and app_name='koudai_ios' and type='account' limit 2;
attribute | partner_code | app_name | type | timestamp | event | sequence_id
-----------+--------------+------------+---------+-----------+--------------------+-------------
test_55 | koudai | koudai_ios | account | 12346595 | raw event json str | 12346595
test_55 | koudai | koudai_ios | account | 12346594 | raw event json str | 12346594

现实世界的查询: Tombstone

tombstone太多查询中断

1.原先查询时没有指定order by, 经常会出现timeout的情况: 查询到10万个tombstone就会timeout, 这是因为查询被中断导致超时的.

1
2
3
4
5
6
7
8
9
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress' limit 1000;

Request did not complete within rpc_timeout.
Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 15:23:59,601 | 192.168.47.203 | 830
Merging data from memtables and 5 sstables | 15:23:59,601 | 192.168.47.203 | 838
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 15:23:59,929 | 192.168.47.203 | 328301
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 15:24:00,068 | 192.168.47.222 | 502772
Timed out; received 0 of 1 responses | 15:24:01,565 | 192.168.47.202 | 2000874
Request complete | 15:24:01,564 | 192.168.47.202 | 2000993

为什么在查询的时候会出现tombstones?
因为默认指定了所有的查询条件除了timestamp外, 那么timestamp是按照升序排列的. 即按照插入记录的timestamp时间升序.
但是因为指定了TTL, 所以最先插入的数据在经过TTL后, 会变成tombstone.
查询时按照timestamp升序查询, 只要数据存在, 尽管是tombstone没有被真正删除, 也会被查询出来.

对应使用JavaAPI时在system.log中报错:

1
2
3
4
5
6
7
ERROR [ReadStage:332] 2015-10-14 13:25:35,597 SliceQueryFilter.java (line 206) Scanned over 100000 tombstones in forseti.velocity; query aborted (see tombstone_failure_threshold)
ERROR [ReadStage:332] 2015-10-14 13:25:35,597 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:332,5,main]
java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2016)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

如果tombstone太多,则查询会被中断, 否则查询方式的图例如下:

1
2
3
这些tombstone都不计入limit,    开始查询                 达到limit值,停止  -->  最终结果: 读取到了1000条记录,并且有xxx个tombstone! 
---------------------------------------------------->|
tombstone .... tombstone .. | rc1 rc2 rc3 .... rc1000 rc1001 rc1002 rc1003

orderby+limit查询

  1. 使用order by查询: 不仅解决了数据准确性的问题(查询最近记录), 还解决了超时的问题(tombstone太多)! 可谓一石二鸟!
    改为按照timestamp倒序查询, 可以看到tombstone cells=0. 耗费时间为62ms(62592).
1
2
3
4
5
6
7
8
9
10
11
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress' 
order by partner_code desc,app_name desc,type desc,timestamp desc limit 1000;

Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 15:26:13,525 | 192.168.47.203 | 315
Merging data from memtables and 5 sstables | 15:26:13,525 | 192.168.47.203 | 323
Read 1001 live and 0 tombstone cells | 15:26:13,542 | 192.168.47.203 | 17706
Enqueuing response to /192.168.47.202 | 15:26:13,542 | 192.168.47.203 | 17852
Sending message to /192.168.47.202 | 15:26:13,542 | 192.168.47.203 | 18011
Message received from /192.168.47.203 | 15:26:13,551 | 192.168.47.202 | null
Processing response from /192.168.47.203 | 15:26:13,551 | 192.168.47.202 | null
Request complete | 15:26:13,548 | 192.168.47.202 | 62592

这里为什么没有查询tombstone. 因为按照时间倒序查询, 最近的数据一定不会过期. 而TTL=3M, 没有过期的数据不止1000条,
所以在这1000条数据里都没有TTL的数据, 即没有tombstone.
那么如果没有过期的数据没有1000条, 我们要求limit 1000, 会不会查询存在于TTL中的数据(比如TTL的数据很多)?

1
2
3
4
5
6
                                      到达limit值,可以看到没有一个tobstone
|<----------------------------------- 倒排是从这里往前查询的
tombstone .... tombstone .. | rc1 rc2 rc3 .... rc1000 rc1001 rc1002 rc1003

|<------------------------ 存活的没有1000条呢? 会不会去查询tombstone?
tombstone .... tombstone .. | rc1 rc2 rc3 .... rc500 r501

timestamp限制不读取tombstone

3.另外的一种解决方式是: 不适用order by,但是对timestamp进行了时间限制, 彻底解决不读取tombstone的问题!
优化点2: 因为每条记录的TTL=3Month, 查询时使用条件: timestamp>(当前时间-3M), 就能保证一定不会有tombstone的记录被查询到!

在system.log中会出现这样的日志:

1
2
3
4
5
cat /var/log/cassandra/system.log | grep "tombstone cells in forseti.velocity" | tail 
WARN [ReadStage:303] 2015-10-14 12:30:25,333 SliceQueryFilter.java (line 231) Read 1001 live and 1110 tombstone cells in forseti.velocity (see tombstone_warn_threshold).
1000 columns was requested, slices=[jiedaibao:jiedaibao_andro:accountPassword-jiedaibao:jiedaibao_andro:accountPassword:!]
WARN [ReadStage:352] 2015-10-14 12:30:37,635 SliceQueryFilter.java (line 231) Read 1001 live and 45075 tombstone cells in forseti.velocity (see tombstone_warn_threshold).
1000 columns was requested, slices=[koudai:weidian:payeeIdNumber-koudai:weidian:payeeIdNumber:!]

上面的日志是对应用内的查询优化后(使用order+limit查询)打印的一些日志, 而按照我们的理解:
倒排后limit 1000条记录, 如果没有过期的有不止1000条, 那么是不会有tombstone出现的. 那么为什么会出现查到了1001条,仍然有tombstone呢?
难道这些查询记录是合作方,全局的吗, 那么为什么slices包括了partner_code:app_name:type.
注意: slices是个范围查询[xxx-yyy]的格式. slice里的是cluser key,没有partition key:attribute和timestamp.

这种情况是怎么出现的呢? 虽然能够正常查询出来, 但同时带有tombstone.
从下面的查询可以看到tombstone的数量对查询的速度还是有一定的影响的: 没有tombstone时是32ms, 2600个tombstone时55ms.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
时间需要有客户端传入, 目前cql虽然能够获取当前时间戳,但不支持时间戳的加减.  

如果把时间刚好定位在三个月之前的那个时间点(或者之后), 则不会查询到tombstone.
Read 1001 live and 0 tombstone cells,, 32ms

稍微过几秒, 会有少量的tombstone.
Read 1001 live and 45 tombstone cells,, 33ms

select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress'
and timestamp>1437645780000 limit 1000;
Read 1001 live and 519 tombstone cells,, 44ms

半个小时:
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress'
and timestamp>1437643644000 limit 1000;
Read 1001 live and 2661 tombstone cells,,55ms

orderby + timestamp

把order by和timestamp的时间限制结合起来: 虽然有时候两者差不多, 但总体来说没有orderby还是高一点.
并且order by+timestamp不会有tombstone被查询出来. 而只有timestamp是会有tombstone的.
经过测试, 只有order by也不会有tombstone.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1.只有timestamp
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress' and timestamp>1437710925000 limit 1000;
Read 1001 live and 111 tombstone cells
111: 38ms,
162: 71ms,
207: 51ms
2241: 79ms

2.只有order by
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress' order by partner_code desc,app_name desc,type desc,timestamp desc limit 1000;
Read 1001 live and 0 tombstone cells
39ms,
ø
3.order by + timestamp
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress' and timestamp>1437710925000 order by partner_code desc,app_name desc,type desc,timestamp desc limit 1000;
Read 1001 live and 0 tombstone cells
33ms

查询只有几条alive, 但是tombstone超过1万条的记录:

1
2
3
4
5
6
7
8
cat /var/log/cassandra/system.log | grep "Read [0-9] live and [0-9][0-9][0-9][0-9][0-9] tombstone cells in forseti.velocity"

WARN [ReadStage:403] 2015-10-17 15:06:22,030 SliceQueryFilter.java (line 231) Read 1 live and 26295 tombstone cells in forseti.velocity (see tombstone_warn_threshold). 1000 columns was requested, slices=[testDemo:testDemo:accountMobile-testDemo:testDemo:accountMobile:!]

WARN [ReadStage:7030] 2015-10-17 15:05:33,481 SliceQueryFilter.java (line 231) Read 0 live and 26295 tombstone cells in forseti.velocity (see tombstone_warn_threshold). 1000 columns was requested, slices=[testDemo:testDemo:accountMobile-testDemo:testDemo:accountMobile:!]
WARN [ReadStage:6968] 2015-10-18 12:12:39,221 SliceQueryFilter.java (line 231) Read 1 live and 15990 tombstone cells in forseti.velocity (see tombstone_warn_threshold). 1000 columns was requested, slices=[testDemo:testDemo:ipAddress-testDemo:testDemo:ipAddress:!]
WARN [ReadStage:7024] 2015-10-21 15:46:03,710 SliceQueryFilter.java (line 231) Read 2 live and 60159 tombstone cells in forseti.velocity (see tombstone_warn_threshold). 1 columns was requested, slices=[testDemo:testDemo:ipAddress:1437580800000:!-testDemo:testDemo:ipAddress:!]
WARN [ReadStage:7038] 2015-10-21 15:47:02,929 SliceQueryFilter.java (line 231) Read 6 live and 60333 tombstone cells in forseti.velocity (see tombstone_warn_threshold). 5 columns was requested, slices=[testDemo:testDemo:ipAddress:1437580800000:!-testDemo:testDemo:ipAddress:!]

orderby+limit如果存活的记录数没有超过limit,是否会查询TTL的数据?

TombstoneQueryTest: 插入103条, Delete 50条.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
select * from velocity where attribute='test2' and partner_code='koudai' and app_name='koudai_ios' and type='account' 

limit 10;
limit 100;
上面默认升序查询, 有多少个tombstone cells都是一样的, 只不过live的数量跟limit有关系:
Read 11 live and 15 tombstone cells
Read 95 live and 15 tombstone cells

order by partner_code desc,app_name desc,type desc,timestamp desc limit 30; //降序查询, 查询部分存活. 即全部存活的不止30个.
Read 31 live and 0 tombstone cells

order by partner_code desc,app_name desc,type desc,timestamp desc limit 100; //降序查询, 全部存活. 但是全部存活的都没有满足limit, 会去查询TTL的吗?
Read 90 live and 30 tombstone cells
在经过50m, 失效了5条记录, tombstones=5*3=15
Read 85 live and 45 tombstone cells

tombstone使用经验

http://yangzhe1991.org/blog/2015/05/using-cassandra-for-one-year/

LSM类的数据库都是把删除替换为写入一个标记,叫tombstone。HBase也差不多的原理,删除数据的话把tombstone写到log里,
compaction的时候如果遇到tombstone就可以不用保留已经被删的数据,读数据的时候遇到tombstone就屏蔽掉数据并且跳过,最后返回所有没删的数据。
但是,C*因为并不是靠单个region server来读写数据,而是N个节点同时读写,所以跟HBase最大的区别,就是每个节点维护tombstone的时候不能只考虑自己。
于是某单个节点读数据的时候读到tombstone后,不能屏蔽掉data然后跳过,而是要**把tombstone也返回给接受client请求的那个coordinator**
因为你不知道其他节点是否写入了这个tombstone(因为写入W个就算成功)。
而且在compaction的时候不能判断说当前节点跟这个tombstone有关的数据都删完了就把自己也删了,
因为你不确定别的节点有没有这个tombstone——如果别的节点在写tombstone的时候没写成(毕竟写W个节点就算成功,最多N-W个节点没有这个tombstone),
自己又在compaction的时候把tombstone删了,那么再读数据的时候有节点有这行数据,又没有节点有这行数据的tombstone,于是这个数据就复活了……

处理这两个tombstone导致的问题,解决方案就是,首先避免大量tombstone的生成。如果不能避免量删column key,那么一定要**避免对column key做范围查找**
(对column key做范围查找指的是查询一个row key的多个column, 对相同的partition key, 可以很多个column)
因为读的时候会把所有范围内包含的tombstone返回给coordinator,而tombstone又不是有效的返回数据不能用来分页,
导致可能client设limit 100,期望读100行数据,但对节点来说是100行数据+中间穿插的1000个tombstone,gc压力、响应时间跟读1100行数据差不多
所以C*默认会在一次读取遇到几万个tombstone的时候强制终止读取数据抛异常。

partition key的使用经验:

C*不对column key做bloom filter的优化, 只对row key做BF优化。而HBase我记得是可以有的。
于是这就导致如果读特定某行数据,如果table的定义是所有primary key都是partition key,那么可能bloom filter过滤完只需要读一个SSTable,
但如果有column key,而且partition key下面有非常多行数据,那么bloom filter对column key不起作用
这个partition key下面的数据分布在多个SSTable里,就需要读所有这些SSTable,影响性能。
因此,如果对范围查询没需求,那么就尽量把所有主键都设成partition key

http://yangzhe1991.org/blog/2014/08/cassandra删除数据的坑

Cassandra通过写一条“tombstone”来标记一个数据被删除了。
被标记的数据默认要10天(配置文件中的gc_grace_seconds)后且被compaction或cleanup执行到对应的SSTable时才会被真正从磁盘删除
在select数据的时候,在每个SSTable遇到的所有符合查询条件的tombstone要放内存中从而在合并每个SSTable文件的数据时判断
哪些column数据没被删能最终返回,tombstone太多会导致heap被大量占用需要各种GC从而影响性能
因此如果经常删除一个row key下的column key(比如因为TTL自动过期,逻辑意义上的一行对应的存储列就会被标记为tombstone),

同时又有一次select一个row key下多个column key的需求,很容易遇到这种情况,tombstone很多的时候即使不到10w最终成功读取了,
这次读取也会很慢,并很快导致触发年轻代GC(读取到内存中,首先放到Eden,如果量太大,Eden区很快满了触发YGC)从而拖慢整个节点的响应速度。

Slice Query

http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

1
2
3
4
A slice query will keep reading columns until one of the following condition is met (assuming regular, non-reverse order):  
- the specified limit of live columns has been read 只有存活的列才会被计入limit
- a column beyond the finish column has been read (if specified)
- all columns in the row have been read 这一行所有的列都读取完毕

http://aryanet.com/blog/cassandra-garbage-collector-tuning

1
2
3
4
When you have rows with dynamic columns(动态的列,即相同分区键不同CK,可以认为一个动态的列是传统意义的一行数据) and your application deletes those columns(删除列), 
Cassandra will apply a tombstones on those columns until next compaction runs and cleans up expired tombstones passed gc_grace_seconds.
When your client does a slice query(切片查询) to read a few columns(范围查询时查询多列), even if majority of columns were perviously deleted,
Cassandra will have to read and scan all the tombstones in order to find [as may non deleted columns** as you requested in your slice query.

Paging Query

https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

  1. I have some difficulties with paging through a row set, is there some examples available?

Prior to Cassandra 2.0, tricks were used to page through a large result set. However, Cassandra 2.0 has auto paging. 自动分页功能
Instead of using token function to create paging, it is now a built-in feature. 不适用token函数来分页,而是只用内置的自动分页功能.
Now developers can iterate over the entire result set, without having to care that it’s size is larger the the memory. 迭代全部数据,不需要关心数据大小能够放到内存中
As the client code iterates over the results, some extra rows can be fetched, while old ones are dropped. 客户端迭代结果时,额外的行会被获取,旧的行会被丢弃
Looking at this in Java, note that SELECT statement returns all rows, and the number of rows retrieved is set to 100.
I’ve shown a simple statement here, but the same code can be written with a prepared statement, couple with a bound statement.
It is possible to disable automatic paging if it is not desired. It is also important to test various fetch size settings,
since you will want to keep the memorize small enough, but not so small that too many round-trips to the database are taken.

1
2
3
4
5
6
7
8
9
Statement stmt = new SimpleStatement("select * FROM raw_weather_data WHERE wsid= '725474:99999' AND year = 2005 AND month = 6");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
Iterator<Row> iter = rs.iterator();
while (!rs.isFullyFetched()) {
rs.fetchMoreResults();
Row row = iter.next();
System.out.println(row);
}

Read Time Out

cassandra.yaml中和timeout相关的配置项:

1
2
3
4
5
6
7
cross_node_timeout: false
read_request_timeout_in_ms: 2000 读取一条记录的超时为2秒
range_request_timeout_in_ms: 15000 读取多条记录的超时为15秒
write_request_timeout_in_ms: 10000 写请求的超时为10秒
request_timeout_in_ms: 5000 请求超时为5秒
truncate_request_timeout_in_ms: 60000 清空表数据的超时为1分钟
cas_contention_timeout_in_ms: 1000

1.read timeout选项:读取一条记录的超时时间为2s, 超过2s这条记录没有返回,就中断查询.
示例: tombstone数量超过10万条.

1
2
3
4
5
select * from velocity where attribute='58.246.57.66' and partner_code='testDemo' and app_name='testDemo' and type='ipAddress' limit 1000;
Request did not complete within rpc_timeout.
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 15:24:00,068 | 192.168.47.222 | 502772
Timed out; received 0 of 1 responses | 15:24:01,565 | 192.168.47.202 | 2000874
Request complete | 15:24:01,564 | 192.168.47.202 | 2000993

2.range_request_timeout_in_ms=15s: 范围查询的超时时间为15s.
示例: truncate一张表的数据, 然后查询limit 1. 或者count(*)查询全部数据:

1
2
3
4
5
6
7
8
9
10
11
cqlsh:forseti> select * from activity_detail limit 1;
Request did not complete within rpc_timeout.
Executing seq scan across 15 sstables for (max(-8451995648092911156), max(-8399560236495679024)] | 09:35:20,165 | 192.168.47.203 | 197
Timed out; received 0 of 1 responses for range 144 of 3329 | 09:35:35,149 | 192.168.47.202 | 15148870
Request complete | 09:35:35,149 | 192.168.47.202 | 15149076

cqlsh:forseti> select count(*) from activity_detail;
Request did not complete within rpc_timeout.
Executing seq scan across 15 sstables for (max(-9209434738668044748), max(-9194935286458247645)] | 09:39:28,453 | 192.168.47.203 | 154
Timed out; received 0 of 1 responses for range 10 of 3329 | 09:39:43,435 | 192.168.47.202 | 15010833
Request complete | 09:39:43,434 | 192.168.47.202 | 15010948

3.查询就是会超时怎么办,而且不是因为tombstone太多,没办法加限制条件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 activity                                                                                                                | timestamp    | source         | source_elapsed
-------------------------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 15:38:26,989 | 192.168.48.159 | 0
Parsing SELECT * from android_device_session where session_id = 'Android_....................................' limit 1; | 15:38:26,989 | 192.168.48.159 | 51
Preparing statement | 15:38:26,989 | 192.168.48.159 | 154
Executing single-partition query on android_device_session | 15:38:26,989 | 192.168.48.159 | 305
Acquiring sstable references | 15:38:26,989 | 192.168.48.159 | 315
Merging memtable tombstones | 15:38:26,989 | 192.168.48.159 | 347
Bloom filter allows skipping sstable 542646 | 15:38:26,989 | 192.168.48.159 | 382
Bloom filter allows skipping sstable 542645 | 15:38:26,989 | 192.168.48.159 | 394
Bloom filter allows skipping sstable 542640 | 15:38:26,989 | 192.168.48.159 | 404
Bloom filter allows skipping sstable 542635 | 15:38:26,989 | 192.168.48.159 | 418
Bloom filter allows skipping sstable 542631 | 15:38:26,989 | 192.168.48.159 | 431
Bloom filter allows skipping sstable 542546 | 15:38:26,989 | 192.168.48.159 | 441
Key cache hit for sstable 542489 | 15:38:26,989 | 192.168.48.159 | 456
Seeking to partition beginning in data file | 15:38:26,989 | 192.168.48.159 | 461
Bloom filter allows skipping sstable 541100 | 15:38:26,989 | 192.168.48.159 | 817
Key cache hit for sstable 539754 | 15:38:26,989 | 192.168.48.159 | 838
Seeking to partition beginning in data file | 15:38:26,989 | 192.168.48.159 | 844
Bloom filter allows skipping sstable 538761 | 15:38:26,990 | 192.168.48.159 | 1050
Bloom filter allows skipping sstable 528588 | 15:38:26,990 | 192.168.48.159 | 1064
Bloom filter allows skipping sstable 540223 | 15:38:26,990 | 192.168.48.159 | 1079
Bloom filter allows skipping sstable 500552 | 15:38:26,990 | 192.168.48.159 | 1090
Bloom filter allows skipping sstable 436090 | 15:38:26,990 | 192.168.48.159 | 1100
Bloom filter allows skipping sstable 375563 | 15:38:26,990 | 192.168.48.159 | 1111
Key cache hit for sstable 275761 | 15:38:26,990 | 192.168.48.159 | 1130
Seeking to partition beginning in data file | 15:38:26,990 | 192.168.48.159 | 1138
Key cache hit for sstable 180212 | 15:38:26,990 | 192.168.48.159 | 1362
Seeking to partition beginning in data file | 15:38:26,990 | 192.168.48.159 | 1370
Timed out; received 0 of 1 responses | 15:38:28,989 | 192.168.48.159 | 2000450
Request complete | 15:38:28,989 | 192.168.48.159 | 2000522

Future Interrupt

查询方式修改为Future, 但还是存在超过500ms的查询:

velocity_500ms

一般在监控上超过50ms, zabbix的查询时间就会升高, 而opscenter就会有抖动:

slow_zabbix

slow_monitor

slow_ops

More Abould Query: Token?

Cassandra中的partition key会映射到集群节点VNodes组成的Token中. 通过对partition key计算hash,映射到token环的特定位置.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cqlsh:forseti> select token(attribute) from velocity limit 1;
token(attribute)
----------------------
-9223372016580570173
(1 rows)

Tracing session: 3d1450b0-7e16-11e5-971e-2bcdca057dae
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 16:22:48,380 | 192.168.47.202 | 0
Parsing select token(attribute) from velocity limit 1; | 16:22:48,380 | 192.168.47.202 | 81
Preparing statement | 16:22:48,381 | 192.168.47.202 | 288
Determining replicas to query | 16:22:48,381 | 192.168.47.202 | 651
Message received from /192.168.47.202 | 16:22:48,382 | 192.168.47.222 | 61
Executing seq scan across 6 sstables for [min(-9223372036854775808), max(-9218212468661360888)] | 16:22:48,383 | 192.168.47.222 | 372
Enqueuing request to /192.168.47.222 | 16:22:48,384 | 192.168.47.202 | 3252
Sending message to /192.168.47.222 | 16:22:48,384 | 192.168.47.202 | 3460
Seeking to partition beginning in data file | 16:22:48,386 | 192.168.47.222 | 3476
Read 1 live and 0 tombstone cells | 16:22:48,386 | 192.168.47.222 | 4165
Seeking to partition beginning in data file | 16:22:48,387 | 192.168.47.222 | 4204
Read 1 live and 0 tombstone cells | 16:22:48,387 | 192.168.47.222 | 4239
Scanned 1 rows and matched 1 | 16:22:48,387 | 192.168.47.222 | 4341
Enqueuing response to /192.168.47.202 | 16:22:48,390 | 192.168.47.222 | 8042
Sending message to /192.168.47.202 | 16:22:48,391 | 192.168.47.222 | 8272
Message received from /192.168.47.222 | 16:22:48,393 | 192.168.47.202 | 12316
Processing response from /192.168.47.222 | 16:22:48,393 | 192.168.47.202 | 12407
Request complete | 16:22:48,392 | 192.168.47.202 | 12789

Node down effect

seeds只设置一个节点52, 当52当掉, 即使使用CQL连接53, 也会报错: Unable to complete request: one or more nodes were unavailable.

使用API查询时: com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive)

这是因为副本数=1(keyspace里面设置的). 如果seeds设置两个节点52,53. 当掉52之后, 还是会出现上面的异常.

TTL

http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
http://engblog.polyvore.com/2015/03/cassandra-compaction-and-tombstone.html

Data in a column, other than a counter column, can have an optional expiration period called TTL (time to live).
The client request specifies a TTL value, defined in seconds, for the data.
TTL data is marked with a tombstone after the requested amount of time has expired.
A tombstone exists for gc_grace_seconds. After data is marked with a tombstone,
the data is automatically removed during the normal compaction and repair processes.
If you want to change the TTL of expiring data, you have to re-insert the data with a new TTL.

中的数据可以有一个可选的TTL间隔. 当指定的时间过去后,TTL数据(指定的列)会被标记为墓碑.
墓碑标记的存活时间为gc_grace_seconds. 如果这个配置=0, 则在整理和修复时会被删除.
如果这个配置不为0, 则还会再存活gc_grace_seconds时间. 在这个时间之后, 在整理时才会被删除.
如果想要更新TTL的时间, 只能使用新的TTL重新插入数据(C*中的insert是插入或更新,取决于这条数据是否已经存在).

Since all Cassandra SSTable entries are immutable, any updates or deletes to a row requires a new SSTable entry to be recorded.
Values which are deleted are replaced by a tombstone record for the length of time configured in gc_grace_seconds.
Cassandra reconciles(使一致) updates by determining the most recent data, and then only returns the valid data to the client.
被删除的记录会用tombstone替代, 并且会存活gc_grace_seconds一段时间.

This means that:

  • Deletions and storage space recovery are lazy 删除和磁盘空间恢复是懒执行的
  • A row can fragment across more than one SSTable 一条记录会分段在不止一个SSTable文件中
  • A tombstone is a fragment of a row 墓碑标记也是一行记录的一个分片(删除后flush到磁盘,这条记录被标记为tombstone)
  • Reading a row requires reconciling all SSTable entries, and determining the most recent set of data before returning data to the caller.
  • Increased row fragments spread across multiple SSTables increases response latency.

the most recent set of data: 最近的数据集, 新插入的是最细的.

tombstone只有出现TTL过期(3Month), 以及DELTE操作(我们的场景没有DELETE).
现在的velocity的gc_grace_seconds已经设置为0. 一旦TTL过期, 就会被删除.
而记录在达到TTL时, 只有在下一次Compaction的时候, 才会被真正删除.

因为每条记录都设置了TTL=3Month. 所以每经过一秒钟, 都会有满足TTL的记录存在.
所以被标记为tombstone(满足TTL)的记录存活的时间是: TTL到期的那一刻, 到下一次Compaction时.

那么问题转到SSTable文件什么时候会被Compaction?

http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

LeveledCompactionStrategy (LCS): The leveled compaction strategy creates SSTables of a fixed,
relatively small size (160 MB by default) that are grouped into levels.
Within each level, SSTables are guaranteed to be non-overlapping(不会重叠).
Each level (L0, L1, L2 and so on) is 10 times as large as the previous(每一级都都是前一级别10倍).
Disk I/O is more uniform(一致) and predictable on higher than on lower levels(高的级别比低级别更一致) as
SSTables are continuously(持续) being compacted into progressively(逐步) larger levels.
At each level, row keys are merged into non-overlapping SSTables.
This can improve performance for reads, because Cassandra can determine
which SSTables in each level to check for the existence of row key data.

Compaction的子属性配置: http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html?scroll=compactSubprop__compactionSubpropertiesLCS

http://stackoverflow.com/questions/29132033/cassandra-low-read-performance-with-high-sstable-count
If you are updating data within the same partition frequently and at different times, 对相同的partition在不同的时间段做频繁的更新
you could consider using LeveledCompactionStrategy. 可以使用分层级的整理压紧策略
LCS will keep data from the same partition together in the same SSTable within a level 在一个层级中将相同的分区数据放在同一个SSTable文件
which greatly improves read performance, at the cost of more Disk I/O doing compaction. 可以提升读性能(因为只读一个文件),但是整理时需要更多的IO操作
In my experience, the extra compaction disk I/O more than pays off in read performance if you have a high ratio of reads to writes.

TTL是针对行, 还是针对列

INSERT INTO users (user_name, password) VALUES ('cbrown', 'ch@ngem4a') USING TTL 86400;

问: 使用TTL后, 经过86400秒后, 这条数据(一行)在CQL中不会被查询到. 那么上面说的Column是什么意思,为什么不是Row?
答: Cassandra的Column实际上就是传统DB中认为的一行!

http://cjwebb.github.io/blog/2015/03/02/cassandra-ttl-is-per-column/
这篇文章在第一次插入时制定了一个TTL, 然后使用insert into只更新其中的一个字段,并使用新的TTL.
这样对于两个不同的字段, 它们的TTL时间就不同. 因此这条记录的两个字段, 会在不同的时间点失效.
那么问题是: 如果这两个字段都失效后, 这条记录还可不可见, 还是说可见, 但是两个字段为NULL. -> 不可见!

插入数据时using TTL

建表时指定default_time_to_live=0,
在插入时不指定ttl(test), 表示数据不会过期. 查询时ttl为null
在插入时指定ttl(test3), 则这一行数据在经过TTL后会过期被删除.

1
2
3
4
5
6
7
8
9
cqlsh:forseti> select ttl(event) from velocity where attribute='test' limit 1;
ttl(event)
------------
null

cqlsh:forseti> select ttl(event) from velocity where attribute='test3' limit 1;
ttl(event)
------------
3544

测试Table的default ttl

1.创建表时指定默认的TTL, 插入数据时不用using TTL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
CREATE TABLE velocity_test (
attribute text,
partner_code text,
app_name text,
type text,
"timestamp" bigint,
event text,
sequence_id text,
PRIMARY KEY ((attribute, partner_code, app_name, type), "timestamp")
) WITH
bloom_filter_fp_chance=0.100000 AND
caching='ALL' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=0 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=3600 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'unchecked_tombstone_compaction': 'true', 'tombstone_threshold': '0.1', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

使用CassandraClient或者原始API插入数据后, 查看非primary key的ttl:

1
2
3
4
5
cqlsh:forseti> select ttl(event) from velocity_test where attribute='test' and partner_code='koudai' and app_name='koudai_ios' and type='account' limit 1;
ttl(event)
------------
3310
(1 rows)

2.修改Table的 default_time_to_live

1
2
3
4
5
6
7
8
9
10
11
12
13
14
alter table velocity_test with  default_time_to_live = 1800;

cqlsh:forseti> select ttl(event) from velocity_test where attribute='test' and partner_code='koudai' and app_name='koudai_ios' and type='account' limit 1;
ttl(event)
------------
2796
(1 rows)

cqlsh:forseti> alter table velocity_test with default_time_to_live = 1800;
cqlsh:forseti> select ttl(event) from velocity_test where attribute='test' and partner_code='koudai' and app_name='koudai_ios' and type='account' limit 1;
ttl(event)
------------
2782
(1 rows)

3.只有新插入的数据(test2)才会使用刚刚更新的default_time_to_live, 而原有的数据(test)则不会更新!

1
2
3
4
5
cqlsh:forseti> select ttl(event) from velocity_test where attribute='test2' and partner_code='koudai' and app_name='koudai_ios' and type='account' limit 1;
ttl(event)
------------
1772
(1 rows)

小结

  • 修改表结构的default_time_to_live并不会改变原有数据的TTL, 只对新插入的数据有影响.
  • 在插入时指定TTL和设置表结构的default_time_to_live的效果都是一样的,数据在经过TTL后会过期被删除.
  • 插入时指定TTL通常要在代码中指定, 如果通过default_time_to_live, 则在插入时可以不用指定using ttl.

gc_grace_seconds

Specifies the time to wait before garbage collecting tombstones (deletion markers).
The default value allows a great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node cluster it can be safely set to zero.

设置gc_grace_seconds是为了在真正物理删除之前, 确保数据的一致性(默认是10天).
如果设置为0, 则在compact时, 因为TTL被标记为墓碑的记录, 会被立即删除.

Delete操作

http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_delete.html

CQL provides the DELETE command to delete a column or row.
Deleted values are removed completely by the first compaction following deletion.

Cassandra是支持删除的! 但是删除并不会立即发生, 而是在执行delete操作之后的第一次整理时.

读一致性

问题:查询时前几次没查询到,后面突然就可以查询到。

问题还原:

  1. 默认策略为ONE,只查询到一条记录
  2. 修改策略为QUORUM,查询到三条记录
  3. 修改策略为ONE,可以查询到三条记录

记录的时间:

no sequence_id timestamp date -r
1 1458614825458434F3079346B9597753 1458614825 2016年 3月22日 星期二 10时47分05秒
2 1455455063813744F307919065153577 1455455063 2016年 2月14日 星期日 21时04分23秒
3 1455454988601891F307919021350725 1455454988 2016年 2月14日 星期日 21时03分08秒
  1. 查询到一条时总是返回最近的记录
  2. 查询到多条时,按照时间升序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
cqlsh:forseti> select sequence_id,attribute from velocity where attribute='450602197704245112';

sequence_id | attribute
----------------------------------+--------------------
1458614825458434F3079346B9597753 | 450602197704245112

activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 09:47:41,507 | 192.168.47.202 | 0
Parsing select sequence_id,attribute from velocity where attribute='450602197704245112' LIMIT 10000; | 09:47:41,508 | 192.168.47.202 | 1453
Preparing statement | 09:47:41,509 | 192.168.47.202 | 2438
Sending message to /192.168.47.221 | 09:47:41,510 | 192.168.47.202 | 2825
Message received from /192.168.47.202 | 09:47:41,523 | 192.168.47.221 | 39
Executing single-partition query on velocity | 09:47:41,523 | 192.168.47.221 | 267
Acquiring sstable references | 09:47:41,523 | 192.168.47.221 | 314
Merging memtable tombstones | 09:47:41,523 | 192.168.47.221 | 358
Bloom filter allows skipping sstable 829683 | 09:47:41,523 | 192.168.47.221 | 462
→→→→ Sending message to /192.168.47.205 | 09:47:41,526 | 192.168.47.202 | 19083
Message received from /192.168.47.221 | 09:47:41,530 | 192.168.47.202 | 22909
Processing response from /192.168.47.221 | 09:47:41,530 | 192.168.47.202 | 23374
Partition index with 0 entries found for sstable 829120 | 09:47:41,533 | 192.168.47.221 | 10514
Seeking to partition beginning in data file | 09:47:41,533 | 192.168.47.221 | 10553
Bloom filter allows skipping sstable 829524 | 09:47:41,534 | 192.168.47.221 | 11161
Skipped 0/6 non-slice-intersecting sstables, included 0 due to tombstones | 09:47:41,534 | 192.168.47.221 | 11234
Merging data from memtables and 1 sstables | 09:47:41,534 | 192.168.47.221 | 11271
Read 1 live and 0 tombstone cells | 09:47:41,534 | 192.168.47.221 | 11346
Enqueuing response to /192.168.47.202 | 09:47:41,534 | 192.168.47.221 | 11482
Sending message to /192.168.47.202 | 09:47:41,534 | 192.168.47.221 | 11655
Request complete | 09:47:41,530 | 192.168.47.202 | 23614

cqlsh:forseti> CONSISTENCY QUORUM
Consistency level set to QUORUM.
cqlsh:forseti> select sequence_id,attribute from velocity where attribute='450602197704245112';

sequence_id | attribute
----------------------------------+--------------------
1455454988601891F307919021350725 | 450602197704245112
1455455063813744F307919065153577 | 450602197704245112
1458614825458434F3079346B9597753 | 450602197704245112

activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 09:51:27,836 | 192.168.47.202 | 0
Parsing select sequence_id,attribute from velocity where attribute='450602197704245112' LIMIT 10000; | 09:51:27,836 | 192.168.47.202 | 137
Preparing statement | 09:51:27,836 | 192.168.47.202 | 218
Sending message to /192.168.47.221 | 09:51:27,836 | 192.168.47.202 | 576
Sending message to /192.168.47.203 | 09:51:27,836 | 192.168.47.202 | 584
Sending message to /192.168.47.205 | 09:51:27,836 | 192.168.47.202 | 589
Message received from /192.168.47.221 | 09:51:27,838 | 192.168.47.202 | 2411
Processing response from /192.168.47.221 | 09:51:27,838 | 192.168.47.202 | 2538
Message received from /192.168.47.205 | 09:51:27,839 | 192.168.47.202 | 2878
Processing response from /192.168.47.205 | 09:51:27,839 | 192.168.47.202 | 2916
Message received from /192.168.47.203 | 09:51:27,841 | 192.168.47.202 | 4825
Processing response from /192.168.47.203 | 09:51:27,841 | 192.168.47.202 | 4966
Message received from /192.168.47.202 | 09:51:27,846 | 192.168.47.221 | 58
Executing single-partition query on velocity | 09:51:27,846 | 192.168.47.221 | 139
Acquiring sstable references | 09:51:27,846 | 192.168.47.221 | 150
Merging memtable tombstones | 09:51:27,846 | 192.168.47.221 | 182
Partition index with 0 entries found for sstable 830589 | 09:51:27,846 | 192.168.47.221 | 294
Seeking to partition beginning in data file | 09:51:27,846 | 192.168.47.221 | 300
Key cache hit for sstable 829120 | 09:51:27,846 | 192.168.47.221 | 487
Seeking to partition beginning in data file | 09:51:27,846 | 192.168.47.221 | 491
Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 09:51:27,846 | 192.168.47.221 | 755
Merging data from memtables and 2 sstables | 09:51:27,846 | 192.168.47.221 | 762
Read 3 live and 0 tombstone cells | 09:51:27,847 | 192.168.47.221 | 810
Message received from /192.168.47.202 | 09:51:27,847 | 192.168.47.205 | 46
Enqueuing response to /192.168.47.202 | 09:51:27,847 | 192.168.47.221 | 899
Executing single-partition query on velocity | 09:51:27,847 | 192.168.47.205 | 176
Sending message to /192.168.47.202 | 09:51:27,847 | 192.168.47.221 | 1086
Acquiring sstable references | 09:51:27,847 | 192.168.47.205 | 215
Merging memtable tombstones | 09:51:27,847 | 192.168.47.205 | 271
Key cache hit for sstable 804150 | 09:51:27,847 | 192.168.47.205 | 386
Seeking to partition beginning in data file | 09:51:27,847 | 192.168.47.205 | 395
Key cache hit for sstable 796552 | 09:51:27,848 | 192.168.47.205 | 775
Seeking to partition beginning in data file | 09:51:27,848 | 192.168.47.205 | 788
Skipped 0/6 non-slice-intersecting sstables, included 0 due to tombstones | 09:51:27,848 | 192.168.47.205 | 1094
Merging data from memtables and 2 sstables | 09:51:27,848 | 192.168.47.205 | 1107
Read 3 live and 0 tombstone cells | 09:51:27,848 | 192.168.47.205 | 1181
Enqueuing response to /192.168.47.202 | 09:51:27,848 | 192.168.47.205 | 1393
Sending message to /192.168.47.202 | 09:51:27,848 | 192.168.47.205 | 1648
Message received from /192.168.47.202 | 09:51:27,851 | 192.168.47.203 | 40
Executing single-partition query on velocity | 09:51:27,851 | 192.168.47.203 | 147
Acquiring sstable references | 09:51:27,851 | 192.168.47.203 | 181
Merging memtable tombstones | 09:51:27,851 | 192.168.47.203 | 217
Partition index lookup complete (bloom filter false positive) for sstable 760553 | 09:51:27,851 | 192.168.47.203 | 325
Partition index with 0 entries found for sstable 757598 | 09:51:27,852 | 192.168.47.203 | 796
Seeking to partition beginning in data file | 09:51:27,852 | 192.168.47.203 | 810
Partition index with 0 entries found for sstable 747846 | 09:51:27,854 | 192.168.47.203 | 2813
Seeking to partition beginning in data file | 09:51:27,854 | 192.168.47.203 | 2829
Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 09:51:27,855 | 192.168.47.203 | 3659
Merging data from memtables and 2 sstables | 09:51:27,855 | 192.168.47.203 | 3678
Read 3 live and 0 tombstone cells | 09:51:27,855 | 192.168.47.203 | 3735
Enqueuing response to /192.168.47.202 | 09:51:27,855 | 192.168.47.203 | 3817
Sending message to /192.168.47.202 | 09:51:27,855 | 192.168.47.203 | 3904
Request complete | 09:51:27,841 | 192.168.47.202 | 5142
cqlsh:forseti> CONSISTENCY ONE
Consistency level set to ONE.
cqlsh:forseti> select sequence_id,attribute from velocity where attribute='450602197704245112';

sequence_id | attribute
----------------------------------+--------------------
1455454988601891F307919021350725 | 450602197704245112
1455455063813744F307919065153577 | 450602197704245112
1458614825458434F3079346B9597753 | 450602197704245112

activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 09:52:15,416 | 192.168.47.202 | 0
Parsing select sequence_id,attribute from velocity where attribute='450602197704245112' LIMIT 10000; | 09:52:15,416 | 192.168.47.202 | 71
Preparing statement | 09:52:15,416 | 192.168.47.202 | 186
Sending message to /192.168.47.221 | 09:52:15,416 | 192.168.47.202 | 463
Message received from /192.168.47.221 | 09:52:15,418 | 192.168.47.202 | 2359
Processing response from /192.168.47.221 | 09:52:15,418 | 192.168.47.202 | 2448
Message received from /192.168.47.202 | 09:52:15,426 | 192.168.47.221 | 34
Executing single-partition query on velocity | 09:52:15,426 | 192.168.47.221 | 118
Acquiring sstable references | 09:52:15,426 | 192.168.47.221 | 144
Merging memtable tombstones | 09:52:15,426 | 192.168.47.221 | 178
Key cache hit for sstable 830589 | 09:52:15,426 | 192.168.47.221 | 226
Seeking to partition beginning in data file | 09:52:15,426 | 192.168.47.221 | 238
Key cache hit for sstable 829120 | 09:52:15,426 | 192.168.47.221 | 535
Seeking to partition beginning in data file | 09:52:15,426 | 192.168.47.221 | 543
Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 09:52:15,427 | 192.168.47.221 | 885
Merging data from memtables and 2 sstables | 09:52:15,427 | 192.168.47.221 | 901
Read 3 live and 0 tombstone cells | 09:52:15,427 | 192.168.47.221 | 986
Enqueuing response to /192.168.47.202 | 09:52:15,427 | 192.168.47.221 | 1074
Sending message to /192.168.47.202 | 09:52:15,427 | 192.168.47.221 | 1303
Request complete | 09:52:15,418 | 192.168.47.202 | 2607

现象

  1. 策略为ONE时,只会发送请求给一个节点(221),这个节点只读取了一个sstable(829120)就返回结果给客户端(协调者)了
  2. 策略为QUORUM时,发送请求给三个节点,其中221节点读取出来了2个sstable(829120和830589)。
    其中829120在上一次读取过了,因此会在key cache中,830589在Partition index。
    不管是key cache还是Partition index,最后都要seek file,因为数据是放在文件中的,缓存里的只有key
    最后三个节点的数据都返回给客户端,即Sending message to /192.168.47.202
  3. 策略为ONE时,发送请求给221,但这个时候会读取2个sstable,并且都在key cache中

其他问题:

  1. 日志的顺序并不代表事件发生的顺序

以第三次查询为例:

1
2
3
4
5
6
7
8
9
09:52:15,416

Sending message to /192.168.47.221

09:52:15,418:

192.168.47.202 Message received from /192.168.47.221
192.168.47.202 Processing response from /192.168.47.221
192.168.47.202 Request complete

请求在09:52:15,418已经完成了,但是打印出来的09:52:15,42还有任务在做。

一点结论

  1. sstable 829120只有一个live,而830589有2个live
  2. 说明要查询的行在221节点上分布在了两个sstable中
  3. cqlsh连接的节点是作为协调节点,response最后都会发送给协调节点
  4. 策略为ONE,发送给一个节点,策略为QUORUM,发送给三个节点
  5. 查询的行的副本在221,205,203上

协调节点变化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
cqlsh:forseti> select sequence_id,attribute from velocity where attribute='450602197704245112';

sequence_id | attribute
----------------------------------+--------------------
1455454988601891F307919021350725 | 450602197704245112
1455455063813744F307919065153577 | 450602197704245112
1458614825458434F3079346B9597753 | 450602197704245112

activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 10:22:25,031 | 192.168.47.224 | 0
Parsing select sequence_id,attribute from velocity where attribute='450602197704245112' LIMIT 10000; | 10:22:25,031 | 192.168.47.224 | 78
Preparing statement | 10:22:25,031 | 192.168.47.224 | 180
Sending message to /192.168.47.205 | 10:22:25,032 | 192.168.47.224 | 592
Message received from /192.168.47.224 | 10:22:25,032 | 192.168.47.205 | 76
Executing single-partition query on velocity | 10:22:25,032 | 192.168.47.205 | 302
Acquiring sstable references | 10:22:25,032 | 192.168.47.205 | 336
Merging memtable tombstones | 10:22:25,032 | 192.168.47.205 | 390
Bloom filter allows skipping sstable 804350 | 10:22:25,032 | 192.168.47.205 | 440
Bloom filter allows skipping sstable 804346 | 10:22:25,032 | 192.168.47.205 | 463
Bloom filter allows skipping sstable 804342 | 10:22:25,032 | 192.168.47.205 | 480
Bloom filter allows skipping sstable 804341 | 10:22:25,032 | 192.168.47.205 | 494
Bloom filter allows skipping sstable 804315 | 10:22:25,032 | 192.168.47.205 | 510
Key cache hit for sstable 804150 | 10:22:25,032 | 192.168.47.205 | 539
Seeking to partition beginning in data file | 10:22:25,032 | 192.168.47.205 | 549
Key cache hit for sstable 796552 | 10:22:25,033 | 192.168.47.205 | 899
Seeking to partition beginning in data file | 10:22:25,033 | 192.168.47.205 | 912
Bloom filter allows skipping sstable 794792 | 10:22:25,033 | 192.168.47.205 | 1218
Skipped 0/8 non-slice-intersecting sstables, included 0 due to tombstones | 10:22:25,033 | 192.168.47.205 | 1243
Merging data from memtables and 2 sstables | 10:22:25,033 | 192.168.47.205 | 1264
Read 3 live and 0 tombstone cells | 10:22:25,033 | 192.168.47.205 | 1356
Message received from /192.168.47.205 | 10:22:25,035 | 192.168.47.224 | 3920
Processing response from /192.168.47.205 | 10:22:25,035 | 192.168.47.224 | 4004
Request complete | 10:22:25,035 | 192.168.47.224 | 4205

用cqlsh连接其他节点,虽然协调节点发生变化,但是如果之前已经查询过了(在不同的节点查询),查询时都是会在key cache里。
比如连接224节点,协调者为224,默认策略为ONE,它联系的节点是205。这里的查询发生在上面第三步之后,所以实际和第三步
一样,也会查询到三条记录,并且都在key cache里。

  1. 所以协调节点变化,对查询结果没有影响,key cache保存在副本所在的节点上,并不是保存在协调节点上
  2. 协调节点变化,对于ONE时,选择最近的Replica节点会有变化,比如连接202时选择了221,连接224时选择了205
  3. 每次查询协调节点选择的Replica不是固定的,第一次选择205,下一次就可能选择203

疑问

官方文档中对于ONE的解释:

In a single data center cluster with a replication factor of 3, and a read consistency level of ONE,
the closest replica for the given row is contacted to fulfill the read request. In the background
a read repair is potentially initiated, based on the read_repair_chance setting of the table, for the other replicas.
只有最近副本的节点才会被联系,因此发送给一个节点和CQL的tracing对比起来是正常的。
下图中ONE的黑色箭头应该只有一个,蓝色的read repair应该有两个。因为请求并不会发送给三个节点,然后选择一个再返回!

c_read_one
c_read_quorum

对于QUORUM的解释:

In a single data center cluster with a replication factor of 3, and a read consistency level of QUORUM,
2 of the 3 replicas for the given row must respond to fulfill the read request. If the contacted replicas
have different versions of the row, the replica with the most recent version will return the requested data.
In the background, the third replica is checked for consistency with the first two, and if needed,
a read repair is initiated for the out-of-date replicas.
必须有2个副本做出响应,那么请求是只发给2个节点吗? 而CQL的tracing显示是同时发送请求给三个节点的
不同版本的行,是指不同列吗,还是说相同列? 因为我们这里返回多行数据,按照C的存储模型,是属于不同列的

Coordinator发送请求给Replica有三种方式:

  1. direct read request
  2. digest request
  3. background read repair request

The coordinator node contacts one replica node with a direct read request.
Then the coordinator sends a digest request to a number of replicas determined by the consistency level specified by the client.
The digest request checks the data in the replica node to make sure it is up to date.
Then the coordinator sends a digest request to all remaining replicas.
If any replica nodes have out of date data, a background read repair request is sent.
Read repair requests ensure that the requested row is made consistent on all replicas.

协调者联系一个节点发送direct read请求,然后根据客户端指定的一致性级别发送digest request给其他副本,比如副本数3,处理为QUORUM=2,
第一个Replica(R1)发送direct read,还剩余QUORUM-1=2-1=1个Replica(R2)会发送digest request。read请求是读取数据内容,而digest请求则读取数据概要。
digest请求会检查R2的数据是否是最新的,和什么比较呢,当然是和R1的数据进行比较,因为如果是同一份数据,概要信息是相同的(类似CRC校验)。
比较是发生在协调节点上,这个时候已经收到了R1的数据,但是因为策略为QUORUM,所以还不能把这个结果返回给客户端。
协调者发送digest请求后,并收到R2的概要数据,然后和R1计算出来的概要进行比较,如果不同,就会向R2发送read repair请求,不过这是在后台运行的。

由于副本有三个,现在协调者向R1发送direct reead request,向R2发送digest request,还有一个副本,也要向它发送digest request。
Read repair确保请求的行在所有的Replicas上的数据都是一致的。

问题1:如果R1虽然是最近的(发送direct read),但是不是最新的,R2是最新的,协调者比较R2的概要和R1计算出来的概要不同,会向R2发送read repair?

问题2:Read repair是怎么将最新的数据更新?是从最新的Replica上读取数据,还是由协调者负责更新。
协调者不保存数据,但是它因为发送direct read请求,能接收到某个副本的数据内容,所以在协调者的内存上是数据的。

协调节点超时

读取超时,即使加上tracing on,也只会告诉你超时,不会有更详细的信息了。

1
2
3
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] 
message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
Statement trace did not complete within 10 seconds

文章目录
  1. 1. Query
    1. 1.1. Range Query
    2. 1.2. 问题1: 查询应用内最近的记录存在BUG: 查的不是最近而是最早
    3. 1.3. 解决方式: 应用内最近记录使用Order By DESC
    4. 1.4. 问题2: 查询全局范围最近的记录目前的表结构无法实现
    5. 1.5. 拆表
  2. 2. CLUSTERING ORDER BY
  3. 3. 现实世界的查询: Tombstone
    1. 3.1. tombstone太多查询中断
    2. 3.2. orderby+limit查询
    3. 3.3. timestamp限制不读取tombstone
    4. 3.4. orderby + timestamp
    5. 3.5. orderby+limit如果存活的记录数没有超过limit,是否会查询TTL的数据?
    6. 3.6. tombstone使用经验
    7. 3.7. Slice Query
  4. 4. Paging Query
  5. 5. Read Time Out
    1. 5.1. Future Interrupt
  6. 6. More Abould Query: Token?
  7. 7. Node down effect
  8. 8. TTL
    1. 8.1. TTL是针对行, 还是针对列
    2. 8.2. 插入数据时using TTL
    3. 8.3. 测试Table的default ttl
    4. 8.4. gc_grace_seconds
    5. 8.5. Delete操作
  9. 9. 读一致性
  10. 10. 协调节点超时