Cassandra源码分析-存储引擎

Cassandra-2.2/3.0 源码分析:存储引擎

DecoratedKey、Token

对字节形式的key进行修饰后的DecoratedKey会用在很多地方,比如读写:

StorageService.getPartitioner()获取唯一的Partitioner。注意一个集群只允许配置一个Partitioner,
不允许配置多个Partitioner,否则会有冲突。而且服务端配置的Partitioner,客户端也必须使用相同的Partitioner,
如果说服务器使用Murmur3,而客户端使用Random,客户端启动时会报错。

1
2
3
4
5
//Read
public Row getRow(Keyspace keyspace) {
DecoratedKey dk = StorageService.getPartitioner().decorateKey(key);
return keyspace.getRow(new QueryFilter(dk, cfName, filter, timestamp));
}

c-decorated key

Murmur3Partitioner对key进行装饰后,最终得到某个Token,这个Token是无状态的数据,所以新的key会创建新的Token对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public class Murmur3Partitioner implements IPartitioner
public static final Murmur3Partitioner instance = new Murmur3Partitioner();

public DecoratedKey decorateKey(ByteBuffer key) {
long[] hash = getHash(key);
return new PreHashedDecoratedKey(getToken(key, hash), key, hash[0], hash[1]);
}
private LongToken getToken(ByteBuffer key, long[] hash) {
return new LongToken(normalize(hash[0]));
}

public static class LongToken extends Token {
final long token;
public LongToken(long token) { this.token = token; }

public IPartitioner getPartitioner() { return instance; }
public Object getTokenValue() { return token; }
}
}

public abstract class Token implements RingPosition<Token>, Serializable {
public static final TokenSerializer serializer = new TokenSerializer();
public static abstract class TokenFactory {
public abstract ByteBuffer toByteArray(Token token);
public abstract Token fromByteArray(ByteBuffer bytes);
public abstract String toString(Token token); // serialize as string, not necessarily human-readable
public abstract Token fromString(String string); // deserialize
}
abstract public IPartitioner getPartitioner();
abstract public Object getTokenValue();
}

DecoratedKey抽象类包括了key内容本身和Token,实现类有内存的BufferDecoratedKey和native的NativeDecoratedKey。

可见在支持native对象时,最底层的key对象已经开始用native方式分配内存了

1
2
3
4
5
public abstract class DecoratedKey implements RowPosition, FilterKey {
private final Token token;
public Token getToken() { return token; }
public abstract ByteBuffer getKey();
}

最底层的接口其实还不是DecoratedKey,而是RingPosition

c-ringposition

Token是对key进行hash得到的一个数值,因此可能产生hash冲突,即同一个hash值可能有多个key对应。
所以Key和Token不是一一对应的,根据key可以得到唯一的Token,但是根据Token不一定有唯一的key。

Native

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class NativeDecoratedKey extends DecoratedKey {
final long peer;

public NativeDecoratedKey(Token token, NativeAllocator allocator, OpOrder.Group writeOp, ByteBuffer key) {
super(token);
int size = key.remaining();
this.peer = allocator.allocate(4 + size, writeOp);
MemoryUtil.setInt(peer, size);
MemoryUtil.setBytes(peer + 4, key);
}
public ByteBuffer getKey() {
return MemoryUtil.getByteBuffer(peer + 4, MemoryUtil.getInt(peer));
}
}
public class BufferDecoratedKey extends DecoratedKey {
private final ByteBuffer key;

public BufferDecoratedKey(Token token, ByteBuffer key) {
super(token);
this.key = key;
}
public ByteBuffer getKey() { return key; }
}

http://normanmaurer.me/blog/2013/10/28/Lesser-known-concurrent-classes-Part-1/
c-atomic update
左图使用Atomic,右图使用volatile(总共占用500M)和AtomicUpdater(136M)

TokenMetadata

DataModel

OnDiskAtom:在盘原子(OnDisk + Atom原子,磁盘上的原子变量),有两个实现类:RangeTombstone和Cell,
Cell也有多种接口:AbstractCell、CounterCell、ExpiringCell、DeletedCell。
这里已经把删除相关的几种实现都覆盖了:TTL为ExpiringCell,删除命令为DeletedCell,删除多个为RangeTombstone。
普通操作的抽象类是AbstractCell,有两种大类实现:BufferCell和AbstractNativeCell,分别代表内存和Offheap中的Cell。

Cell有多种实现,除了几种删除相关的Cell外,普通Cell又分为BufferCell和NativeCell。
其中BufferCell在内存中,而NativeCell在OffHeap中。

c-cell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class Row {
public final DecoratedKey key;
public final ColumnFamily cf;
}
public abstract class ColumnFamily implements Iterable<Cell>, IRowCacheEntry {
protected final CFMetaData metadata;
}

public class ArrayBackedSortedColumns extends ColumnFamily {
private DeletionInfo deletionInfo;
private Cell[] cells;
private int size;
private int sortedSize;
}

public interface Cell extends OnDiskAtom {
public CellName name();
public ByteBuffer value();
}
public class BufferCell extends AbstractCell {
protected final CellName name; //Cell名称
protected final ByteBuffer value; //Cell值
protected final long timestamp; //时间撮,每个Cell都有一个时间撮,用来防止冲突
}

public class Keyspace {
private final ConcurrentMap<UUID, ColumnFamilyStore> columnFamilyStores = new ConcurrentHashMap<>();
private volatile KSMetaData metadata;
}

Keyspace和ColumnFamily的定义都有元数据,分别用KSMetaData和CFMetadata表示ks和表级别的配置信息。

数据库:Keyspace
表:ColumnFamily
主键:DecoratedKey
行:Row
列:Cell

Row由Key和ColumnFamily组成,即主键和列族(很多个列)。ColumnFamily是Column的Family,Column也叫做Cell。
ColumnFamily可以用一个双层Map表示:Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
因为是Map结构,所以查询Map中的指定key非常快,列是有序存储的,所以扫描多个列或者定位某个列也很高效。

column-family

SSTableWriter

SSTable构建在SequenceFile上,它在磁盘的数据存储是有序的,SSTable包括数据文件和索引文件。除此之外,为了加快文件的读取,
还有BloomFilter、IndexSummary,注意索引文件会存储每个key在数据文件中的索引位置,而IndexSummary文件则存储部分key,
每隔一定key的数量才在summary文件中存储一个条目。通常summary文件比较小,所以可以直接以MMap的形式映射到内存中。

SSTable分成SSTableWriter和SSTableReader,具体的文件操作接口实现是:BigTableWriter和BigTableReader。

1
2
3
4
5
SSTable
|-- SSTableWriter
|-- BigTableWriter
|-- SSTableReader
|-- BigTableReader

BigTableWriter.append()应该是实际的写入一行记录方法,看看调用链:当Memtable刷写时,会把内存中有序的数据追加到BigTableWriter。

1
2
3
BigTableWriter.append(DecoratedKey, ColumnFamily)
|-- Memtable.writeSortedContents(File)
|-- Memtable.flush()

append方法第一个参数DecoratedKey表示row key,那么第二个参数ColumnFamily表示的是这个key对应的所有Column家族。
ColumnFamily是Column数据的集合,Column包括ColumnName和ColumnValue,有了row key,column name,column value,数据也就准备完毕。

BigTableWriter

BigTableWriter针对索引文件和数据文件的写入分别是:IndexWriter和SequentialWriter。后者负责data文件,
而前者除了Index文件,还有BloomFilter文件、Summary文件都一起完成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class BigTableWriter extends SSTableWriter {
private final IndexWriter iwriter; //勤劳的Index,还要负责BF、IndexSummary
private final SequentialWriter dataFile; //孤傲的数据文件

public void append(DecoratedKey decoratedKey, ColumnFamily cf) {
long startPosition = beforeAppend(decoratedKey); //写文件,要知道开始位置
RowIndexEntry entry = rawAppend(cf, startPosition, decoratedKey, dataFile.stream); //返回索引条目
long endPosition = dataFile.getFilePointer();
afterAppend(decoratedKey, endPosition, entry);
}
private long beforeAppend(DecoratedKey decoratedKey) {
return (lastWrittenKey == null) ? 0 : dataFile.getFilePointer();
}
private void afterAppend(DecoratedKey decoratedKey, long dataEnd, RowIndexEntry index) {
lastWrittenKey = decoratedKey;
if (first == null) first = lastWrittenKey;
iwriter.append(decoratedKey, index, dataEnd); //索引文件,以及其他BF、IndexSummary都在这里完成
dbuilder.addPotentialBoundary(dataEnd);
}
}

IndexWriter索引文件

先来看IndexWriter怎么写入索引文件以及BF、IndexSummary等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class IndexWriter extends AbstractTransactional implements Transactional {
private final SequentialWriter indexFile; //Index文件
public final SegmentedFile.Builder builder;
public final IndexSummaryBuilder summary; //IndexSummary文件
public final IFilter bf; //Bloom Filter文件

public void append(DecoratedKey key, RowIndexEntry indexEntry, long dataEnd) throws IOException {
bf.add(key); //添加到Bloom Filter中,BF类似于一个List集合
long indexStart = indexFile.getFilePointer();
ByteBufferUtil.writeWithShortLength(key.getKey(), indexFile.stream);
rowIndexEntrySerializer.serialize(indexEntry, indexFile.stream); //序列化到IndexFile文件中
long indexEnd = indexFile.getFilePointer();
summary.maybeAddEntry(key, indexStart, indexEnd, dataEnd); //也许需要追加索引条目到Summary文件中
builder.addPotentialBoundary(indexStart);
}
}

Write Data(Column Index)

key和ColumnFamily已经足够可以代表要写入的数据了。ColumnIndex字面意思是Column索引,为什么要在列上加索引,
因为Cassandra是宽表,一行数据可能有很多列(最多2亿个列,可想而知,对列做索引也可以提高性能)。
这里的out参数是dataFile的输出流,所以接下来的文件写入都是会写到Data数据文件中的。

1
2
3
4
5
6
private static RowIndexEntry rawAppend(ColumnFamily cf, long startPosition, DecoratedKey key, DataOutputPlus out) {
ColumnIndex.Builder builder = new ColumnIndex.Builder(cf, key.getKey(), out);
ColumnIndex index = builder.build(cf); //这里会由dataFile写文件内容
out.writeShort(END_OF_ROW); //一行数据的结束标记位
return RowIndexEntry.create(startPosition, cf.deletionInfo().getTopLevelDeletion(), index); //返回RowIndex
}

IndexInfo的组成:Composite lastName、firsetName(Block索引块的第一个列名和最后一个列名),offset、width(索引块的偏移量和长度)。
一个ColumnIndex包括了多个IndexInfo,因此ColumnIndex表示的是所有Column组成在一起的最终索引结果。
c column index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class ColumnIndex {
public final List<IndexHelper.IndexInfo> columnsIndex;

// Help to create an index for a column family based on size of columns, and write said columns to disk.
public static class Builder {
private final ColumnIndex result;
private final DataOutputPlus output;
private final ByteBuffer key;

public ColumnIndex build(ColumnFamily cf){
for(Cell c : cf) add(c); //不考虑Tombstone等,非常简单!
ColumnIndex index = build();
return index;
}
}
}

ColumnIndex实际上并不是单个Column,或者说仅仅表示一个Column的Index,它表示的真正含义是一行的所有Column。
一行记录有很多Column,这些Column会每隔blockSize生成一个IndexInfo关于列的索引条目。
c index info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public void add(OnDiskAtom column) throws IOException {
if (firstColumn == null) { //第一列,在一行中只会执行一次。。错误!因为一行有多个Block, 每个Block都会执行一次
firstColumn = column;
startPosition = endPosition;
endPosition += tombstoneTracker.writeOpenedMarkers(firstColumn.name(), output, atomSerializer);
blockSize = 0;
maybeWriteRowHeader(); //第一个列,需要添加RowHeader
}
if (tombstoneTracker.update(column, false)) {
long size = tombstoneTracker.writeUnwrittenTombstones(output, atomSerializer);
size += atomSerializer.serializedSizeForSSTable(column);
endPosition += size;
blockSize += size; //增加block大小,最后才可以判断是否需要添加列索引
atomSerializer.serializeForSSTable(column, output); //序列化Column!!!
}
lastColumn = column; //最近一个列

// if we hit the column index size that we have to index after, go ahead and index it.
if (blockSize >= DatabaseDescriptor.getColumnIndexSize()) { //间隔blockSize,添加一个列索引
IndexHelper.IndexInfo cIndexInfo = new IndexHelper.IndexInfo(
firstColumn.name(), column.name(), indexOffset + startPosition, endPosition - startPosition);
result.columnsIndex.add(cIndexInfo);
firstColumn = null; //重置firstColumn为null,这样下一个Block会重新执行开头的if条件
lastBlockClosing = column;
}
}

最后build的时候,第一个if条件表示在add过程中都没产生IndexInfo,第二个条件是说即使add过程有IndexInfo,
可能剩余的Column不够一个完整的Block,也要新建一个IndexInfo不足以产生一个完整的Block。

1
2
3
4
5
6
7
8
public ColumnIndex build() {
if (result.columnsIndex.isEmpty() || lastBlockClosing != lastColumn) {
IndexHelper.IndexInfo cIndexInfo = new IndexHelper.IndexInfo(
firstColumn.name(), lastColumn.name(), indexOffset + startPosition, endPosition - startPosition);
result.columnsIndex.add(cIndexInfo);
}
return result;
}

ColumnIndex会被用于创建RowIndexEntry,如果索引块(IndexInfo)只有一个,则直接创建RowIndexEntry(不需要把IndexInfo传进去),
否则创建IndexedEntry,会把ColumnIndex的所有IndexInfo都传入。
c-index entry

position其实就是Row的起始位置,知道了起始位置,就可以构建row key索引的条目了。后续的IndexWriter我们已经分析过了。

1
2
3
4
5
6
public static RowIndexEntry<IndexHelper.IndexInfo> create(long position, DeletionTime deletionTime, ColumnIndex index) {
if (index.columnsIndex.size() > 1)
return new IndexedEntry(position, deletionTime, index.columnsIndex);
else
return new RowIndexEntry<>(position); //如果只有一个columnsIndex,直接用RowIndexEntry
}

最后回到IndexWriter流程,看看索引文件中RowIndex条目的序列化:

1
2
3
4
5
6
7
8
9
10
11
//RowIndexEntry.Serializer
public void serialize(RowIndexEntry<IndexHelper.IndexInfo> rie, DataOutputPlus out) throws IOException {
out.writeLong(rie.position);
out.writeInt(rie.promotedSize(idxSerializer));
if (rie.isIndexed()) { //如果有多个IndexInfo
DeletionTime.serializer.serialize(rie.deletionTime(), out);
out.writeInt(rie.columnsIndex().size());
for (IndexHelper.IndexInfo info : rie.columnsIndex()) //每个IndexInfo都要序列化
idxSerializer.serialize(info, out);
}
}

Tombstone

http://stackoverflow.com/questions/27776337/what-types-of-tombstones-does-cassandra-support
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Tombstone标记可以作用在A Column(d)、A Range of Columns(e)、A Whole Row。
单单DELETE语法就有多种:删除表、删除指定行、删除指定行的指定列。下面列举了几种Tombstone类型:

Tombstone类型 SQL 示例
column tombstone delete id from ts1 WHERE col1 = '3131'; {“key”: “3131”,”columns”: [[“id”,”54822130”,1417814320400000,”d”]]},
row tombstone delete from ts1 WHERE col1 = '31'; {“key”: “31”,”metadata”: {“deletionInfo”: {“markedForDeleteAt”:1417814302304000,”localDeletionTime”:1417814302}},”columns”: []}
list tombstone insert into flights (id, destinations) values ('BA1234', ['ORD', 'LHR']); [“1381316637599609:45787829:tags:_”,”1381316637599609:45787829:tags:!”,1438264650252000,”t”,1438264650],

Mutation.delete

Mutation中有三个关于delete的方法(虽然参数中都没有row key,不过Mutation一定是具体到key的),
cfName表示ColumnFamilyName,即表名。下面的示例中cfName=tableX,CellName=col1。

1
2
delete from tableX where rowkey=key;      #delete(String cfName, long timestamp)
delete col1 from tableX where rowkey=key; #delete(String cfName, CellName name, long timestamp)

三种delete方法,调用的是不同的方法,分别是:delete,addTombstone,addAtom。其中删除列没有使用DeletionInfo。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public class Mutation implements IMutation
//1.删除行
public void delete(String cfName, long timestamp) {
int localDeleteTime = (int) (System.currentTimeMillis() / 1000);
addOrGet(cfName).delete(new DeletionInfo(timestamp, localDeleteTime));
}
//2.删除行中的某一列
public void delete(String cfName, CellName name, long timestamp) {
addOrGet(cfName).addTombstone(name, localDeleteTime, timestamp);
}
//3.删除行的多个列,范围
public void deleteRange(String cfName, Composite start, Composite end, long timestamp) {
addOrGet(cfName).addAtom(new RangeTombstone(start, end, timestamp, localDeleteTime));
}
}

public abstract class ColumnFamily implements Iterable<Cell>, IRowCacheEntry
//2.删除行中的某一列
public void addTombstone(CellName name, int localDeletionTime, long timestamp) {
addColumn(new BufferDeletedCell(name, localDeletionTime, timestamp));
}
//3.删除行的多个列,范围
public void addAtom(OnDiskAtom atom) {
if (atom instanceof Cell) {
addColumn((Cell)atom);
} else {
assert atom instanceof RangeTombstone;
delete((RangeTombstone)atom); //RangeTombstone实现了OnDiskAtom,所以可以强转
}
}

public abstract void delete(DeletionInfo info);
}

DeletionInfo、DeletionTime

列级别的删除采用的数据结构是BufferDeletedCell,只有行级别和RangeTombstone的删除才会使用DeletionInfo。

TopLevel字面意思是最高级别,在所有Column之上的是Row,Row叫做TopLevel。

1
2
3
4
5
6
7
8
9
10
11
12
// A combination of a top-level (or row) tombstone and range tombstones describing the deletions within a {@link ColumnFamily} (or row).
public class DeletionInfo implements IMeasurableMemory {
private DeletionTime topLevel; //代表row级别的删除
private RangeTombstoneList ranges; //A list of range tombstones within the row.

//判断一个Cell是否能够被DeletionInfo删除
public boolean isDeleted(Cell cell) {
if (isLive()) return false;
if (cell.timestamp() <= topLevel.markedForDeleteAt) return true; //Cell的时间撮比markedForDeleteAt小,可以删除
return ranges != null && ranges.isDeleted(cell); //如果存在RangeTombstone,则用RangeTombstoneList判断
}
}

序列化DeletionInfo的两个时间撮到文件中:

  1. localDeletionTime:什么时候会创建这个Tombstone,它的目的仅仅是为了经过gc_grace_seconds后删除Tombstone。
  2. markedForDeleteAt:标记记录什么时候可以被删除,通常用来作为判断条件,如果为MIN,则表示这行记录不会被删除

通常使用DELETE命令删除行,会立即生成Tombstone,这时localDeletionTime是当前系统时间撮。如果使用TTL方式,则localDeletionTime是在TTL后的系统时间。
当创建Tombstone之后,为了不让Tombstone一直保存在磁盘中,再经过gc_grace_seconds后要把Tombstone删除掉,注意只有在创建Tombstone之后的gc才删除。
假设是TTL方式创建一条记录,并不是说在创建记录之后经过gc时间删除Tombstone;而是经过TTL时间+gc时间才删除Tombstone,因为经过TTL时间才开始创建Tombstone。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//DeletionTime
public class DeletionTime implements Comparable<DeletionTime>, IMeasurableMemory {
//A special DeletionTime that signifies that there is no top-level (row) tombstone.
public static final DeletionTime LIVE = new DeletionTime(Long.MIN_VALUE, Integer.MAX_VALUE);
//A timestamp after which data should be considered deleted. If set to Long.MIN_VALUE, this implies that the data has not been marked for deletion at all.
public final long markedForDeleteAt;
//The local server timestamp, at which this tombstone was created. This is only used for purposes of purging the tombstone after gc_grace_seconds have elapsed.
public final int localDeletionTime;

public DeletionTime(long markedForDeleteAt, int localDeletionTime) {
this.markedForDeleteAt = markedForDeleteAt;
this.localDeletionTime = localDeletionTime;
}
public boolean isLive() { //Returns whether this DeletionTime is live, that is deletes no columns.
return markedForDeleteAt == Long.MIN_VALUE && localDeletionTime == Integer.MAX_VALUE;
}
public boolean isDeleted(OnDiskAtom atom) {
return atom.timestamp() <= markedForDeleteAt; //可以看到比较某个Atom能够被删除,用的是markedForDeleteAt
}
}

序列化RowKey和TopLevel Tombstone

回头重新看下ColumnIndex添加原始数据中关于Tombstone的逻辑(前面忽略了Tombstone)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public ColumnIndex build(ColumnFamily cf) throws IOException {
// cf has disentangled the columns and range tombstones, we need to re-interleave them in comparator order
Comparator<Composite> comparator = cf.getComparator();
DeletionInfo.InOrderTester tester = cf.deletionInfo().inOrderTester();
Iterator<RangeTombstone> rangeIter = cf.deletionInfo().rangeIterator();
RangeTombstone tombstone = rangeIter.hasNext() ? rangeIter.next() : null;
for (Cell c : cf) {
while (tombstone != null && comparator.compare(c.name(), tombstone.min) >= 0) {
// skip range tombstones that are shadowed by partition tombstones
if (!cf.deletionInfo().getTopLevelDeletion().isDeleted(tombstone)) add(tombstone);
tombstone = rangeIter.hasNext() ? rangeIter.next() : null;
}
if (!tester.isDeleted(c)) add(c);
}
while (tombstone != null) { //添加所有的Tombstone,DeletionInfo中有RangeTombstoneList
add(tombstone);
tombstone = rangeIter.hasNext() ? rangeIter.next() : null;
}
finishAddingAtoms(); //writeUnwrittenTombstones,主要是序列化RangeTombstone
ColumnIndex index = build(); //构建ColumnIndex最后一个IndexInfo,其他IndexInfo在add方法中已经添加过了
maybeWriteEmptyRowHeader(); //序列化RowKey和TomLevel(Row) Tombstone
return index;
}

maybeWriteEmptyRowHeader有两种调用场景

因为DeletionInfo中用DeletionTime字段表示A top-level (row) tombstone。即DeletionTime代表的是行级别的,所以会和RowKey同一个级别一起序列化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
//ColumnIndex的Builder内部类
public void maybeWriteEmptyRowHeader() throws IOException {
if (!deletionInfo.isLive()) maybeWriteRowHeader(); //存在Tombstone
}
private void maybeWriteRowHeader() throws IOException {
if (lastColumn == null) { //如果lastColumn为空,说明没有任何列
ByteBufferUtil.writeWithShortLength(key, output); //写入RowKeyLength和RowKey内容
DeletionTime.serializer.serialize(deletionInfo.getTopLevelDeletion(), output);
}
}
//DeletionTime的序列化内部类
public void serialize(DeletionTime delTime, DataOutputPlus out) throws IOException {
out.writeInt(delTime.localDeletionTime);
out.writeLong(delTime.markedForDeleteAt);
}

maybeWriteRowHeader不仅仅在最后调用一次,实际上在add(Atom)中每个Collumn Block也都会调用一次。也就是说每个Column Block都会分别创建一个IndexInfo,
也会调用一次maybeWriteRowHeader。add方法中会把这两个操作放在一起执行,而最后一个不足的Block,在build中创建IndexInfo,然后手动调用maybeWriteRowHeader。

Wait!Wait!Wait!注意到maybeWriteRowHeader序列化RowKey的条件是lastColumn=null,通常情况下只要有列,lastColulmn都不会为空!
不过即使没有任何列,也需要写入RowKey和TopLevel Tombstone。所以maybeWriteRowHeader有两个调用场景:
c-write row

  1. 没有任何列的情况下会被调用,并且只会被调用一次
  2. 有列,不管有多少个Column Block,也只会调用一次,因此第一个Block时lastColumn=null,后续的Block则lastColumn!=null
1
2
3
4
5
6
7
8
9
public void add(OnDiskAtom column){  //如果没有任何列,不会调用add方法,lastColumn=null,也需要写入RowKey和TopLevel Tombstone
if (firstColumn == null) { //一旦调用该add方法,说明有Column列
maybeWriteRowHeader(); //只有lastColumn=null时,才写入RowKey和TopLevel Tombstone
}
lastColumn = column; //只要有一个Column被添加,lastColumn就不为空
if (blockSize > ...) {
firstColumn = null; //新的Block,重置firstColumn,但是不会重置lastColumn哦
}
}

序列化Column

RowKey和Tombstone要序列化,当然Column肯定也需要序列化,在添加每个Column时,先序列化Column大小,再序列化Column值。

1
2
3
4
5
6
7
8
9
public void add(OnDiskAtom column) throws IOException {
if (tombstoneTracker.update(column, false)) {
long size = tombstoneTracker.writeUnwrittenTombstones(output, atomSerializer); //序列化Column的SIZE
size += atomSerializer.serializedSizeForSSTable(column);
endPosition += size;
blockSize += size;
atomSerializer.serializeForSSTable(column, output); //序列化Column的内容
}
}

先来看下atomSerializer的构造,它和ColumnFamily的Comparator有关,而Comparator来源于CFMetaData

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//ColumnIndex的Builder内部类
public static class Builder {
private final OnDiskAtom.Serializer atomSerializer;

public Builder(ColumnFamily cf, ByteBuffer key, DataOutputPlus output) {
this.key = key;
deletionInfo = cf.deletionInfo();
this.indexOffset = rowHeaderSize(key, deletionInfo);
this.result = new ColumnIndex(new ArrayList<IndexHelper.IndexInfo>());
this.output = output;
this.tombstoneTracker = new RangeTombstone.Tracker(cf.getComparator());
this.atomSerializer = cf.getComparator().onDiskAtomSerializer();
}
}

//CFMetadata
public final class CFMetaData {
//必选
public volatile CellNameType comparator; // bytes, long, timeuuid, utf8, etc.
//可选
private volatile AbstractType<?> defaultValidator = BytesType.instance;
private volatile AbstractType<?> keyValidator = BytesType.instance;
}

ColumnFamily的比较类型(CellNameType)

CFMetaData字面意思是ColumnFamily的元数据,实际上是表级别的设置。下面举例看下表级别的Comparator到底是什么东西。
CellNameType的类继承体系如下,主要分为Compound和Simple,然后又根据Sparse和Dense,最终由四种组合类型。

1
2
3
4
5
6
7
8
CellNameType
AbstractCellNameType
AbstractCompoundCellNameType
CompoundDenseCellNameType
CompoundSparseCellNameType
AbstractSimpleCellNameType
SimpleSparseCellNameType
SimpleDenseCellNameType

CellName是一个组合,主要是为了CQL3而设计的,a CellName has first a number of clustering components, followed by the CQL3 column name, and then possibly followed by a collection element part.,首先是一些排序组件,然后是(普通的)CQL3列名,最后可能还有集合类型。

The sparse ones are CellName where one of the component (the last or second-to-last for collections) is used to store the CQL3 column name.
In other words, we have 4 types of CellName/CellNameType which correspond to the 4 type of table layout that we need to distinguish:

  1. Simple (non-truly-composite) dense: this is the dynamic thrift CFs whose comparator is not composite.
  2. Simple (non-truly-composite) sparse: this is the thrift static CFs (that don’t have a composite comparator).
  3. Composite dense: this is the dynamic thrift CFs with a CompositeType comparator.
  4. Composite sparse: this is the CQL3 layout (note that this is the only one that support collections).
1
2
3
4
5
public interface CellNameType extends CType {
public boolean isDense(); // Whether or not the cell names for this type are dense.
public int clusteringPrefixSize(); // The number of clustering columns for the table this is the type of.
public CBuilder prefixBuilder(); // A builder for the clustering prefix.
}

clustering prefix表示列名中,clustering key的值是作为列名的前缀。当然也可以不需要clustering key,这时候就只有普通的PartitionKey作为主键了。
注意如果是CQL3的COMPACT STORAGE,则不会在列名中存储CQL3列名(即普通的列名),这种类型叫做dense,所以CellNameType继承体系的依据是:

  1. 是否是COMPACT STORAGE(或者thrift),是为:Dense,不是为:Sparse
  2. ParitionKey是否有多个,是为Compound,不是为:Simple
COMPACT STORAGE Partition Key多个 类型
N Y SimpleDenseCellNameType
N N SimpleSparseCellNameType
Y Y CompoundDenseCellNameType
Y N CompoundSparseCellNameType

COMPACT STORAGE和thrift是等价的,Dense译为密集,表示存储比较紧密,Spark译为稀疏,则存储的列比较多。

CellNameType示例

下面示例的Partition Key有四个字段,Clustering Key有一个字段(sequence_id),并且排序方式是DESC(即ReversedType),并且还有两个普通字段(event、timestamp)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
CREATE TABLE forseti.velocity_app (
attribute text,
partner_code text,
app_name text,
type text,
sequence_id text,
event text,
timestamp bigint,
PRIMARY KEY ((attribute, partner_code, app_name, type), sequence_id)
) WITH CLUSTERING ORDER BY (sequence_id DESC)
AND bloom_filter_fp_chance = 0.1
AND caching = '{"keys":"ALL", "rows_per_partition":"ALL"}'
AND comment = ''
AND compaction = {'unchecked_tombstone_compaction': 'true', 'tombstone_threshold': '0.1', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = '99.0PERCENTILE';

insert into velocity_app(attribute, partner_code, app_name, type,timestamp,event,sequence_id)values('zqhxuyuan','tongdun','tongdun_app','login',1111111111,'{jsondata}','1111111111-1');
select * from velocity_app where attribute='zqhxuyuan' and type='login' and partner_code='tongdun' and app_name='tongdun_app';

#schema_columnfamilies实际上对应的就是CFMetadata
cqlsh:system> select * from system.schema_columnfamilies where keyspace_name='forseti' and columnfamily_name='velocity_app';
keyspace_name | columnfamily_name | bloom_filter_fp_chance | caching | cf_id | comment | compaction_strategy_class | compaction_strategy_options | comparator | compression_parameters | default_time_to_live | default_validator | dropped_columns | gc_grace_seconds | is_dense | key_validator | local_read_repair_chance | max_compaction_threshold | max_index_interval | memtable_flush_period_in_ms | min_compaction_threshold | min_index_interval | read_repair_chance | speculative_retry | subcomparator | type
---------------+-------------------+------------------------+---------------------------------------------+--------------------------------------+---------+-----------------------------------------------------------------+-----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------------------+-------------------------------------------+-----------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+--------------------------+--------------------+-----------------------------+--------------------------+--------------------+--------------------+-------------------+---------------+----------
forseti | velocity_app | 0.01 | {"keys":"ALL", "rows_per_partition":"NONE"} | 763248f0-8f88-11e6-a6b6-71d72bc0ba41 | | org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy | {} | org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.UTF8Type),org.apache.cassandra.db.marshal.UTF8Type) | {"sstable_compression":"org.apache.cassandra.io.compress.LZ4Compressor"} | 0 | org.apache.cassandra.db.marshal.BytesType | null | 864000 | False | org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type) | 0.1 | 32 | 2048 | 0 | 4 | 128 | 0 | 99.0PERCENTILE | null | Standard

cqlsh:system> select * FROM schema_columns where keyspace_name='forseti' and columnfamily_name='velocity_app';
keyspace_name | columnfamily_name | column_name | component_index | index_name | index_options | index_type | type | validator
---------------+-------------------+--------------+-----------------+------------+---------------+------------+----------------+----------------------------------------------------------------------------------------
forseti | velocity_app | app_name | 2 | null | null | null | partition_key | org.apache.cassandra.db.marshal.UTF8Type
forseti | velocity_app | attribute | 0 | null | null | null | partition_key | org.apache.cassandra.db.marshal.UTF8Type
forseti | velocity_app | event | 1 | null | null | null | regular | org.apache.cassandra.db.marshal.UTF8Type
forseti | velocity_app | partner_code | 1 | null | null | null | partition_key | org.apache.cassandra.db.marshal.UTF8Type
forseti | velocity_app | sequence_id | 0 | null | null | null | clustering_key | org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.UTF8Type)
forseti | velocity_app | timestamp | 1 | null | null | null | regular | org.apache.cassandra.db.marshal.LongType
forseti | velocity_app | type | 3 | null | null | null | partition_key | org.apache.cassandra.db.marshal.UTF8Type

下表总结了schema_columnfamilies的几种比较类型,其中第一行comparator,即是表级别的Comparator。
这是一个组合类型,第一个字段是倒序的sequence_id,那么第二个字段到底表示的是哪个字段。
首先不可能是PartitionKey,因为PartitionKey有四个字段,那么就只剩下两个普通字段了,
但是这两个普通字段的类型又不同,分别是bigint和text!实际上用UTF8类型代表这两个字段。
所以comparator实际上是Column级别的比较器!因为物理存储时,列名为:ClusteringKey的Value:普通字段的名称。
即ReversedType(UTF8Type)=ClusteringKey的值(倒序排列),UTF8Type=普通列的名称。

比较器 类型 说明
comparator CompositeType(ReversedType(UTF8Type),UTF8Type) CellNameType
default_validator BytesType
key_validator CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type) PartitionKey,不包括ClusteringKey

注意CQL的comparator和Thrift的Comparator有点不同,除非如果是Compact Storage的CQL才相同(因为COMPACT不会存储普通列名)。

CellName与Type

CellNameType继承CType(左图),CellName继承Composite(右图)。CellName是列的名称,即列的数据内容,CellNameType是列的类型,即列数据是什么样的类型。
比如上面示例中,相同sequence_id有两个(普通)列:123456789-001:event,123456789-001:timestamp,列的值分别是:”{json evnet}”,123456789。

ColumnName 123456789-001:event 123456789-001:timestamp
ColumnValue “{json event}” 123456789

那么CellName分别是”123456789-001:event”和”123456789-001:timestamp”,而CellNameType只有一个:CompositeType(ReversedType(UTF8Type),UTF8Type)

c-ctype | c-composite

列名存储列的数据内容,列类型用来表示列的一些特性,比如是不Dense的,前缀是什么,这样根据列的类型,可以做一些特殊的操作。可以把列类型看做是列的元数据。

列名

CellName的最底层接口是Composite:A composite value can be though as a list of ByteBuffer。即列名就是ByteBuffer数据内容,
不管是SimpleSparseCellName还是CompoundSparseCellName,列名肯定是唯一的,用ColumnIdentifier表示:Represents an identifer for a CQL column definition。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public interface CellName extends Composite {
public ColumnIdentifier cql3ColumnName(CFMetaData metadata);
}
public class ColumnIdentifier extends org.apache.cassandra.cql3.selection.Selectable {
public final ByteBuffer bytes;
private final String text;
}

public class SimpleSparseCellName extends AbstractComposite implements CellName {
private final ColumnIdentifier columnName;
}
public class CompoundSparseCellName extends CompoundComposite implements CellName {
protected final ColumnIdentifier columnName;
}

列类型

根据PartitionKey的数量,一个是Simple,多个是Compound,那么CompoundSparseCellNameType表示多个PartitionKey,而且是正常的CQL表(不是COMPACT STORAGE)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//是不是truly composite CType,是由多少个PartitionKey决定
public class SimpleCType extends AbstractCType { // A not truly-composite CType.
protected final AbstractType<?> type; //一个PartitionKey,所以只有一个成员变量,代表PartitionKey的类型
}
public class CompoundCType extends AbstractCType { // A truly-composite CType.
final List<AbstractType<?>> types; //多个PartitionKey,所以是一个列表
}

public abstract class AbstractCompoundCellNameType extends AbstractCellNameType {
protected final CompoundCType clusteringType;
protected final CompoundCType fullType;
}
public class CompoundSparseCellNameType extends AbstractCompoundCellNameType {
// For CQL3 columns, this is always UTF8Type. However, for compatibility with super columns, we need to allow it to be non-UTF8.
private final AbstractType<?> columnNameType;

public CompoundSparseCellNameType(List<AbstractType<?>> types) {
this(types, UTF8Type.instance); //第二个参数是:columnNameType
}
}

虽然CFMetadata的comparator是CellNameType,但是上面的CQL示例中comparaor却是CompositeType。CompoundCType和CompositeType都表示复合类型,其成员变量都是List类型。

问题:为什么需要CompoundCType和CompositeType,Composite可以看做是CellName,CompositeType就是CelllNameType了,而CompoundCType也是CellNameType。两者到底有什么不同?

1
2
3
4
5
6
7
8
public class CompoundCType extends AbstractCType { // A truly-composite CType.
final List<AbstractType<?>> types; //多个PartitionKey,所以是一个列表
}

public class CompositeType extends AbstractCompositeType {
public final List<AbstractType<?>> types;
}
public abstract class AbstractCompositeType extends AbstractType<ByteBuffer> {..}

序列化Cell

回到ColumnIndex序列化Column上,CellName只是Column的名称,而Cell实际上才是一个完整的列:包括列名和列值。Cell的最底层是OnDiskAtom。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public interface OnDiskAtom {
public Composite name(); //列名,比如CellName继承了Composite
public long timestamp(); //每个Atom都有时间撮

public static class Serializer implements ISSTableSerializer<OnDiskAtom> {
private final CellNameType type; //列名的类型

public void serializeForSSTable(OnDiskAtom atom, DataOutputPlus out) throws IOException {
if (atom instanceof Cell) {
type.columnSerializer().serialize((Cell)atom, out); //Atom强转为Cell
} else {
assert atom instanceof RangeTombstone;
type.rangeTombstoneSerializer().serializeForSSTable((RangeTombstone)atom, out);
}
}
}
}

public interface Cell extends OnDiskAtom {
public CellName name(); //列名 -->重写了OnDiskAtom的name方法,将返回类型从Composite转为CellName
public ByteBuffer value(); //列值
}

由于Cell包括了列名和列值,当然每个列都有一个时间撮,所以序列化时要写入这三部分。ColumnSerializer是Cell的序列化类。

  1. 序列化CellName(cell.name方法)
  2. 如果是CounterCell或ExpiringCell,则还要写入相关的时间撮
  3. 序列化Cell的时间撮(cell.timestamp方法)
  4. 序列化Cell的值(cell.value方法)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class ColumnSerializer implements ISerializer<Cell> {
private final CellNameType type;

public void serialize(Cell cell, DataOutputPlus out) {
type.cellSerializer().serialize(cell.name(), out); //序列化Cell的name,实际上就是序列化CellName
out.writeByte(cell.serializationFlags()); //Flag标志位
if (cell instanceof CounterCell) {
out.writeLong(((CounterCell) cell).timestampOfLastDelete());
} else if (cell instanceof ExpiringCell) {
out.writeInt(((ExpiringCell) cell).getTimeToLive());
out.writeInt(cell.getLocalDeletionTime());
}
out.writeLong(cell.timestamp());
ByteBufferUtil.writeWithLength(cell.value(), out);
}
}

下图总结了Data文件和索引文件的物理视图,其中最右边最细粒度的Column包括了四个字段,正好对应了上面ColumnSerializer的序列化过程。

c-storage layout

CQLSSTableWriter

除了数据写流程中Memtable的flush会通过SSTableWriter生成SSTable,也可以直接使用离线方式的CQLSSTableWriter直接生成SSTable。
后者不需要启动Cassandra,通常可以用在离线数据生成,如果要把数据导入Cassandra中,还需要通过NodeTool的Stream方式导入。

Builder构建模式用在很多地方,比如添加Column数据的ColumnIndex,通过Build模式生成关于Column的索引块信息

1
2
3
4
5
6
7
8
9
10
String schema = "CREATE TABLE myKs.myTable (k int PRIMARY KEY,v1 text,v2 int)";
String insert = "INSERT INTO myKs.myTable (k, v1, v2) VALUES (?, ?, ?)";
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("path/to/directory")
.forTable(schema)
.using(insert).build();
writer.addRow(0, "test1", 24);
writer.addRow(1, "test2", null);
writer.addRow(2, "test3", 42);
writer.close();

CQLSSTableWriter只是提供了CQL方式的一种接口(工具类),实际上最后还是会以SSTableWriter的方式进入到写流程中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class CQLSSTableWriter implements Closeable {
private final AbstractSSTableSimpleWriter writer;
private final UpdateStatement insert; //Writer只可能是insert语句,所以Statement是固定的UpdateStatement
private final List<ColumnSpecification> boundNames;

public static class Builder
private boolean sorted = false;

public Builder sorted() {
this.sorted = true;
return this;
}
public CQLSSTableWriter build() {
AbstractSSTableSimpleWriter writer = sorted ? new SSTableSimpleWriter(directory, schema, partitioner)
: new BufferedWriter(directory, schema, partitioner, bufferSizeInMB);
return new CQLSSTableWriter(writer, insert, boundNames);
}
}

insert语句会被解析成List<ColumnSpecification>,insert语句的?部分和addRow的参数是绑定在一起的(bound)。

1
2
3
4
5
6
7
8
9
10
11
12
public CQLSSTableWriter rawAddRow(List<ByteBuffer> values) {
QueryOptions options = QueryOptions.forInternalCalls(null, values);
List<ByteBuffer> keys = insert.buildPartitionKeyNames(options);
Composite clusteringPrefix = insert.createClusteringPrefix(options);
UpdateParameters params = new UpdateParameters(insert.cfm, options, insert.getTimestamp(now, options), insert.getTimeToLive(options), Collections.<ByteBuffer, CQL3Row>emptyMap());
for (ByteBuffer key : keys) {
if (writer.shouldStartNewRow() || !key.equals(writer.currentKey().getKey()))
writer.newRow(key); //如果是没有排序的,使用SSTableSimpleWriter
insert.addUpdateForKey(writer.currentColumnFamily(), key, clusteringPrefix, params, false);
}
return this;
}

SSTableSimpleWriter底层使用了SSTableWriter,它的writeRow方法写入一行记录,调用SSTableWriter.append方法,剩余的流程就和前面分析的写入一样了。

1
2
3
4
5
6
7
8
9
10
11
12
13
public abstract class AbstractSSTableSimpleWriter implements Closeable {
public void newRow(ByteBuffer key) throws IOException {
writeRow(currentKey, columnFamily);
}
}

public class SSTableSimpleWriter extends AbstractSSTableSimpleWriter {
private final SSTableWriter writer;

protected void writeRow(DecoratedKey key, ColumnFamily columnFamily) {
writer.append(key, columnFamily);
}
}

AbstractSSTableSimpleWriter的继承树如下:

1
2
3
4
AbstractSSTableSimpleWriter (org.apache.cassandra.io.sstable)
SSTableSimpleWriter (org.apache.cassandra.io.sstable)
SSTableSimpleUnsortedWriter (org.apache.cassandra.io.sstable)
BufferedWriter in CQLSSTableWriter (org.apache.cassandra.io.sstable)

客户端使用CQLSSTableWriter工具生成SSTable的流程图如下:
c-writer flow

Spark Cassandra Connector的Writer

ColumnFamily

1
2
3
ColumnFamily
AtomicBTreeColumns
ArrayBackedSortedColumns

ColumnFamily提供了添加原子变量的几种方式,比如添加Tombstone、Atom、Column等。该抽象类还定义了很多获取Column的方法,比如迭代器,对Columns进行排序等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public abstract class ColumnFamily implements Iterable<Cell>, IRowCacheEntry {
protected final CFMetaData metadata;

public void addTombstone(CellName name, int localDeletionTime, long timestamp) {
addColumn(new BufferDeletedCell(name, localDeletionTime, timestamp));
}
public void addAtom(OnDiskAtom atom) {
if (atom instanceof Cell) {
addColumn((Cell)atom);
} else {
delete((RangeTombstone)atom);
}
}
public void addColumn(CellName name, ByteBuffer value, long timestamp, int timeToLive) {
Cell cell = AbstractCell.create(name, value, timestamp, timeToLive, metadata());
addColumn(cell);
}
public abstract void addColumn(Cell cell);
}
1
2
3
4
public class ArrayBackedSortedColumns extends ColumnFamily {
private Cell[] cells;

}

SSTableReader

BigTableReader

3.0新存储引擎

http://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html
Starting with the 3.x storage engine Partitions, Rows, and Clustering are natively supported.
A Partition is a collection of Rows that share the same Partition Key(s) that are ordered,
within the Partition, by their Clustering Key(s). Rows are then by globally identified by
their Primary Key: the combination of Partition Key and Clustering Key.
The important change is that the 3.x storage engine now knows about these ideas,
it may seem strange but previously it did not know about the Rows in a Partition.
The new storage engine was created specifically to handle these concepts in a way
that reduces storage requirements and improves performance.

http://www.datastax.com/2015/12/storage-engine-30
2.0 maps of (ordered) maps of binary data:Map<byte[], SortedMap<byte[], Cell>>
The top-level keys of that map are the partition keys,
and each partition (identified by its key) is a sorted key/value map.
The inner values of that partition map is called a Cell mostly because
it contains both a binary value and the timestamp that is used for conflict resolution

3.0 Map<byte[], SortedMap<Clustering, Row>>
At the top-level, a table is still a map of partitions indexed by their partition key.
And the partition is still a sorted map, but it is one of rows indexed by their “clustering”.
The Clustering holds the values for the clustering columns of the CQL row it represents.
And the Row object represents, well, a given CQL row, associating to each column their value and timestamp.


文章目录
  1. 1. DecoratedKey、Token
    1. 1.1. Native
  2. 2. TokenMetadata
  3. 3. DataModel
  4. 4. SSTableWriter
    1. 4.1. BigTableWriter
    2. 4.2. IndexWriter索引文件
    3. 4.3. Write Data(Column Index)
    4. 4.4. Tombstone
      1. 4.4.1. Mutation.delete
      2. 4.4.2. DeletionInfo、DeletionTime
    5. 4.5. 序列化RowKey和TopLevel Tombstone
    6. 4.6. 序列化Column
      1. 4.6.1. ColumnFamily的比较类型(CellNameType)
      2. 4.6.2. CellNameType示例
      3. 4.6.3. CellName与Type
      4. 4.6.4. 序列化Cell
    7. 4.7. CQLSSTableWriter
      1. 4.7.1. Spark Cassandra Connector的Writer
  5. 5. ColumnFamily
  6. 6. SSTableReader
    1. 6.1. BigTableReader
  7. 7. 3.0新存储引擎