Cassandra常见异常

Cassandra异常

  • Cassandra常见异常与解决
    • 安装
    • 内存OOM
    • 新添加节点表不存在
    • 种子节点不一样导致集群分离
    • Request did not complete within rpc_timeout
    • NoHostAvailableException 主机不可用导致操作超时
    • Not enough replica available 副本不足
    • fromIndex
    • 移动文件后重启Cassandra节点:
    • 管道断开
    • api使用Level compaction策略计算mean时
      • cfstats
      • cfhistograms
    • fp空间不足,无法compcation,FileNotFoundException:

Cassandra常见异常与解决

安装

安装的时候要一个启动完了,才能启动下一个

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Exception (java.lang.UnsupportedOperationException) encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:571)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:781)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:709)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:595)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625)
ERROR 06:26:25 Exception encountered during startup
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:571) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:781) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:709) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:595) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) [apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.6.jar:2.2.6]
WARN 06:26:25 No local state or state is in silent shutdown, not announcing shutdown
1
2
3
4
5
6
7
8
ERROR 06:29:09 Fatal exception during initialization
org.apache.cassandra.exceptions.ConfigurationException: Found system keyspace files, but they couldn't be loaded!
at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:728) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.StartupChecks$9.execute(StartupChecks.java:306) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.StartupChecks.verify(StartupChecks.java:107) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:162) [apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.6.jar:2.2.6]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
ERROR 02:03:03 Exiting forcefully due to file system exception on startup, disk failure policy "stop"
org.apache.cassandra.io.FSReadError: java.io.IOException: Channel not open for writing - cannot extend file to required size
at org.apache.cassandra.io.util.ChannelProxy.map(ChannelProxy.java:156) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.util.CompressedSegmentedFile.createMappedSegments(CompressedSegmentedFile.java:85) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.<init>(CompressedPoolingSegmentedFile.java:38) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:101) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:198) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:189) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.sstable.format.SSTableReader.load(SSTableReader.java:711) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.sstable.format.SSTableReader.load(SSTableReader.java:672) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:466) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:371) ~[apache-cassandra-2.2.6.jar:2.2.6]
at org.apache.cassandra.io.sstable.format.SSTableReader$4.run(SSTableReader.java:509) ~[apache-cassandra-2.2.6.jar:2.2.6]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: java.io.IOException: Channel not open for writing - cannot extend file to required size
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:851) ~[na:1.7.0_51]
at org.apache.cassandra.io.util.ChannelProxy.map(ChannelProxy.java:152) ~[apache-cassandra-2.2.6.jar:2.2.6]
... 15 common frames omitted

修改DEBUG级别

1
2
3
DEBUG 00:11:50 Evicting cold readers for /home/admin/cassandra/data/system/sstable_activity/system-sstable_activity-ka-5688-Data.db
ERROR 00:11:51 Exiting forcefully due to file system exception on startup, disk failure policy "stop"
....
1
2
3
4
5
6
7
8
9
10
11
12
13
bin/sstablescrub --debug -s -v system sstable_activity
Pre-scrub sstables snapshotted into snapshot pre-scrub-1468282635420
Scrubbing BigTableReader(path='/home/admin/cassandra/data/system/sstable_activity/system-sstable_activity-ka-5688-Data.db') (348911 bytes)
Reading row at 0
row 000a666f72736574695f667000000c7463705f73796e5f646174610000040001b6eb00 is 125 bytes
Reading row at 162
row 000a666f72736574695f667000000c7463705f73796e5f646174610000040001a22400 is 125 bytes
Reading row at 324
row 000a666f72736574695f667000000c7463705f73796e5f64617461000004000194bd00 is 125 bytes
...
Reading row at 15484
row 00094f707343656e746572000009726f6c6c757073363000000400000ba300 is 14 bytes
Scrub of BigTableReader(path='/home/admin/cassandra/data/system/sstable_activity/system-sstable_activity-ka-5689-Data.db') complete: 118 rows in new sstable and 0 empty (tombstoned) rows dropped
1
WARN  [GossipTasks:1] 2016-07-12 18:18:01,742 FailureDetector.java:258 - Not marking nodes down due to local pause of 5589128315 > 5000000000

内存OOM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ERROR [StreamReceiveTask:1345] 2016-01-11 02:04:03,305 CassandraDaemon.java (line 258) Exception in thread Thread[StreamReceiveTask:1345,5,main]
FSReadError in /home/admin/cassandra/data/forseti/velocity_app/forseti-velocity_app-jb-620650-Index.db
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:200)
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:168)
at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:345)
at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:335)
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:126)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888)
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:192)
... 9 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:885)
... 10 more
ERROR [StreamReceiveTask:1345] 2016-01-11 02:04:03,356 StorageService.java (line 377) Stopping gossiper

内存不足时,会默认在cassandra的安装目录生成hprof文件

1
2
3
4
5
6
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007fce3b000, 466944, 0) failed; error='无法分配内存' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 466944 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid8223.log

也有可能内存不足时报OOM:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ERROR [CompactionExecutor:48826] 2016-03-11 08:20:41,088 CassandraDaemon.java (line 258) Exception in thread Thread[CompactionExecutor:48826,1,main]
java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at org.apache.cassandra.io.util.NativeAllocator.allocate(NativeAllocator.java:43)
at org.apache.cassandra.io.util.Memory.<init>(Memory.java:51)
at org.apache.cassandra.io.util.Memory.allocate(Memory.java:59)
at org.apache.cassandra.io.sstable.IndexSummaryBuilder.build(IndexSummaryBuilder.java:77)
at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:347)
at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:216)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

写入md5_id数据量很大近千亿。写到一半报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ERROR [CompactionExecutor:1441] 2016-03-18 06:54:26,852 CassandraDaemon.java (line 258) Exception in thread Thread[CompactionExecutor:1441,1,main]
java.lang.RuntimeException: Out of native memory occured, You can avoid it by increasing the system ram space or by increasing bloom_filter_fp_chance.
at org.apache.cassandra.utils.obs.OffHeapBitSet.<init>(OffHeapBitSet.java:49)
at org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:85)
at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:79)
at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.<init>(SSTableWriter.java:455)
at org.apache.cassandra.io.sstable.SSTableWriter.<init>(SSTableWriter.java:103)
at org.apache.cassandra.db.compaction.CompactionTask.createCompactionWriter(CompactionTask.java:316)
at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:162)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

查询随机keys

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[qihuang.zheng@spark047211 ~]$ nodetool -h 192.168.48.162 rangekeysample
RangeKeySample:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuilder.append(StringBuilder.java:204)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFSpan(ObjectInputStream.java:3143)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3051)
at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2864)
at java.io.ObjectInputStream.readString(ObjectInputStream.java:1638)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1341)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at java.util.ArrayList.readObject(ArrayList.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at sun.rmi.server.UnicastRef.unmarshalValue(UnicastRef.java:325)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:174)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source)
at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1029)
at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:292)
at com.sun.proxy.$Proxy0.sampleKeyRange(Unknown Source)
at org.apache.cassandra.tools.NodeProbe.sampleKeyRange(NodeProbe.java:902)
at org.apache.cassandra.tools.NodeCmd.printRangeKeySample(NodeCmd.java:1661)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1536)

sstableloader导入大量数据,服务端崩溃

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ERROR [StreamReceiveTask:2] 2016-06-16 21:12:56,011 StreamReceiveTask.java:183 - Error applying streamed data:
org.apache.cassandra.io.FSReadError: java.io.IOException: Map failed
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:399) ~[apache-cassandra-2.1.13.jar:2.1.13]
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:365) ~[apache-cassandra-2.1.13.jar:2.1.13]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:174) ~[apache-cassandra-2.1.13.jar:2.1.13]
at org.apache.cassandra.io.sstable.SSTableWriter.finish(SSTableWriter.java:463) ~[apache-cassandra-2.1.13.jar:2.1.13]
at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:447) ~[apache-cassandra-2.1.13.jar:2.1.13]
at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:442) ~[apache-cassandra-2.1.13.jar:2.1.13]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:141) ~[apache-cassandra-2.1.13.jar:2.1.13]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888) ~[na:1.7.0_51]
at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:392) ~[apache-cassandra-2.1.13.jar:2.1.13]
... 11 common frames omitted
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method) ~[na:1.7.0_51]
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:885) ~[na:1.7.0_51]
... 12 common frames omitted
ERROR [StreamReceiveTask:2] 2016-06-16 21:12:56,026 JVMStabilityInspector.java:117 - JVM state determined to be unstable. Exiting forcefully due to:
java.lang.OutOfMemoryError: Map failed

秒杀场景OOM:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,274 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@25ffeff6) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@725095737:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-9-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,301 StatusLogger.java:106 - system_traces.events 0,0
INFO [Service Thread] 2016-08-03 14:38:40,302 GCInspector.java:284 - G1 Young Generation GC in 384ms. G1 Eden Space: 1593835520 -> 0; G1 Old Gen: 31711480592 -> 33695096592;
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,302 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@2f078486) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@725095737:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-9-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,302 StatusLogger.java:106 - realtime.holmes_metrics 0,0
INFO [Service Thread] 2016-08-03 14:38:40,302 StatusLogger.java:52 - Pool Name Active Pending Completed Blocked All Time Blocked
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,306 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@6a051f02) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@725095737:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-9-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,306 StatusLogger.java:106 - realtime.activity_segments 0,0
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,307 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@305fddeb) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@725095737:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-9-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,307 StatusLogger.java:106 - realtime.activity_segments_v2 0,0
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,307 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@5ba59704) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@1209338690:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-18-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,307 StatusLogger.java:106 - realtime.dev_activity_segments_v2 0,0
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,307 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@a2ae04d) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@1209338690:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-18-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,307 StatusLogger.java:106 - realtime.dev_activity_segments 1111816,194638707
WARN [GossipTasks:1] 2016-08-03 14:38:40,307 FailureDetector.java:287 - Not marking nodes down due to local pause of 52801769613 > 5000000000
ERROR [Reference-Reaper:1] 2016-08-03 14:38:40,307 Ref.java:187 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@4454d466) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup@1209338690:/home/admin/data/data/realtime/flash-f2a59850520d11e68c4db36bb69e35dc/la-18-big-Data.db was not released before the reference was garbage collected
INFO [ScheduledTasks:1] 2016-08-03 14:38:40,308 StatusLogger.java:106 - realtime.flash 673700,11091988
ERROR [SharedPool-Worker-283] 2016-08-03 14:38:40,308 JVMStabilityInspector.java:117 - JVM state determined to be unstable. Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space

新添加节点表不存在

当新添加节点到已经存在的集群时,新节点是没有schema的,它需要从集群中同步表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 WARN [MessagingService-Incoming-/192.168.48.161] 2016-01-22 10:22:04,665 IncomingTcpConnection.java (line 86) UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=06010280-2614-37de-8bfa-7588805d9347
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:178)
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:103)
at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserializeOneCf(RowMutation.java:314)
at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:294)
at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:322)
at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:264)
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:166)
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:148)
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:77)

org.apache.cassandra.db.UnknownColumnFamilyException: Got slice command for nonexistent table forseti_fp.android_device_time. If the table was just created, this is likely due to the schema not being fully propagated. Please wait for schema agreement on table creation.
at org.apache.cassandra.db.SliceFromReadCommandSerializer.deserialize(SliceFromReadCommand.java:189)

INFO [InternalResponseStage:1] 2016-01-22 10:22:04,813 ColumnFamilyStore.java (line 250) Initializing forseti.ip_ip.ip_ip_update_time_idx
ERROR [MutationStage:44] 2016-01-22 10:22:14,504 Keyspace.java (line 366) Attempting to mutate non-existant column family 9c57333a-0557-35d8-afb0-55b40bdae534

问题分析:新添加节点时,如果该节点还没有创建表结构,而被添加到集群中,会导致分发到这个节点的请求无法被处理

种子节点不一样导致集群分离

问题出现: 159节点的/data commitLog目录满了,导致进程没了,由于空间不足,也无法启动
删除部分文件重启节点.replay操作会将commitLog写到数据目录,最终会删除commitLog文件,释放空间.

1
2
3
4
5
6
7
8
9
[qihuang.zheng@fp-cass048159 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 271G 256G 1.3G 100% /
/dev/sdb1 11T 8.7T 1.2T 89% /home

[qihuang.zheng@fp-cass048159 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 271G 33G 225G 13% /
/dev/sdb1 11T 8.7T 1.2T 89% /home

一共有159, 160两个作为种子节点, 但是它们分别不认识对方, 是不是发生了集群分离?
如果种子节点不一样,是会存在集群分离的,但它们的配置文件都是一样的,即种子节点都是一样的.

1
2
3
4
5
6
7
8
9
10
11
12
13
[qihuang.zheng@fp-cass048159 data]$ nodetool status
-- Address Load Tokens Owns Host ID Rack
UN 192.168.48.161 7.36 TB 256 25.4% 54b3a7e0-f778-4087-98c3-ac84e56f77e6 RAC1
DN 192.168.48.160 8.24 TB 256 26.2% 18a87c56-dcba-4614-adba-804aa7761a06 RAC1
UN 192.168.48.175 3.36 TB 256 24.3% 5bbd4400-2132-42b6-91c5-389592b75423 RAC1
UN 192.168.48.159 8.65 TB 256 24.2% df9a693b-efc1-41bc-9a42-cf868ea75e65 RAC1

[qihuang.zheng@fp-cass048160 ~]$ nodetool status
-- Address Load Tokens Owns Host ID Rack
UN 192.168.48.161 7.22 TB 256 25.4% 54b3a7e0-f778-4087-98c3-ac84e56f77e6 RAC1
UN 192.168.48.160 8.24 TB 256 26.2% 18a87c56-dcba-4614-adba-804aa7761a06 RAC1
UN 192.168.48.175 3.22 TB 256 24.3% 5bbd4400-2132-42b6-91c5-389592b75423 RAC1
DN 192.168.48.159 8.65 TB 256 24.2% df9a693b-efc1-41bc-9a42-cf868ea75e65 RAC1

Request did not complete within rpc_timeout

http://stackoverflow.com/questions/27340812/cassandra-timing-out-when-queried-for-key-that-have-over-10-000-rows-even-after

1
2
3
4
5
6
7
8
9
10
cqlsh:forseti>  TRACING ON;
Now tracing requests.
cqlsh:forseti> select * from ip_account limit 1;
Request did not complete within rpc_timeout.

Seeking to partition beginning in data file | 10:59:05,049 | 192.168.47.205 | 8629
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 10:59:07,064 | 192.168.47.205 | 2023597
Scanned 10 rows and matched 10 | 10:59:07,064 | 192.168.47.205 | 2024243
Timed out; received 0 of 1 responses for range 3 of 2561 | 10:59:20,030 | 192.168.47.202 | 15001569
Request complete | 10:59:20,030 | 192.168.47.202 | 15001676

原因分析: limit 1没有指定任何查询条件, 属于Range Query. cassandra.yaml中指定range_request_timeout_in_ms: 15000是15s.

NoHostAvailableException 主机不可用导致操作超时

All host(s) tried for query failed真正的原因要查看括号里的,比如这里的操作超时。如果节点当掉了,就会出现超时。

1
2
3
4
Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: 
All host(s) tried for query failed (tried: /192.168.6.52:9042 (com.datastax.driver.core.OperationTimedOutException:
[/192.168.6.52:9042] Operation timed out), /192.168.6.53:9042 (com.datastax.driver.core.OperationTimedOutException:
[/192.168.6.53:9042] Operation timed out))

Not enough replica available 副本不足

1
2
CassandraClient - Not enough replica available for query at consistency ONE (1 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive)

http://stackoverflow.com/questions/27974911/not-enough-replica-available-for-query-at-consistency-one-1-required-but-only-0

If you want to be able to handle more than 1 node being unavailable, what you could do is look into
altering your keyspace to set a higher replication factor, preferably three in this case,
and then running a nodetool repair on each node to get all of your data on all nodes.
With this change, you would be able to survive the loss of 2 nodes to read at a consistency level of one.

如果集群有3个节点, 设置副本数为3, 这样运行两个节点当掉, 在一致性级别=ONE时, 仍然可以正常读取.
因为副本数=3, 而集群节点数=3, 则每个节点都有一份数据.

副本数对读写的影响: http://www.ecyrd.com/cassandracalculator/
建表时DC大小写问题(应该不是主要问题):
https://issues.apache.org/jira/browse/CASSANDRA-7905
https://datastax-oss.atlassian.net/browse/JAVA-37
https://issues.apache.org/jira/browse/CASSANDRA-5292

Cassandra种子节点不一致引起的写副本不足

fp线上批量写入时报异常: Not enough replica available for query at consistency ONE (1 required but only 0 alive)

原先的种子节点是227,228. 迁移到SSD后, 新的机器必须包含原先的种子节点, 选择了227.
比如159设置的种子节点是227,159. 160设置的种子节点是159,160.
而应用程序连接的ContactPoint是159,160. 后来将227做了下线处理. 导致有些节点的Seeds发生了冲突.
这样Cassandra的gossip协议在判断集群中节点的状态会出现混乱:

虽然所有的节点的DataCenter都是DC3. 但是查看下面的状态发现228和159认识不到160,161. 而且DC也不是我们配置的DC3.

1
2
3
4
5
6
7
8
9
10
11
[qihuang.zheng@fp-cass048159 ~]$ nodetool status
Datacenter: DC1
===============
-- Address Load Tokens Owns Host ID Rack
DN 192.168.48.161 ? 256 24.9% 54b3a7e0-f778-4087-98c3-ac84e56f77e6 r1
DN 192.168.48.160 ? 256 25.8% 18a87c56-dcba-4614-adba-804aa7761a06 r1
Datacenter: DC3
===============
-- Address Load Tokens Owns Host ID Rack
UN 192.168.47.228 11.82 TB 256 26.3% f3d26148-d2da-479c-ae9e-ae41aced1be9 RAC1
UN 192.168.48.159 3.54 TB 256 23.0% df9a693b-efc1-41bc-9a42-cf868ea75e65 RAC1

而160,161却可以认识到228和159.

1
2
3
4
5
6
7
8
[qihuang.zheng@fp-cass048161 ~]$ nodetool status
Datacenter: DC3
===============
-- Address Load Tokens Owns Host ID Rack
UN 192.168.48.161 3.24 TB 256 24.9% 54b3a7e0-f778-4087-98c3-ac84e56f77e6 RAC1
UN 192.168.48.160 1.59 TB 256 25.8% 18a87c56-dcba-4614-adba-804aa7761a06 RAC1
UN 192.168.47.228 7.18 TB 256 26.3% f3d26148-d2da-479c-ae9e-ae41aced1be9 RAC1
UN 192.168.48.159 2.18 TB 256 23.0% df9a693b-efc1-41bc-9a42-cf868ea75e65 RAC1

解决办法是在228和159将其他DN的节点手工removenode掉(还需要force remove). 比如在159节点把两个DN都移除掉:

1
2
3
4
5
6
7
[qihuang.zheng@fp-cass048159 ~]$ nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC3
===============
-- Address Load Tokens Owns Host ID Rack
UN 192.168.47.228 11.6 TB 256 50.0% f3d26148-d2da-479c-ae9e-ae41aced1be9 RAC1
UN 192.168.48.159 3.54 TB 256 50.0% df9a693b-efc1-41bc-9a42-cf868ea75e65 RAC1

然后再把160,161重启, 这样会重新加入到228和159组成的集群中.

1
2
3
4
5
6
7
8
9
[qihuang.zheng@fp-cass048159 ~]$ nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC3
===============
-- Address Load Tokens Owns Host ID Rack
UN 192.168.48.161 3.24 TB 256 24.9% 54b3a7e0-f778-4087-98c3-ac84e56f77e6 RAC1
UN 192.168.48.160 2.69 TB 256 25.8% 18a87c56-dcba-4614-adba-804aa7761a06 RAC1
UN 192.168.47.228 11.6 TB 256 26.3% f3d26148-d2da-479c-ae9e-ae41aced1be9 RAC1
UN 192.168.48.159 3.54 TB 256 23.0% df9a693b-efc1-41bc-9a42-cf868ea75e65 RAC1

现在查看每个节点的状态, 会发现都是一样的了.

结论: 在迁移节点,下线节点, 如果涉及到种子节点, 最后集群中每个节点配置文件中的seeds种子节点一定要更新成一样的. 并重启节点.
虽然Gossip协议能够帮我们自动发现集群中的其他节点, 但是如果使用status查看每个节点的状态时, 如果发现不一致了, 一定要及时更改!!!

修改成Quorum

1
2
com.datastax.driver.core.exceptions.UnavailableException: 
Not enough replica available for query at consistency QUORUM (1 required but only 0 alive)

fromIndex

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ERROR [ReadStage:46463] 2016-01-11 14:51:13,903 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:46463,5,main]
java.lang.RuntimeException: java.lang.IllegalArgumentException: fromIndex(3000) > toIndex(0)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2016)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalArgumentException: fromIndex(3000) > toIndex(0)
at java.util.ArrayList.subListRangeCheck(ArrayList.java:964)
at java.util.ArrayList.subList(ArrayList.java:954)
at org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator.computeNext(ArrayBackedSortedColumns.java:405)
at org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator.computeNext(ArrayBackedSortedColumns.java:365)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.filter.SliceQueryFilter$1.hasNext(SliceQueryFilter.java:154)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:157)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:140)
at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:191)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:89)
at org.apache.cassandra.db.ColumnFamilyStore.filterColumnFamily(ColumnFamilyStore.java:1422)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1382)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:330)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:65)
at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1447)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2012)

移动文件后重启Cassandra节点:

某个节点磁盘不足,而有些文件很大(因为采用Size方式,有些文件有1T多),移动到其他节点后,在新节点重启加载这个文件,会有数据重叠的现象。

1
2
3
4
5
6
7
8
WARN 14:14:55,720 At level 1, SSTableReader(path='/home/admin/cassandra/data/forseti/velocity_app/forseti-velocity_app-jb-1675-Data.db') 
[DecoratedKey(-2251189883765179735, 002034623334643764366665656239313330373930376365363937393130343164370000096a696564616962616f00000f6a696564616962616f5f616e64726f00000c6163636f756e744c6f67696e00),
DecoratedKey(744337094338704937, 0006e4b8ade59bbd0000066a6975796f7500000a6a6975796f755f776562000010697041646472657373436f756e74727900)]
overlaps SSTableReader(path='/home/admin/cassandra/data/forseti/velocity_app/forseti-velocity_app-jb-191783-Data.db')
[DecoratedKey(-321297051577330931, 000d3136372e31342e3232352e343400000972797861617736734e00000d72797861617736734e5f77656200000969704164647265737300),
DecoratedKey(-283124829323804542, 0009e4b88ae6b5b7e5b882000005637472697000000963747269705f77656200000d6970416464726573734369747900)].
This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have dropped sstables from another node into the data directory.
Sending back to L0. If you didn't drop in sstables, and have not yet run scrub, you should do so since you may also have rows out-of-order within an sstable

管道断开

通常如果某个节点发生严重的GC,发送到这个节点的请求无法响应时,会被断开。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
 INFO [Native-Transport-Requests:637] 2015-11-08 16:31:24,368 Message.java (line 397) Unexpected exception during request; channel = [id: 0x64da6e04, /192.168.47.75:38758 :> /192.168.47.202:9042]
java.io.IOException: 断开的管道
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$UnpooledSendBuffer.transferTo(SocketSendBufferPool.java:203)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:202)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:152)
at org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:335)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
INFO [CompactionExecutor:19230] 2015-11-08 16:31:31,036 CompactionTask.java (line 120) Compacting [SSTableReader(path='/home/admin/cassandra/data/system/hints/system-hints-jb-4140-Data.db'), SSTableReader(path='/home/admin/cassandra/data/system/hints/system-hints-jb-4139-Data.db'), SSTableReader(path='/home/admin/cassandra/data/system/hints/system-hints-jb-4141-Data.db')]
INFO [CompactionExecutor:19230] 2015-11-08 16:32:27,041 CompactionTask.java (line 299) Compacted 3 sstables to [/home/admin/cassandra/data/system/hints/system-hints-jb-4142,]. 1,576,968,570 bytes to 1,557,007,102 (~98% of original) in 56,002ms = 26.514726MB/s. 5 total partitions merged to 2. Partition merge counts were {2:1, 3:1, }
INFO [HintedHandoff:2] 2015-11-08 16:32:27,106 HintedHandOffManager.java (line 349) Started hinted handoff for host: 57eb6f83-734d-48ef-855b-e745001e8207 with IP: /192.168.47.227
INFO [HintedHandoff:1] 2015-11-08 16:32:27,106 HintedHandOffManager.java (line 349) Started hinted handoff for host: f3d26148-d2da-479c-ae9e-ae41aced1be9 with IP: /192.168.47.228

INFO [HintedHandoff:2] 2015-11-08 16:32:54,726 HintedHandOffManager.java (line 469) Timed out replaying hints to /192.168.47.227; aborting (468 delivered)

api使用Level compaction策略计算mean时

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ERROR [CompactionExecutor:46654] 2016-03-15 09:31:59,459 CassandraDaemon.java (line 258) Exception in thread Thread[CompactionExecutor:46654,1,main]
java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed
at org.apache.cassandra.utils.EstimatedHistogram.mean(EstimatedHistogram.java:202)
at org.apache.cassandra.io.sstable.SSTableMetadata.getEstimatedDroppableTombstoneRatio(SSTableMetadata.java:192)
at org.apache.cassandra.io.sstable.SSTableReader.getEstimatedDroppableTombstoneRatio(SSTableReader.java:1334)
at org.apache.cassandra.db.compaction.LeveledCompactionStrategy$1.compare(LeveledCompactionStrategy.java:341)
at org.apache.cassandra.db.compaction.LeveledCompactionStrategy$1.compare(LeveledCompactionStrategy.java:338)
at java.util.TimSort.binarySort(TimSort.java:265)
at java.util.TimSort.sort(TimSort.java:208)
at java.util.Arrays.sort(Arrays.java:727)
at com.google.common.collect.ImmutableSortedSet.construct(ImmutableSortedSet.java:428)
at com.google.common.collect.ImmutableSortedSet.copyOf(ImmutableSortedSet.java:357)
at com.google.common.collect.ImmutableSortedSet.copyOf(ImmutableSortedSet.java:380)
at org.apache.cassandra.db.compaction.LeveledManifest.getLevelSorted(LeveledManifest.java:612)
at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.findDroppableSSTable(LeveledCompactionStrategy.java:337)
at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:125)
at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:113)
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:192)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
WARN [ReadStage:72] 2016-03-15 09:32:03,616 SliceQueryFilter.java (line 231) Read 1002 live and 7758 tombstone cells in forseti.velocity (see tombstone_warn_threshold).
1001 columns was requested, slices=[dhgate:dh_web_seller:ip3:1455670777904:sequence_id-dhgate:!]

WARN [ReadStage:907] 2016-03-15 10:23:55,294 SliceQueryFilter.java (line 231) Read 1001 live and 6045 tombstone cells in forseti.velocity (see tombstone_warn_threshold).
1000 columns was requested, slices=[dhgate-dhgate:!]
WARN [ReadStage:824] 2016-03-15 10:23:55,374 SliceQueryFilter.java (line 231) Read 1002 live and 5691 tombstone cells in forseti.velocity (see tombstone_warn_threshold).
1001 columns was requested, slices=[dhgate:dh_web_seller:ip3:1455670777904:sequence_id-dhgate:!]

问题分析:https://issues.apache.org/jira/browse/CASSANDRA-8028 这里说是有很大的partition导致溢出了,默认90个buckets.
下一步:查看EstimatedHistogram计算bucket的方式

问题思路:partition很大,分布不均匀,导致默认90个的buckets不够

cfstats

使用Level Compaction策略时,查看cf的状态,会显示每个Level的sstable数量

1
2
3
4
5
6
7
8
9
10
11
[qihuang.zheng@cass047202 ~]$ nodetool cfstats forseti.velocity
Keyspace: forseti
Read Count: 17155706
Read Latency: 2.544613554872064 ms.
Write Count: 366320040
Write Latency: 0.05012879881482869 ms.
SSTable count: 3807
SSTables in each level: [4, 9, 100, 622, 3072, 0, 0, 0, 0]
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 228(G),504(M),356(K),366(B)
Compacted partition mean bytes: 9748

1.cfstats中有关于Partition被Compacted的最小值,最大值和平均值。最大的Partition有228G!
而Partition实际上是表的Partition Key。

2.每个level的sstble数量加起来,正好等于sstable count.
每一个level的数量表示该level的sstable数量。L0的大小为5M,后面每个Level的大小是前一个的10倍。
源码位置: LeveledManifest

3.按照每个Level的大小,分别为5M=4,50M=9,500M=100,5G=522,50G=3072,500G=0..
按文件大小查询,貌似没有那么多5G,甚至50G的文件:

1
find /home/admin/cassandra/data/forseti/velocity/*Data.db -size +53687091200c

实际上按Level来划分文件,并不是说该Level每个文件都有这么大,而是总的文件大小:
http://stackoverflow.com/questions/29766453/how-does-the-leveled-compaction-strategy-ensure-90-of-reads-are-from-one-sstabl

  • Every sstable is created when a fixed (relatively small) size limit is reached. By default L0 gets 5MB files of files, and each subsequent level is 10x the size. (in L1 you’ll have 50MB of data, L2 500MB, and so on).
  • Sstables are created with the guarantee that they don’t overlap
  • When a level fills up, a compaction is triggered and stables from level-L are promoted to level-L+1. So, in L1 you’ll have 50MB in ~10 files, L2 500MB in ~100 files, etc..

cfhistograms

查看cfhistograms直方图,有时候会报错数组越界超过90个(第二次没有异常时请求书明显少了很多)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[qihuang.zheng@cass047202 ~]$ nodetool cfhistograms forseti velocity
forseti/velocity histograms

SSTables per Read
1 sstables: 9243544
2 sstables: 2447040
3 sstables: 1714616
4 sstables: 912563
5 sstables: 430344
6 sstables: 59759
7 sstables: 14888
8 sstables: 5128
10 sstables: 690

Write Latency (microseconds)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 90
at org.apache.cassandra.tools.NodeCmd.printHistogram(NodeCmd.java:1084)
at org.apache.cassandra.tools.NodeCmd.printCfHistograms(NodeCmd.java:1133)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1469)

[qihuang.zheng@cass047202 ~]$ nodetool cfhistograms forseti velocity
forseti/velocity histograms

SSTables per Read
1 sstables: 101
2 sstables: 29
3 sstables: 22
4 sstables: 13
5 sstables: 4
6 sstables: 2
...

cfhistograms几个维度的意义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SSTables per Read : 读取的sstable数量: 请求数
1 sstables: 300887 读取一个sstables的请求数量有30万
2 sstables: 65075
8 sstables: 109

Write Latency (microseconds) 写入延迟毫秒:请求数
6 us: 13
42 us: 1174485
263210 us: 1

Read Latency (microseconds) 读取延迟毫秒:请求数
24 us: 3
2759 us: 20459
.... us: ...
4866323 us: 3

Partition Size (bytes) Partitiond大小:请求数
124 bytes: 2029
25109160 bytes: 883

Cell Count per Partition 每个Partition的列数量:Partition数
1 cells: 2030
25109160 cells: 13
  • Cell Count per Partition 有90个
  • Partition Size 有68个。上面Partition最大为25M,但是怎么和cfstats中最大200G联系起来。
  • SSTables per Read中右侧的所有请求数总和 = Read Latency (microseconds) 右侧的所有请求数总和。 对于一份数据包括所有请求,分别统计这些请求读取了多少个sstable,或者每个请求的读取延迟时间。

fp空间不足,无法compcation,FileNotFoundException:

fp_fnf

问题分析:没有发生compaction,文件却被删除了导致找不到,比较奇怪。照理说没有compaction是不会删除原始数据的。
问题状态:未解决

读取的时候如果找不到文件也会报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
ERROR [ReadStage:91] 2016-02-09 04:09:00,384 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:91,5,main]
java.lang.RuntimeException: java.io.FileNotFoundException: /home/admin/cassandra/data/forseti_fp/android_device_session/forseti_fp-android_device_session-jb-115318-Data.db (没有那个文件或目录)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:47)
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.createReader(CompressedPoolingSegmentedFile.java:48)
at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(PoolingSegmentedFile.java:39)
at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:1242)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(SimpleSliceReader.java:57)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:42)
at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:173)
at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:250)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1567)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1386)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:330)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:65)
at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.FileNotFoundException: /home/admin/cassandra/data/forseti_fp/android_device_session/forseti_fp-android_device_session-jb-115318-Data.db (没有那个文件或目录)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:43)
... 19 more

某天169上大量报错FNF,重启后仍然没有解决,查看sstable数量非常多(正常的只有30个左右)。手工做compact报错FNF:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
[qihuang.zheng@cass048169 ~]$ nodetool cfstats forseti.velocity_app | grep "SSTable count"
SSTable count: 5746
[qihuang.zheng@cass048169 ~]$ nodetool compact forseti velocity_app
Error occurred during compaction
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/admin/cassandra/data/forseti/velocity_app/forseti-velocity_app-jb-1014038-Data.db (没有那个文件或目录)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.cassandra.db.compaction.CompactionManager.performMaximal(CompactionManager.java:288)
at org.apache.cassandra.db.ColumnFamilyStore.forceMajorCompaction(ColumnFamilyStore.java:1990)
at org.apache.cassandra.service.StorageService.forceKeyspaceCompaction(StorageService.java:2205)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /home/admin/cassandra/data/forseti/velocity_app/forseti-velocity_app-jb-1014038-Data.db (没有那个文件或目录)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1402)
at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1208)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1220)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:273)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:279)
at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:303)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.io.FileNotFoundException: /home/admin/cassandra/data/forseti/velocity_app/forseti-velocity_app-jb-1014038-Data.db (没有那个文件或目录)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
... 17 more

又重启了一次(歇了会儿没有立即重启),就可以compact了。


文章目录
  1. 1. Cassandra常见异常与解决
    1. 1.1. 安装
    2. 1.2. 内存OOM
    3. 1.3. 新添加节点表不存在
    4. 1.4. 种子节点不一样导致集群分离
    5. 1.5. Request did not complete within rpc_timeout
    6. 1.6. NoHostAvailableException 主机不可用导致操作超时
    7. 1.7. Not enough replica available 副本不足
    8. 1.8. fromIndex
    9. 1.9. 移动文件后重启Cassandra节点:
    10. 1.10. 管道断开
    11. 1.11. api使用Level compaction策略计算mean时
      1. 1.11.1. cfstats
      2. 1.11.2. cfhistograms
    12. 1.12. fp空间不足,无法compcation,FileNotFoundException: