top of page

vRLI Cluster unresponsive as / partition full on 1 node due to multiple .hints file

Recently we've seen a situation where the root partition was full on vRLI appliance.

This was part of a vRLI 3 node cluster.

When this issue occurs, the cassandra service gets into a hung state and then this issue starts impacting other nodes in the cluster as well.

cassandra.log shows service unresponsive due to space issue on the root partition

INFO  [HANDSHAKE-XXXXXXX] 2020-03-04 10:47:57,384 - Handshaking version with XXXXXXX
INFO  [RequestResponseStage-3] 2020-03-04 10:47:57,400 - InetAddress /ZZZZZZZ is now UP
INFO  [GossipStage:1] 2020-03-04 10:47:58,379 - Node /ZZZZZZZ state jump to NORMAL
ERROR [HintsWriteExecutor:1] 2020-03-04 10:48:24,194 - Exception in thread Thread[HintsWriteExecutor:1,5,main] No space left on device
        at org.apache.cassandra.hints.HintsWriteExecutor.flushInternal( ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.hints.HintsWriteExecutor.flush( ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.hints.HintsWriteExecutor.lambda$flush$1( ~[apache-cassandra-3.11.2.jar:3.11.2]

The root partition was occupied by a .hprof file along with multiple .hints file and crc32 file getting created in /usr/lib/loginsight/application/lib/apache-cassandra-*/data/hints directory

Background on hints

Hints are one of three ways to support consistency in the system. When replica node is not available coordinator stores mutating data in temporary hint files to proceed as replica is available.

For details look here -

Ideally, in all vRLI deployments, it's configured that they are deleted after the default 3 hours. But somehow it's not working and hint files stay there seems forever in some environments.

Repairing runs automatically that is an addition way to support consistency in the system.

Manual deletion is solution in this situation.

This is a bug and will be addressed in upcoming releases of vRLI

350 views0 comments
bottom of page