Description:
We are testing Apache Ignite 3.1.0 using a 3-node cluster under heavy JMeter load (~3 million records).
Environment:
Ignite 3.1.0
Java 17
3 nodes
Xms/Xmx = 16GB
G1GC
Problem:
Only node3 experiences severe JVM heap pressure while node1/node2 remain relatively stable.
Observed JVM Usage:
node1 → ~50-60%
node2 → ~60-70%
node3 → ~99.96% Old Gen
even High CPU usage
GC Statistics on node3:
Full GC Count = 2189
Full GC Time = 16220 sec
Errors Observed:
1.
IGN-TX-4 Failed to acquire a lock due to a possible deadlock
Replication is timed out
SYSTEM_WORKER_BLOCKED
Example:
A critical thread is blocked for 11772 ms:
node3-network-worker-5
JRaft PreVote timeout / unsuccessful election rounds
Heap Histogram Findings on node3:
Very large accumulation of:
CompletableFuture (~36 million)
PartitionReplicaListener$OperationId (~36 million)
TxCleanupReadyFutureList (~36 million)
We also observed very high raft activity for Zone 20 partitions (13k+ events in logs).
Example workload:
requestType=RW_GET_ALL
primaryKeys.size=39
Questions:
Is this expected under heavy concurrent RW_GET_ALL workload?
Could this indicate replication/transaction cleanup backlog or future accumulation issue?
Are there recommended tuning settings for this workload pattern?
Has this behavior improved in newer Ignite 3 versions?
I can provide:
GC logs
heap histogram
thread dumps
JVM graphs
additional logs if needed.
Description:
We are testing Apache Ignite 3.1.0 using a 3-node cluster under heavy JMeter load (~3 million records).
Environment:
Ignite 3.1.0
Java 17
3 nodes
Xms/Xmx = 16GB
G1GC
Problem:
Only node3 experiences severe JVM heap pressure while node1/node2 remain relatively stable.
Observed JVM Usage:
node1 → ~50-60%
node2 → ~60-70%
node3 → ~99.96% Old Gen
even High CPU usage
GC Statistics on node3:
Full GC Count = 2189
Full GC Time = 16220 sec
Errors Observed:
1.
IGN-TX-4 Failed to acquire a lock due to a possible deadlock
Replication is timed out
SYSTEM_WORKER_BLOCKED
Example:
A critical thread is blocked for 11772 ms:
node3-network-worker-5
JRaft PreVote timeout / unsuccessful election rounds
Heap Histogram Findings on node3:
Very large accumulation of:
CompletableFuture (~36 million)
PartitionReplicaListener$OperationId (~36 million)
TxCleanupReadyFutureList (~36 million)
We also observed very high raft activity for Zone 20 partitions (13k+ events in logs).
Example workload:
requestType=RW_GET_ALL
primaryKeys.size=39
Questions:
Is this expected under heavy concurrent RW_GET_ALL workload?
Could this indicate replication/transaction cleanup backlog or future accumulation issue?
Are there recommended tuning settings for this workload pattern?
Has this behavior improved in newer Ignite 3 versions?
I can provide:
GC logs
heap histogram
thread dumps
JVM graphs
additional logs if needed.