Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions packages/activemq_otel/changelog.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# newer versions go on top
- version: "0.2.0"
changes:
- description: Add description and artifact fields to alerting rule template.
type: enhancement
link: https://github.com/elastic/integrations/pull/18506
- version: "0.1.0"
changes:
- description: Initial draft of the package
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"type": "alerting_rule_template",
"managed": true,
"attributes": {
"description": "Alerts when producer sends are blocked by broker memory pressure (`activemq.queue.blocked.sends` rate > 0). Any non-zero value means the broker is already out of memory headroom and is back-pressuring producers.",
"name": "[ActiveMQ OTel] Blocked sends detected",
"ruleTypeId": ".es-query",
"tags": [
Expand All @@ -16,13 +17,31 @@
"alertDelay": {
"active": 1
},
"artifacts": {
"dashboards": [
{
"id": "activemq_otel-broker-health"
},
{
"id": "activemq_otel-destinations"
},
{
"id": "activemq_otel-overview"
}
],
"investigation_guide": {
"blob": "## ActiveMQ blocked sends detected\n\n### What fired\nThe counter `activemq.queue.blocked.sends` registered a non-zero rate on at least one destination. Producers are being blocked while the broker waits for memory to free up.\n\n### Why it matters\nBlocked sends are the canary for broker memory exhaustion. When memory is saturated, producers stall instead of pushing new messages through; enqueue latency spikes and throughput drops. Left alone, the broker may stop accepting new persistent messages entirely.\n\n### Triage\n1. Identify the affected destinations and brokers from the alert context.\n2. On the Broker Health dashboard, check `activemq.memory.utilization` \u2014 it is almost certainly above 0.85.\n3. Compare enqueue vs dequeue rates (`activemq.message.enqueued` / `activemq.message.dequeued`) on the hot destinations to confirm consumer lag.\n4. Check `activemq.consumer.count` on affected queues \u2014 zero or insufficient consumers is a common root cause.\n5. Inspect JVM state (`activemq.jvm.memory.heap.used`, GC activity) \u2014 broker under JVM pressure will drag memory signals with it.\n\n### Remediation\n- Scale consumers or fix slow consumer code paths for the backed-up destinations.\n- Raise broker memory limit if the workload has legitimately grown.\n- Drain the DLQ or pressure-relief queues if they are consuming memory budget.\n\n### Tuning\n- The rule fires on *any* blocked send rate > 0, which matches the documented \"zero is the only healthy value\" stance. If your cluster tolerates brief bursts, require `blocked_rate > N` over a longer window.\n"
}
},
"params": {
"searchType": "esqlQuery",
"esqlQuery": {
"esql": "TS metrics-activemq.otel-*\n// activemq.queue.blocked.sends is counter_long; dimension: activemq.destination.name\n| WHERE activemq.queue.blocked.sends IS NOT NULL\n| STATS blocked_rate = SUM(RATE(activemq.queue.blocked.sends)) BY activemq.broker.name, activemq.destination.name\n// Any blocked sends indicate memory pressure critical signal\n| WHERE blocked_rate > 0\n| SORT blocked_rate DESC\n| LIMIT 10"
"esql": "TS metrics-activemq.otel-*\n// activemq.queue.blocked.sends is counter_long; dimension: activemq.destination.name\n| WHERE activemq.queue.blocked.sends IS NOT NULL\n| STATS blocked_rate = SUM(RATE(activemq.queue.blocked.sends)) BY activemq.broker.name, activemq.destination.name\n// Any blocked sends indicate memory pressure \u2014 critical signal\n| WHERE blocked_rate > 0\n| SORT blocked_rate DESC\n| LIMIT 10"
},
"size": 0,
"threshold": [0],
"threshold": [
0
],
"thresholdComparator": ">",
"timeField": "@timestamp",
"timeWindowSize": 10,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"type": "alerting_rule_template",
"managed": true,
"attributes": {
"description": "Alerts when a broker's memory utilization sustains above 85% (`activemq.memory.utilization > 0.85`). At this level producers are likely to be blocked and the broker is one incident away from refusing writes.",
"name": "[ActiveMQ OTel] Broker memory utilization high",
"ruleTypeId": ".es-query",
"tags": [
Expand All @@ -16,13 +17,28 @@
"alertDelay": {
"active": 2
},
"artifacts": {
"dashboards": [
{
"id": "activemq_otel-broker-health"
},
{
"id": "activemq_otel-overview"
}
],
"investigation_guide": {
"blob": "## ActiveMQ broker memory utilization high\n\n### What fired\n`activemq.memory.utilization` averaged above 0.85 over the evaluation window.\n\n### Why it matters\nActiveMQ reserves a bounded pool of JVM heap for buffering in-flight messages. As the pool fills, the broker applies back-pressure \u2014 first by slowing producers, then by blocking sends outright. Approaching 100% puts the broker in a near-unavailable state.\n\n### Triage\n1. Overview dashboard: confirm which broker(s) are affected.\n2. Broker Health dashboard: inspect `activemq.memory.usage` vs `activemq.memory.limit` to see absolute pressure.\n3. Correlate with `activemq.queue.blocked.sends` \u2014 if > 0 the broker is already blocking producers.\n4. Check consumer counts on busy destinations; consumer starvation is the usual root cause.\n5. Review JVM heap (`activemq.jvm.memory.heap.used` vs `.max`) \u2014 a broker under GC pressure drives this metric too.\n\n### Remediation\n- Unblock consumers / add consumer capacity for the backed-up destinations.\n- Raise the broker's memory limit (`systemUsage/memoryUsage/limit`) if the workload has legitimately grown.\n- Investigate message producers bursting into topics/queues without matching consumption capacity.\n\n### Tuning\n- `> 0.85` with a 15-minute window balances responsiveness and noise. Tighten to `> 0.80` for latency-sensitive tiers; loosen to `> 0.90` if your workload sustains high utilization in steady state.\n"
}
},
"params": {
"searchType": "esqlQuery",
"esqlQuery": {
"esql": "TS metrics-activemq.otel-*\n// Broker memory utilization is a gauge (0–1 ratio)\n| WHERE activemq.memory.utilization IS NOT NULL\n| STATS max_util = MAX(AVG_OVER_TIME(activemq.memory.utilization)) BY activemq.broker.name, host.name\n// Alert when memory utilization exceeds 85%; producers may be blocked\n| WHERE max_util > 0.85\n| SORT max_util DESC\n| LIMIT 10"
"esql": "TS metrics-activemq.otel-*\n// Broker memory utilization is a gauge (0\u20131 ratio)\n| WHERE activemq.memory.utilization IS NOT NULL\n| STATS max_util = MAX(AVG_OVER_TIME(activemq.memory.utilization)) BY activemq.broker.name, host.name\n// Alert when memory utilization exceeds 85%; producers may be blocked\n| WHERE max_util > 0.85\n| SORT max_util DESC\n| LIMIT 10"
},
"size": 0,
"threshold": [0],
"threshold": [
0
],
"thresholdComparator": ">",
"timeField": "@timestamp",
"timeWindowSize": 15,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"type": "alerting_rule_template",
"managed": true,
"attributes": {
"description": "Alerts when the `ActiveMQ.DLQ` destination accumulates more than 100 unconsumed messages. Growth in the dead-letter queue means messages are failing delivery \u2014 poison payloads, serialization errors, or downstream failures.",
"name": "[ActiveMQ OTel] Dead letter queue depth high",
"ruleTypeId": ".es-query",
"tags": [
Expand All @@ -16,13 +17,31 @@
"alertDelay": {
"active": 2
},
"artifacts": {
"dashboards": [
{
"id": "activemq_otel-destinations"
},
{
"id": "activemq_otel-overview"
},
{
"id": "activemq_otel-broker-health"
}
],
"investigation_guide": {
"blob": "## ActiveMQ dead letter queue depth high\n\n### What fired\nThe gauge `activemq.message.current` for destination `ActiveMQ.DLQ` exceeded 100 in the evaluation window.\n\n### Why it matters\nMessages land in the DLQ after exhausting broker-side retries. DLQ growth is almost always an application-layer signal: a poison message format, a serializer bug, a downstream service outage, or a TTL that is too aggressive for current consumer throughput.\n\n### Triage\n1. Destinations dashboard: confirm the DLQ depth and rate of growth.\n2. Check which source destinations are feeding the DLQ \u2014 usually via broker logs or by inspecting message headers (`OriginalDestination`).\n3. Correlate with `activemq.message.expired` increases and downstream service health.\n4. Sample a few DLQ messages (via JMX/console/Jolokia) to inspect payload and failure reason.\n\n### Remediation\n- Fix the application or consumer that is rejecting messages.\n- Drain the DLQ after the root cause is addressed (reprocess or discard as business rules dictate).\n- If the DLQ is used as a feature for async retries, consider adding a separate queue rather than accumulating in `ActiveMQ.DLQ`.\n\n### Tuning\n- Threshold (`> 100`) and 15-minute window can be lowered for zero-tolerance environments; raise for noisy systems where small DLQ churn is expected.\n"
}
},
"params": {
"searchType": "esqlQuery",
"esqlQuery": {
"esql": "TS metrics-activemq.otel-*\n// activemq.message.current is gauge; destination name in activemq.destination.name\n| WHERE activemq.message.current IS NOT NULL AND activemq.destination.name == \"ActiveMQ.DLQ\"\n| STATS dlq_depth = MAX(LAST_OVER_TIME(activemq.message.current)) BY activemq.broker.name, activemq.destination.name\n// Alert when DLQ accumulates significant messages; adjust threshold for your tolerance\n| WHERE dlq_depth > 100\n| SORT dlq_depth DESC\n| LIMIT 10"
},
"size": 0,
"threshold": [0],
"threshold": [
0
],
"thresholdComparator": ">",
"timeField": "@timestamp",
"timeWindowSize": 15,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"type": "alerting_rule_template",
"managed": true,
"attributes": {
"description": "Alerts when the ActiveMQ broker JVM CPU usage sustains above 85% (`activemq.jvm.cpu.process.usage > 0.85`). The broker is CPU-bound and will degrade message latency and throughput.",
"name": "[ActiveMQ OTel] High JVM CPU utilization",
"ruleTypeId": ".es-query",
"tags": [
Expand All @@ -16,13 +17,28 @@
"alertDelay": {
"active": 2
},
"artifacts": {
"dashboards": [
{
"id": "activemq_otel-broker-health"
},
{
"id": "activemq_otel-overview"
}
],
"investigation_guide": {
"blob": "## ActiveMQ high JVM CPU utilization\n\n### What fired\n`activemq.jvm.cpu.process.usage` averaged above 0.85 over the evaluation window.\n\n### Why it matters\nA CPU-bound broker cannot drain its destinations fast enough. Expect GC pressure to grow (compounded by high heap use), enqueue latency to spike, and queue depth to accumulate. Recovery without adding capacity is difficult once the broker is pegged.\n\n### Triage\n1. Broker Health dashboard: confirm which brokers and compare against system CPU (`activemq.jvm.cpu.system.usage`).\n2. Check GC activity (`activemq.jvm.gc.collections` / `activemq.jvm.gc.duration` rates) \u2014 frequent old-gen GCs waste CPU.\n3. Look at destination-level traffic \u2014 is one queue producing an outsized dispatch/forward rate?\n4. Inspect thread counts (`activemq.jvm.thread.count`) and file descriptor usage.\n\n### Remediation\n- Scale out brokers / use a network of brokers to distribute load.\n- Tune GC settings or raise heap size if GC is the dominant CPU consumer.\n- Optimise high-volume destinations (batching, compression, selector simplification).\n\n### Tuning\n- `> 0.85` over 15 minutes. Raise to 0.90 for CPU-tight clusters; lower for sensitive tiers.\n"
}
},
"params": {
"searchType": "esqlQuery",
"esqlQuery": {
"esql": "TS metrics-activemq.otel-*\n// JVM CPU is broker-level metric (ratio 0–1)\n| WHERE activemq.jvm.cpu.process.usage IS NOT NULL\n| STATS avg_cpu = MAX(AVG_OVER_TIME(activemq.jvm.cpu.process.usage)) BY activemq.broker.name, host.name\n// Alert when CPU exceeds 85%; adjust for your baseline\n| WHERE avg_cpu > 0.85\n| SORT avg_cpu DESC\n| LIMIT 10"
"esql": "TS metrics-activemq.otel-*\n// JVM CPU is broker-level metric (ratio 0\u20131)\n| WHERE activemq.jvm.cpu.process.usage IS NOT NULL\n| STATS avg_cpu = MAX(AVG_OVER_TIME(activemq.jvm.cpu.process.usage)) BY activemq.broker.name, host.name\n// Alert when CPU exceeds 85%; adjust for your baseline\n| WHERE avg_cpu > 0.85\n| SORT avg_cpu DESC\n| LIMIT 10"
},
"size": 0,
"threshold": [0],
"threshold": [
0
],
"thresholdComparator": ">",
"timeField": "@timestamp",
"timeWindowSize": 15,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"type": "alerting_rule_template",
"managed": true,
"attributes": {
"description": "Alerts when the broker JVM heap utilization exceeds 85% (`heap.used / heap.max > 0.85`). Sustained high heap triggers GC thrashing and often precedes OutOfMemoryError.",
"name": "[ActiveMQ OTel] High JVM heap utilization",
"ruleTypeId": ".es-query",
"tags": [
Expand All @@ -16,13 +17,28 @@
"alertDelay": {
"active": 2
},
"artifacts": {
"dashboards": [
{
"id": "activemq_otel-broker-health"
},
{
"id": "activemq_otel-overview"
}
],
"investigation_guide": {
"blob": "## ActiveMQ JVM heap utilization high\n\n### What fired\n`activemq.jvm.memory.heap.used / activemq.jvm.memory.heap.max` peaked above 0.85 over the evaluation window.\n\n### Why it matters\nWhen the JVM cannot reclaim enough heap, GC pauses lengthen and become more frequent. Message processing stalls during major collections, destination memory fills up, and a full heap eventually crashes the broker with `OutOfMemoryError`.\n\n### Triage\n1. Broker Health dashboard: confirm heap trend vs `.committed` and `.max`.\n2. Correlate with broker memory utilization (`activemq.memory.utilization`) \u2014 these often move together.\n3. Look at GC metrics for a sustained upward step in `activemq.jvm.gc.collections` or `.duration` rates.\n4. Watch `activemq.jvm.thread.count` \u2014 thread leaks can drive heap growth.\n\n### Remediation\n- Raise `-Xmx` if the workload has legitimately grown.\n- Investigate leaks: unbounded DLQ, long-lived subscriptions, custom plugins holding references.\n- Ensure `storeCursor` / `fileCursor` is used for large queues instead of `vmCursor` which keeps everything in memory.\n\n### Tuning\n- Threshold `> 0.85` matches the documented \"warning\" band. Tighten to 0.80 for tight clusters.\n"
}
},
"params": {
"searchType": "esqlQuery",
"esqlQuery": {
"esql": "TS metrics-activemq.otel-*\n// JVM heap metrics are broker-level (no destination dimension)\n| WHERE activemq.jvm.memory.heap.used IS NOT NULL AND activemq.jvm.memory.heap.max IS NOT NULL AND activemq.jvm.memory.heap.max > 0\n| STATS heap_used = MAX(LAST_OVER_TIME(activemq.jvm.memory.heap.used)), heap_max = MAX(LAST_OVER_TIME(activemq.jvm.memory.heap.max)) BY activemq.broker.name, host.name\n| EVAL heap_util = heap_used / heap_max\n// Alert when heap utilization exceeds 85%; adjust threshold for your environment\n| WHERE heap_util > 0.85\n| SORT heap_util DESC\n| LIMIT 10"
},
"size": 0,
"threshold": [0],
"threshold": [
0
],
"thresholdComparator": ">",
"timeField": "@timestamp",
"timeWindowSize": 15,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"type": "alerting_rule_template",
"managed": true,
"attributes": {
"description": "Alerts when a non-advisory destination holds more than 1000 unconsumed messages. A deep queue indicates the consumer cohort cannot keep up with producer rate \u2014 the classic consumer-lag signal.",
"name": "[ActiveMQ OTel] Queue depth high",
"ruleTypeId": ".es-query",
"tags": [
Expand All @@ -16,13 +17,28 @@
"alertDelay": {
"active": 2
},
"artifacts": {
"dashboards": [
{
"id": "activemq_otel-destinations"
},
{
"id": "activemq_otel-overview"
}
],
"investigation_guide": {
"blob": "## ActiveMQ queue depth high\n\n### What fired\nThe gauge `activemq.message.current` on a user destination (DLQ and advisory destinations excluded) exceeded 1000 over the evaluation window.\n\n### Why it matters\nGrowing queue depth is the canonical signal that consumers are falling behind producers. Effects compound quickly: message wait times rise, broker memory usage climbs, and eventually producers are blocked or messages expire.\n\n### Triage\n1. Destinations dashboard: identify the affected destination(s) and trend.\n2. Check enqueue vs dequeue rates (`activemq.message.enqueued` rate / `activemq.message.dequeued` rate). A persistent gap confirms consumer lag.\n3. Check `activemq.consumer.count` on affected queues \u2014 zero or too few consumers is a common root cause.\n4. Inspect `activemq.queue.message.inflight` for slow/stuck consumers holding unacked messages.\n\n### Remediation\n- Add consumers or fix slow consumer code paths.\n- Shed load on the producer side if consumer capacity cannot grow.\n- For bursty workloads, consider topics with selectors or sharding.\n\n### Tuning\n- `> 1000` is a conservative default; tune to your expected steady state (deep queues are acceptable for batch-style workloads).\n- Lengthen the evaluation window to smooth bursty traffic.\n"
}
},
"params": {
"searchType": "esqlQuery",
"esqlQuery": {
"esql": "TS metrics-activemq.otel-*\n// activemq.message.current is gauge; exclude advisory/system destinations\n| WHERE activemq.message.current IS NOT NULL\n AND activemq.destination.name IS NOT NULL\n AND activemq.destination.name NOT IN (\"ActiveMQ.DLQ\", \"ActiveMQ.Advisory.MasterBroker\", \"ActiveMQ.Advisory.Queue\", \"ActiveMQ.Advisory.Topic\")\n| STATS queue_depth = MAX(LAST_OVER_TIME(activemq.message.current)) BY activemq.broker.name, activemq.destination.name\n// Alert when queue depth exceeds 1000; adjust for your expected throughput\n| WHERE queue_depth > 1000\n| SORT queue_depth DESC\n| LIMIT 10"
},
"size": 0,
"threshold": [0],
"threshold": [
0
],
"thresholdComparator": ">",
"timeField": "@timestamp",
"timeWindowSize": 15,
Expand Down
Loading
Loading