Skip to content

feat(electric-telemetry): add process_subtype attribute for supervisor/erlang/logger_olp granularity#4397

Open
erik-the-implementer wants to merge 2 commits into
mainfrom
erik/process-subtype-attribute
Open

feat(electric-telemetry): add process_subtype attribute for supervisor/erlang/logger_olp granularity#4397
erik-the-implementer wants to merge 2 commits into
mainfrom
erik/process-subtype-attribute

Conversation

@erik-the-implementer
Copy link
Copy Markdown
Contributor

Summary

Adds a new low-cardinality process_subtype attribute alongside the existing process_type on all electric-telemetry events that today carry it: vm.monitor.long_gc, vm.monitor.long_schedule, vm.monitor.long_message_queue, process.memory, process.bin_memory.

For the three coarse process_type buckets that hide the most signal during overload (per recent investigations into long-mailbox spikes), process_subtype is derived from cheap process introspection:

  • process_type = "supervisor" → registered name; else first atom in $ancestors; else nil.
  • process_type = "erlang" → registered name (catches named VM helpers like :erts_dirty_process_signal_handler); else initial_call MFA string (e.g. ":erlang.apply/2").
  • process_type = "logger_olp" → registered name (the handler id — default, otel_log_handler, logger_proxy, …).

For all other process_type values, process_subtype is nil.

The change is purely additive: process_type values are unchanged, so existing Honeycomb boards and alerts that group by process_type continue to work. process_subtype gives a drill-down dimension without exploding cardinality (registered names + MFAs only; no pids, no dynamic registry tuples).

Implementation notes

  • New ElectricTelemetry.Processes.proc_type_and_subtype/1 returns {type, subtype} in a single Process.info/2 call; proc_subtype/1 is also exported for callers that only want the subtype.
  • Process.info/2 now also fetches :registered_name (one extra key per call).
  • sorted_groups/2 groups by {type, subtype} so the process.memory / process.bin_memory metrics break down by subtype as well.
  • :process_subtype is added to the tags: lists of the affected last_value, sum, and distribution metric definitions.
  • proc_type/1 is kept unchanged for backward compatibility.

Test plan

  • New unit tests for each of the three subtype-bucket derivations and their fallback paths (packages/electric-telemetry/test/electric/telemetry/processes_test.exs).
  • Existing proc_type/1 tests unchanged and still pass (121/121 tests pass in electric-telemetry).
  • Changeset added: @core/electric-telemetry: minor.

🤖 Generated with Claude Code

Adds a new low-cardinality `process_subtype` attribute alongside the
existing `process_type` on all telemetry events that today carry it
(`vm.monitor.long_{gc,schedule,message_queue}`, `process.memory`,
`process.bin_memory`).

For the three coarse `process_type` buckets that previously hid most
of the signal during overload, `process_subtype` is derived as:

  * `:supervisor`  -> registered name, else first atom in $ancestors
  * `:erlang`      -> registered name, else initial_call MFA string
  * `:logger_olp`  -> registered name (handler id)

For every other `process_type` value, `process_subtype` is `nil`.
The existing `process_type` taxonomy is unchanged, so Honeycomb boards
and alerts that group by it continue to work; `process_subtype` adds
a finer-grained drill-down without exploding cardinality.

Refs electric-sql/alco-agent-tasks#46.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 58.13953% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.66%. Comparing base (8803b36) to head (1bd5c76).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...telemetry/lib/electric/telemetry/system_monitor.ex 0.00% 12 Missing ⚠️
...tric-telemetry/lib/electric/telemetry/processes.ex 86.20% 4 Missing ⚠️
...ry/lib/electric/telemetry/application_telemetry.ex 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4397      +/-   ##
==========================================
+ Coverage   59.46%   59.66%   +0.20%     
==========================================
  Files         304      319      +15     
  Lines       30626    31224     +598     
  Branches     8335     8334       -1     
==========================================
+ Hits        18211    18630     +419     
- Misses      12397    12576     +179     
  Partials       18       18              
Flag Coverage Δ
electric-telemetry 70.06% <58.13%> (?)
elixir 70.06% <58.13%> (?)
packages/agents 67.52% <ø> (ø)
packages/agents-mcp 77.54% <ø> (ø)
packages/agents-runtime 80.67% <ø> (ø)
packages/agents-server 74.44% <ø> (ø)
packages/agents-server-ui 6.21% <ø> (ø)
packages/electric-ax 43.81% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 94.39% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 59.46% <ø> (ø)
unit-tests 59.66% <58.13%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude Code Review

Summary

Adds a process_subtype attribute alongside the existing process_type on the five process telemetry events, giving low-cardinality drill-down for the three coarse buckets (supervisor, erlang, logger_olp) that historically hide the most signal during overload. The change is purely additive — existing process_type values are unchanged — and well-scoped to a single package.

What's Working Well

  • Genuinely additive. Existing dashboards/alerts that group on process_type keep working; the only new dimension is process_subtype, which is nil for the untouched buckets. Backward-compatibility risk is very low.
  • Single Process.info/2 per pid. proc_type_and_subtype/1 reuses one info call for both derivations and sorted_groups/2 continues to call info/1 once per pid via type_and_memory/1 — no extra round-trips.
  • Nil-safe defensive guards. registered_name_string/1 explicitly handles [], nil, atoms, and unexpected shapes. ancestor_atom_string/1 correctly guards not is_nil(name) to avoid the is_atom(nil) == true trap that would have stringified nil to "nil".
  • No atom-table risk. Subtype values are always Atom.to_string/1 of an already-existing atom (registered name, ancestor name, MFA components from initial_call) — no String.to_atom on external input.
  • Test coverage matches the spec. Each of the three subtype buckets has a positive case and a fallback/nil case, plus a non-bucketed control. The new derivation logic is well-exercised.
  • Changeset present (@core/electric-telemetry: minor).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None blocking.

Suggestions (Nice to Have)

1. proc_subtype/1 invites accidental double Process.info/2 calls

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:58-61

proc_subtype(pid) does its own info(pid) lookup. A caller that wants both type and subtype and naively uses proc_type/1 + proc_subtype/1 will end up doing two Process.info/2 calls, defeating the point of the combined proc_type_and_subtype/1 function. Consider either marking proc_subtype/1 with @doc false (it's not used anywhere outside tests in this PR) or implementing it as proc_type_and_subtype(pid) |> elem(1) to make the relationship explicit and prevent drift. Minor — current code is correct.

2. Stale subtype on :recheck_message_queues

File: packages/electric-telemetry/lib/electric/telemetry/system_monitor.ex:104-110, 128-131

{type, subtype} is captured when the long-mailbox event fires and then re-emitted on every recheck tick. If the process is renamed/relabelled in the meantime, the subtype becomes stale. This is consistent with how type was already handled (so not a regression), but worth noting because subtypes are more likely to drift than coarse types (e.g. a registered name being set after spawn). Re-resolving via proc_type_and_subtype(pid) inside log_long_message_queue_event/3 would cost one Process.info/2 per tracked pid per 200ms — probably fine given the small set, and arguably more accurate. Up to you.

3. :erlang subtype fallback is often the literal \":erlang.apply/2\"

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:196-197

For any unnamed spawn/spawn_link(fn -> ... end) the subtype will collapse to the same literal \":erlang.apply/2\" string (verified by the test at line 168). This is still strictly more information than the bucket alone (you can now tell named-:erlang processes apart from anonymous spawns), but if you're hoping to identify the actual call site, raw spawn does not carry it in initial_call. Worth setting expectations in the changeset that the :erlang drill-down is most useful when the process is named or proc_lib-spawned.

4. Cardinality note for closure MFAs

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:222-238

Exception.format_mfa/3 on the initial call is bounded by the code (one entry per distinct spawned function), so cardinality is fine, but the MFA for an Elixir closure can look like \"MyModule.\\\"-some_fun/0-fun-0-\\\"/0\" — a compiler-generated name. That string is stable across processes spawning the same closure, just visually noisy in Honeycomb. Not a bug; just a heads-up.

5. No test for proc_type_and_subtype/1 on a dead pid

File: packages/electric-telemetry/test/electric/telemetry/processes_test.exs

proc_type/1 is tested for dead processes (returns :dead). The new combined function falls through to the catch-all nil for any non-subtyped type (including :dead and :unknown), so this is implicitly covered, but an explicit assertion that proc_type_and_subtype(dead_pid) == {:dead, nil} would lock in the contract for telemetry code that runs on hot paths.

6. Reporter behavior with nil tag values

File: packages/electric-telemetry/lib/electric/telemetry/application_telemetry.ex:91-130

Worth a manual sanity check: how do the configured reporters (Reporters.Otel, Reporters.Prometheus, Reporters.Statsd) serialise a nil tag value? Some encode it as the literal string \"nil\", others drop the tag, others emit an empty-string label. The behavior is consistent across the new metrics either way, but you may want to verify it matches what your existing Honeycomb / Prometheus queries expect before relying on process_subtype = nil as a filter predicate.

Issue Conformance

No linked issue. PR description references "recent investigations into long-mailbox spikes" but no tracking ticket. Per project guidelines, PRs should reference an issue (even a short one capturing the original observability gap) — flagging as a minor process note, not a blocker. The PR description itself is detailed and adequate as a self-contained spec.

Previous Review Status

First review — no prior context.


Review iteration: 1 | 2026-05-22

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude Code Review

Summary

Adds a new process_subtype attribute alongside process_type on five existing telemetry events, deriving a stable low-cardinality string for the three coarse buckets (:supervisor, :erlang, :logger_olp) that previously hid the most signal during overload. The change is purely additive and the existing proc_type/1 API/values are unchanged.

What's Working Well

  • Clean, well-scoped change: one new public API (proc_type_and_subtype/1) plus a small convenience (proc_subtype/1), no behavioural changes to existing callers.
  • A single Process.info/2 call carries the work for both type and subtype — no extra round-trip per process in hot paths.
  • The fallback chain per bucket (registered_name → $ancestorsinitial_call MFA) is explicit, documented inline, and matches the rationale in the PR description.
  • Module attribute @subtyped_types makes it obvious which buckets get a subtype and is the single point of extension.
  • Tests cover the three buckets, both the happy path (named) and the fallback paths ($ancestors, MFA), plus the "no useful subtype" cases.
  • Atom.to_string/1 is only used on atoms read out of the process dictionary / VM (registered name, $ancestors head, initial_call module) — no atoms ever created from external input.
  • Changeset file present and accurate (@core/electric-telemetry: minor).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Missing linked issue

Per project review convention (Phase 2.7), PRs should reference the issue they address. The PR description mentions "recent investigations into long-mailbox spikes" but doesn't link an issue. Worth adding a tracking issue (or linking the relevant incident/investigation) so future readers can find the motivation.

Suggestions (Nice to Have)

Test gap: dead process behaviour

File: packages/electric-telemetry/test/electric/telemetry/processes_test.exs

Existing tests cover proc_type/1 returning :dead for an exited process, but the new proc_type_and_subtype/1 / proc_subtype/1 functions aren't exercised for that case. The implementation handles it correctly (info(pid) returns nil, Access on nil returns nil for every lookup, and :dead is not in @subtyped_types), but the contract — {:dead, nil} and nil respectively — is worth pinning down with a test so a future refactor of info/1 (e.g. returning something other than nil) doesn't silently change the answer.

registered_name_string/1 clause ordering

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:206-213

defp registered_name_string(info) do
  case info[:registered_name] do
    [] -> nil
    nil -> nil
    name when is_atom(name) -> Atom.to_string(name)
    _ -> nil
  end
end

The clause order is fine in practice, but is_atom(nil) is true, so the ordering is load-bearing — drop the nil -> nil clause and Atom.to_string(nil) returns "nil" instead of nil. A slightly more defensive form like name when is_atom(name) and not is_nil(name) -> Atom.to_string(name) would make the case self-contained.

proc_type_and_subtype/1 spec is very loose

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:37

@spec proc_type_and_subtype(pid()) :: {term(), binary() | nil}

term() is correct but uninformative — in practice the type is atom() | binary() (atoms cover :dead, :unknown, module atoms, and atom labels; binaries cover the string-label case). Tightening to atom() | binary() would document the contract better and help dialyzer downstream. Same applies to the existing (unspecced) proc_type/1 if you want to add a spec while in the area.

process_subtype: nil on tagged metrics

Files: packages/electric-telemetry/lib/electric/telemetry/system_monitor.ex:59-61, plus implicit nil from proc_type_and_subtype/1 for non-bucketed processes.

Adding :process_subtype to the tags: list of process.memory, process.bin_memory, vm.monitor.long_* metrics means every emitted data point now carries a process_subtype label whose value is sometimes nil. The three exporters (telemetry_metrics_prometheus_core, telemetry_metrics_statsd, otel_metric_exporter) each have their own conventions for nil label values — typically empty string or the literal "nil". Worth a quick sanity check against the actual Honeycomb / Prometheus / StatsD pipeline before relying on the "purely additive, existing boards keep working" claim, especially for any alert that filters on process_type=<bucket>. (No code change needed if the rendering is acceptable; otherwise consider mapping nil"_none" or similar at the :telemetry.execute boundary so the label is always a string.)

Doc nit: proc_subtype/1 recomputes the type

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:44-61

The docstring for proc_subtype/1 doesn't mention that it does its own Process.info/2 call and recomputes the type internally — callers who want both should reach for proc_type_and_subtype/1. A one-line "Prefer proc_type_and_subtype/1 if you need both type and subtype" in the docstring would steer people to the cheaper path.

Issue Conformance

No linked issue. PR description is detailed and self-contained, so the implementation is easy to evaluate on its own merits. The implementation matches the description (additive attribute on the exact five events listed, three buckets subtyped, others nil). No scope creep.

Previous Review Status

First review iteration — no prior context to compare against.


Review iteration: 1 | 2026-05-22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant