Skip to content

Reduce contention in attribute metadata caches#25

Closed
Symmetricity wants to merge 2 commits into
perf/reduce-allocation-churnfrom
perf/pr902-cache-candidates
Closed

Reduce contention in attribute metadata caches#25
Symmetricity wants to merge 2 commits into
perf/reduce-allocation-churnfrom
perf/pr902-cache-candidates

Conversation

@Symmetricity
Copy link
Copy Markdown
Owner

This PR is AI generated.

Summary

This PR is stacked on systemed#902 and should be reviewed after systemed#902.

It reduces lock contention and repeated metadata work in two hot paths used
while writing OpenMapTiles attributes:

  • increase the small thread-local AttributeStore lookup caches from 64 to 256
    entries
  • cache vector-layer metadata updates per worker thread so repeated writes of
    the same layer/key/type do not repeatedly take vectorLayerMetadataMutex

On a warmed Austria OpenMapTiles run stacked on systemed#902, this reduced wall time by
about 4.9%, system CPU by about 43%, voluntary context switches by about 44%,
and peak RSS by about 0.9%.

Background

PR systemed#902 reduces allocation churn in the clipping path. After that change, a
large remaining cost in profiling was synchronization around small repeated
attribute and layer-metadata updates.

AttributeStore already has small thread-local direct-mapped caches to avoid
falling through to the shared sharded stores. This PR keeps that existing
approach, but makes the cache slightly larger so common attribute pairs and
sets survive more collision patterns.

OsmLuaProcessing::setVectorLayerMetadata() records the type of each
layer/key pair for the final vector_layers metadata table. The shared write
still needs a mutex, but most calls repeat the same layer/key/type combination
many times from the same worker thread. This PR skips the mutex path for those
same-thread exact repeats.

Implementation

The AttributeStore change is deliberately small:

  • cachedAttributePairPointers and cachedAttributePairIndexes grow from 64
    to 256 entries per thread
  • cachedAttributeSetPointers and cachedAttributeSetIndexes grow from 64 to
    256 entries per thread
  • the lookup behavior, shard selection, and locking behavior are otherwise
    unchanged

The vector-layer metadata change adds a thread-local cache keyed by layer and
attribute name:

  • the cache is reset if the worker starts using a different LayerDefinition
    instance
  • exact same-thread repeats of the same layer/key/type return before taking
    vectorLayerMetadataMutex
  • first sightings and type changes still update the shared
    layers.layers[layer].attributeMap under the existing mutex

This keeps the shared metadata map as the source of truth and does not remove
the synchronization that protects actual shared writes.

Performance

I compared clean systemed#902 with clean systemed#902 plus only these two cache changes.

Runtime fixture:

fixture: Austria Geofabrik extract
profile: OpenMapTiles profile
output: PMTiles
store: no --store
threads: 8
warmup: PBF plus coastline/landcover sidecar files
method: six alternating warmed runs with /usr/bin/time -v

Combined result:

metric systemed#902 mean systemed#902 + this PR mean delta delta %
wall time 32.795 s 31.177 s -1.618 s -4.93%
user CPU 189.337 s 198.053 s +8.717 s +4.60%
system CPU 41.465 s 23.575 s -17.890 s -43.14%
peak RSS 4,523,051 KiB 4,480,713 KiB -42,337 KiB -0.94%
minor faults 1,223,322 1,222,723 -599 -0.05%
filesystem outputs 1,111,761 1,111,955 +193 +0.02%
voluntary context switches 333,560 185,866 -147,695 -44.28%
involuntary context switches 2,438 2,132 -306 -12.54%

The user-CPU total increased, but the system-time and context-switch reductions
more than offset that in wall-clock time on this fixture.

Possible Regressions

The intended output behavior is unchanged.

Potential costs:

  • the AttributeStore thread-local caches retain a few more pointer/index
    entries per worker thread
  • the vector-layer metadata cache keeps a small per-thread map of layer/key
    types already seen by that worker
  • if a profile intentionally writes the same layer/key with different types,
    type changes still go through the existing mutex and update the shared map

This PR does not make the final metadata type-resolution semantics stricter or
more deterministic. It only avoids repeated same-thread writes that would have
stored the same type again.

Testing

Build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build --parallel 8

Code style:

git diff --check

Performance testing used the warmed Austria OpenMapTiles PMTiles run described
above.

Semantic checks were run during candidate acceptance on the submitted/open PR
stack used for output comparisons:

fixture: Liechtenstein Geofabrik extract
profile: OpenMapTiles profile

AttributeStore TLS cache size:
  baseline repeat changed_tiles 0
  candidate repeat changed_tiles 0
  baseline -> candidate changed_tiles 0

Vector layer metadata cache:
  baseline repeat changed_tiles 0
  candidate repeat changed_tiles 0
  baseline -> candidate changed_tiles 0

The latest timing table above is narrower: it compares clean systemed#902 with clean
systemed#902 plus only these two cache candidates, so the performance result is not
polluted by other open correctness changes.

Symmetricity and others added 2 commits June 4, 2026 11:52
Attribute pair and set lookups are very frequent during profile processing. The existing 64-entry thread-local caches leave many repeated lookups to fall back to the shared stores on larger fixtures.

Use a still-small 256-entry cache so more hot lookups are served locally without changing generated tile semantics. On the Austria fixture this was neutral to slightly faster and reduced native RSS in warmed alternating runs.

Co-authored-by: Codex <noreply@openai.com>
Layer metadata is shared across Lua processing threads and still needs the mutex added by the thread-safety fix in PR systemed#761. On larger fixtures, repeated Attribute() calls now make that coarse lock a major futex source.

Remember metadata types already recorded by the current thread for the active LayerDefinition, and skip the lock only for exact same-type repeats. This preserves the protected shared map update for first sightings and type changes while avoiding redundant lock traffic.

On the Austria fixture this reduced futex syscalls by about 96%, voluntary context switches by about 89%, and wall time by about 4% without changing decoded Liechtenstein output.

Co-authored-by: Codex <noreply@openai.com>
@Symmetricity
Copy link
Copy Markdown
Owner Author

Closed as superseded by upstream split PRs: systemed#913 and systemed#914.

@Symmetricity Symmetricity deleted the perf/pr902-cache-candidates branch June 4, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant