Batch SQLite INSERTs for indexing pipeline by KRRT7 · Pull Request #5 · KRRT7/typeagent-py

KRRT7 · 2026-04-10T05:37:38Z

Summary

Add add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
SQLite backend uses executemany instead of individual cursor.execute() calls (~1000+ calls → 2-3 calls per indexing batch)
Restructure add_metadata_to_index_from_list and add_to_property_index to collect all data first (pure functions), then batch-insert
Memory backend implements batch methods as loops for interface compatibility

Benchmark

Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM), Python 3.13, Ubuntu 24.04
pytest-async-benchmark pedantic mode, 20 rounds, 3 warmup -- only hot path timed (setup/teardown excluded)

Benchmark	main (min)	optimized (min)	Speedup
`add_messages_with_indexing` (200 msgs)	28.8 ms	25.0 ms	1.16x
`add_messages_with_indexing` (50 msgs)	7.8 ms	6.7 ms	1.16x
VTT ingest (40 msgs)	6.9 ms	6.1 ms	1.14x

Consistent ~14-16% improvement -- executemany amortizes per-call overhead.

Reproduce

Save the benchmark file below as tests/benchmarks/test_benchmark_indexing.py, then:

pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio

# Run on main
git checkout main
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

# Run on this branch
git checkout perf/batch-inserts
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

Benchmark test file (tests/benchmarks/test_benchmark_indexing.py)

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

"""Benchmarks for add_messages_with_indexing -- the core indexing pipeline.

Exercises: message storage, semantic ref creation, term index insertion,
property index insertion, and embedding computation.

Only the hot path (add_messages_with_indexing) is timed -- DB creation,
storage provider init, VTT parsing, and teardown are excluded via
async_benchmark.pedantic().

Run:
    uv run python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s
"""

import itertools
import os
import shutil
import tempfile

import pytest

from typeagent.aitools.model_adapters import create_test_embedding_model
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.knowpro.universal_message import ConversationMessage
from typeagent.storage.sqlite.provider import SqliteStorageProvider
from typeagent.transcripts.transcript import (
    Transcript,
    TranscriptMessage,
    TranscriptMessageMeta,
)
from typeagent.transcripts.transcript_ingest import ingest_vtt_transcript

TESTDATA = os.path.join(os.path.dirname(__file__), "..", "testdata")
CONFUSE_A_CAT_VTT = os.path.join(TESTDATA, "Confuse-A-Cat.vtt")


def make_settings() -> ConversationSettings:
    """Create conversation settings with fake embedding model (no API keys)."""
    model = create_test_embedding_model()
    settings = ConversationSettings(model=model)
    settings.semantic_ref_index_settings.auto_extract_knowledge = False
    return settings


async def extract_vtt_messages(vtt_path: str) -> list[ConversationMessage]:
    """Parse a VTT file via ingest_vtt_transcript and return the messages."""
    settings = make_settings()
    with tempfile.TemporaryDirectory() as tmpdir:
        db_path = os.path.join(tmpdir, "parse.db")
        transcript = await ingest_vtt_transcript(vtt_path, settings, dbname=db_path)
        n = await transcript.messages.size()
        messages = await transcript.messages.get_slice(0, n)
        await settings.storage_provider.close()
    return messages


def synthetic_messages(n: int) -> list[TranscriptMessage]:
    """Build n synthetic TranscriptMessages."""
    return [
        TranscriptMessage(
            text_chunks=[f"Message {i} about topic {i % 10}"],
            metadata=TranscriptMessageMeta(speaker=f"Speaker{i % 3}"),
            tags=[f"tag{i % 5}"],
        )
        for i in range(n)
    ]


async def run_indexing_benchmark(async_benchmark, messages, message_type):
    """Shared benchmark harness: fresh DB per round, only hot path timed."""
    settings = make_settings()
    tmpdir = tempfile.mkdtemp()
    counter = itertools.count()

    async def setup():
        i = next(counter)
        db_path = os.path.join(tmpdir, f"bench_{i}.db")
        storage = SqliteStorageProvider(
            db_path,
            message_type=message_type,
            message_text_index_settings=settings.message_text_index_settings,
            related_term_index_settings=settings.related_term_index_settings,
        )
        settings.storage_provider = storage
        transcript = await Transcript.create(settings, name="bench")
        return transcript, storage, db_path

    async def teardown(setup_rv):
        _, storage, db_path = setup_rv
        await storage.close()
        os.remove(db_path)

    async def target(transcript, storage, db_path):
        await transcript.add_messages_with_indexing(messages)

    try:
        await async_benchmark.pedantic(
            target, setup=setup, teardown=teardown, rounds=20, warmup_rounds=3
        )
    finally:
        shutil.rmtree(tmpdir, ignore_errors=True)


@pytest.mark.asyncio
async def test_benchmark_vtt_ingest(async_benchmark):
    """Benchmark indexing of pre-parsed VTT messages (Confuse-A-Cat, 40 msgs)."""
    messages = await extract_vtt_messages(CONFUSE_A_CAT_VTT)
    await run_indexing_benchmark(async_benchmark, messages, ConversationMessage)


@pytest.mark.asyncio
async def test_benchmark_add_messages_50(async_benchmark):
    """Benchmark add_messages_with_indexing with 50 synthetic messages."""
    await run_indexing_benchmark(
        async_benchmark, synthetic_messages(50), TranscriptMessage
    )


@pytest.mark.asyncio
async def test_benchmark_add_messages_200(async_benchmark):
    """Benchmark add_messages_with_indexing with 200 synthetic messages."""
    await run_indexing_benchmark(
        async_benchmark, synthetic_messages(200), TranscriptMessage
    )

If you'd like the benchmark tests committed to the repo, see follow-up PR #6.

Test plan

All 69 offline tests pass (pytest tests/ -k "not test_benchmark")
pyright passes with 0 errors
Memory and SQLite backends both implement the new batch interface methods

…crosoft#231) - `parse_azure_endpoint` returned the full URL including `?api-version=...` - `AsyncAzureOpenAI` appends `/openai/` to `azure_endpoint`, producing a mangled URL with the query string in the path - Now strips the query string with `str.split("?", 1)[0]` before returning - Added 6 unit tests covering: basic URL, no version, separate env var, missing env var, empty query string ## Benchmark No performance impact — this is a correctness fix. --- *Generated by codeflash optimization agent*

- Defer `import black` from module level to first use in `answers.py` and `utils.py` - `black` (code formatter + transitive deps: pathspec, black.nodes, etc.) loaded on every `import typeagent` but only used in two cold formatting paths `black.format_str()` is called in two places: - `create_context_prompt()` in `knowpro/answers.py` — formats debug context for LLM prompts - `format_code()` in `aitools/utils.py` — developer pretty-print utility Neither runs during normal library operation. Moving the import inside each function eliminates ~78ms of transitive module loading from the import chain. ## Benchmark ### Azure Standard_D2s_v5 — 2 vCPU, 8 GiB RAM, Python 3.13 #### Import Time (hyperfine, warmup 5, min-runs 30) | Benchmark | Before | After | Speedup | |:---|---:|---:|---:| | `import typeagent` | 791 ms ± 11 ms | 713 ms ± 8 ms | **1.11x** | #### Offline E2E Test Suite (hyperfine, warmup 2, min-runs 10) | Benchmark | Before | After | Speedup | |:---|---:|---:|---:| | 69 offline tests | 5.72 s ± 90 ms | 5.60 s ± 98 ms | 1.02x | --- *Generated by codeflash optimization agent* --------- Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>

Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

…oft#235)

…ft#237) Apparently at least one of isort and black was updated and now produces more compact code in some cases. Also change the black call in Makefile to use only -tpy312 since black on 3.12 cannot parse back the py314 code it generated (maybe only the last -t flag was effective?). And finally make Makefile more robust by using `uv run X` instead of `.venv/bin/X`. --------- Co-authored-by: Kevin Turcios <turcioskevinr@gmail.com>

@bmerkle

@bmerkle could you review this too? This is really an independent little tool, but super handy -- it can show you all your AI chats used from VS Code. (I suspect it wouldn't be too hard to be able to point it to other frameworks' logs as well.) --------- Co-authored-by: Guido van Rossum <gvanrossum@microsoft.com> Co-authored-by: Bernhard Merkle <bernhard.merkle@gmail.com>

fixes bugs described in microsoft#238 - regression URL parsing in PR microsoft#231 - uv.lock updated to newer versions in many cases

…move imports - Fix inverse_actions omission in add_metadata_to_index_from_list (regression) - Fix inverse_actions omission in add_metadata_to_index (pre-existing) - Delete duplicate add_entity_to_index, add_action_to_index, add_topic_to_index, text_range_from_location — unified into add_entity, add_action, add_topic - Update all callers and tests to use unified functions - Move function-level imports to top-level in sqlite/propindex.py per AGENTS.md

KRRT7 force-pushed the perf/batch-inserts branch 4 times, most recently from 9a4400f to 19520f3 Compare April 10, 2026 06:58

This was referenced Apr 10, 2026

Add benchmark tests for indexing pipeline #6

Open

Batch metadata query to avoid N+1 across 5 call sites #7

Open

KRRT7 force-pushed the perf/batch-inserts branch from 19520f3 to e7e804e Compare April 10, 2026 10:21

KRRT7 and others added 5 commits April 10, 2026 09:23

Remove underscore prefix from collect helper functions

fcc7c23

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Fix pyright errors: use Sequence for batch method signatures

82ba650

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

KRRT7 force-pushed the perf/batch-inserts branch from 4030379 to 82ba650 Compare April 10, 2026 20:50

KRRT7 and others added 9 commits April 10, 2026 21:15

perf: Replace black with stdlib pprint for runtime formatting (micros…

59be9a5

…oft#235)

perf: Vectorize fuzzy_lookup_embedding with numpy ops (microsoft#234)

8c8f67a

Merge branch 'main' into perf/batch-inserts

544912a

Bump pytest from 9.0.2 to 9.0.3 (microsoft#240)

20c817f

regression URL parsing in PR microsoft#231 (microsoft#239)

a407770

fixes bugs described in microsoft#238 - regression URL parsing in PR microsoft#231 - uv.lock updated to newer versions in many cases

Merge branch 'main' into perf/batch-inserts

f705657

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch SQLite INSERTs for indexing pipeline#5

Batch SQLite INSERTs for indexing pipeline#5
KRRT7 wants to merge 14 commits intomainfrom
perf/batch-inserts

KRRT7 commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KRRT7 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented Apr 10, 2026 •

edited

Loading