Skip to content

Batch SQLite INSERTs for indexing pipeline#5

Open
KRRT7 wants to merge 14 commits intomainfrom
perf/batch-inserts
Open

Batch SQLite INSERTs for indexing pipeline#5
KRRT7 wants to merge 14 commits intomainfrom
perf/batch-inserts

Conversation

@KRRT7
Copy link
Copy Markdown
Owner

@KRRT7 KRRT7 commented Apr 10, 2026

Summary

  • Add add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
  • SQLite backend uses executemany instead of individual cursor.execute() calls (~1000+ calls → 2-3 calls per indexing batch)
  • Restructure add_metadata_to_index_from_list and add_to_property_index to collect all data first (pure functions), then batch-insert
  • Memory backend implements batch methods as loops for interface compatibility

Benchmark

Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM), Python 3.13, Ubuntu 24.04
pytest-async-benchmark pedantic mode, 20 rounds, 3 warmup -- only hot path timed (setup/teardown excluded)

Benchmark main (min) optimized (min) Speedup
add_messages_with_indexing (200 msgs) 28.8 ms 25.0 ms 1.16x
add_messages_with_indexing (50 msgs) 7.8 ms 6.7 ms 1.16x
VTT ingest (40 msgs) 6.9 ms 6.1 ms 1.14x

Consistent ~14-16% improvement -- executemany amortizes per-call overhead.

Reproduce

Save the benchmark file below as tests/benchmarks/test_benchmark_indexing.py, then:

pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio

# Run on main
git checkout main
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

# Run on this branch
git checkout perf/batch-inserts
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s
Benchmark test file (tests/benchmarks/test_benchmark_indexing.py)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

"""Benchmarks for add_messages_with_indexing -- the core indexing pipeline.

Exercises: message storage, semantic ref creation, term index insertion,
property index insertion, and embedding computation.

Only the hot path (add_messages_with_indexing) is timed -- DB creation,
storage provider init, VTT parsing, and teardown are excluded via
async_benchmark.pedantic().

Run:
    uv run python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s
"""

import itertools
import os
import shutil
import tempfile

import pytest

from typeagent.aitools.model_adapters import create_test_embedding_model
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.knowpro.universal_message import ConversationMessage
from typeagent.storage.sqlite.provider import SqliteStorageProvider
from typeagent.transcripts.transcript import (
    Transcript,
    TranscriptMessage,
    TranscriptMessageMeta,
)
from typeagent.transcripts.transcript_ingest import ingest_vtt_transcript

TESTDATA = os.path.join(os.path.dirname(__file__), "..", "testdata")
CONFUSE_A_CAT_VTT = os.path.join(TESTDATA, "Confuse-A-Cat.vtt")


def make_settings() -> ConversationSettings:
    """Create conversation settings with fake embedding model (no API keys)."""
    model = create_test_embedding_model()
    settings = ConversationSettings(model=model)
    settings.semantic_ref_index_settings.auto_extract_knowledge = False
    return settings


async def extract_vtt_messages(vtt_path: str) -> list[ConversationMessage]:
    """Parse a VTT file via ingest_vtt_transcript and return the messages."""
    settings = make_settings()
    with tempfile.TemporaryDirectory() as tmpdir:
        db_path = os.path.join(tmpdir, "parse.db")
        transcript = await ingest_vtt_transcript(vtt_path, settings, dbname=db_path)
        n = await transcript.messages.size()
        messages = await transcript.messages.get_slice(0, n)
        await settings.storage_provider.close()
    return messages


def synthetic_messages(n: int) -> list[TranscriptMessage]:
    """Build n synthetic TranscriptMessages."""
    return [
        TranscriptMessage(
            text_chunks=[f"Message {i} about topic {i % 10}"],
            metadata=TranscriptMessageMeta(speaker=f"Speaker{i % 3}"),
            tags=[f"tag{i % 5}"],
        )
        for i in range(n)
    ]


async def run_indexing_benchmark(async_benchmark, messages, message_type):
    """Shared benchmark harness: fresh DB per round, only hot path timed."""
    settings = make_settings()
    tmpdir = tempfile.mkdtemp()
    counter = itertools.count()

    async def setup():
        i = next(counter)
        db_path = os.path.join(tmpdir, f"bench_{i}.db")
        storage = SqliteStorageProvider(
            db_path,
            message_type=message_type,
            message_text_index_settings=settings.message_text_index_settings,
            related_term_index_settings=settings.related_term_index_settings,
        )
        settings.storage_provider = storage
        transcript = await Transcript.create(settings, name="bench")
        return transcript, storage, db_path

    async def teardown(setup_rv):
        _, storage, db_path = setup_rv
        await storage.close()
        os.remove(db_path)

    async def target(transcript, storage, db_path):
        await transcript.add_messages_with_indexing(messages)

    try:
        await async_benchmark.pedantic(
            target, setup=setup, teardown=teardown, rounds=20, warmup_rounds=3
        )
    finally:
        shutil.rmtree(tmpdir, ignore_errors=True)


@pytest.mark.asyncio
async def test_benchmark_vtt_ingest(async_benchmark):
    """Benchmark indexing of pre-parsed VTT messages (Confuse-A-Cat, 40 msgs)."""
    messages = await extract_vtt_messages(CONFUSE_A_CAT_VTT)
    await run_indexing_benchmark(async_benchmark, messages, ConversationMessage)


@pytest.mark.asyncio
async def test_benchmark_add_messages_50(async_benchmark):
    """Benchmark add_messages_with_indexing with 50 synthetic messages."""
    await run_indexing_benchmark(
        async_benchmark, synthetic_messages(50), TranscriptMessage
    )


@pytest.mark.asyncio
async def test_benchmark_add_messages_200(async_benchmark):
    """Benchmark add_messages_with_indexing with 200 synthetic messages."""
    await run_indexing_benchmark(
        async_benchmark, synthetic_messages(200), TranscriptMessage
    )

If you'd like the benchmark tests committed to the repo, see follow-up PR #6.

Test plan

  • All 69 offline tests pass (pytest tests/ -k "not test_benchmark")
  • pyright passes with 0 errors
  • Memory and SQLite backends both implement the new batch interface methods

KRRT7 and others added 5 commits April 10, 2026 09:23
…crosoft#231)

- `parse_azure_endpoint` returned the full URL including
`?api-version=...`
- `AsyncAzureOpenAI` appends `/openai/` to `azure_endpoint`, producing a
mangled URL with the query string in the path
- Now strips the query string with `str.split("?", 1)[0]` before
returning
- Added 6 unit tests covering: basic URL, no version, separate env var,
missing env var, empty query string

## Benchmark

No performance impact — this is a correctness fix.

---

*Generated by codeflash optimization agent*
- Defer `import black` from module level to first use in `answers.py`
and `utils.py`
- `black` (code formatter + transitive deps: pathspec, black.nodes,
etc.) loaded on every `import typeagent` but only used in two cold
formatting paths

`black.format_str()` is called in two places:
- `create_context_prompt()` in `knowpro/answers.py` — formats debug
context for LLM prompts
- `format_code()` in `aitools/utils.py` — developer pretty-print utility

Neither runs during normal library operation. Moving the import inside
each function eliminates ~78ms of transitive module loading from the
import chain.

## Benchmark

### Azure Standard_D2s_v5 — 2 vCPU, 8 GiB RAM, Python 3.13

#### Import Time (hyperfine, warmup 5, min-runs 30)

| Benchmark | Before | After | Speedup |
|:---|---:|---:|---:|
| `import typeagent` | 791 ms ± 11 ms | 713 ms ± 8 ms | **1.11x** |

#### Offline E2E Test Suite (hyperfine, warmup 2, min-runs 10)

| Benchmark | Before | After | Speedup |
|:---|---:|---:|---:|
| 69 offline tests | 5.72 s ± 90 ms | 5.60 s ± 98 ms | 1.02x |

---

*Generated by codeflash optimization agent*

---------

Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
Add add_terms_batch / add_properties_batch to the index interfaces
with executemany-based SQLite implementations. Restructure
add_metadata_to_index_from_list and add_to_property_index to collect
all items first, then batch-insert via extend() and the new batch
methods. Eliminates ~1000 individual INSERT round-trips during
indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch
interfaces and implementations to satisfy covariance. Add missing
add_terms_batch to FakeTermIndex in conftest.py.
@KRRT7 KRRT7 force-pushed the perf/batch-inserts branch from 4030379 to 82ba650 Compare April 10, 2026 20:50
KRRT7 and others added 9 commits April 10, 2026 21:15
…ft#237)

Apparently at least one of isort and black was updated and now produces
more compact code in some cases.

Also change the black call in Makefile to use only -tpy312 since black
on 3.12 cannot parse back the py314 code it generated (maybe only the
last -t flag was effective?).

And finally make Makefile more robust by using `uv run X` instead of
`.venv/bin/X`.

---------

Co-authored-by: Kevin Turcios <turcioskevinr@gmail.com>
@bmerkle could you review this too? This is really an independent little
tool, but super handy -- it can show you all your AI chats used from VS
Code. (I suspect it wouldn't be too hard to be able to point it to other
frameworks' logs as well.)

---------

Co-authored-by: Guido van Rossum <gvanrossum@microsoft.com>
Co-authored-by: Bernhard Merkle <bernhard.merkle@gmail.com>
fixes bugs described in microsoft#238 
- regression URL parsing in PR microsoft#231
- uv.lock updated to newer versions in many cases
…move imports

- Fix inverse_actions omission in add_metadata_to_index_from_list (regression)
- Fix inverse_actions omission in add_metadata_to_index (pre-existing)
- Delete duplicate add_entity_to_index, add_action_to_index, add_topic_to_index,
  text_range_from_location — unified into add_entity, add_action, add_topic
- Update all callers and tests to use unified functions
- Move function-level imports to top-level in sqlite/propindex.py per AGENTS.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants