Add S3-native ingest_results() for MinIO result recovery#32
Add S3-native ingest_results() for MinIO result recovery#32
Conversation
| >>> # Download locally first, then load | ||
| >>> rows = ingest_results(cli, registry, "my-label", download=True) | ||
| """ | ||
| import os |
There was a problem hiding this comment.
move imports out of methods - these are base utils, it's not like we want a lazy import here
| import os | ||
| import tempfile | ||
|
|
||
| # 1. Resolve label/hash to run metadata |
There was a problem hiding this comment.
Break these steps into more readable methods
| ) | ||
|
|
||
| # 5. Resolve run_id | ||
| run_id = registry._resolve_run_id_for_hash(run_hash) |
There was a problem hiding this comment.
I'm now remembering we have a run_id var indexing the actual runs in RunRegistry, but josh refers to the same run_id as the batch identififier for polling from the remote target - are we handling this? Seems smelly, probably something we should just change in josh (batch_id) to disambiguate
| run_id=run_id, | ||
| run_hash=run_hash, | ||
| entity_type=export_type, | ||
| ) |
There was a problem hiding this comment.
I'm concerned with the case where attribute the wrong csvs to a given hash on this query - is that possible? I suppose if we pull our label from RunRegistry and the hash is included in the csv outputs, we are probably relatively safe by convention. But I wonder if any enforced run_hash in path business might allow us to guarantee no mismatch? Not sure it's worth it....
- Move os, tempfile, StageFromMinioConfig, configure_s3 imports to module level instead of inside methods - Extract monolithic ingest_results() into focused helpers: _resolve_ingest_metadata(), _get_josh_source(), _configure_minio_access(), _load_ingest_replicates() - Fix test mock patch path to match new top-level import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable recovering simulation results from MinIO into the RunRegistry by label. DuckDB reads CSVs directly from S3 via httpfs — no local download needed. Also provides a download=True fallback via stageFromMinio. - configure_s3(): reusable DuckDB httpfs + S3 credential setup - CellDataLoader.load_csv(): accepts s3:// URLs alongside local Paths - ingest_results(): label lookup → export path discovery → S3 read → load - SweepManager.ingest(): convenience wrapper - StageFromMinioConfig + JoshCLI.stage_from_minio(): download fallback - Fix pre-existing test regression from 3e487fe (data file extension) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move os, tempfile, StageFromMinioConfig, configure_s3 imports to module level instead of inside methods - Extract monolithic ingest_results() into focused helpers: _resolve_ingest_metadata(), _get_josh_source(), _configure_minio_access(), _load_ingest_replicates() - Fix test mock patch path to match new top-level import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestStageFromMinio: mock JarManager.get_jar so tests don't require a local JAR file (they already mock subprocess.run) - TestDiffCLI.test_main_view: mock _launch_ide so test doesn't require VS Code's `code` CLI in PATH Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Escalating integration tests against a real MinIO service container: - Level 1: DuckDB httpfs write/read to MinIO - Level 2: Josh JAR simulation exports to MinIO, Python reads back - Level 3: CellDataLoader.load_csv from s3:// URLs - Level 4: End-to-end ingest_results() by label from MinIO - Level 5: Partial/interrupted sweep recovery (missing replicates) - Edge cases: bad credentials, missing bucket, namespace isolation Infrastructure: - tests/conftest.py with shared fixtures (minio_conn, minio_registry, seed_csv, etc.) - tests/fixtures/minio_export.josh minimal test simulation - pytest 'integration' marker registered in pyproject.toml - pixi tasks: 'test' excludes integration, 'test-integration' runs only integration - CI workflow with unit-tests + integration-tests jobs (MinIO service container) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
af4e3ba to
ecbec04
Compare
Fixes zizmor alerts: - Pin bitnamilegacy/minio to SHA digest (unpinned image reference) - Add persist-credentials: false to checkout (credential persistence) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
(Claude summary)
Summary
ingest_results(cli, registry, label)— recovers simulation results from MinIO into the RunRegistry by label. DuckDB reads CSVs directly from S3 viahttpfs(no local download needed). Missing replicates (e.g. from OOM) are skipped gracefully.configure_s3()— reusable DuckDB httpfs + S3 credential setup, the foundation for all future S3 access patterns (serverless aggregators, WASM, multi-machine)CellDataLoader.load_csv()S3 URL support — acceptss3://strings alongside localPathobjectsSweepManager.ingest()convenience wrapperStageFromMinioConfig+JoshCLI.stage_from_minio()fordownload=Truefallback3e487fe(data file extension logic changed without test update)Part of #31 (PR 1). Access model: S3 CSVs are source of truth, local
.duckdbis a materialized cache any machine can rebuild.User-facing usage
Pixi task example (for josh-models):
Test plan
TestStageFromMinioConfig— frozen dataclass, optional defaultsTestStageFromMinio— subprocess arg building, minio flags only when non-NoneTestIngestResults— local file protocol, minio S3 read, missing replicate skip, josh_content fallback, missing creds error, unknown label errorTestConfigureS3— INSTALL httpfs + CREATE SECRET SQL generationtest_run_with_data_filesregression fixed🤖 Generated with Claude Code