Skip to content

Add S3-native ingest_results() for MinIO result recovery#32

Closed
GondekNP wants to merge 5 commits intomainfrom
feat/ingest-results
Closed

Add S3-native ingest_results() for MinIO result recovery#32
GondekNP wants to merge 5 commits intomainfrom
feat/ingest-results

Conversation

@GondekNP
Copy link
Copy Markdown
Contributor

@GondekNP GondekNP commented Apr 15, 2026

(Claude summary)

Summary

  • Adds ingest_results(cli, registry, label) — recovers simulation results from MinIO into the RunRegistry by label. DuckDB reads CSVs directly from S3 via httpfs (no local download needed). Missing replicates (e.g. from OOM) are skipped gracefully.
  • Adds configure_s3() — reusable DuckDB httpfs + S3 credential setup, the foundation for all future S3 access patterns (serverless aggregators, WASM, multi-machine)
  • Adds CellDataLoader.load_csv() S3 URL support — accepts s3:// strings alongside local Path objects
  • Adds SweepManager.ingest() convenience wrapper
  • Adds StageFromMinioConfig + JoshCLI.stage_from_minio() for download=True fallback
  • Fixes pre-existing test regression from 3e487fe (data file extension logic changed without test update)

Part of #31 (PR 1). Access model: S3 CSVs are source of truth, local .duckdb is a materialized cache any machine can rebuild.

User-facing usage

from joshpy.sweep import ingest_results

# Reads CSVs directly from S3 into DuckDB — no download
rows = ingest_results(cli, registry, "my-label")

# Or download locally first
rows = ingest_results(cli, registry, "my-label", download=True)

# Via SweepManager
manager.ingest()

Pixi task example (for josh-models):

recover = { cmd = "python scripts/recover.py", env = { JOSH_LABEL = "{{ LABEL }}" }, args = [{ arg = "LABEL" }] }

Test plan

  • 867 tests pass (856 existing + 11 new), 0 failures
  • TestStageFromMinioConfig — frozen dataclass, optional defaults
  • TestStageFromMinio — subprocess arg building, minio flags only when non-None
  • TestIngestResults — local file protocol, minio S3 read, missing replicate skip, josh_content fallback, missing creds error, unknown label error
  • TestConfigureS3 — INSTALL httpfs + CREATE SECRET SQL generation
  • Pre-existing test_run_with_data_files regression fixed
  • Manual e2e test against real MinIO (blocked on dev JAR update)

🤖 Generated with Claude Code

Comment thread joshpy/sweep.py Outdated
>>> # Download locally first, then load
>>> rows = ingest_results(cli, registry, "my-label", download=True)
"""
import os
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move imports out of methods - these are base utils, it's not like we want a lazy import here

Comment thread joshpy/sweep.py Outdated
import os
import tempfile

# 1. Resolve label/hash to run metadata
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break these steps into more readable methods

Comment thread joshpy/sweep.py Outdated
)

# 5. Resolve run_id
run_id = registry._resolve_run_id_for_hash(run_hash)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now remembering we have a run_id var indexing the actual runs in RunRegistry, but josh refers to the same run_id as the batch identififier for polling from the remote target - are we handling this? Seems smelly, probably something we should just change in josh (batch_id) to disambiguate

Comment thread joshpy/sweep.py Outdated
run_id=run_id,
run_hash=run_hash,
entity_type=export_type,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned with the case where attribute the wrong csvs to a given hash on this query - is that possible? I suppose if we pull our label from RunRegistry and the hash is included in the csv outputs, we are probably relatively safe by convention. But I wonder if any enforced run_hash in path business might allow us to guarantee no mismatch? Not sure it's worth it....

GondekNP added a commit that referenced this pull request Apr 15, 2026
- Move os, tempfile, StageFromMinioConfig, configure_s3 imports to
  module level instead of inside methods
- Extract monolithic ingest_results() into focused helpers:
  _resolve_ingest_metadata(), _get_josh_source(),
  _configure_minio_access(), _load_ingest_replicates()
- Fix test mock patch path to match new top-level import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GondekNP and others added 4 commits April 16, 2026 18:39
Enable recovering simulation results from MinIO into the RunRegistry by
label.  DuckDB reads CSVs directly from S3 via httpfs — no local download
needed.  Also provides a download=True fallback via stageFromMinio.

- configure_s3(): reusable DuckDB httpfs + S3 credential setup
- CellDataLoader.load_csv(): accepts s3:// URLs alongside local Paths
- ingest_results(): label lookup → export path discovery → S3 read → load
- SweepManager.ingest(): convenience wrapper
- StageFromMinioConfig + JoshCLI.stage_from_minio(): download fallback
- Fix pre-existing test regression from 3e487fe (data file extension)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move os, tempfile, StageFromMinioConfig, configure_s3 imports to
  module level instead of inside methods
- Extract monolithic ingest_results() into focused helpers:
  _resolve_ingest_metadata(), _get_josh_source(),
  _configure_minio_access(), _load_ingest_replicates()
- Fix test mock patch path to match new top-level import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestStageFromMinio: mock JarManager.get_jar so tests don't require
  a local JAR file (they already mock subprocess.run)
- TestDiffCLI.test_main_view: mock _launch_ide so test doesn't require
  VS Code's `code` CLI in PATH

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Escalating integration tests against a real MinIO service container:
- Level 1: DuckDB httpfs write/read to MinIO
- Level 2: Josh JAR simulation exports to MinIO, Python reads back
- Level 3: CellDataLoader.load_csv from s3:// URLs
- Level 4: End-to-end ingest_results() by label from MinIO
- Level 5: Partial/interrupted sweep recovery (missing replicates)
- Edge cases: bad credentials, missing bucket, namespace isolation

Infrastructure:
- tests/conftest.py with shared fixtures (minio_conn, minio_registry, seed_csv, etc.)
- tests/fixtures/minio_export.josh minimal test simulation
- pytest 'integration' marker registered in pyproject.toml
- pixi tasks: 'test' excludes integration, 'test-integration' runs only integration
- CI workflow with unit-tests + integration-tests jobs (MinIO service container)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@GondekNP GondekNP force-pushed the feat/ingest-results branch from af4e3ba to ecbec04 Compare April 16, 2026 18:39
Comment thread .github/workflows/test.yml Fixed
Comment thread .github/workflows/test.yml Fixed
Fixes zizmor alerts:
- Pin bitnamilegacy/minio to SHA digest (unpinned image reference)
- Add persist-credentials: false to checkout (credential persistence)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@GondekNP
Copy link
Copy Markdown
Contributor Author

Superseded by #34 (feat/batch-run), which includes all commits from this PR plus PR2 (target profiles + devcontainer K8s tooling). Consolidating all batch remote PRs into a single branch per #31.

@GondekNP GondekNP closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants