A reproducible pipeline for building a curated catalog of public datasets as analytics-ready Parquet and Vortex files.
Raincloud is a reproducible baseline of public datasets in modern columnar formats, curated from research papers and existing community efforts. The project's motivation comes from file-format research, where consistent test corpora are needed to compare encoding, compression, and layout choices on real-world inputs. Beyond file-format research, we see broader value in providing a community-curated set of real-world data, as we expect it to be useful for other tasks such as analytical benchmarking and model evaluation.
β Third-party data. Raincloud fetches data from URLs declared in
sources.json. Those bytes come from upstream sources, not from us β we don't audit, host, or redistribute them. SeeDISCLAIMER.mdfor the AS IS posture, content / license / supply-chain disclaimers, and the dataset-removal channel.
The repo is driven by a single manifest β sources.json β which declares:
- Where to fetch data from. Each entry names an upstream source (HTTP, Kaggle, Hugging Face) and the URLs to pull.
- How to transform it. Each entry names a parser (
csv,parquet,jsonl,xml,pbf,custom) and a transform handler that converts the raw bytes into one or more typed Arrow tables. - Where it lands. Transformed tables are written to
outputs/v{schema_version}/<slug>/parquet/<slug>.parquet, with the optional Vortex sibling atoutputs/v{schema_version}/<slug>/vortex/<slug>.vortex. The per-format subdirectory leaves room for additional artifact tiers (e.g.parquet-hydrated/) without filename collisions.
Nothing downstream of sources.json is hand-maintained; docs/datasets.md and docs/handlers.md are derived artefacts regenerated after each build. Column-level / type-coverage / vortex-skip / hydration-candidate views are queryable via list_datasets flags and the TUI rather than markdown.
Browse the catalog at a glance β sortable columns, parquet/vortex presence per slug, no builds required:
uv sync --extra tui
python -m scripts.pipeline.browseA read-only Textual TUI over sources.json. Click any column header to sort; right pane shows description, license, fetch URL, and on-disk state for the highlighted slug. Press q to quit.
Tell Raincloud which dataset you want; get back a Parquet + Vortex file on disk.
uv sync
python -m scripts.pipeline.status --fast --missing-only # read-only env check
python -m scripts.pipeline.build countries-of-the-worldoutputs/v1/countries-of-the-world/parquet/countries-of-the-world.parquet
outputs/v1/countries-of-the-world/vortex/countries-of-the-world.vortex
The command runs every pipeline stage β fetch, extract, parse, transform, write, validate, convert β and leaves both a Parquet file and its converted Vortex sibling under per-format subdirectories of outputs/v1/<slug>/.
Pick any other dataset from docs/v1/datasets.md (249 curated) and pass its slug the same way. Examples spanning the size range:
python -m scripts.pipeline.build uci-seeds # 210 rows, ~200 ms
python -m scripts.pipeline.build clickbench-hits # 100 M rows, ~10 GB parquet
python -m scripts.pipeline.build --family public-bi # all 46 Public BI workloads157 of 249 manifest entries fetch from direct HTTPS endpoints and need no additional setup. The rest:
# Kaggle-hosted (33 slugs). One-time credential setup:
uv sync --extra kaggle
mkdir -p ~/.kaggle && mv /path/to/kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
# Hugging Face-hosted (59 slugs):
uv sync --extra huggingface
# Everything:
uv sync --extra allIf you're an AI coding agent landing in this repo:
- Read
AGENTS.md(auto-loaded fromCLAUDE.md β AGENTS.md) for the invariants and architecture. - Run
python -m scripts.pipeline.status --fast --missing-onlyto verify the env, thenpython -m scripts.pipeline.validate_manifestto confirmsources.jsonis well-formed. Both are sub-second and side-effect-free. - Run
pytest(afteruv sync --extra dev) for a regression net before any non-trivial change to the manifest, schema, or handler registry. - For catalog questions ("which slugs use handler X", "what's CC0-licensed"), use
python -m scripts.pipeline.list_datasetsrather than grepingsources.jsonor scrollingdocs/v1/datasets.md. - Copy-pasteable templates for new manifest entries and streaming handlers live in
examples/. - Harnesses that follow the Agent Skills standard get 16 invokable skills under
.agents/skills/(the.claude β .agentssymlink means Claude Code sees the same files). Tracked safe-default permissions in.agents/settings.jsonβ see.agents/README.mdfor the full layout.
sources.json # the manifest β one DatasetSpec per dataset
sources.schema.md # human-friendly schema reference
sources.schema.json # machine-readable JSON Schema (Draft 2020-12)
AGENTS.md # invariants + first-contact guide for AI coding agents (CLAUDE.md β AGENTS.md)
SKILLS.md # narrative playbooks
HYDRATING.md # hand-maintained hydration policy / philosophy
DISCLAIMER.md # AS IS posture, content/license disclaimers, dataset-removal reporting
scripts/
pipeline/
build.py # orchestrator β ties the 7 stages together
fetch.py # stage 1: download raw bytes
custom_fetch.py # named custom-fetch helpers (fetch.type = "custom")
extract.py # stage 2: unpack archives into _workdir/
parse.py # stage 3: read raw files into Arrow tables
transform.py # stage 4: dispatch to named handler
write.py # stage 5: emit parquet
validate.py # stage 6: assert rows / schema_hash
convert.py # stage 7 (optional): emit sibling .vortex per spec's convert.vortex flag
hydrate.py # stage 8 (optional, opt-in): dereference URL columns into parquet-hydrated/
docs.py # regenerate docs/datasets.md + handlers.md (other catalog views live in list_datasets / TUI)
tighten_variant.py # in-place JSON β VARIANT pass
validate_manifest.py # static checks on sources.json (schema + cross-checks)
list_datasets.py # filter/list slugs by family / handler / license / etc.
status.py # per-slug filesystem state report
browse.py # interactive Textual TUI over sources.json (requires --extra tui)
spec.py # manifest loader, path helpers, duckdb_connect
handlers/ # named transform handlers
tests/ # pytest smoke suite (manifest, schema, handler registry, examples)
examples/ # copy-pasteable templates (minimal_spec.json, streaming_handler.py.tmpl)
.agents/ # tracked agent allow-list (settings.json) + 16 invokable skills (.claude β .agents)
outputs/
raw_downloads/<slug>/ # stage 1 output β unversioned, cached
v{schema_version}/<slug>/ # stage 5 output β version-scoped
docs/
datasets.md # auto-generated index (one row per dataset)
handlers.md # auto-generated registry view (purpose, streaming, extra deps, usage)
v{schema_version}/ # tracked canonical snapshot of datasets.md + handlers.md
_workdir/<slug>/ # stage 2 scratch space (gitignored)
# Build a single dataset
python -m scripts.pipeline.build <slug>
# Build everything in a family
python -m scripts.pipeline.build --family uci
python -m scripts.pipeline.build --family public-bi
# Build every dataset in the manifest
python -m scripts.pipeline.build --all
# Loosen validation (warn instead of error on row-count drift)
python -m scripts.pipeline.build <slug> --loose
# Regenerate the derived docs after a build
python -m scripts.pipeline.docs # both files (datasets.md + handlers.md)
python -m scripts.pipeline.docs datasets # just datasets.md
python -m scripts.pipeline.docs handlers # just handlers.md
# Catalog views that used to be markdown live as list_datasets flags now
python -m scripts.pipeline.list_datasets --columns [<slug>...] [--column-grep PATTERN]
python -m scripts.pipeline.list_datasets --coverage [--source parquet|vortex]
python -m scripts.pipeline.list_datasets --no-vortex --json # vortex-skip slugs + reasons
python -m scripts.pipeline.list_datasets --hydrate --long # hydration candidates
# Post-process: promote JSON-annotated string columns to VARIANT in-place
python -m scripts.pipeline.tighten_variant # every built parquet
python -m scripts.pipeline.tighten_variant <slug>... # specific slugs
# Emit a sibling .vortex for every spec that opts in via convert.vortex: true
python -m scripts.pipeline.convert # respects per-spec flag
python -m scripts.pipeline.convert <slug>...
# Optional stage 8: dereference URL columns into parquet-hydrated/<slug>.parquet.
# Off by default β only for slugs that opted in via the `hydrate` block.
# Safety filter ON by default; bypass requires --unsafe-allow-all-domains
# AND --i-accept-the-risk. See HYDRATING.md for the full discussion.
python -m scripts.pipeline.hydrate <slug> # one slug
python -m scripts.pipeline.hydrate <slug> --limit 100 # first N rows (recommended for first run)
python -m scripts.pipeline.hydrate --all # every spec with hydrate
# Read-only inspection / triage
python -m scripts.pipeline.status --fast --missing-only # filesystem state across the manifest
python -m scripts.pipeline.validate_manifest # static checks on sources.json (schema + cross-checks)
python -m scripts.pipeline.list_datasets --family uci # filter the catalog without grepping JSON
python -m scripts.pipeline.list_datasets --handler tighten_types --long
python -m scripts.pipeline.list_datasets --grep '\bgeo' --long
python -m scripts.pipeline.browse # interactive TUI (requires --extra tui)
# Run the test suite (sub-second, no fetch / no build)
uv sync --extra dev && pytestEach stage is independently invokable β e.g. python -m scripts.pipeline.fetch <slug> to download raw bytes without running the rest. Stages are idempotent: fetch skips when expected_bytes/expected_sha256 already matches on disk, write skips when the output parquet is already current.
Every DuckDB connection in the pipeline goes through scripts.pipeline.spec.duckdb_connect, which reads these optional env vars before opening:
| Env var | Effect |
|---|---|
RAINCLOUD_DUCKDB_MEMORY_LIMIT |
Caps DuckDB's working set (memory_limit setting). E.g. 8GB, 512MB. DuckDB spills to disk once the limit is hit. Unset = DuckDB's default (~80% of system RAM), which can be a problem on shared hosts or CI runners. |
RAINCLOUD_DUCKDB_THREADS |
Caps the thread pool. Integer. |
RAINCLOUD_DUCKDB_TEMP_DIRECTORY |
Where DuckDB spills intermediate batches. Default = system tempdir. Point at a larger volume if the system tempdir runs out of room on a big build. |
Example:
RAINCLOUD_DUCKDB_MEMORY_LIMIT=8GB \
RAINCLOUD_DUCKDB_TEMP_DIRECTORY=/mnt/scratch/duckdb-tmp \
python -m scripts.pipeline.build jsonbench-bluesky-100mPersistent DuckDB databases are opened with storage_compatibility_version=v1.5.0 automatically (required for VARIANT columns).
See sources.schema.md for the human-friendly reference and sources.schema.json for the machine-readable JSON Schema (Draft 2020-12). After editing the manifest, run
python -m scripts.pipeline.validate_manifestto catch typo'd handler names, slug collisions, and shape errors in under a second before paying for a fetch.
A minimal entry looks like:
Current counts:
- 249 datasets across five families:
direct,kaggle-upstream,nyc-tlc,public-bi,uci. - Fetch types in use:
http,kaggle,huggingface. - Parse readers in use:
csv,parquet,jsonl,xml,pbf,custom. - Schema version: 1 β outputs land in
outputs/v1/.
Each dataset produces exactly one parquet per output slug (some handlers split one source into multiple outputs β glove_split β 3, osm_pbf_split β 3, stack_exchange_split β N). Within each slug directory, artefacts live in per-format subdirectories so additional tiers (Vortex sibling, future hydrated copies, partitioned variants) can coexist without filename collisions:
outputs/v1/clickbench-hits/parquet/clickbench-hits.parquet
outputs/v1/clickbench-hits/vortex/clickbench-hits.vortex
outputs/v1/glove-6b-50d/parquet/glove-6b-50d.parquet
outputs/v1/glove-6b-50d/vortex/glove-6b-50d.vortex
outputs/v1/osm-germany-nodes/parquet/osm-germany-nodes.parquet
...
Raw downloads are cached separately and not version-scoped, since the same upstream bytes can feed any schema_version:
outputs/raw_downloads/clickbench-hits/hits.parquet
outputs/raw_downloads/glove-6b-50d/glove.6B.zip # hardlinked into sibling slugs
outputs/raw_downloads/osm-germany-nodes/germany-latest.osm.pbf
Sibling slugs sharing the same upstream URL (GloVe 50d/100d/200d; OSM Germany nodes/ways/relations) are deduped via hardlink during fetch.
The manifest is curated to exercise a broad range of Parquet logical and nested types, including:
VARIANTβcountries-of-the-world(227 country JSON blobs),jsonbench-bluesky-100m(100M Bluesky firehose records).- GeoParquet 1.1 with WKB geometry β
osm-germany-{nodes,ways,relations}. fixed_size_list<float32, N>β GloVe embeddings, dbpedia 1536-dim OpenAI embeddings.list<...>with tightened element types β e.g.list<uint32>for Hacker News kids/parts.- Nested
struct/list<struct>/map<string, int64>β Wikipedia Structured Contents. - Timestamp precision narrowing (
ns β mswhere every value is a whole second). - UUID and JSON logical-type annotations on string columns.
DECIMAL(P, S)where every double value round-trips losslessly through the chosen precision.
Type-tightening is idempotent β tighten_types can be re-run against any parquet in outputs/v*/ without regressing the widths.
docs/datasets.mdβ one row per dataset with short/full name, description, source URL, data kind, license, row count, row group count, and file size. Regenerate:python -m scripts.pipeline.docs datasets.docs/handlers.mdβ one row per registered transform handler with its one-line purpose, streaming flag, format-specific deps it imports (e.g.pandas,openpyxl,pyreadstat,osmium,zstandard,unlzw3β pyarrow / numpy / duckdb suppressed as core), manifest-spec usage count, and example slugs. Useful when picking a handler for a new dataset, finding precedent for a given upstream shape, or knowing which extras a new manifest entry will pull in. Regenerate:python -m scripts.pipeline.docs handlers.
Both are machine-generated; do not hand-edit. python -m scripts.pipeline.docs with no args refreshes both.
HYDRATING.md is hand-maintained policy / philosophy for the optional hydrate stage β preamble only, no auto-generated per-slug list.
Other catalog views (column index, type coverage, vortex-skip list, hydration candidates) used to be auto-generated markdown. They moved out of the markdown layer because the multi-megabyte indexes were unscannable as a reading experience and duplicated state already queryable. They're now flags on list_datasets:
python -m scripts.pipeline.list_datasets --columns [<slug>...] [--column-grep PATTERN]
python -m scripts.pipeline.list_datasets --coverage [--source parquet|vortex]
python -m scripts.pipeline.list_datasets --no-vortex --json # vortex-skip slugs + reasons
python -m scripts.pipeline.list_datasets --hydrate --long # hydration candidatesOr use python -m scripts.pipeline.browse for an interactive view.
Bug reports, feature requests, and PRs are welcome. See
CONTRIBUTING.md for dev-environment setup, the pre-PR
check sequence, and pointers into SKILLS.md for the most common
change types (new dataset, new handler). Notable changes land in
CHANGELOG.md.
Please report vulnerabilities privately rather than via a public issue β see
SECURITY.md for the disclosure channel and timelines.
DISCLAIMER.md covers Raincloud's posture on third-party
datasets: AS IS warranty disclaimer, content and association disclaimer
(any fetched file may contain questionable or offensive material β we
don't audit upstream content), license diligence, supply-chain risk, and
the process for requesting that a dataset be removed from sources.json.
Raincloud is licensed under the Apache License 2.0. Each dataset
declared in sources.json carries its own upstream license under the
license.spdx field; those licenses govern redistribution of any Parquet /
Vortex artefact built against that upstream and are independent of the
license covering the pipeline code itself.
{ "slug": "clickbench-hits", "short_name": "ClickBench Hits", "family": "direct", "license": { "spdx": "Apache-2.0", "source_url": "..." }, "fetch": { "type": "http", "urls": ["https://datasets.clickhouse.com/hits_compatible/hits.parquet"] }, "extract": { "type": "passthrough" }, "parse": { "reader": "parquet" }, "transform": { "handler": "identity" }, "write": { "output": "clickbench-hits.parquet", "compression": "zstd" }, "expect": { "rows": 99997497 } }