Skip to content
Draft
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
280 changes: 280 additions & 0 deletions .cursor/plans/dvc_pipeline_setup_872f50fc.plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
---
name: DVC Pipeline Setup
overview: Set up DVC to define a cached, repeatable data pipeline for the OneZoom tree-build project, replacing manual download/filter workarounds with a declarative `dvc.yaml` pipeline backed by shared remote cache storage.
todos:
- id: install-dvc
content: Add dvc to pyproject.toml dependencies, run dvc init to create .dvc/ directory
status: completed
- id: params-yaml
content: Create params.yaml with oz_tree, ot_version, ot_taxonomy_version, ot_taxonomy_extra, build_version, exclude_from_popularity
status: completed
- id: split-filters
content: "Split generate_filtered_files.py into 4 separate filter modules: filter_eol.py, filter_wikidata.py, filter_wikipedia_sql.py, filter_pageviews.py. Remove generate_and_cache_filtered_file. Each module gets its own CLI entry point writing to a specified output path."
status: completed
- id: register-scripts
content: Register the 4 new filter scripts as console_scripts in pyproject.toml
status: completed
- id: dvc-yaml
content: Create dvc.yaml with all 11 pipeline stages (tree build + 4 parallel-capable filter stages + tables + JS) using DVC templating from params.yaml
status: completed
- id: gitignore-update
content: "Update .gitignore and data/ .gitignore files: add data/filtered/, data/output_files/js/, ensure .dvc files are not ignored"
status: completed
- id: update-docs
content: Update README.markdown, oz_tree_build/README.markdown, and data/Wiki/README.markdown with DVC workflow
status: completed
isProject: false
---

# DVC Pipeline for OneZoom Tree-Build

## Current State

The build process is a sequence of manual shell commands documented in `[oz_tree_build/README.markdown](oz_tree_build/README.markdown)`. Key pain points:

- Massive source files (Wikidata ~100GB, enwiki SQL ~1GB, pageviews multi-GB) must be downloaded by every contributor
- `generate_filtered_files` takes **5-7 hours** to reduce these to usable subsets
- Pre-processed pageviews are distributed as GitHub releases as a workaround (no longer needed with DVC)
- No caching or reproducibility guarantees

## Target Workflow

```bash
# First person: downloads data, runs pipeline, pushes cache
dvc repro
dvc push

# Everyone else: pulls only the cached outputs they need
dvc repro --pull --allow-missing
```

If nothing has changed, `dvc repro --pull --allow-missing` pulls pre-built outputs from shared storage -- no multi-GB downloads, no 5-7 hour filtering runs.

## Pipeline DAG

The monolithic `filter_files` stage is split into 4 independent filter stages. EOL and wikidata filters can run in parallel (both depend on taxonomy). SQL and pageview filters can run in parallel (both depend on filtered wikidata output).

```mermaid
graph TD
OT_tre[labelled_supertree.tre.dvc] --> preprocess_opentree
OT_tgz[ott_taxonomy.tgz.dvc] --> unpack_taxonomy
preprocess_opentree --> prepare_open_trees
unpack_taxonomy --> add_ott_numbers
bespoke[BespokeTree in git] --> add_ott_numbers
add_ott_numbers --> prepare_open_trees
OT_req[OT_required in git] --> prepare_open_trees
prepare_open_trees --> build_tree

unpack_taxonomy --> filter_eol
EOL[provider_ids.csv.gz.dvc] --> filter_eol

unpack_taxonomy --> filter_wikidata
WD[latest-all.json.bz2.dvc] --> filter_wikidata

filter_wikidata --> filter_sql
WP_SQL[enwiki-page.sql.gz.dvc] --> filter_sql

filter_wikidata --> filter_pageviews
WP_PV[wp_pagecounts.dvc] --> filter_pageviews

build_tree --> create_tables
filter_eol --> create_tables
filter_wikidata --> create_tables
filter_sql --> create_tables
filter_pageviews --> create_tables
unpack_taxonomy --> create_tables
SupTax[SupplementaryTaxonomy in git] --> create_tables
create_tables --> make_js
```

## Key Design Decisions

### 1. Parameters in `params.yaml` (replaces env vars)

Currently `OT_VERSION`, `OT_TAXONOMY_VERSION`, `OT_TAXONOMY_EXTRA`, and `OZ_TREE` are shell environment variables. These become DVC parameters so that changing a version automatically invalidates the right stages.

```yaml
# params.yaml
oz_tree: AllLife
ot_version: "15.1"
ot_taxonomy_version: "3.7"
ot_taxonomy_extra: "draft2"
build_version: 28017344 # deterministic version for CSV_base_table_creator (replaces time-based default)
exclude_from_popularity:
- Archosauria_ott335588
- Dinosauria_ott90215
```

The `build_version` param is important: `CSV_base_table_creator` defaults to `int(time.time()/60)`, which would make outputs non-deterministic. A fixed param ensures DVC caching works correctly.

### 2. Source data tracked with `dvc add`

Large downloaded files are tracked via `dvc add`, producing `.dvc` files committed to git. The raw data itself lives only in DVC cache/remote, never in git. Files to track:

- `data/OpenTree/labelled_supertree_simplified_ottnames.tre`
- `data/OpenTree/ott${ot_taxonomy_version}.tgz`
- `data/Wiki/wd_JSON/latest-all.json.bz2`
- `data/Wiki/wp_SQL/enwiki-latest-page.sql.gz`
- `data/Wiki/wp_pagecounts/` (directory -- raw pageview files; pre-processed GitHub releases are no longer needed since DVC caches the filtered outputs)
- `data/EOL/provider_ids.csv.gz`

With `--allow-missing`, DVC can skip stages whose inputs haven't changed even when the raw files aren't present locally.

### 3. Split filters into separate modules and remove mtime caching

The monolithic `[generate_filtered_files.py](oz_tree_build/utilities/generate_filtered_files.py)` will be refactored:

**Remove `generate_and_cache_filtered_file`** -- this function implements mtime-based caching (comparing filtered file timestamps to source file timestamps). DVC's run cache completely supersedes this. Each filter script simply writes its output; DVC decides whether to run it.

**Split into 4 separate filter modules**, each with its own CLI entry point:

- `oz_tree_build/utilities/filter_eol.py` -- filters EOL provider IDs CSV
- Inputs: EOL CSV (gz), taxonomy.tsv
- Output: filtered EOL CSV
- Reads taxonomy to build `source_ids` (NCBI, IF, WoRMS, IRMNG, GBIF sets), then keeps only matching EOL rows
- `oz_tree_build/utilities/filter_wikidata.py` -- filters the massive wikidata JSON dump (~100GB compressed)
- Inputs: wikidata JSON (bz2), taxonomy.tsv
- Outputs: filtered wikidata JSON, **plus a sidecar `wikidata_titles.txt`** (one Wikipedia page title per line)
- The sidecar file replaces the in-memory `context.wikidata_ids` handoff. It's produced by running the equivalent of `read_wikidata_dump()` on the filtered output and writing the titles to a text file. This is the key that enables SQL and pageview filters to run independently.
- `oz_tree_build/utilities/filter_wikipedia_sql.py` -- filters enwiki SQL page dump
- Inputs: enwiki SQL (gz), `wikidata_titles.txt`
- Output: filtered SQL file
- Reads the titles file to build the filter set (replaces `context.wikidata_ids`)
- `oz_tree_build/utilities/filter_pageviews.py` -- filters Wikipedia pageview files
- Inputs: one or more pageview files (bz2), `wikidata_titles.txt`
- Output: filtered pageview files in output directory
- Reads the titles file to build the filter set

**Shared code** stays in `generate_filtered_files.py` (or a new common module): `read_taxonomy_file`, helper imports, and the orchestrating `generate_all_filtered_files` function (simplified to call the individual filter modules directly, useful for non-DVC usage and clade-specific test filtering).

**New console scripts** registered in `pyproject.toml`:

```
filter_eol = "oz_tree_build.utilities.filter_eol:main"
filter_wikidata = "oz_tree_build.utilities.filter_wikidata:main"
filter_wikipedia_sql = "oz_tree_build.utilities.filter_wikipedia_sql:main"
filter_pageviews = "oz_tree_build.utilities.filter_pageviews:main"
```

The parallelism benefit: `filter_eol` and `filter_wikidata` share no outputs, so DVC can run them concurrently. Once `filter_wikidata` finishes and produces `wikidata_titles.txt`, `filter_sql` and `filter_pageviews` can also run concurrently.

### 4. JS output stays in this repo

`make_js_treefiles` currently defaults to writing into `../OZtree/static/FinalOutputs/data/`. In the DVC pipeline, use `--outdir data/output_files/js/` to keep outputs within this repo for DVC tracking. Users copy to OZtree manually afterward.

### 5. DVC remote (shared cache)

A DVC remote must be configured for shared caching. This is a one-line config per backend:

```bash
dvc remote add -d myremote s3://my-bucket/dvc-cache # S3
dvc remote add -d myremote gs://my-bucket/dvc-cache # GCS
dvc remote add -d myremote ssh://server:/path/to/cache # SSH
dvc remote add -d myremote /mnt/shared/dvc-cache # local/NFS
```

The choice of backend can be made later; the pipeline design is independent of it.

## Pipeline Stages (`dvc.yaml`)

The `dvc.yaml` at the project root will define these stages (using DVC templating with `vars` from `params.yaml`):

**preprocess_opentree** -- perl to strip mrca labels and normalize underscores

- deps: `data/OpenTree/labelled_supertree_simplified_ottnames.tre`
- params: `ot_version`
- outs: `data/OpenTree/draftversion${ot_version}.tre`

**unpack_taxonomy** -- extract taxonomy.tsv from tarball

- deps: `data/OpenTree/ott${ot_taxonomy_version}.tgz`
- params: `ot_taxonomy_version`
- outs: `data/OpenTree/ott${ot_taxonomy_version}/` (directory)

**add_ott_numbers** -- call OpenTree API to annotate bespoke trees with OTT IDs

- deps: `data/OZTreeBuild/${oz_tree}/BespokeTree/include_noAutoOTT/`
- params: `oz_tree`, `ot_taxonomy_version`, `ot_taxonomy_extra`
- outs: `data/OZTreeBuild/${oz_tree}/BespokeTree/include_OTT${ot_taxonomy_version}${ot_taxonomy_extra}/`
- Note: calls external API; cached unless inputs change. Use `dvc repro -f add_ott_numbers` to force refresh.

**prepare_open_trees** -- copy supplementary .nwk files and extract OpenTree subtrees

- deps: `draftversion${ot_version}.tre`, `include_OTT.../`, `OT_required/`
- outs: `data/OZTreeBuild/${oz_tree}/OpenTreeParts/OpenTree_all/`

**build_tree** -- assemble the full newick tree

- deps: `include_OTT.../`, `OpenTree_all/`
- outs: `data/OZTreeBuild/${oz_tree}/${oz_tree}_full_tree.phy`

**filter_eol** -- filter EOL provider IDs to relevant sources

- deps: `data/EOL/provider_ids.csv.gz`, `data/OpenTree/ott${ot_taxonomy_version}/taxonomy.tsv`
- outs: `data/filtered/OneZoom_provider_ids.csv`
- Parallelizable with `filter_wikidata`

**filter_wikidata** -- filter massive wikidata JSON to taxon/vernacular items (THE most expensive step, hours)

- deps: `data/Wiki/wd_JSON/latest-all.json.bz2`, `data/OpenTree/ott${ot_taxonomy_version}/taxonomy.tsv`
- outs: `data/filtered/OneZoom_latest-all.json`, `data/filtered/wikidata_titles.txt`
- Parallelizable with `filter_eol`

**filter_sql** -- filter enwiki SQL page dump to matching titles

- deps: `data/Wiki/wp_SQL/enwiki-latest-page.sql.gz`, `data/filtered/wikidata_titles.txt`
- outs: `data/filtered/OneZoom_enwiki-latest-page.sql`
- Parallelizable with `filter_pageviews`

**filter_pageviews** -- filter and aggregate Wikipedia pageview counts

- deps: `data/Wiki/wp_pagecounts/`, `data/filtered/wikidata_titles.txt`
- outs: `data/filtered/pageviews/` (directory of filtered pageview files)
- Parallelizable with `filter_sql`

**create_tables** -- map taxa, calculate popularity, produce DB-ready CSVs and ordered trees

- deps: full tree, taxonomy, all `data/filtered/` outputs, `SupplementaryTaxonomy.tsv`
- params: `build_version`, `exclude_from_popularity`
- outs: `data/output_files/`

**make_js** -- convert ordered trees to JS viewer files

- deps: `data/output_files/`
- outs: `data/output_files/js/`

## Files to Create/Modify

- **Create** `params.yaml` -- pipeline parameters
- **Create** `dvc.yaml` -- pipeline definition (11 stages)
- **Create** `oz_tree_build/utilities/filter_eol.py` -- standalone EOL filter with CLI
- **Create** `oz_tree_build/utilities/filter_wikidata.py` -- standalone wikidata filter with CLI
- **Create** `oz_tree_build/utilities/filter_wikipedia_sql.py` -- standalone SQL filter with CLI
- **Create** `oz_tree_build/utilities/filter_pageviews.py` -- standalone pageviews filter with CLI
- **Modify** `[oz_tree_build/utilities/generate_filtered_files.py](oz_tree_build/utilities/generate_filtered_files.py)` -- remove `generate_and_cache_filtered_file`, simplify to orchestrator that calls the new modules (retains clade-filtering support for tests)
- **Modify** `[pyproject.toml](pyproject.toml)` -- add `dvc` to dependencies, register 4 new console scripts
- **Modify** `[.gitignore](.gitignore)` -- add `/data/filtered/`, DVC internals are handled by `dvc init`
- **Update** `[README.markdown](README.markdown)` -- new DVC-based workflow instructions
- **Update** `[oz_tree_build/README.markdown](oz_tree_build/README.markdown)` -- reference DVC pipeline
- **Update** `[data/Wiki/README.markdown](data/Wiki/README.markdown)` -- remove pre-processed pageview GitHub release instructions (DVC cache replaces this entirely)

After creating these files, the first pipeline run involves:

```bash
pip install -e .
dvc init
# download source files, then:
dvc add data/OpenTree/labelled_supertree_simplified_ottnames.tre
dvc add data/OpenTree/ott3.7.tgz
dvc add data/Wiki/wd_JSON/latest-all.json.bz2
dvc add data/Wiki/wp_SQL/enwiki-latest-page.sql.gz
dvc add data/Wiki/wp_pagecounts/
dvc add data/EOL/provider_ids.csv.gz
dvc repro
dvc push
git add . && git commit -m "Add DVC pipeline"
```

But this should not be run as part of this plan, the user will run it manually after the pipeline is set up.

Also note that you should not try to run the individual large stages as part of this plan, since the input files are massive and the processing takes a long time, so the user will schedule it for a convenient time.
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
5 changes: 5 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[core]
remote = jared-r2
['remote "jared-r2"']
url = s3://onezoom
endpointurl = https://9d168184d3ac384b6a159313dd90a75a.r2.cloudflarestorage.com
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
43 changes: 31 additions & 12 deletions README.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ If you want to run the test suite, make sure the test requirements are also inst

pip install -e '.[test]'

To be able to run the pipeline, you'll also need to install `wget`.

## Testing

Assuming you have installed the test requirements, you should be able to run
Expand All @@ -41,22 +43,39 @@ you will need a valid Azure Image cropping key in your appconfig.ini.

## Building the latest tree from OpenTree

### Setup
This project uses [DVC](https://dvc.org/) to manage the pipeline. The build parameters are defined in `params.yaml` and the pipeline stages are declared in `dvc.yaml`.

We assume that you want to build a OneZoom tree based on the most recent online OpenTree version.
You can check the most recent version of both the synthetic tree (`synth_id`) and the taxonomy (`taxonomy_version`) via the
[API](https://github.com/OpenTreeOfLife/germinator/wiki/Open-Tree-of-Life-Web-APIs) e.g. by running `curl -X POST https://api.opentreeoflife.org/v3/tree_of_life/about`. Later in the build, we use specific environment variables set to these version numbers. Assuming you are in a bash shell or similar, you can set them as follows:
### Quick start (using cached outputs)

You'll need to ask for the DVC remote credentials on the OneZoom Slack channel in order to pull cached results.
Then, if someone has already run the pipeline and pushed the results to the DVC remote, you can reproduce the build and any of the intermediate stages without downloading any of the massive source files:

```bash
source .venv/bin/activate
dvc repro --pull --allow-missing
```
OT_VERSION=14.9 #or whatever your OpenTree version is
OT_TAXONOMY_VERSION=3.6
OT_TAXONOMY_EXTRA=draft1 #optional - the draft for this version, e.g. `draft1` if the taxonomy_version is 3.6draft1
```

### Download
DVC will pull only the cached outputs needed for stages that haven't changed. If all stages are cached, nothing needs to be re-run.

### Full build (first time / updating source data)

1. Set `ot_version` in `params.yaml` to the desired OpenTree synthesis version (e.g. `"v16.1"`). Available versions can be found in the [synthesis manifest](https://raw.githubusercontent.com/OpenTreeOfLife/opentree/master/webapp/static/statistics/synthesis.json). The OpenTree tree and taxonomy will be downloaded automatically by the `download_opentree` pipeline stage.

2. Some source files are unversioned so will use cached results unless forced. To force re-download them all with the latest upstream data:

```bash
dvc repro --force download_eol download_wikipedia_sql download_and_filter_wikidata download_and_filter_pageviews
```

Note that download_and_filter_wikidata and download_and_filter_pageviews take several hours to run.

3. Run the pipeline and push results to the shared cache:

Constructing the full tree of life requires various files downloaded from the internet. They should be placed within the appropriate directories in the `data` directory, as [documented here](data/README.markdown).
```bash
dvc repro
dvc push
```

### Building the tree
4. Commit `dvc.lock` to git.

Once data files are downloaded, you should be set up to actually build the tree and other backend files, by following [these instructions](oz_tree_build/README.markdown).
For detailed step-by-step documentation, see [oz_tree_build/README.markdown](oz_tree_build/README.markdown).
2 changes: 2 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/js_output
/output_files
3 changes: 2 additions & 1 deletion data/EOL/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@

# But not these files...
!.gitignore
!README.markdown
!README.markdown
!*.dvc
1 change: 1 addition & 0 deletions data/OZTreeBuild/AllLife/OpenTreeParts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/OpenTree_all

This file was deleted.

2 changes: 1 addition & 1 deletion data/OpenTree/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@

# But not these files...
!.gitignore
!README.markdown
!README.markdown
Loading