Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .env-example
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,3 @@ NUMERAI_PRIVATE_API_KEY=...

# Optional: for wandb logging (or use `wandb login`)
WANDB_API_KEY=...

# Optional: for TabPFN3ReasoningModel / TabPFN3Reasoning API (pip install 'alphapulse[foundation-api]')
TABPFN_API_KEY=...
30 changes: 16 additions & 14 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,23 @@ All notable changes to AlphaPulse are documented here.

---

## [Unreleased] — WandB XAI & Plot Quality Overhaul
## [Unreleased]

- **Universal feature importance:** `compute_universal_feature_importance` extracts and normalizes importance from any supported model type (XGBoost pred_contribs, LightGBM gain, CatBoost PredictionValuesChange, sklearn `feature_importances_`), averages across all models present, and logs a ranked bar chart to WandB.
- **Era-stratified importance:** `_log_era_stratified_importance` slices validation data by era, computes importance per slice, and logs a `line_series` chart showing each feature's importance trajectory over eras — directly reveals temporal stability.
- **Per-era stability report wired:** `compute_feature_report` (LightGBM proxy) now surfaces in WandB via `_log_feature_report`; logs top features by mean importance, top by era stability, and worst by era stability — each with bar charts.
- **Best-trial diagnostics run:** After HPO, the best config is retrained on an 80/20 era split and all expensive diagnostics (`log_era_importance=True`, top-50 importance artifact) are logged to a dedicated `best-trial-diagnostics` WandB run.
- **Prediction histogram fixed:** `_log_prediction_diagnostics` now uses `np.histogram(bins=50)` (50 rows) instead of logging every prediction row (50k+ rows).
- **Per-era line charts fixed:** `era_index` (0, 1, 2…) used as x-axis — fixes alphabetical string sort that scrambled chronological order.
- **Drawdown curve added:** Per-era drawdown from peak cumulative correlation logged alongside the cumulative correlation line chart.
- **Correlation distribution histogram:** Distribution of per-era Spearman correlations logged as a bar chart — directly answers "how many negative eras does this model have?"
- **Missing bar charts added:** Feature exposure top-15, ensemble model-pair correlation (A→B format), worst stability by era — all now have companion bar charts.
- **HPO summary table expanded:** 18 → 30 columns; adds `model_1/2/3_type` (split, for WandB parallel coordinates), XGBoost/LightGBM hyperparams, feature selection, noise injection, augmentation flags.
- **Convergence chart:** `log_hpo_convergence` logs all trial scores and running-best `corr_sharpe` in a single WandB run after the HPO search completes, rendering as a proper convergence curve.
- **String metric bug fixed:** `feature_importance_model_type` moved from `wandb.log()` (coerced to NaN) to `wandb.run.summary`.
- **Duplicate metric removed:** `metric/corr_sharpe` deduplicated in `log_hpo_trial_metrics`.
## [0.6.0] — MMC Scoring & W&B Chart Diagnostics

- **MMC on validation split:** `load_mmc_validation_frame` aligns `validation.parquet` with `meta_model.parquet`; HPO merges `mmc`, `mmc_sharpe`, and `payout_score` after train-era holdout evaluation so W&B `metric/mmc` is no longer null.
- **W&B diagnostics as charts:** Raw `diagnostics/` tables replaced with matplotlib horizontal bar charts, correlation heatmaps, and line charts; NaN metrics are skipped in trial logging.
- **Live W&B training logs:** `wandb_logging.py` bridges loguru to the W&B Logs panel and logs per-round XGBoost metrics during training.
- **MultiTarget diagnostics:** `pipeline/model_access.py` provides `iter_trained_models`, `model_prediction_map`, and `multitarget_blend_weights` for SHAP and ensemble diagnostics on multi-target pipelines.
- **Feature catalog & HPO routing:** `features/catalog.py`, `hpo/feature_routing.py`, and `hpo/target_strategy.py` resolve feature sets and multi-target training from `features.json`.
- **HPO export module:** `hpo/export.py` centralises pipeline fitting for Numerai pickle export from flat HPO configs.
- **Universal feature importance:** `compute_universal_feature_importance` extracts importance from XGBoost, LightGBM, CatBoost, and sklearn tree models; logs ranked bar charts to W&B.
- **Era-stratified importance:** `_log_era_stratified_importance` logs per-era importance `line_series` for temporal stability analysis.
- **Per-era stability report:** `compute_feature_report` (LightGBM proxy) surfaces top/mean/worst stability features as bar charts in W&B.
- **Best-trial diagnostics run:** After HPO, the best config is retrained and logged to a dedicated `best-trial-diagnostics` WandB run with full XAI artifacts.
- **Per-era line charts fixed:** `era_index` used as x-axis for chronological ordering; drawdown and correlation distribution charts added.
- **HPO summary expanded:** 30-column trial table plus scatter charts for corr Sharpe, MMC Sharpe, and runtime; `log_hpo_convergence` renders running-best curve in one W&B run.
- **W&B metric fixes:** `feature_importance_model_type` stored in `wandb.run.summary`; duplicate `metric/corr_sharpe` removed; finite-check on NaN metrics.

## [0.5.0] — Production Hardening

Expand Down
36 changes: 29 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# AlphaPulse v0.5.0
# AlphaPulse v0.6.0

AlphaPulse is a config-driven framework for building, training, and deploying ML pipelines for the [Numerai](https://numer.ai) stock-market prediction tournament. It covers the full workflow: dataset download, experiment definition, backtesting, hyperparameter optimization (HPO), and automated weekly submission.

Expand All @@ -13,7 +13,7 @@ The framework is organized into five layers:
| **Data** | `NumeraiDataLoader`, parquet files, `features.json` | Downloads and loads Numerai dataset splits (train/validation/live) |
| **Configuration** | `ExperimentV1` YAML schema, HPO search space, `TrialDB`, AutoResearch agent | Defines what to train — via static YAML, automated HPO, or Claude-agent-driven research |
| **Core Pipeline** | Preprocessors, Models, `Pipeline` / `MultiHeadPipeline`, Ensemble, `FeatureNeutralizer` | Fits and combines models; handles feature routing, ensembling, and prediction neutralization |
| **Evaluation** | `Backtester`, `PurgedEraCV`, SHAP report, W&B diagnostics | Computes era-aware metrics (CORR, Sharpe, MMC) and XAI reports |
| **Evaluation** | `Backtester`, `PurgedEraCV`, SHAP report, W&B diagnostics (charts) | Era-aware metrics (CORR, Sharpe, MMC on validation split) and matplotlib XAI plots |
| **Export & Submission** | `predict.pkl`, live inference, submission validation, Numerai upload | Produces tournament-ready predictions and submits them |

> The diagram is editable — open `docs/assets/architecture.drawio` in [draw.io](https://app.diagrams.net) to modify it.
Expand Down Expand Up @@ -194,9 +194,22 @@ uv run python scripts/hpo_pipeline.py \
--train-subsample 0.125 \
--num-trials 30 \
--output-dir artifacts/hpo_x8 \
--local
--local \
--wandb-project alphapulse-hpo
```

**Useful flags:**
- `--wandb-project <name>` — log every trial to Weights & Biases (project name is timestamped and saved for `--resume`)
- `--resume` — skip trials already recorded in `trials.db`
- `--trial-timeout <sec>` — kill stuck trials (default: 1800)
- `--max-hours <h>` — stop after a time budget
- `--objective corr_sharpe|mmc_sharpe|payout_score` — optimization target

W&B trial runs include scalar metrics (`corr_sharpe`, `mmc_sharpe`, `metric/mmc`) and per-trial
`diagnostics/` charts (per-era correlation, feature exposure, SHAP importance). After the search,
a `best-trial-diagnostics` run and `search-convergence` / `hpo-summary` runs are logged to the
same W&B group.

The best resulting configuration will be saved to `artifacts/hpo_x8/best_config.json`.

### 4\. Run AutoResearch (Agent-Driven Research Loop)
Expand Down Expand Up @@ -519,11 +532,12 @@ make eda-lint
├── src/alphapulse/ # Core framework source code
│ ├── autoresearch/ # Agent-driven research loop (loop, agent, mutations, state)
│ ├── evaluation/ # Backtesting, metrics, SHAP report, W&B diagnostics, submission validation
│ ├── experiments/ # YAML schema (ExperimentV1), runner
│ ├── hpo/ # HPO objective, search space, builder, registry, TrialDB (SQLite)
│ ├── logging_/ # Leaderboard and W&B helpers
│ ├── experiments/ # YAML schema (ExperimentV1), runner, data loaders (incl. MMC validation frame)
│ ├── features/ # Feature/target catalog loaded from features.json
│ ├── hpo/ # HPO objective, search space, builder, registry, TrialDB (SQLite), export
│ ├── logging_/ # Leaderboard, W&B helpers, live loguru → W&B Logs bridge
│ ├── models/ # All model implementations + factory
│ ├── pipeline/ # Pipeline, MultiHeadPipeline, MultiTargetPipeline, ensemble, neutralizer, stacker
│ ├── pipeline/ # Pipeline, MultiHeadPipeline, MultiTargetPipeline, model_access, ensemble, neutralizer
│ ├── preprocessors/ # All preprocessor implementations + factory (incl. autoencoder, compression, era-stable selector)
│ ├── utils/ # Global seed utility
│ └── validation/ # PurgedEraCV
Expand All @@ -549,6 +563,14 @@ Commit messages: prefer conventional commits (e.g. `feat: ...`, `fix: ...`, `doc

See [CHANGELOG.md](CHANGELOG.md) for completed releases.

**Completed — v0.6.0 (MMC + W&B Diagnostics):**
- **MMC on validation split:** HPO scores `mmc`, `mmc_sharpe`, and `payout_score` on `validation.parquet` rows aligned with `meta_model.parquet` (train holdout ids do not overlap meta-model ids).
- **W&B diagnostics as charts:** `diagnostics/` logs matplotlib bar/heatmap/line charts instead of raw tables; horizontal bar charts for feature importance and exposure; ensemble correlation heatmap.
- **Live W&B training logs:** loguru lines and per-round XGBoost metrics stream to W&B during HPO trials.
- **MultiTarget diagnostics:** `pipeline/model_access.py` unifies model iteration and prediction collection for SHAP and ensemble diagnostics across `Pipeline` and `MultiTargetPipeline`.
- **Feature catalog & routing:** `features/catalog.py` and HPO feature routing resolve `features.json` sets and YAML groups into per-model column lists.
- **HPO summary charts:** scatter plots for trial corr Sharpe, MMC Sharpe, and runtime in the `hpo-summary` W&B run.

**Completed — v0.5.0 (Production Hardening + XAI):**
- **HPO fault tolerance:** Each local trial runs in an isolated subprocess; crashes mark the trial failed and the sweep continues. A SQLite-backed `TrialDB` persists trial state. `--resume` skips already-completed trials.
- **Provenance artifact:** On every export, a hermetically sealed bundle is written: resolved config, `uv export` dependency snapshot, and git commit hash.
Expand Down
15 changes: 10 additions & 5 deletions eda/pages/hpo_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,9 +403,14 @@ def load_trials(path: str, _min_sharpe: float) -> pd.DataFrame:
max_value=min(100, len(df)),
value=20,
)
rank_col = (
"payout_score"
if "payout_score" in df.columns and df["payout_score"].notna().any()
else "sharpe"
)
show_cols = [
"trial",
"sharpe",
rank_col,
"mean_era_corr",
"std_era_corr",
"max_drawdown",
Expand All @@ -416,12 +421,12 @@ def load_trials(path: str, _min_sharpe: float) -> pd.DataFrame:
"use_neutralization",
"elapsed_seconds",
]
if "payout_score" in df.columns:
show_cols.insert(2, "payout_score")
if rank_col == "payout_score":
show_cols.insert(2, "corr_sharpe" if "corr_sharpe" in df.columns else "sharpe")

leaderboard = df.nlargest(top_n, "sharpe")[show_cols]
leaderboard = df.nlargest(top_n, rank_col)[show_cols]
st.dataframe(
leaderboard.style.background_gradient(subset=["sharpe"], cmap="RdYlGn"),
leaderboard.style.background_gradient(subset=[rank_col], cmap="RdYlGn"),
use_container_width=True,
height=420,
)
Expand Down
43 changes: 43 additions & 0 deletions experiments/mmc_asymmetric_lgbm_tabicl_v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
version: "1"
data:
data_dir: data/v5.2
train_subsample: 0.125
target_col: target
seed: 42
features:
columns: null
groups: {}
preprocessing:
- type: StandardScaler
params: {}
models:
- type: LightGBM
params:
params:
num_leaves: 31
learning_rate: 0.01
objective: regression
metric: rmse
- type: TabICL
params:
n_estimators: 4
max_train_rows: 8000
compression: autoencoder
compression_components: 128
compression_epochs: 10
kv_cache: false
batch_size: 8
ensemble_method: weighted
ensemble_params:
optimize_weights: true
objective: payout_score
min_weight: 0.05
max_weight: 0.90
neutralization:
proportion: 0.5
train:
n_rounds: 2000
early_stopping_rounds: 100
evaluation:
primary_metric: payout_score
walk_forward: false
36 changes: 36 additions & 0 deletions experiments/mmc_catboost_baseline_v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
version: "1"
data:
data_dir: data/v5.2
train_subsample: 0.125
target_col: target
seed: 42
features:
columns: null
groups: {}
preprocessing:
- type: RobustScaler
params: {}
models:
- type: CatBoost
params:
params:
depth: 6
learning_rate: 0.03
l2_leaf_reg: 5.0
min_data_in_leaf: 200
colsample_bylevel: 0.3
loss_function: RMSE
verbose: 0
iterations: 400
early_stopping_rounds: 50
ensemble_method: single
neutralization:
proportion: 0.35
meta_neutralization:
proportion: 0.0
train:
n_rounds: 400
early_stopping_rounds: 50
evaluation:
primary_metric: payout_score
walk_forward: false
53 changes: 53 additions & 0 deletions experiments/mmc_lgbm_catboost_ensemble_v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
version: "1"
data:
data_dir: data/v5.2
train_subsample: 0.125
target_col: target
seed: 42
features:
columns: null
groups: {}
preprocessing:
- type: StandardScaler
params: {}
models:
- type: LightGBM
params:
params:
num_leaves: 31
learning_rate: 0.01
min_child_samples: 200
reg_alpha: 0.5
reg_lambda: 5.0
colsample_bytree: 0.3
subsample: 0.7
objective: regression
metric: rmse
- type: CatBoost
params:
params:
depth: 6
learning_rate: 0.03
l2_leaf_reg: 5.0
min_data_in_leaf: 200
colsample_bylevel: 0.3
loss_function: RMSE
verbose: 0
iterations: 400
early_stopping_rounds: 50
ensemble_method: weighted
ensemble_params:
optimize_weights: true
objective: payout_score
min_weight: 0.05
max_weight: 0.90
neutralization:
proportion: 0.5
meta_neutralization:
proportion: 0.55
train:
n_rounds: 400
early_stopping_rounds: 50
evaluation:
primary_metric: payout_score
walk_forward: false
36 changes: 36 additions & 0 deletions experiments/mmc_lgbm_meta_neutral_v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
version: "1"
data:
data_dir: data/v5.2
train_subsample: 0.125
target_col: target
seed: 42
features:
columns: null
groups: {}
preprocessing:
- type: StandardScaler
params: {}
models:
- type: LightGBM
params:
params:
num_leaves: 31
learning_rate: 0.01
min_child_samples: 200
reg_alpha: 0.5
reg_lambda: 5.0
colsample_bytree: 0.3
subsample: 0.7
objective: regression
metric: rmse
ensemble_method: single
neutralization:
proportion: 0.5
meta_neutralization:
proportion: 0.6
train:
n_rounds: 400
early_stopping_rounds: 50
evaluation:
primary_metric: payout_score
walk_forward: false
35 changes: 35 additions & 0 deletions experiments/mmc_multitarget_lgbm_v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
version: "1"
data:
data_dir: data/v5.2
train_subsample: 0.125
target_col: target
auxiliary_targets:
- target_jerome_v4_20
- target_ralph_v4_20
- target_tyler_v4_20
target_blend_method: equal
seed: 42
features:
columns: null
groups: {}
preprocessing:
- type: StandardScaler
params: {}
models:
- type: LightGBM
params:
params:
num_leaves: 31
learning_rate: 0.01
objective: regression
metric: rmse
ensemble_method: single
ensemble_params: {}
neutralization:
proportion: 0.5
train:
n_rounds: 2000
early_stopping_rounds: 100
evaluation:
primary_metric: corr_sharpe
walk_forward: false
Loading
Loading