Palamabron · Palamabron · Jun 21, 2026 · Jun 16, 2026 · Jun 16, 2026 · Jun 16, 2026
diff --git a/.env-example b/.env-example
@@ -3,6 +3,3 @@ NUMERAI_PRIVATE_API_KEY=...
 
 # Optional: for wandb logging (or use `wandb login`)
 WANDB_API_KEY=...
-
-# Optional: for TabPFN3ReasoningModel / TabPFN3Reasoning API (pip install 'alphapulse[foundation-api]')
-TABPFN_API_KEY=...
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,21 +4,23 @@ All notable changes to AlphaPulse are documented here.
 
 ---
 
-## [Unreleased] — WandB XAI & Plot Quality Overhaul
+## [Unreleased]
 
-- **Universal feature importance:** `compute_universal_feature_importance` extracts and normalizes importance from any supported model type (XGBoost pred_contribs, LightGBM gain, CatBoost PredictionValuesChange, sklearn `feature_importances_`), averages across all models present, and logs a ranked bar chart to WandB.
-- **Era-stratified importance:** `_log_era_stratified_importance` slices validation data by era, computes importance per slice, and logs a `line_series` chart showing each feature's importance trajectory over eras — directly reveals temporal stability.
-- **Per-era stability report wired:** `compute_feature_report` (LightGBM proxy) now surfaces in WandB via `_log_feature_report`; logs top features by mean importance, top by era stability, and worst by era stability — each with bar charts.
-- **Best-trial diagnostics run:** After HPO, the best config is retrained on an 80/20 era split and all expensive diagnostics (`log_era_importance=True`, top-50 importance artifact) are logged to a dedicated `best-trial-diagnostics` WandB run.
-- **Prediction histogram fixed:** `_log_prediction_diagnostics` now uses `np.histogram(bins=50)` (50 rows) instead of logging every prediction row (50k+ rows).
-- **Per-era line charts fixed:** `era_index` (0, 1, 2…) used as x-axis — fixes alphabetical string sort that scrambled chronological order.
-- **Drawdown curve added:** Per-era drawdown from peak cumulative correlation logged alongside the cumulative correlation line chart.
-- **Correlation distribution histogram:** Distribution of per-era Spearman correlations logged as a bar chart — directly answers "how many negative eras does this model have?"
-- **Missing bar charts added:** Feature exposure top-15, ensemble model-pair correlation (A→B format), worst stability by era — all now have companion bar charts.
-- **HPO summary table expanded:** 18 → 30 columns; adds `model_1/2/3_type` (split, for WandB parallel coordinates), XGBoost/LightGBM hyperparams, feature selection, noise injection, augmentation flags.
-- **Convergence chart:** `log_hpo_convergence` logs all trial scores and running-best `corr_sharpe` in a single WandB run after the HPO search completes, rendering as a proper convergence curve.
-- **String metric bug fixed:** `feature_importance_model_type` moved from `wandb.log()` (coerced to NaN) to `wandb.run.summary`.
-- **Duplicate metric removed:** `metric/corr_sharpe` deduplicated in `log_hpo_trial_metrics`.
+## [0.6.0] — MMC Scoring & W&B Chart Diagnostics
+
+- **MMC on validation split:** `load_mmc_validation_frame` aligns `validation.parquet` with `meta_model.parquet`; HPO merges `mmc`, `mmc_sharpe`, and `payout_score` after train-era holdout evaluation so W&B `metric/mmc` is no longer null.
+- **W&B diagnostics as charts:** Raw `diagnostics/` tables replaced with matplotlib horizontal bar charts, correlation heatmaps, and line charts; NaN metrics are skipped in trial logging.
+- **Live W&B training logs:** `wandb_logging.py` bridges loguru to the W&B Logs panel and logs per-round XGBoost metrics during training.
+- **MultiTarget diagnostics:** `pipeline/model_access.py` provides `iter_trained_models`, `model_prediction_map`, and `multitarget_blend_weights` for SHAP and ensemble diagnostics on multi-target pipelines.
+- **Feature catalog & HPO routing:** `features/catalog.py`, `hpo/feature_routing.py`, and `hpo/target_strategy.py` resolve feature sets and multi-target training from `features.json`.
+- **HPO export module:** `hpo/export.py` centralises pipeline fitting for Numerai pickle export from flat HPO configs.
+- **Universal feature importance:** `compute_universal_feature_importance` extracts importance from XGBoost, LightGBM, CatBoost, and sklearn tree models; logs ranked bar charts to W&B.
+- **Era-stratified importance:** `_log_era_stratified_importance` logs per-era importance `line_series` for temporal stability analysis.
+- **Per-era stability report:** `compute_feature_report` (LightGBM proxy) surfaces top/mean/worst stability features as bar charts in W&B.
+- **Best-trial diagnostics run:** After HPO, the best config is retrained and logged to a dedicated `best-trial-diagnostics` WandB run with full XAI artifacts.
+- **Per-era line charts fixed:** `era_index` used as x-axis for chronological ordering; drawdown and correlation distribution charts added.
+- **HPO summary expanded:** 30-column trial table plus scatter charts for corr Sharpe, MMC Sharpe, and runtime; `log_hpo_convergence` renders running-best curve in one W&B run.
+- **W&B metric fixes:** `feature_importance_model_type` stored in `wandb.run.summary`; duplicate `metric/corr_sharpe` removed; finite-check on NaN metrics.
 
 ## [0.5.0] — Production Hardening
 

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# AlphaPulse v0.5.0
+# AlphaPulse v0.6.0
 
 AlphaPulse is a config-driven framework for building, training, and deploying ML pipelines for the [Numerai](https://numer.ai) stock-market prediction tournament. It covers the full workflow: dataset download, experiment definition, backtesting, hyperparameter optimization (HPO), and automated weekly submission.
 
@@ -13,7 +13,7 @@ The framework is organized into five layers:
 | **Data** | `NumeraiDataLoader`, parquet files, `features.json` | Downloads and loads Numerai dataset splits (train/validation/live) |
 | **Configuration** | `ExperimentV1` YAML schema, HPO search space, `TrialDB`, AutoResearch agent | Defines what to train — via static YAML, automated HPO, or Claude-agent-driven research |
 | **Core Pipeline** | Preprocessors, Models, `Pipeline` / `MultiHeadPipeline`, Ensemble, `FeatureNeutralizer` | Fits and combines models; handles feature routing, ensembling, and prediction neutralization |
-| **Evaluation** | `Backtester`, `PurgedEraCV`, SHAP report, W&B diagnostics | Computes era-aware metrics (CORR, Sharpe, MMC) and XAI reports |
+| **Evaluation** | `Backtester`, `PurgedEraCV`, SHAP report, W&B diagnostics (charts) | Era-aware metrics (CORR, Sharpe, MMC on validation split) and matplotlib XAI plots |
 | **Export & Submission** | `predict.pkl`, live inference, submission validation, Numerai upload | Produces tournament-ready predictions and submits them |
 
 > The diagram is editable — open `docs/assets/architecture.drawio` in [draw.io](https://app.diagrams.net) to modify it.
@@ -194,9 +194,22 @@ uv run python scripts/hpo_pipeline.py \
   --train-subsample 0.125 \
   --num-trials 30 \
   --output-dir artifacts/hpo_x8 \
-  --local
+  --local \
+  --wandb-project alphapulse-hpo
 ```
 
+**Useful flags:**
+- `--wandb-project <name>` — log every trial to Weights & Biases (project name is timestamped and saved for `--resume`)
+- `--resume` — skip trials already recorded in `trials.db`
+- `--trial-timeout <sec>` — kill stuck trials (default: 1800)
+- `--max-hours <h>` — stop after a time budget
+- `--objective corr_sharpe|mmc_sharpe|payout_score` — optimization target
+
+W&B trial runs include scalar metrics (`corr_sharpe`, `mmc_sharpe`, `metric/mmc`) and per-trial
+`diagnostics/` charts (per-era correlation, feature exposure, SHAP importance). After the search,
+a `best-trial-diagnostics` run and `search-convergence` / `hpo-summary` runs are logged to the
+same W&B group.
+
 The best resulting configuration will be saved to `artifacts/hpo_x8/best_config.json`.
 
 ### 4\. Run AutoResearch (Agent-Driven Research Loop)
@@ -519,11 +532,12 @@ make eda-lint
 ├── src/alphapulse/  # Core framework source code
 │   ├── autoresearch/  # Agent-driven research loop (loop, agent, mutations, state)
 │   ├── evaluation/    # Backtesting, metrics, SHAP report, W&B diagnostics, submission validation
-│   ├── experiments/   # YAML schema (ExperimentV1), runner
-│   ├── hpo/           # HPO objective, search space, builder, registry, TrialDB (SQLite)
-│   ├── logging_/      # Leaderboard and W&B helpers
+│   ├── experiments/   # YAML schema (ExperimentV1), runner, data loaders (incl. MMC validation frame)
+│   ├── features/      # Feature/target catalog loaded from features.json
+│   ├── hpo/           # HPO objective, search space, builder, registry, TrialDB (SQLite), export
+│   ├── logging_/      # Leaderboard, W&B helpers, live loguru → W&B Logs bridge
 │   ├── models/        # All model implementations + factory
-│   ├── pipeline/      # Pipeline, MultiHeadPipeline, MultiTargetPipeline, ensemble, neutralizer, stacker
+│   ├── pipeline/      # Pipeline, MultiHeadPipeline, MultiTargetPipeline, model_access, ensemble, neutralizer
 │   ├── preprocessors/ # All preprocessor implementations + factory (incl. autoencoder, compression, era-stable selector)
 │   ├── utils/         # Global seed utility
 │   └── validation/    # PurgedEraCV
@@ -549,6 +563,14 @@ Commit messages: prefer conventional commits (e.g. `feat: ...`, `fix: ...`, `doc
 
 See [CHANGELOG.md](CHANGELOG.md) for completed releases.
 
+**Completed — v0.6.0 (MMC + W&B Diagnostics):**
+- **MMC on validation split:** HPO scores `mmc`, `mmc_sharpe`, and `payout_score` on `validation.parquet` rows aligned with `meta_model.parquet` (train holdout ids do not overlap meta-model ids).
+- **W&B diagnostics as charts:** `diagnostics/` logs matplotlib bar/heatmap/line charts instead of raw tables; horizontal bar charts for feature importance and exposure; ensemble correlation heatmap.
+- **Live W&B training logs:** loguru lines and per-round XGBoost metrics stream to W&B during HPO trials.
+- **MultiTarget diagnostics:** `pipeline/model_access.py` unifies model iteration and prediction collection for SHAP and ensemble diagnostics across `Pipeline` and `MultiTargetPipeline`.
+- **Feature catalog & routing:** `features/catalog.py` and HPO feature routing resolve `features.json` sets and YAML groups into per-model column lists.
+- **HPO summary charts:** scatter plots for trial corr Sharpe, MMC Sharpe, and runtime in the `hpo-summary` W&B run.
+
 **Completed — v0.5.0 (Production Hardening + XAI):**
 - **HPO fault tolerance:** Each local trial runs in an isolated subprocess; crashes mark the trial failed and the sweep continues. A SQLite-backed `TrialDB` persists trial state. `--resume` skips already-completed trials.
 - **Provenance artifact:** On every export, a hermetically sealed bundle is written: resolved config, `uv export` dependency snapshot, and git commit hash.

diff --git a/eda/pages/hpo_analysis.py b/eda/pages/hpo_analysis.py
@@ -403,9 +403,14 @@ def load_trials(path: str, _min_sharpe: float) -> pd.DataFrame:
     max_value=min(100, len(df)),
     value=20,
 )
+rank_col = (
+    "payout_score"
+    if "payout_score" in df.columns and df["payout_score"].notna().any()
+    else "sharpe"
+)
 show_cols = [
     "trial",
-    "sharpe",
+    rank_col,
     "mean_era_corr",
     "std_era_corr",
     "max_drawdown",
@@ -416,12 +421,12 @@ def load_trials(path: str, _min_sharpe: float) -> pd.DataFrame:
     "use_neutralization",
     "elapsed_seconds",
 ]
-if "payout_score" in df.columns:
-    show_cols.insert(2, "payout_score")
+if rank_col == "payout_score":
+    show_cols.insert(2, "corr_sharpe" if "corr_sharpe" in df.columns else "sharpe")
 
-leaderboard = df.nlargest(top_n, "sharpe")[show_cols]
+leaderboard = df.nlargest(top_n, rank_col)[show_cols]
 st.dataframe(
-    leaderboard.style.background_gradient(subset=["sharpe"], cmap="RdYlGn"),
+    leaderboard.style.background_gradient(subset=[rank_col], cmap="RdYlGn"),
     use_container_width=True,
     height=420,
 )

diff --git a/experiments/mmc_asymmetric_lgbm_tabicl_v1.yaml b/experiments/mmc_asymmetric_lgbm_tabicl_v1.yaml
@@ -0,0 +1,43 @@
+version: "1"
+data:
+  data_dir: data/v5.2
+  train_subsample: 0.125
+  target_col: target
+  seed: 42
+features:
+  columns: null
+  groups: {}
+preprocessing:
+  - type: StandardScaler
+    params: {}
+models:
+  - type: LightGBM
+    params:
+      params:
+        num_leaves: 31
+        learning_rate: 0.01
+        objective: regression
+        metric: rmse
+  - type: TabICL
+    params:
+      n_estimators: 4
+      max_train_rows: 8000
+      compression: autoencoder
+      compression_components: 128
+      compression_epochs: 10
+      kv_cache: false
+      batch_size: 8
+ensemble_method: weighted
+ensemble_params:
+  optimize_weights: true
+  objective: payout_score
+  min_weight: 0.05
+  max_weight: 0.90
+neutralization:
+  proportion: 0.5
+train:
+  n_rounds: 2000
+  early_stopping_rounds: 100
+evaluation:
+  primary_metric: payout_score
+  walk_forward: false
diff --git a/experiments/mmc_catboost_baseline_v1.yaml b/experiments/mmc_catboost_baseline_v1.yaml
@@ -0,0 +1,36 @@
+version: "1"
+data:
+  data_dir: data/v5.2
+  train_subsample: 0.125
+  target_col: target
+  seed: 42
+features:
+  columns: null
+  groups: {}
+preprocessing:
+  - type: RobustScaler
+    params: {}
+models:
+  - type: CatBoost
+    params:
+      params:
+        depth: 6
+        learning_rate: 0.03
+        l2_leaf_reg: 5.0
+        min_data_in_leaf: 200
+        colsample_bylevel: 0.3
+        loss_function: RMSE
+        verbose: 0
+      iterations: 400
+      early_stopping_rounds: 50
+ensemble_method: single
+neutralization:
+  proportion: 0.35
+meta_neutralization:
+  proportion: 0.0
+train:
+  n_rounds: 400
+  early_stopping_rounds: 50
+evaluation:
+  primary_metric: payout_score
+  walk_forward: false
diff --git a/experiments/mmc_lgbm_catboost_ensemble_v1.yaml b/experiments/mmc_lgbm_catboost_ensemble_v1.yaml
@@ -0,0 +1,53 @@
+version: "1"
+data:
+  data_dir: data/v5.2
+  train_subsample: 0.125
+  target_col: target
+  seed: 42
+features:
+  columns: null
+  groups: {}
+preprocessing:
+  - type: StandardScaler
+    params: {}
+models:
+  - type: LightGBM
+    params:
+      params:
+        num_leaves: 31
+        learning_rate: 0.01
+        min_child_samples: 200
+        reg_alpha: 0.5
+        reg_lambda: 5.0
+        colsample_bytree: 0.3
+        subsample: 0.7
+        objective: regression
+        metric: rmse
+  - type: CatBoost
+    params:
+      params:
+        depth: 6
+        learning_rate: 0.03
+        l2_leaf_reg: 5.0
+        min_data_in_leaf: 200
+        colsample_bylevel: 0.3
+        loss_function: RMSE
+        verbose: 0
+      iterations: 400
+      early_stopping_rounds: 50
+ensemble_method: weighted
+ensemble_params:
+  optimize_weights: true
+  objective: payout_score
+  min_weight: 0.05
+  max_weight: 0.90
+neutralization:
+  proportion: 0.5
+meta_neutralization:
+  proportion: 0.55
+train:
+  n_rounds: 400
+  early_stopping_rounds: 50
+evaluation:
+  primary_metric: payout_score
+  walk_forward: false
diff --git a/experiments/mmc_lgbm_meta_neutral_v1.yaml b/experiments/mmc_lgbm_meta_neutral_v1.yaml
@@ -0,0 +1,36 @@
+version: "1"
+data:
+  data_dir: data/v5.2
+  train_subsample: 0.125
+  target_col: target
+  seed: 42
+features:
+  columns: null
+  groups: {}
+preprocessing:
+  - type: StandardScaler
+    params: {}
+models:
+  - type: LightGBM
+    params:
+      params:
+        num_leaves: 31
+        learning_rate: 0.01
+        min_child_samples: 200
+        reg_alpha: 0.5
+        reg_lambda: 5.0
+        colsample_bytree: 0.3
+        subsample: 0.7
+        objective: regression
+        metric: rmse
+ensemble_method: single
+neutralization:
+  proportion: 0.5
+meta_neutralization:
+  proportion: 0.6
+train:
+  n_rounds: 400
+  early_stopping_rounds: 50
+evaluation:
+  primary_metric: payout_score
+  walk_forward: false
diff --git a/experiments/mmc_multitarget_lgbm_v1.yaml b/experiments/mmc_multitarget_lgbm_v1.yaml
@@ -0,0 +1,35 @@
+version: "1"
+data:
+  data_dir: data/v5.2
+  train_subsample: 0.125
+  target_col: target
+  auxiliary_targets:
+    - target_jerome_v4_20
+    - target_ralph_v4_20
+    - target_tyler_v4_20
+  target_blend_method: equal
+  seed: 42
+features:
+  columns: null
+  groups: {}
+preprocessing:
+  - type: StandardScaler
+    params: {}
+models:
+  - type: LightGBM
+    params:
+      params:
+        num_leaves: 31
+        learning_rate: 0.01
+        objective: regression
+        metric: rmse
+ensemble_method: single
+ensemble_params: {}
+neutralization:
+  proportion: 0.5
+train:
+  n_rounds: 2000
+  early_stopping_rounds: 100
+evaluation:
+  primary_metric: corr_sharpe
+  walk_forward: false