skrub-data · jeromedockes · Jun 11, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.github/workflows/check_stub_files_diff.yaml b/.github/workflows/check_stub_files_diff.yaml
@@ -17,7 +17,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: prefix-dev/setup-pixi@v0.9.6
         with:
-          pixi-version: v0.59.0
+          pixi-version: v0.68.0
           frozen: true
 
       - name: Check stub file for `_data_ops.py` is up-to-date

diff --git a/.github/workflows/run-code-format-checks.yaml b/.github/workflows/run-code-format-checks.yaml
@@ -17,7 +17,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: prefix-dev/setup-pixi@v0.9.6
         with:
-          pixi-version: v0.59.0
+          pixi-version: v0.68.0
           frozen: true
 
       - name: Run tests

diff --git a/.github/workflows/test-javascript.yml b/.github/workflows/test-javascript.yml
@@ -15,7 +15,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: prefix-dev/setup-pixi@v0.9.6
         with:
-          pixi-version: v0.59.0
+          pixi-version: v0.68.0
           environments: ci-py314-latest-optional-deps
           # we can freeze the environment and manually bump the dependencies to the
           # latest version time to time.

diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -27,7 +27,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: prefix-dev/setup-pixi@v0.9.6
         with:
-          pixi-version: v0.59.0
+          pixi-version: v0.68.0
           environments: ${{ matrix.environment }}
           # we can freeze the environment and manually bump the dependencies to the
           # latest version time to time.
@@ -63,7 +63,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: prefix-dev/setup-pixi@v0.9.6
         with:
-          pixi-version: v0.59.0
+          pixi-version: v0.68.0
           environments: ci-nightly-deps
           # we can freeze the environment and manually bump the dependencies to the
           # latest version time to time.

diff --git a/.github/workflows/update_pixi_lock_files.yml b/.github/workflows/update_pixi_lock_files.yml
@@ -24,7 +24,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: prefix-dev/setup-pixi@v0.9.6
         with:
-          pixi-version: v0.59.0
+          pixi-version: v0.68.0
           run-install: false
 
       - name: Remove the current lock file

diff --git a/doc/conf.py b/doc/conf.py
@@ -70,6 +70,7 @@
     "sphinx.ext.linkcode",
     "sphinx.ext.autodoc.typehints",
     # contrib
+    "sphinx_design",
     "numpydoc",
     "sphinx_issues",
     "sphinx_copybutton",

diff --git a/doc/data_ops.rst b/doc/data_ops.rst
@@ -1,46 +1,66 @@
 .. _user_guide_data_ops_index:
 
-Complex multi-table pipelines with Data Ops
-===========================================
-
-Skrub provides an easy way to build complex, flexible machine learning pipelines.
-There are several needs that are not easily addressed with standard scikit-learn
-tools such as :class:`~sklearn.pipeline.Pipeline` and
-:class:`~sklearn.compose.ColumnTransformer`, and for which the skrub DataOps offer
-a solution:
-
-- Multiple tables: We often have several tables of different shapes (for
-  example, "Customers", "Orders", and "Products" tables) that need to be
-  processed and assembled into a design matrix ``X``. The target ``y`` may also
-  be the result of some data processing. Standard scikit-learn estimators do not
-  support this, as they expect right away a single design matrix ``X`` and a
-  target array ``y``, with one row per observation.
-- DataFrame wrangling: Performing typical DataFrame operations such as
-  projections, joins, and aggregations should be possible and allow leveraging
-  the powerful and familiar APIs of `Pandas <https://pandas.pydata.org>`_ or
-  `Polars <https://docs.pola.rs/>`_.
-- Hyperparameter tuning: Choices of estimators, hyperparameters, and even
-  the pipeline architecture can be guided by validation scores. Specifying
-  ranges of possible values outside of the pipeline itself (as in
-  :class:`~sklearn.model_selection.GridSearchCV`) is difficult in complex
-  pipelines.
-- Iterative development: Building a pipeline step by step while inspecting
-  intermediate results allows for a short feedback loop and early discovery of
-  errors.
-
-In this section we cover all about the skrub Data Ops, from starting out with a
-simple example, to more advanced concepts like parameter tuning and and pipeline
-validation.
+.. currentmodule:: skrub
+
+Building complete pipelines with DataOps
+========================================
+
+A skrub DataOp is a complete machine learning pipeline —from data loading and
+wrangling to the final prediction— in a single object that can be fitted, tuned,
+cross-validated, and saved in a file like any scikit-learn estimator.
+
+To solve a machine-learning task we often need to combine multiple operations
+such as loading and filtering data, joining tables and computing aggregations,
+extracting numerical features, and fitting a classifier or regressor.
+
+**Storing state**  Each of those operations may need to be fitted: to learn some
+information from training data and reuse it to apply consistent transformations
+to new data. This is the case for transformers like the
+:class:`~sklearn.preprocessing.StandardScaler` and :class:`TableVectorizer` and
+estimators like :class:`~sklearn.ensemble.RandomForestClassifier`.
+
+**Tuning**  Moreover, each processing step may involve decisions that need to be
+tuned (*tuning* means finding the value that gives the best predictive
+performance), for example: what weather forecast features should I include to
+predict the load on an electric grid? How should I encode a product description
+to help predict the product's category? What learning rate to set on a
+:class:`~sklearn.ensemble.HistGradientBoostingRegressor`?
+
+**Validation**  Finally, the quality of predictions must be evaluated on
+held-out data (with a train/test split or cross-validation), taking care to
+**avoid leakage** of test data into the training set.
+
+Separating the data wrangling from the fitted estimator prevents correctly
+handling the tasks above. Skrub DataOps help by binding an arbitrary set of
+transformations of any number of inputs in a single estimator. These
+transformations can be easily parametrized with tunable choices. The resulting
+objects have built-in methods for cross-validation and tuning with either Optuna
+or scikit-learn, and for inspecting runs and intermediate results. Once fitted,
+they can be saved in a file, loaded, applied to new data as easily as a single
+:class:`~sklearn.linear_model.LogisticRegression`.
+
+.. dropdown:: Going beyond the scikit-learn Pipeline
+  :color: primary
+
+  To some extent, the DataOps exist for the same reasons as the simpler
+  scikit-learn :class:`sklearn.pipeline.Pipeline` used in other parts of this
+  documentation. However the Pipeline is too limited for many real-world problems:
+  it can only represent a linear sequence of scikit-learn transformers, the design
+  matrix and target variables must be constructed and divided into training and
+  testing sets outside of the pipeline and the number of rows cannot change, only
+  a single table can be handled, hyperparameter choices are difficult to define,
+  etc. . Skrub DataOps remove those limitations and add several useful features
+  such as interactive previews and integration with Optuna.
 
 Data Ops basic concepts
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. toctree::
    :maxdepth: 3
 
+   auto_tutorials/1111_data_ops_quick_tour
    modules/data_ops/basics/what_are_data_ops
    modules/data_ops/basics/building_data_ops_plan
-   auto_tutorials/1110_data_ops_intro
    modules/data_ops/basics/using_previews
    modules/data_ops/basics/direct_access_methods
    modules/data_ops/basics/control_flow

diff --git a/doc/data_ops_report.py b/doc/data_ops_report.py
@@ -74,18 +74,16 @@ def create_employee_salaries_report():
     if (output_dir / "index.html").exists():
         return output_dir
 
-    dataset = skrub.datasets.fetch_employee_salaries(split="train").employee_salaries
-    data_var = skrub.var("data", dataset)
-    X = data_var.drop("current_annual_salary", axis=1).skb.mark_as_X()
-    y = data_var["current_annual_salary"].skb.mark_as_y()
-
-    vectorizer = TableVectorizer()
-    X_vec = X.skb.apply(vectorizer)
+    pred = (
+        skrub.var("employee_data")
+        .skb.apply(TableVectorizer())
+        .skb.apply(HistGradientBoostingRegressor(), y=skrub.var("salary"))
+    )
 
-    hgb = HistGradientBoostingRegressor()
-    predictor = X_vec.skb.apply(hgb, y=y)
+    dataset = skrub.datasets.fetch_employee_salaries(split="train")
 
-    predictor.skb.full_report(
+    pred.skb.full_report(
+        {"employee_data": dataset.X, "salary": dataset.y},
         output_dir=output_dir,
         overwrite=True,
         open=False,