Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/check_stub_files_diff.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
frozen: true

- name: Check stub file for `_data_ops.py` is up-to-date
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/run-code-format-checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
frozen: true

- name: Run tests
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-javascript.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
environments: ci-py314-latest-optional-deps
# we can freeze the environment and manually bump the dependencies to the
# latest version time to time.
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
environments: ${{ matrix.environment }}
# we can freeze the environment and manually bump the dependencies to the
# latest version time to time.
Expand Down Expand Up @@ -63,7 +63,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
environments: ci-nightly-deps
# we can freeze the environment and manually bump the dependencies to the
# latest version time to time.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/update_pixi_lock_files.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
run-install: false

- name: Remove the current lock file
Expand Down
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"sphinx.ext.linkcode",
"sphinx.ext.autodoc.typehints",
# contrib
"sphinx_design",
"numpydoc",
"sphinx_issues",
"sphinx_copybutton",
Expand Down
84 changes: 52 additions & 32 deletions doc/data_ops.rst
Original file line number Diff line number Diff line change
@@ -1,46 +1,66 @@
.. _user_guide_data_ops_index:

Complex multi-table pipelines with Data Ops
===========================================

Skrub provides an easy way to build complex, flexible machine learning pipelines.
There are several needs that are not easily addressed with standard scikit-learn
tools such as :class:`~sklearn.pipeline.Pipeline` and
:class:`~sklearn.compose.ColumnTransformer`, and for which the skrub DataOps offer
a solution:

- Multiple tables: We often have several tables of different shapes (for
example, "Customers", "Orders", and "Products" tables) that need to be
processed and assembled into a design matrix ``X``. The target ``y`` may also
be the result of some data processing. Standard scikit-learn estimators do not
support this, as they expect right away a single design matrix ``X`` and a
target array ``y``, with one row per observation.
- DataFrame wrangling: Performing typical DataFrame operations such as
projections, joins, and aggregations should be possible and allow leveraging
the powerful and familiar APIs of `Pandas <https://pandas.pydata.org>`_ or
`Polars <https://docs.pola.rs/>`_.
- Hyperparameter tuning: Choices of estimators, hyperparameters, and even
the pipeline architecture can be guided by validation scores. Specifying
ranges of possible values outside of the pipeline itself (as in
:class:`~sklearn.model_selection.GridSearchCV`) is difficult in complex
pipelines.
- Iterative development: Building a pipeline step by step while inspecting
intermediate results allows for a short feedback loop and early discovery of
errors.

In this section we cover all about the skrub Data Ops, from starting out with a
simple example, to more advanced concepts like parameter tuning and and pipeline
validation.
.. currentmodule:: skrub

Building complete pipelines with DataOps
========================================

A skrub DataOp is a complete machine learning pipeline —from data loading and
wrangling to the final prediction— in a single object that can be fitted, tuned,
cross-validated, and saved in a file like any scikit-learn estimator.

To solve a machine-learning task we often need to combine multiple operations
such as loading and filtering data, joining tables and computing aggregations,
extracting numerical features, and fitting a classifier or regressor.

**Storing state**  Each of those operations may need to be fitted: to learn some
information from training data and reuse it to apply consistent transformations
to new data. This is the case for transformers like the
:class:`~sklearn.preprocessing.StandardScaler` and :class:`TableVectorizer` and
estimators like :class:`~sklearn.ensemble.RandomForestClassifier`.

**Tuning**  Moreover, each processing step may involve decisions that need to be
tuned (*tuning* means finding the value that gives the best predictive
performance), for example: what weather forecast features should I include to
predict the load on an electric grid? How should I encode a product description
to help predict the product's category? What learning rate to set on a
:class:`~sklearn.ensemble.HistGradientBoostingRegressor`?

**Validation**  Finally, the quality of predictions must be evaluated on
held-out data (with a train/test split or cross-validation), taking care to
**avoid leakage** of test data into the training set.

Separating the data wrangling from the fitted estimator prevents correctly
handling the tasks above. Skrub DataOps help by binding an arbitrary set of
transformations of any number of inputs in a single estimator. These
transformations can be easily parametrized with tunable choices. The resulting
objects have built-in methods for cross-validation and tuning with either Optuna
or scikit-learn, and for inspecting runs and intermediate results. Once fitted,
they can be saved in a file, loaded, applied to new data as easily as a single
:class:`~sklearn.linear_model.LogisticRegression`.

.. dropdown:: Going beyond the scikit-learn Pipeline
:color: primary

To some extent, the DataOps exist for the same reasons as the simpler
scikit-learn :class:`sklearn.pipeline.Pipeline` used in other parts of this
documentation. However the Pipeline is too limited for many real-world problems:
it can only represent a linear sequence of scikit-learn transformers, the design
matrix and target variables must be constructed and divided into training and
testing sets outside of the pipeline and the number of rows cannot change, only
a single table can be handled, hyperparameter choices are difficult to define,
etc. . Skrub DataOps remove those limitations and add several useful features
such as interactive previews and integration with Optuna.

Data Ops basic concepts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. toctree::
:maxdepth: 3

auto_tutorials/1111_data_ops_quick_tour
modules/data_ops/basics/what_are_data_ops
modules/data_ops/basics/building_data_ops_plan
auto_tutorials/1110_data_ops_intro
modules/data_ops/basics/using_previews
modules/data_ops/basics/direct_access_methods
modules/data_ops/basics/control_flow
Expand Down
18 changes: 8 additions & 10 deletions doc/data_ops_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,18 +74,16 @@ def create_employee_salaries_report():
if (output_dir / "index.html").exists():
return output_dir

dataset = skrub.datasets.fetch_employee_salaries(split="train").employee_salaries
data_var = skrub.var("data", dataset)
X = data_var.drop("current_annual_salary", axis=1).skb.mark_as_X()
y = data_var["current_annual_salary"].skb.mark_as_y()

vectorizer = TableVectorizer()
X_vec = X.skb.apply(vectorizer)
pred = (
skrub.var("employee_data")
.skb.apply(TableVectorizer())
.skb.apply(HistGradientBoostingRegressor(), y=skrub.var("salary"))
)

hgb = HistGradientBoostingRegressor()
predictor = X_vec.skb.apply(hgb, y=y)
dataset = skrub.datasets.fetch_employee_salaries(split="train")

predictor.skb.full_report(
pred.skb.full_report(
{"employee_data": dataset.X, "salary": dataset.y},
output_dir=output_dir,
overwrite=True,
open=False,
Expand Down
Loading