Skip to content
Open
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
18ba895
Adding extensions for generating md version of docs
rcap107 Jun 17, 2026
7067138
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 18, 2026
d022a01
changing llms library
rcap107 Jun 18, 2026
209a68c
updating build to ship documentation
rcap107 Jun 18, 2026
b435e39
changelog
rcap107 Jun 18, 2026
0172334
avoiding duplication, renaming folder
rcap107 Jun 18, 2026
b24ce6a
[doc-build]
rcap107 Jun 18, 2026
1b90ce7
pyproject
rcap107 Jun 18, 2026
62dfcbb
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 18, 2026
15dcc08
cleanup of doc install
rcap107 Jun 18, 2026
8d15e9d
adding manifest file to clean up wheel
rcap107 Jun 18, 2026
6ab36e8
improvements
rcap107 Jun 18, 2026
76dfc61
more references to applytocols/selectors
rcap107 Jun 18, 2026
8f4e62d
clean up build
rcap107 Jun 19, 2026
4e577f9
updating build process
rcap107 Jun 22, 2026
95e922f
removing setup
rcap107 Jun 22, 2026
1f5d2a5
changelog
rcap107 Jun 22, 2026
4f732fc
removing manifest
rcap107 Jun 22, 2026
f78dac8
avoiding exec
rcap107 Jun 22, 2026
3c748c2
improving tv/applytocol docs
rcap107 Jun 22, 2026
d05cb0d
commenting out a test
rcap107 Jun 22, 2026
b8ec156
cleanup
rcap107 Jun 23, 2026
b8549dc
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 23, 2026
aa749a9
lock file
rcap107 Jun 23, 2026
918906d
testing build
rcap107 Jun 23, 2026
ef17076
excluding py files from test collection
rcap107 Jun 23, 2026
6519659
moving doc files to _docs
rcap107 Jun 23, 2026
6cf2470
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 25, 2026
e9f4ff7
updating examples
rcap107 Jun 25, 2026
6a19cf7
updating changelog
rcap107 Jun 25, 2026
79b2690
restoring old files
rcap107 Jun 25, 2026
6f96af6
some comments
rcap107 Jun 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .gitignore

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding files copied from the skrub/_docs folder to the gitignore so they don't get counted twice (like the changelog)

Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,25 @@ doc/CHANGES.rst
doc/RELEASE_PROCESS.rst
doc/CONTRIBUTING.rst
doc/sg_execution_times.rst
# RST content files synced from skrub/_docs at build time (conf.py)
doc/about.rst
doc/column_level_featurizing.rst
doc/data_ops.rst
doc/default_wrangling.rst
doc/development.rst
doc/documentation.rst
doc/exploring_a_dataframe.rst
doc/howto.rst
doc/index.rst
doc/install.rst
doc/joining_dataframes.rst
doc/learning_materials.rst
doc/multi_column_operations.rst
doc/tutorial_example.rst
doc/vision.rst
doc/guides/
doc/modules/
doc/tutorials/
.DS_Store
doc/_templates/demo_table_report_generated.html
doc/reference/*.rst
Expand Down
5 changes: 5 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,11 @@ Changes
:pr:`2048` by :user:`Riccardo Cappuzzo <rcap107>`.
- The minimum required version of matplotlib has been increased from 3.4.3 to 3.6.1.
:pr:`2159` by :user:`Riccardo Cappuzzo <rcap107>`.
- The package build has been updated to include the user guide and examples with
the package, so that it is now possible to access it directly from the wheel
rather than having to rely on the online docs. Docs and examples are now stored
in ``skrub/_docs``, rather than in the root of the repository.
:pr:`2173` by :user:`Riccardo Cappuzzo <rcap107>`.

Bugfixes
--------
Expand Down
11 changes: 9 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ skrub
.. |black| image:: https://img.shields.io/badge/code%20style-black-000000.svg


**skrub** (formerly *dirty_cat*) is a Python
library that facilitates machine learning with dataframes.
**skrub** is a Python library that facilitates machine learning with dataframes.

If you like the package, spread the word and ⭐ this repository!
You can also join the `Discord server <https://discord.gg/ABaPnm7fDC>`_.
Expand All @@ -28,6 +27,14 @@ Website: https://skrub-data.org/
See our `examples <https://skrub-data.org/stable/auto_examples>`_, or check out
the `learning materials <https://skrub-data.org/skrub-materials/index.html>`_.

Documentation and examples are bundled with the package itself, in
``skrub/_docs``. After installing, you can find it at:

.. code-block:: python

import skrub
print(skrub.__docs_dir__)

Installation
------------

Expand Down
32 changes: 29 additions & 3 deletions doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -29,24 +29,50 @@ html:
rm -rf $(BUILDDIR)/html/_images
#rm -rf _build/doctrees/
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
# Build markdown sources so llms.txt links point to .md files
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
cp -r $(BUILDDIR)/markdown/. $(BUILDDIR)/html/_sources/
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

html-noplot:
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -D plot_gallery=0 -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D markdown_uri_doc_suffix="html.md" -D plot_gallery=0 -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
# Build markdown sources so llms.txt links point to .md files
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D plot_gallery=0 -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
cp -r $(BUILDDIR)/markdown/. $(BUILDDIR)/html/_sources/
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Linkcheck finished. Results are in $(BUILDDIR)/linkcheck."

linkcheck-noplot:
$(SPHINXBUILD) -D plot_gallery=0 -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck-noplot
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D plot_gallery=0 -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck-noplot
@echo
@echo "Linkcheck (no plot) finished. Results are in $(BUILDDIR)/linkcheck-noplot."

markdown:
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
@echo
@echo "Markdown build finished. The markdown files are in $(BUILDDIR)/markdown."

markdown-noplot:
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D plot_gallery=0 -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
@echo
@echo "Markdown build (no plot) finished. The markdown files are in $(BUILDDIR)/markdown."

# skrub/_docs is the single source of truth for guide/content RST files.
# They are synced into doc/ automatically at build time by conf.py.
# Use this target to verify the two trees are in sync (no-op if they match).
check-docs-sync:
diff -rq --exclude="*.pyc" ../skrub/_docs/ . \
--exclude="CHANGES.rst" --exclude="CONTRIBUTING.rst" \
--exclude="RELEASE_PROCESS.rst" \
$(addprefix --exclude=,$(notdir $(wildcard ../skrub/_docs/*.rst))) \
&& echo "skrub/_docs and doc/ are in sync" || true

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
Expand Down
67 changes: 51 additions & 16 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@

import jinja2

# Allow skipping jupyterlite to speed up builds (e.g. html-noplot)
_SKIP_JUPYTERLITE = os.environ.get("SKIP_JUPYTERLITE", "").strip() in (
"1",
"true",
"yes",
)

# Generate the table report html file for the homepage
sys.path.append(os.path.relpath("."))
from data_ops_report import create_data_ops_report
Expand All @@ -43,14 +50,33 @@
from github_link import make_linkcode_resolve
from sphinx_gallery.notebook import add_code_cell, add_markdown_cell

# -- Copy files for docs --------------------------------------------------
# -- Sync documentation source files from skrub/_docs --------------------
#
# We avoid duplicating the information, but we do not use symlinks to be
# able to build the docs on Windows
# skrub/_docs is the single source of truth for all guide/content RST files
# so they are packaged with the wheel. We copy them into doc/ at build time
# rather than using symlinks (to support Windows builds).
#
# CHANGES.rst, CONTRIBUTING.rst and RELEASE_PROCESS.rst are canonical in the
# project root and are NOT stored in skrub/_docs.
shutil.copyfile("../RELEASE_PROCESS.rst", "RELEASE_PROCESS.rst")
shutil.copyfile("../CHANGES.rst", "CHANGES.rst")
shutil.copyfile("../CONTRIBUTING.rst", "CONTRIBUTING.rst")

_docs_src = Path("../skrub/_docs")

# Copy top-level RST content files
_skip_toplevel = {"CHANGES.rst", "CONTRIBUTING.rst", "RELEASE_PROCESS.rst"}
for _rst_file in _docs_src.glob("*.rst"):
if _rst_file.name not in _skip_toplevel:
shutil.copyfile(_rst_file, _rst_file.name)

# Copy content subdirectories (guides, modules)
for _subdir in ["guides", "modules"]:
shutil.copytree(_docs_src / _subdir, _subdir, dirs_exist_ok=True)

# Copy tutorials source files for sphinx-gallery
shutil.copytree(_docs_src / "tutorials", "tutorials", dirs_exist_ok=True)

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
Expand All @@ -76,27 +102,36 @@
"sphinx_copybutton",
"sphinx_gallery.gen_gallery",
"autoshortsummary",
"sphinx_llms_txt",
"sphinx_markdown_builder",
]

# -- sphinx-llms-txt configuration -------------------------------------------
# Link to Markdown sources in _sources/ (generated by the markdown builder).
llms_txt_uri_template = "{base_url}_sources/{docname}.md"

try:
import sphinxext.opengraph # noqa

extensions.append("sphinxext.opengraph")
except ImportError:
print("ERROR: sphinxext.opengraph import failed")

try:
import jupyterlite_sphinx # noqa: F401

extensions.append("jupyterlite_sphinx")
with_jupyterlite = True
except ImportError:
# In some cases we don't want to require jupyterlite_sphinx to be installed,
# e.g. the doc-min-dependencies build
warnings.warn(
"jupyterlite_sphinx is not installed, you need to install it "
"if you want JupyterLite links to appear in each example"
)
if not _SKIP_JUPYTERLITE:
try:
import jupyterlite_sphinx # noqa: F401

extensions.append("jupyterlite_sphinx")
with_jupyterlite = True
except ImportError:
# In some cases we don't want to require jupyterlite_sphinx to be installed,
# e.g. the doc-min-dependencies build
warnings.warn(
"jupyterlite_sphinx is not installed, you need to install it "
"if you want JupyterLite links to appear in each example"
)
with_jupyterlite = False
else:
with_jupyterlite = False

import sphinx_autosummary_accessors
Expand Down Expand Up @@ -480,7 +515,7 @@ def call_garbage_collector(gallery_conf, fname):
# See https://sphinx-gallery.github.io/stable/configuration.html#link-to-documentation # noqa
},
"filename_pattern": ".*",
"examples_dirs": ["../examples", "tutorials"],
"examples_dirs": ["../skrub/_docs/examples", "tutorials"],
"gallery_dirs": ["auto_examples", "auto_tutorials"],
"within_subsection_order": FileNameSortKey,
"download_all_examples": False,
Expand Down
29 changes: 15 additions & 14 deletions doc/modules/default_wrangling/apply_to_cols.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

.. |ApplyToCols| replace:: :class:`ApplyToCols`
.. |TableVectorizer| replace:: :class:`TableVectorizer`
.. |selectors| replace:: :mod:`skrub.selectors`
.. |s.string| replace:: :meth:`~skrub.selectors.string`
.. |s.numeric| replace:: :meth:`~skrub.selectors.numeric`
.. |RejectColumn| replace:: :class:`core.RejectColumn`
Expand All @@ -14,30 +15,30 @@

.. _user_guide_multiple_columns:

Transforming selected columns with |ApplyToCols|
Transforming only some columns with |ApplyToCols|
===========================================================

Very often and for various reasons, transformers must be applied only to some of the
columns in a dataframe. For example, all numeric columns in a dataframe may need
to be scaled at the same time, while string columns should be left alone.
While the heuristics used by the :class:`TableVectorizer` are usually good enough
to apply the proper transformers to different datatypes, using it may not be an
option in all cases. In scikit-learn pipelines, the column selection operation can
be done with the :class:`~sklearn.compose.ColumnTransformer`.
option in all cases.

Skrub provides the |ApplyToCols| transformer to achieve the same results with
a larger degree of control over which columns are being transformed.
|ApplyToCols| maps a transformer to columns in a dataframe, so that all
columns that satisfy a certain condition are transformed, while the others are
left untouched.
|ApplyToCols| (optionally paired with the |selectors|) allows to transform specific
columns with a large degree of control: |ApplyToCols| maps a transformer to columns
in a dataframe, so that all columns that satisfy a certain condition are transformed,
while the others are left untouched. |ApplyToCols| and the |selectors| are similar
to scikit-learn's :class:`~sklearn.compose.ColumnTransformer`.

.. tip::

If a skrub transformer has a ``cols`` parameter to specify a column list,
that can be a selector as well. Selectors give more control over which columns
are being transformed: they are discussed at length in the
:ref:`selectors user guide<user_guide_selectors>`.
Using selectors to choose or exclude columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a skrub transformer has a ``cols`` parameter to specify a column list,
that can be a selector as well. Selectors give more control over which columns
are being transformed: they are discussed at length in the
:ref:`selectors user guide<user_guide_selectors>`.

|ApplyToCols| can be used to transform a subset of columns in a dataframe, while
leaving the non-selected columns unchanged. In this example, we want to apply
Expand Down Expand Up @@ -110,7 +111,7 @@ id city_Madrid city_Paris city_Rome date_year date_month date_day date_to

Note that the column "id" was not encoded and was instead left as-is.

Dealing with columns that cannot be handled by a transformer
Rejecting columns that cannot be handled by a transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

|ApplyToCols| can allow the underlying encoder to decide which columns it can be applied to.
Expand Down
14 changes: 8 additions & 6 deletions doc/modules/multi_column_operations/selectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,8 @@ name, data type, contents, or according to arbitrary user-provided rules::

* :ref:`user_guide_advanced_selectors`

Combining selectors
-------------------
Selectors can be combined with the set operators
------------------------------------------------

The available operators are ``|``, ``&``, ``-``, ``^`` with the meaning of usual
python sets, and ``~`` to invert a selection:
Expand Down Expand Up @@ -146,8 +146,8 @@ following selector won't compute the cardinality of non-categorical columns:
(categorical() & cardinality_below(10))

.. _user_guide_selectors_expand:
Visualizing a selector
----------------------
Using selectors with dataframe libraries
----------------------------------------

All selectors have the :meth:`expand` method, which allows dataframe manipulation
outside of a skrub workflow: applying it to any dataframe will return the list
Expand Down Expand Up @@ -180,8 +180,10 @@ The :meth:`expand_index` method also exists: rather than returning a list of col
Using selectors with other skrub transformers
-------------------------------------------------

Skrub transformers are designed to be used in conjunction with other transformers
that operate on columns to improve their versatility.
Skrub selectors are designed to be used in conjunction with :class:`~skrub.ApplyToCols`,
:class:`skrub.SelectCols`, and :class:`skrub.DropCols`, as well as
:func:`~skrub.DataOp.skb.apply` to improve their versatility in how they modify
columns.

For example, it is possible to drop columns that have more unique values than a
certain amount by combining :func:`~skrub.selectors.cardinality_below` with
Expand Down
1 change: 0 additions & 1 deletion doc/modules/multi_column_operations/type_of_selectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@ Selectors based on column data types
- :func:`~skrub.selectors.any_date`: Select columns with date or datetime data types
- :func:`~skrub.selectors.categorical`: Select columns with categorical data types
- :func:`~skrub.selectors.string`: Select columns with string data types
- :func:`~skrub.selectors.object`: Select columns with the ``object`` (pandas) or ``pl.Object`` (polars) dtype
- :func:`~skrub.selectors.boolean`: Select columns with boolean data types

Selectors based on column content and properties
Expand Down
Loading
Loading