Skip to content
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
18ba895
Adding extensions for generating md version of docs
rcap107 Jun 17, 2026
7067138
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 18, 2026
d022a01
changing llms library
rcap107 Jun 18, 2026
209a68c
updating build to ship documentation
rcap107 Jun 18, 2026
b435e39
changelog
rcap107 Jun 18, 2026
0172334
avoiding duplication, renaming folder
rcap107 Jun 18, 2026
b24ce6a
[doc-build]
rcap107 Jun 18, 2026
1b90ce7
pyproject
rcap107 Jun 18, 2026
62dfcbb
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 18, 2026
15dcc08
cleanup of doc install
rcap107 Jun 18, 2026
8d15e9d
adding manifest file to clean up wheel
rcap107 Jun 18, 2026
6ab36e8
improvements
rcap107 Jun 18, 2026
76dfc61
more references to applytocols/selectors
rcap107 Jun 18, 2026
8f4e62d
clean up build
rcap107 Jun 19, 2026
4e577f9
updating build process
rcap107 Jun 22, 2026
95e922f
removing setup
rcap107 Jun 22, 2026
1f5d2a5
changelog
rcap107 Jun 22, 2026
4f732fc
removing manifest
rcap107 Jun 22, 2026
f78dac8
avoiding exec
rcap107 Jun 22, 2026
3c748c2
improving tv/applytocol docs
rcap107 Jun 22, 2026
d05cb0d
commenting out a test
rcap107 Jun 22, 2026
b8ec156
cleanup
rcap107 Jun 23, 2026
b8549dc
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 23, 2026
aa749a9
lock file
rcap107 Jun 23, 2026
918906d
testing build
rcap107 Jun 23, 2026
ef17076
excluding py files from test collection
rcap107 Jun 23, 2026
6519659
moving doc files to _docs
rcap107 Jun 23, 2026
6cf2470
Merge remote-tracking branch 'upstream/HEAD' into doc-improve-agentic…
rcap107 Jun 25, 2026
e9f4ff7
updating examples
rcap107 Jun 25, 2026
6a19cf7
updating changelog
rcap107 Jun 25, 2026
79b2690
restoring old files
rcap107 Jun 25, 2026
6f96af6
some comments
rcap107 Jun 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding files copied from the skrub/_docs folder to the gitignore so they don't get counted twice (like the changelog)

Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ doc/sg_execution_times.rst
doc/_templates/demo_table_report_generated.html
doc/reference/*.rst
doc/benchmark_indications.rst
skrub/_docs/*

# Pkl files for benchmarks
benchmarks/*.pkl
Expand Down
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,9 @@ Changes
:pr:`2048` by :user:`Riccardo Cappuzzo <rcap107>`.
- The minimum required version of matplotlib has been increased from 3.4.3 to 3.6.1.
:pr:`2159` by :user:`Riccardo Cappuzzo <rcap107>`.
- The package wheel has been updated so that it includes the User Guide and examples
in Markdown format.
:pr:`2173` by :user:`Riccardo Cappuzzo <rcap107>`.

Bugfixes
--------
Expand Down
12 changes: 12 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
prune .binder
prune .circleci
prune .github
prune build_tools
prune doc
prune examples

exclude .coveragerc
exclude .git-blame-ignore-revs
exclude .pre-commit-config.yaml
exclude codecov.yml
exclude pixi.lock
11 changes: 9 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ skrub
.. |black| image:: https://img.shields.io/badge/code%20style-black-000000.svg


**skrub** (formerly *dirty_cat*) is a Python
library that facilitates machine learning with dataframes.
**skrub** is a Python library that facilitates machine learning with dataframes.

If you like the package, spread the word and ⭐ this repository!
You can also join the `Discord server <https://discord.gg/ABaPnm7fDC>`_.
Expand All @@ -28,6 +27,14 @@ Website: https://skrub-data.org/
See our `examples <https://skrub-data.org/stable/auto_examples>`_, or check out
the `learning materials <https://skrub-data.org/skrub-materials/index.html>`_.

The documentation (in Markdown format) is also bundled with the package itself.
After installing, you can find it at:

.. code-block:: python

import skrub
print(skrub.__docs_dir__)

Installation
------------

Expand Down
39 changes: 36 additions & 3 deletions doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -29,24 +29,57 @@ html:
rm -rf $(BUILDDIR)/html/_images
#rm -rf _build/doctrees/
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
# Build markdown sources so llms.txt links point to .md files
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
cp -r $(BUILDDIR)/markdown/. $(BUILDDIR)/html/_sources/
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

html-noplot:
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -D plot_gallery=0 -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D markdown_uri_doc_suffix="html.md" -D plot_gallery=0 -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
# Build markdown sources so llms.txt links point to .md files
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D plot_gallery=0 -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
cp -r $(BUILDDIR)/markdown/. $(BUILDDIR)/html/_sources/
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Linkcheck finished. Results are in $(BUILDDIR)/linkcheck."

linkcheck-noplot:
$(SPHINXBUILD) -D plot_gallery=0 -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck-noplot
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D plot_gallery=0 -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck-noplot
@echo
@echo "Linkcheck (no plot) finished. Results are in $(BUILDDIR)/linkcheck-noplot."

markdown:
SKB_TABLE_REPORT_VERBOSITY=0 $(SPHINXBUILD) -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
@echo
@echo "Markdown build finished. The markdown files are in $(BUILDDIR)/markdown."

markdown-noplot:
SKB_TABLE_REPORT_VERBOSITY=0 SKIP_JUPYTERLITE=1 $(SPHINXBUILD) -D plot_gallery=0 -b markdown $(ALLSPHINXOPTS) $(BUILDDIR)/markdown
@echo
@echo "Markdown build (no plot) finished. The markdown files are in $(BUILDDIR)/markdown."

# Copy the generated markdown docs into the skrub/ package tree so they
# get bundled with the wheel. Run after html, html-noplot, markdown, or
# markdown-noplot.
install-docs:
rm -rf ../skrub/_docs
mkdir -p ../skrub/_docs
mkdir -p ../skrub/_docs/examples
mkdir -p ../skrub/_docs/tutorials
mkdir -p ../skrub/_docs/guides
cp -r ../examples/* ../skrub/_docs/examples/
cp -r ../doc/tutorials/* ../skrub/_docs/tutorials/
cp -r $(BUILDDIR)/markdown/guides/* ../skrub/_docs/guides
cp -r $(BUILDDIR)/markdown/modules/* ../skrub/_docs/guides
cp -r $(BUILDDIR)/markdown/*.md ../skrub/_docs/
cp -r $(BUILDDIR)/html/llms.txt ../skrub/_docs/
find ../skrub/_docs -name "*.md" | wc -l | xargs echo "Number of markdown files installed:"

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
Expand Down
40 changes: 28 additions & 12 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@

import jinja2

# Allow skipping jupyterlite to speed up builds (e.g. html-noplot)
_SKIP_JUPYTERLITE = os.environ.get("SKIP_JUPYTERLITE", "").strip() in (
"1",
"true",
"yes",
)

# Generate the table report html file for the homepage
sys.path.append(os.path.relpath("."))
from data_ops_report import create_data_ops_report
Expand Down Expand Up @@ -76,27 +83,36 @@
"sphinx_copybutton",
"sphinx_gallery.gen_gallery",
"autoshortsummary",
"sphinx_llms_txt",
"sphinx_markdown_builder",
]

# -- sphinx-llms-txt configuration -------------------------------------------
# Link to Markdown sources in _sources/ (generated by the markdown builder).
llms_txt_uri_template = "{base_url}_sources/{docname}.md"

try:
import sphinxext.opengraph # noqa

extensions.append("sphinxext.opengraph")
except ImportError:
print("ERROR: sphinxext.opengraph import failed")

try:
import jupyterlite_sphinx # noqa: F401

extensions.append("jupyterlite_sphinx")
with_jupyterlite = True
except ImportError:
# In some cases we don't want to require jupyterlite_sphinx to be installed,
# e.g. the doc-min-dependencies build
warnings.warn(
"jupyterlite_sphinx is not installed, you need to install it "
"if you want JupyterLite links to appear in each example"
)
if not _SKIP_JUPYTERLITE:
try:
import jupyterlite_sphinx # noqa: F401

extensions.append("jupyterlite_sphinx")
with_jupyterlite = True
except ImportError:
# In some cases we don't want to require jupyterlite_sphinx to be installed,
# e.g. the doc-min-dependencies build
warnings.warn(
"jupyterlite_sphinx is not installed, you need to install it "
"if you want JupyterLite links to appear in each example"
)
with_jupyterlite = False
else:
with_jupyterlite = False

import sphinx_autosummary_accessors
Expand Down
29 changes: 15 additions & 14 deletions doc/modules/default_wrangling/apply_to_cols.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

.. |ApplyToCols| replace:: :class:`ApplyToCols`
.. |TableVectorizer| replace:: :class:`TableVectorizer`
.. |selectors| replace:: :mod:`skrub.selectors`
.. |s.string| replace:: :meth:`~skrub.selectors.string`
.. |s.numeric| replace:: :meth:`~skrub.selectors.numeric`
.. |RejectColumn| replace:: :class:`core.RejectColumn`
Expand All @@ -14,30 +15,30 @@

.. _user_guide_multiple_columns:

Transforming selected columns with |ApplyToCols|
Transforming only some columns with |ApplyToCols|
===========================================================

Very often and for various reasons, transformers must be applied only to some of the
columns in a dataframe. For example, all numeric columns in a dataframe may need
to be scaled at the same time, while string columns should be left alone.
While the heuristics used by the :class:`TableVectorizer` are usually good enough
to apply the proper transformers to different datatypes, using it may not be an
option in all cases. In scikit-learn pipelines, the column selection operation can
be done with the :class:`~sklearn.compose.ColumnTransformer`.
option in all cases.

Skrub provides the |ApplyToCols| transformer to achieve the same results with
a larger degree of control over which columns are being transformed.
|ApplyToCols| maps a transformer to columns in a dataframe, so that all
columns that satisfy a certain condition are transformed, while the others are
left untouched.
|ApplyToCols| (optionally paired with the |selectors|) allows to transform specific
columns with a large degree of control: |ApplyToCols| maps a transformer to columns
in a dataframe, so that all columns that satisfy a certain condition are transformed,
while the others are left untouched. |ApplyToCols| and the |selectors| are similar
to scikit-learn's :class:`~sklearn.compose.ColumnTransformer`.

.. tip::

If a skrub transformer has a ``cols`` parameter to specify a column list,
that can be a selector as well. Selectors give more control over which columns
are being transformed: they are discussed at length in the
:ref:`selectors user guide<user_guide_selectors>`.
Using selectors to choose or exclude columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a skrub transformer has a ``cols`` parameter to specify a column list,
that can be a selector as well. Selectors give more control over which columns
are being transformed: they are discussed at length in the
:ref:`selectors user guide<user_guide_selectors>`.

|ApplyToCols| can be used to transform a subset of columns in a dataframe, while
leaving the non-selected columns unchanged. In this example, we want to apply
Expand Down Expand Up @@ -110,7 +111,7 @@ id city_Madrid city_Paris city_Rome date_year date_month date_day date_to

Note that the column "id" was not encoded and was instead left as-is.

Dealing with columns that cannot be handled by a transformer
Rejecting columns that cannot be handled by a transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

|ApplyToCols| can allow the underlying encoder to decide which columns it can be applied to.
Expand Down
14 changes: 8 additions & 6 deletions doc/modules/multi_column_operations/selectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,8 @@ name, data type, contents, or according to arbitrary user-provided rules::

* :ref:`user_guide_advanced_selectors`

Combining selectors
-------------------
Selectors can be combined with the set operators
------------------------------------------------

The available operators are ``|``, ``&``, ``-``, ``^`` with the meaning of usual
python sets, and ``~`` to invert a selection:
Expand Down Expand Up @@ -146,8 +146,8 @@ following selector won't compute the cardinality of non-categorical columns:
(categorical() & cardinality_below(10))

.. _user_guide_selectors_expand:
Visualizing a selector
----------------------
Using selectors with dataframe libraries
----------------------------------------

All selectors have the :meth:`expand` method, which allows dataframe manipulation
outside of a skrub workflow: applying it to any dataframe will return the list
Expand Down Expand Up @@ -180,8 +180,10 @@ The :meth:`expand_index` method also exists: rather than returning a list of col
Using selectors with other skrub transformers
-------------------------------------------------

Skrub transformers are designed to be used in conjunction with other transformers
that operate on columns to improve their versatility.
Skrub selectors are designed to be used in conjunction with :class:`~skrub.ApplyToCols`,
:class:`skrub.SelectCols`, and :class:`skrub.DropCols`, as well as
:func:`~skrub.DataOp.skb.apply` to improve their versatility in how they modify
columns.

For example, it is possible to drop columns that have more unique values than a
certain amount by combining :func:`~skrub.selectors.cardinality_below` with
Expand Down
Loading
Loading