Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions doc/guides/table_report/02_exporting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
.. |column_associations| replace:: :func:`~skrub.column_associations`

.. _user_guide_table_report_sharing:
How to export and share the |TableReport|
-----------------------------------------
How to export and share the |TableReport| for use by other tools
----------------------------------------------------------------

The |TableReport| is generated as a standalone HTML file that includes the report
data, the plots, and the Javascript necessary to provide interactivity.
Expand All @@ -31,7 +31,8 @@ respectively.

The report can be exported in JSON format, which allows structured
access to the data and statistics used to build the report with
:func:`~skrub.TableReport.json`.
:func:`~skrub.TableReport.json`. The schema of the JSON data is reported in
:ref:`table_report_json_schema`.

.. code-block::
Expand Down
6 changes: 3 additions & 3 deletions doc/modules/data_ops/basics/control_flow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,9 +168,9 @@ Finally, there are other situations where using :func:`deferred` can be helpful:

.. rubric:: Examples

- See :ref:`sphx_glr_auto_examples_data_ops_1110_data_ops_intro.py` for an introductory
- See :ref:`sphx_glr_auto_tutorials_1110_data_ops_intro.py` for an introductory
example on how to use skrub DataOps on a single dataframe.
- See :ref:`sphx_glr_auto_examples_data_ops_1120_multiple_tables.py` for an example
- See :ref:`sphx_glr_auto_examples_02_data_ops_1120_multiple_tables.py` for an example
of how skrub DataOps can be used to process multiple tables using dataframe APIs.
- See :ref:`sphx_glr_auto_examples_data_ops_1130_choices.py` for an example of
- See :ref:`sphx_glr_auto_examples_02_data_ops_1130_choices.py` for an example of
hyper-parameter tuning using skrub DataOps.
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,4 @@ to obtain the final result:

More info on advanced column selection and manipulation be found in
:ref:`user_guide_selectors` and example
:ref:`sphx_glr_auto_examples_0090_apply_to_cols.py`.
:ref:`sphx_glr_auto_examples_0010_apply_to_cols.py`.
2 changes: 1 addition & 1 deletion doc/modules/data_ops/ml_pipeline/subsampling_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ set to ``True`` to force using the subsampling when we call them. Note that
even if we set ``keep_subsampling=True``, subsampling is not applied when using
``predict``.

See more details in a :ref:`full example <sphx_glr_auto_examples_data_ops_1140_subsampling.py>`.
See more details in a :ref:`full example <sphx_glr_auto_examples_02_data_ops_1140_subsampling.py>`.
2 changes: 1 addition & 1 deletion doc/modules/data_ops/validation/exporting_data_ops.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,5 +62,5 @@ or in a different environment:
>>> loaded_learner.fit({"orders": new_orders_df})
SkrubLearner(data_op=<Apply TableVectorizer>)

See :ref:`sphx_glr_auto_examples_data_ops_1150_use_case.py` for an example of how
See :ref:`sphx_glr_auto_examples_02_data_ops_1150_use_case.py` for an example of how
to use the learner in a microservice.
2 changes: 1 addition & 1 deletion doc/modules/data_ops/validation/hyperparameter_tuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ search respectively), or with the ``choose`` parameter of
:meth:`.skb.make_learner() <DataOp.skb.make_learner>`.

A full example of how to use hyperparameter search is available in
:ref:`sphx_glr_auto_examples_data_ops_1130_choices.py`, and a full example using
:ref:`sphx_glr_auto_examples_02_data_ops_1130_choices.py`, and a full example using
Optuna is in :ref:`example_optuna_choices`.

|
Expand Down
8 changes: 4 additions & 4 deletions doc/modules/default_wrangling/apply_to_cols.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,9 @@ to apply the proper transformers to different datatypes, using it may not be an
option in all cases. In scikit-learn pipelines, the column selection operation can
be done with the :class:`~sklearn.compose.ColumnTransformer`.

Skrub provides the |ApplyToCols| transformer to achieve the same results with
a larger degree of control over which columns are being transformed.
Skrub provides the |ApplyToCols| transformer and the
:ref:`selectors<user_guide_selectors>` to achieve the same results with larger
Comment thread
rcap107 marked this conversation as resolved.
Outdated
degree of control over which columns are being transformed.
|ApplyToCols| maps a transformer to columns in a dataframe, so that all
columns that satisfy a certain condition are transformed, while the others are
Comment thread
rcap107 marked this conversation as resolved.
Outdated
left untouched.
Expand All @@ -35,8 +36,7 @@ left untouched.

If a skrub transformer has a ``cols`` parameter to specify a column list,
that can be a selector as well. Selectors give more control over which columns
are being transformed: they are discussed at length in the
:ref:`selectors user guide<user_guide_selectors>`.
are being transformed.


|ApplyToCols| can be used to transform a subset of columns in a dataframe, while
Expand Down
15 changes: 14 additions & 1 deletion doc/modules/joining_tables/assembling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,20 @@ requires including as much information as possible, often from different sources
Skrub allows you to join tables on keys of different types (string, numerical,
datetime) with imprecise correspondence.

.. warning::

**Joiners are designed for small-to-medium datasets.**
Comment thread
rcap107 marked this conversation as resolved.

- **Memory**: The auxiliary table is stored in the transformer state.
For tables > 1 million rows, consider using :ref:`skrub Data Ops
<user_guide_data_ops_index>` with pandas/polars joins instead.

- **Computational Cost**: Fuzzy joining requires vectorizing columns
and nearest-neighbor search. Test on samples first for large datasets.

- **Dynamic Data**: If your auxiliary table changes after fitting,
you must refit the transformer. Joiners are not suitable for continuously
updated tables.

Joining external tables for machine learning
--------------------------------------------
Expand Down Expand Up @@ -58,4 +71,4 @@ in the right table (the table to be added). This is done by estimating the value
that the missing rows would have by training a machine learning model on the data
we have access to.

This transformer is explored in more detail in :ref:`this example <sphx_glr_auto_examples_0080_interpolation_join.py>`.
This transformer is explored in more detail in :ref:`this example <sphx_glr_auto_examples_03_joining_0080_interpolation_join.py>`.
1 change: 1 addition & 0 deletions doc/reference/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ classes and functions may not be enough to give full guidelines on their use.
{% for module, _ in API_REFERENCE %}
{{ module }} <{{ module }}>
{%- endfor %}
TableReport JSON schema <table_report_json_schema>
{%- if DEPRECATED_API_REFERENCE %}
deprecated
{%- endif %}
Expand Down
Loading
Loading