Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions doc/guides/table_report/02_exporting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
.. |column_associations| replace:: :func:`~skrub.column_associations`

.. _user_guide_table_report_sharing:
How to export and share the |TableReport|
-----------------------------------------
How to export and share the |TableReport| for use by other tools
----------------------------------------------------------------

The |TableReport| is generated as a standalone HTML file that includes the report
data, the plots, and the Javascript necessary to provide interactivity.
Expand All @@ -31,7 +31,8 @@ respectively.

The report can be exported in JSON format, which allows structured
access to the data and statistics used to build the report with
:func:`~skrub.TableReport.json`.
:func:`~skrub.TableReport.json`. The schema of the JSON data is reported in
:ref:`table_report_json_schema`.

.. code-block::
Expand Down
6 changes: 3 additions & 3 deletions doc/modules/data_ops/basics/control_flow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,9 +168,9 @@ Finally, there are other situations where using :func:`deferred` can be helpful:

.. rubric:: Examples

- See :ref:`sphx_glr_auto_examples_data_ops_1110_data_ops_intro.py` for an introductory
- See :ref:`sphx_glr_auto_tutorials_1110_data_ops_intro.py` for an introductory
example on how to use skrub DataOps on a single dataframe.
- See :ref:`sphx_glr_auto_examples_data_ops_1120_multiple_tables.py` for an example
- See :ref:`sphx_glr_auto_examples_02_data_ops_1120_multiple_tables.py` for an example
of how skrub DataOps can be used to process multiple tables using dataframe APIs.
- See :ref:`sphx_glr_auto_examples_data_ops_1130_choices.py` for an example of
- See :ref:`sphx_glr_auto_examples_02_data_ops_1130_choices.py` for an example of
hyper-parameter tuning using skrub DataOps.
Original file line number Diff line number Diff line change
Expand Up @@ -150,4 +150,4 @@ to obtain the final result:

More info on advanced column selection and manipulation be found in
:ref:`user_guide_selectors` and example
:ref:`sphx_glr_auto_examples_0090_apply_to_cols.py`.
:ref:`sphx_glr_auto_examples_0010_apply_to_cols.py`.
2 changes: 1 addition & 1 deletion doc/modules/data_ops/ml_pipeline/subsampling_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ set to ``True`` to force using the subsampling when we call them. Note that
even if we set ``keep_subsampling=True``, subsampling is not applied when using
``predict``.

See more details in a :ref:`full example <sphx_glr_auto_examples_data_ops_1140_subsampling.py>`.
See more details in a :ref:`full example <sphx_glr_auto_examples_02_data_ops_1140_subsampling.py>`.
2 changes: 1 addition & 1 deletion doc/modules/data_ops/validation/exporting_data_ops.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,5 +62,5 @@ or in a different environment:
>>> loaded_learner.fit({"orders": new_orders_df})
SkrubLearner(data_op=<Apply TableVectorizer>)

See :ref:`sphx_glr_auto_examples_data_ops_1150_use_case.py` for an example of how
See :ref:`sphx_glr_auto_examples_02_data_ops_1150_use_case.py` for an example of how
to use the learner in a microservice.
2 changes: 1 addition & 1 deletion doc/modules/data_ops/validation/hyperparameter_tuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ search respectively), or with the ``choose`` parameter of
:meth:`.skb.make_learner() <DataOp.skb.make_learner>`.

A full example of how to use hyperparameter search is available in
:ref:`sphx_glr_auto_examples_data_ops_1130_choices.py`, and a full example using
:ref:`sphx_glr_auto_examples_02_data_ops_1130_choices.py`, and a full example using
Optuna is in :ref:`example_optuna_choices`.

|
Expand Down
12 changes: 6 additions & 6 deletions doc/modules/default_wrangling/apply_to_cols.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,18 @@ to apply the proper transformers to different datatypes, using it may not be an
option in all cases. In scikit-learn pipelines, the column selection operation can
be done with the :class:`~sklearn.compose.ColumnTransformer`.

Skrub provides the |ApplyToCols| transformer to achieve the same results with
a larger degree of control over which columns are being transformed.
Skrub provides the |ApplyToCols| transformer and the
:ref:`selectors<user_guide_selectors>` to achieve the same results with a larger
degree of control over which columns are being transformed.
|ApplyToCols| maps a transformer to columns in a dataframe, so that all
columns that satisfy a certain condition are transformed, while the others are
left untouched.
columns that satisfy the condition given by the user are transformed, while the
others are left untouched.

.. tip::

If a skrub transformer has a ``cols`` parameter to specify a column list,
that can be a selector as well. Selectors give more control over which columns
are being transformed: they are discussed at length in the
:ref:`selectors user guide<user_guide_selectors>`.
are being transformed.


|ApplyToCols| can be used to transform a subset of columns in a dataframe, while
Expand Down
17 changes: 16 additions & 1 deletion doc/modules/joining_tables/assembling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,22 @@ requires including as much information as possible, often from different sources
Skrub allows you to join tables on keys of different types (string, numerical,
datetime) with imprecise correspondence.

.. warning::

To be considered when using one of the joiners:

**Joiners are designed for small-to-medium datasets.**
Comment thread
rcap107 marked this conversation as resolved.

- **Memory**: The auxiliary table is stored in the transformer state.
For tables > 1 million rows, consider using :ref:`skrub Data Ops
<user_guide_data_ops_index>` with pandas/polars joins instead.

- **Computational Cost**: Fuzzy joining requires vectorizing columns
and nearest-neighbor search. Test on samples first for large datasets.

- **Dynamic Data**: If your auxiliary table changes after fitting,
you must refit the transformer. Joiners are not suitable for continuously
updated tables.

Joining external tables for machine learning
--------------------------------------------
Expand Down Expand Up @@ -58,4 +73,4 @@ in the right table (the table to be added). This is done by estimating the value
that the missing rows would have by training a machine learning model on the data
we have access to.

This transformer is explored in more detail in :ref:`this example <sphx_glr_auto_examples_0080_interpolation_join.py>`.
This transformer is explored in more detail in :ref:`this example <sphx_glr_auto_examples_03_joining_0080_interpolation_join.py>`.
1 change: 1 addition & 0 deletions doc/reference/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ classes and functions may not be enough to give full guidelines on their use.
{% for module, _ in API_REFERENCE %}
{{ module }} <{{ module }}>
{%- endfor %}
TableReport JSON schema <table_report_json_schema>
{%- if DEPRECATED_API_REFERENCE %}
deprecated
{%- endif %}
Expand Down
Loading
Loading