diff --git a/doc/guides/table_report/02_exporting.rst b/doc/guides/table_report/02_exporting.rst index 4805d2cf5..2c39b274a 100644 --- a/doc/guides/table_report/02_exporting.rst +++ b/doc/guides/table_report/02_exporting.rst @@ -3,8 +3,8 @@ .. |column_associations| replace:: :func:`~skrub.column_associations` .. _user_guide_table_report_sharing: -How to export and share the |TableReport| ------------------------------------------ +How to export and share the |TableReport| for use by other tools +---------------------------------------------------------------- The |TableReport| is generated as a standalone HTML file that includes the report data, the plots, and the Javascript necessary to provide interactivity. @@ -31,7 +31,8 @@ respectively. The report can be exported in JSON format, which allows structured access to the data and statistics used to build the report with -:func:`~skrub.TableReport.json`. +:func:`~skrub.TableReport.json`. The schema of the JSON data is reported in +:ref:`table_report_json_schema`. .. code-block:: diff --git a/doc/modules/data_ops/basics/control_flow.rst b/doc/modules/data_ops/basics/control_flow.rst index 7cd1fc31a..6c3ecd63e 100644 --- a/doc/modules/data_ops/basics/control_flow.rst +++ b/doc/modules/data_ops/basics/control_flow.rst @@ -168,9 +168,9 @@ Finally, there are other situations where using :func:`deferred` can be helpful: .. rubric:: Examples -- See :ref:`sphx_glr_auto_examples_data_ops_1110_data_ops_intro.py` for an introductory +- See :ref:`sphx_glr_auto_tutorials_1110_data_ops_intro.py` for an introductory example on how to use skrub DataOps on a single dataframe. -- See :ref:`sphx_glr_auto_examples_data_ops_1120_multiple_tables.py` for an example +- See :ref:`sphx_glr_auto_examples_02_data_ops_1120_multiple_tables.py` for an example of how skrub DataOps can be used to process multiple tables using dataframe APIs. -- See :ref:`sphx_glr_auto_examples_data_ops_1130_choices.py` for an example of +- See :ref:`sphx_glr_auto_examples_02_data_ops_1130_choices.py` for an example of hyper-parameter tuning using skrub DataOps. diff --git a/doc/modules/data_ops/ml_pipeline/applying_different_transformers.rst b/doc/modules/data_ops/ml_pipeline/applying_different_transformers.rst index 54901bee1..9e44d3317 100644 --- a/doc/modules/data_ops/ml_pipeline/applying_different_transformers.rst +++ b/doc/modules/data_ops/ml_pipeline/applying_different_transformers.rst @@ -150,4 +150,4 @@ to obtain the final result: More info on advanced column selection and manipulation be found in :ref:`user_guide_selectors` and example -:ref:`sphx_glr_auto_examples_0090_apply_to_cols.py`. +:ref:`sphx_glr_auto_examples_0010_apply_to_cols.py`. diff --git a/doc/modules/data_ops/ml_pipeline/subsampling_data.rst b/doc/modules/data_ops/ml_pipeline/subsampling_data.rst index 51a045feb..b6509f6c5 100644 --- a/doc/modules/data_ops/ml_pipeline/subsampling_data.rst +++ b/doc/modules/data_ops/ml_pipeline/subsampling_data.rst @@ -26,4 +26,4 @@ set to ``True`` to force using the subsampling when we call them. Note that even if we set ``keep_subsampling=True``, subsampling is not applied when using ``predict``. -See more details in a :ref:`full example `. +See more details in a :ref:`full example `. diff --git a/doc/modules/data_ops/validation/exporting_data_ops.rst b/doc/modules/data_ops/validation/exporting_data_ops.rst index 2e462c4e5..b70ba1c23 100644 --- a/doc/modules/data_ops/validation/exporting_data_ops.rst +++ b/doc/modules/data_ops/validation/exporting_data_ops.rst @@ -62,5 +62,5 @@ or in a different environment: >>> loaded_learner.fit({"orders": new_orders_df}) SkrubLearner(data_op=) -See :ref:`sphx_glr_auto_examples_data_ops_1150_use_case.py` for an example of how +See :ref:`sphx_glr_auto_examples_02_data_ops_1150_use_case.py` for an example of how to use the learner in a microservice. diff --git a/doc/modules/data_ops/validation/hyperparameter_tuning.rst b/doc/modules/data_ops/validation/hyperparameter_tuning.rst index 0e2110c81..57df292c4 100644 --- a/doc/modules/data_ops/validation/hyperparameter_tuning.rst +++ b/doc/modules/data_ops/validation/hyperparameter_tuning.rst @@ -162,7 +162,7 @@ search respectively), or with the ``choose`` parameter of :meth:`.skb.make_learner() `. A full example of how to use hyperparameter search is available in -:ref:`sphx_glr_auto_examples_data_ops_1130_choices.py`, and a full example using +:ref:`sphx_glr_auto_examples_02_data_ops_1130_choices.py`, and a full example using Optuna is in :ref:`example_optuna_choices`. | diff --git a/doc/modules/default_wrangling/apply_to_cols.rst b/doc/modules/default_wrangling/apply_to_cols.rst index c25eeb418..839329824 100644 --- a/doc/modules/default_wrangling/apply_to_cols.rst +++ b/doc/modules/default_wrangling/apply_to_cols.rst @@ -25,18 +25,18 @@ to apply the proper transformers to different datatypes, using it may not be an option in all cases. In scikit-learn pipelines, the column selection operation can be done with the :class:`~sklearn.compose.ColumnTransformer`. -Skrub provides the |ApplyToCols| transformer to achieve the same results with -a larger degree of control over which columns are being transformed. +Skrub provides the |ApplyToCols| transformer and the +:ref:`selectors` to achieve the same results with a larger +degree of control over which columns are being transformed. |ApplyToCols| maps a transformer to columns in a dataframe, so that all -columns that satisfy a certain condition are transformed, while the others are -left untouched. +columns that satisfy the condition given by the user are transformed, while the +others are left untouched. .. tip:: If a skrub transformer has a ``cols`` parameter to specify a column list, that can be a selector as well. Selectors give more control over which columns - are being transformed: they are discussed at length in the - :ref:`selectors user guide`. + are being transformed. |ApplyToCols| can be used to transform a subset of columns in a dataframe, while diff --git a/doc/modules/joining_tables/assembling.rst b/doc/modules/joining_tables/assembling.rst index 51601324c..f2b46859f 100644 --- a/doc/modules/joining_tables/assembling.rst +++ b/doc/modules/joining_tables/assembling.rst @@ -9,7 +9,22 @@ requires including as much information as possible, often from different sources Skrub allows you to join tables on keys of different types (string, numerical, datetime) with imprecise correspondence. +.. warning:: + To be considered when using one of the joiners: + + **Joiners are designed for small-to-medium datasets.** + + - **Memory**: The auxiliary table is stored in the transformer state. + For tables > 1 million rows, consider using :ref:`skrub Data Ops + ` with pandas/polars joins instead. + + - **Computational Cost**: Fuzzy joining requires vectorizing columns + and nearest-neighbor search. Test on samples first for large datasets. + + - **Dynamic Data**: If your auxiliary table changes after fitting, + you must refit the transformer. Joiners are not suitable for continuously + updated tables. Joining external tables for machine learning -------------------------------------------- @@ -58,4 +73,4 @@ in the right table (the table to be added). This is done by estimating the value that the missing rows would have by training a machine learning model on the data we have access to. -This transformer is explored in more detail in :ref:`this example `. +This transformer is explored in more detail in :ref:`this example `. diff --git a/doc/reference/index.rst.template b/doc/reference/index.rst.template index 04af24f2d..989c30a7d 100644 --- a/doc/reference/index.rst.template +++ b/doc/reference/index.rst.template @@ -17,6 +17,7 @@ classes and functions may not be enough to give full guidelines on their use. {% for module, _ in API_REFERENCE %} {{ module }} <{{ module }}> {%- endfor %} + TableReport JSON schema {%- if DEPRECATED_API_REFERENCE %} deprecated {%- endif %} diff --git a/doc/reference/table_report_json_schema.rst b/doc/reference/table_report_json_schema.rst new file mode 100644 index 000000000..3d14fcd01 --- /dev/null +++ b/doc/reference/table_report_json_schema.rst @@ -0,0 +1,354 @@ +.. _table_report_json_schema: + +TableReport JSON schema +======================= + +:meth:`TableReport.json() ` returns a JSON string whose +top-level object contains the keys described below. + +.. note:: + + The ``dataframe`` and ``sample_table`` keys, which are present in the + internal summary object, are **not** included in the JSON output. + +Top-level object +---------------- + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``dataframe_module`` + - string + - Name of the dataframe library used. Either ``"pandas"`` or + ``"polars"``. + * - ``n_rows`` + - integer + - Number of rows in the dataframe. + * - ``n_columns`` + - integer + - Number of columns in the dataframe. + * - ``dataframe_is_empty`` + - boolean + - ``true`` when the dataframe has no rows or no columns. + * - ``plots_skipped`` + - boolean + - ``true`` when ``plot_distributions=False`` was passed to + :class:`~skrub.TableReport`. When ``true``, plot keys are absent from + column objects, but ``histogram_data`` is still present for numeric and + datetime columns. + * - ``associations_skipped`` + - boolean + - ``true`` when association computation was skipped (either because + ``with_associations=False`` was passed, or because polars was used + without pyarrow installed). + * - ``cardinality_threshold`` + - integer + - The threshold from :func:`~skrub.get_config` above which a column is + considered high-cardinality. Default: 40. + * - ``n_constant_columns`` + - integer + - Number of columns whose values are all identical. + * - ``columns`` + - array of :ref:`column objects ` + - One entry per column, in the original column order. + * - ``top_associations`` + - array of :ref:`association objects ` + - Column-pair association scores (up to 1 000 pairs, sorted by strength). + Present only when ``associations_skipped`` is ``false``. + * - ``title`` + - string + - *Optional.* Present only when a ``title`` argument was passed to + :class:`~skrub.TableReport`. + * - ``order_by`` + - string + - *Optional.* Name of the column used for sorting when ``order_by`` was + passed to :class:`~skrub.TableReport`. + + +.. _table_report_json_schema_column: + +Column object +------------- + +Every entry in ``columns`` contains the following keys. Additional keys are +present depending on the column's dtype; they are documented in the subsections +below. + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``position`` + - integer + - Zero-based index of the column in the dataframe (same as ``idx``). + * - ``idx`` + - integer + - Zero-based index of the column in the dataframe (same as + ``position``). + * - ``name`` + - string + - Column name. + * - ``dtype`` + - string + - Column dtype as a string (e.g. ``"float64"``, ``"object"``, + ``"datetime64[ns]"``). + * - ``null_count`` + - integer + - Number of null / NaN values. + * - ``null_proportion`` + - number + - Fraction of null values in ``[0, 1]``. + * - ``nulls_level`` + - string + - Summary severity level: ``"ok"`` (no nulls), ``"warning"`` (some + nulls), or ``"critical"`` (all values are null). + * - ``value_is_constant`` + - boolean + - ``true`` when every non-null value in the column is identical. + * - ``is_ordered`` + - boolean + - ``true`` when the column values are sorted in ascending or descending + order. + * - ``plot_names`` + - array of strings + - Names of the plot keys that are present on this column object (e.g. + ``["histogram_plot"]``). Empty when ``plots_skipped`` is ``true`` or + when the column contains only nulls. + + +Columns containing only nulls +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When ``null_count`` equals ``n_rows`` (all values are null) only the keys of +the base column object above are present; no statistical keys are added. + + +Categorical / string columns +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Present for non-numeric, non-datetime, non-duration columns that are not +entirely null. + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``n_unique`` + - integer + - Number of distinct values (including null). + * - ``unique_proportion`` + - number + - Fraction of distinct values relative to ``n_rows``. + * - ``is_high_cardinality`` + - boolean + - ``true`` when ``n_unique`` exceeds ``cardinality_threshold``. + * - ``value_counts`` + - array of ``[value, count]`` pairs + - The up to 10 most frequent values and their counts, sorted by + frequency descending. Each element is a two-element array + ``[value, count]`` where ``value`` is a string and ``count`` is an + integer. + * - ``most_frequent_values`` + - array of strings + - The up to 10 most frequent values (same order as ``value_counts``). + * - ``constant_value`` + - any + - *Present only when* ``value_is_constant`` *is* ``true``. The single + value shared by all non-null rows. + * - ``value_counts_plot`` + - string (SVG) + - *Present only when* ``plots_skipped`` *is* ``false`` *and* + ``value_is_constant`` *is* ``false``. Bar chart of the top value + counts as an inline SVG string. + + +Numeric columns +~~~~~~~~~~~~~~~ + +Present for columns whose dtype is numeric (integer or float), or duration +(timedelta). Boolean columns have a subset of these keys (only ``mean``). + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``n_unique`` + - integer + - Number of distinct values. + * - ``unique_proportion`` + - number + - Fraction of distinct values relative to ``n_rows``. + * - ``is_high_cardinality`` + - boolean + - ``true`` when ``n_unique`` exceeds ``cardinality_threshold``. + * - ``mean`` + - number + - Arithmetic mean of non-null values. + * - ``standard_deviation`` + - number + - Standard deviation of non-null values. ``null`` when it cannot be + computed (e.g. single non-null value). + * - ``inter_quartile_range`` + - number + - Difference between the 75th and 25th percentiles. + * - ``quantiles`` + - object + - Map of quantile level (as a numeric key) to value, for levels + ``0.0``, ``0.25``, ``0.5``, ``0.75``, and ``1.0``. Absent when + ``value_is_constant`` is ``true``. + * - ``constant_value`` + - number + - *Present only when* ``value_is_constant`` *is* ``true``. The single + value shared by all non-null rows. + * - ``is_duration`` + - boolean + - ``true`` for timedelta / duration columns (values were converted to a + numeric unit before statistics were computed). + * - ``duration_unit`` + - string or null + - Unit used when ``is_duration`` is ``true``: one of + ``"microsecond"``, ``"millisecond"``, ``"second"``, ``"hour"``, + ``"day"``, or ``"year"``. ``null`` for non-duration columns. + * - ``histogram_data`` + - :ref:`histogram data object ` + - Bin counts and edges for the distribution histogram. Always present + for non-constant numeric columns (even when ``plots_skipped`` is + ``true``). + * - ``histogram_plot`` + - string (SVG) + - *Present only when* ``plots_skipped`` *is* ``false`` *and* + ``order_by`` *is not set*. Distribution histogram as an inline SVG + string. + * - ``line_plot`` + - string (SVG) + - *Present only when* ``plots_skipped`` *is* ``false`` *and* ``order_by`` + *is set*. Line chart of the column values against the sort column as + an inline SVG string. + + +Datetime columns +~~~~~~~~~~~~~~~~ + +Present for columns with a date or datetime dtype. + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``n_unique`` + - integer + - Number of distinct values. + * - ``unique_proportion`` + - number + - Fraction of distinct values relative to ``n_rows``. + * - ``is_high_cardinality`` + - boolean + - ``true`` when ``n_unique`` exceeds ``cardinality_threshold``. + * - ``min`` + - string (ISO 8601) + - Earliest datetime value. Absent when ``value_is_constant`` is + ``true``. + * - ``max`` + - string (ISO 8601) + - Latest datetime value. Absent when ``value_is_constant`` is + ``true``. + * - ``constant_value`` + - string (ISO 8601) + - *Present only when* ``value_is_constant`` *is* ``true``. The single + datetime value shared by all non-null rows. + * - ``histogram_data`` + - :ref:`histogram data object ` + - Bin counts and edges. Always present for non-constant datetime + columns (even when ``plots_skipped`` is ``true``). + * - ``histogram_plot`` + - string (SVG) + - *Present only when* ``plots_skipped`` *is* ``false`` *and* + ``value_is_constant`` *is* ``false``. Distribution histogram as an + inline SVG string. + + +.. _table_report_json_schema_hist_data: + +Histogram data object +--------------------- + +The ``histogram_data`` key on numeric and datetime column objects contains an +object with the following keys. + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``bin_counts`` + - array of integers + - Count of values in each bin (length *n*). + * - ``bin_edges`` + - array of numbers + - Left and right edges of each bin (length *n + 1*). + * - ``n_low_outliers`` + - integer + - Number of values below the plotted range that were excluded from the + histogram bins. + * - ``n_high_outliers`` + - integer + - Number of values above the plotted range that were excluded from the + histogram bins. + + +.. _table_report_json_schema_assoc: + +Association object +------------------ + +Each entry in the top-level ``top_associations`` array describes the +association between a pair of columns. + +.. list-table:: + :header-rows: 1 + :widths: 25 15 60 + + * - Key + - Type + - Description + * - ``left_column_name`` + - string + - Name of the first column in the pair. + * - ``left_column_idx`` + - integer + - Zero-based index of the first column in the pair. + * - ``right_column_name`` + - string + - Name of the second column in the pair. + * - ``right_column_idx`` + - integer + - Zero-based index of the second column in the pair. + * - ``cramer_v`` + - number or null + - `Cramér's V `_ + statistic (``[0, 1]``) for the pair. ``null`` when it could not be + computed. + * - ``pearson_corr`` + - number or null + - `Pearson correlation + `_ + coefficient (``[-1, 1]``) for the pair. ``null`` when it could not be + computed (e.g. one column is not numeric). diff --git a/doc/tutorials/0000_getting_started.py b/doc/tutorials/0000_getting_started.py index 4e7cca8f8..d659a77b8 100644 --- a/doc/tutorials/0000_getting_started.py +++ b/doc/tutorials/0000_getting_started.py @@ -112,7 +112,8 @@ # To handle rich tabular data and feed it to a machine learning model, the # pipeline returned by |tabular_pipeline| preprocesses and encodes # strings, categories and dates using the |TableVectorizer|. -# See its documentation or :ref:`sphx_glr_auto_examples_0010_encodings.py` for +# See its documentation or +# :ref:`sphx_glr_auto_examples_01_encoding_0010_encodings.py` for # more details. An overview of the chosen defaults is available in # :ref:`user_guide_tabular_pipeline`. @@ -190,7 +191,8 @@ # which uses pre-trained language models retrieved from the HuggingFace hub to # create meaningful text embeddings. # See :ref:`user_guide_encoders_index` for more details on all the categorical encoders -# provided by skrub, and :ref:`sphx_glr_auto_examples_0010_encodings.py` for a +# provided by skrub, and +# :ref:`sphx_glr_auto_examples_01_encoding_0010_encodings.py` for a # comparison between the different methods. # diff --git a/examples/03_joining/0070_join_aggregation.py b/examples/03_joining/0070_join_aggregation.py index 27426f2b7..f4b779b5b 100644 --- a/examples/03_joining/0070_join_aggregation.py +++ b/examples/03_joining/0070_join_aggregation.py @@ -162,7 +162,7 @@ # # We bring this logic into a |TableVectorizer| to vectorize these columns in a # single step. -# See `this example `_ +# See :ref:`this example ` # for more details about these encoding choices. from sklearn.preprocessing import OrdinalEncoder diff --git a/skrub/_apply_to_cols.py b/skrub/_apply_to_cols.py index 7168cf70b..64d34e9cf 100644 --- a/skrub/_apply_to_cols.py +++ b/skrub/_apply_to_cols.py @@ -216,6 +216,25 @@ class ApplyToCols(TransformerMixin, SkrubBaseEstimator): skrub.core.RejectColumn: Column 'A' does not have Date or Datetime dtype. Transformer DatetimeEncoder.fit_transform failed on column 'A'. See above for the full traceback. + It is also possible to wrap a :class:`TableVectorizer` or :class:`Cleaner` in + ``ApplyToCols`` to select or exclude columns based on patterns. For example, + to apply a :class:`TableVectorizer` to all columns except those ending with "_id", + we can do: + + >>> import skrub.selectors as s + >>> from skrub import ApplyToCols, TableVectorizer + + >>> df = pd.DataFrame(dict( + ... user_id=["A001", "A002"], + ... age=[25, 30], + ... department=["Engineering", "Sales"], + ... )) + >>> tv = ApplyToCols(TableVectorizer(), cols=~s.glob("*_id")) + >>> tv.fit_transform(df) + user_id age department_Sales + 0 A001 25.0 0.0 + 1 A002 30.0 1.0 + **Accessing fitted transformers** Depending on the transformer, the fitted transformers diff --git a/skrub/_reporting/_table_report.py b/skrub/_reporting/_table_report.py index db491170f..9269e0e58 100644 --- a/skrub/_reporting/_table_report.py +++ b/skrub/_reporting/_table_report.py @@ -100,7 +100,10 @@ class TableReport: This class summarizes a dataframe or numpy array, providing information such as the type and summary statistics (mean, number of missing values, etc.) for each - column. Numpy arrays are converted to pandas DataFrame or Series. + column. Numpy arrays are converted to pandas DataFrame or Series. The computed + statistics can be accessed interactively in a Jupyter notebook or web browser. + Alternatively, it can be saved or exported in JSON, Markdown, or HTML format + for programmatic access or for inclusion in documents. Parameters ---------- @@ -232,7 +235,8 @@ class TableReport: # DataFrame Report... The report can also be obtained in JSON format with :meth:`json`, which can - be useful for programmatic access to the report data. + be useful for programmatic access to the report data. The schema of the + JSON data is reported in :ref:`table_report_json_schema`. Note that the resulting JSON includes the plots in SVG format, which can be quite verbose: plots can be disabled by setting ``plot_distributions=False`` @@ -244,17 +248,20 @@ class TableReport: Advanced configuration: you can add custom column filters that will appear - in the report's dropdown menu. + in the report's dropdown menu, allowing you to select a subset of columns to + display in the report. >>> filters = { - ... "display_name": ["a", "b"], + ... "my_filter": ["a", "b"], ... } >>> report = TableReport(df, column_filters=filters) With the code above, in addition to the default filters such as "All - columns", "Numeric columns", etc., the added "Columns with at least 2 - unique values" will be available in the report, selecting columns "a" and - "b". + columns", "Numeric columns", etc., the added "my_filter" will be available + in the report, selecting both columns "a" and "b". + Filters may be specified as a list of column names, a list of column indices, + or one of the :ref:`skrub selectors ` objects. + """ def __init__( @@ -431,6 +438,13 @@ def html_snippet(self): def json(self): """Get the report data in JSON format. + By default, the JSON output includes the plots in SVG format, which can + be quite verbose. Plots can be disabled by setting + ``plot_distributions=False`` when generating the report. + + The schema of the JSON data is reported in :ref:`table_report_json_schema`. + + Returns ------- str : diff --git a/skrub/_table_vectorizer.py b/skrub/_table_vectorizer.py index 3a096a385..20baefff7 100644 --- a/skrub/_table_vectorizer.py +++ b/skrub/_table_vectorizer.py @@ -590,9 +590,11 @@ class TableVectorizer(TransformerMixin, SkrubBaseEstimator): specified transformer. This disables any preprocessing usually done by the TableVectorizer; the columns are passed to the transformer without any modification. A column is not allowed to appear twice in - ``specific_transformers``. Using ``specific_transformers`` provides - similar functionality to what is offered by scikit-learn's - :class:`~sklearn.compose.ColumnTransformer`. + ``specific_transformers``. + Consider wrapping the ``TableVectorizer`` in :class:`~skrub.ApplyToCols` + to select or exclude specific columns from the processing. Alternatively, + the :ref:`skrub Data Ops ` allows for more complex + pre-processing. drop_null_fraction : float or None, default=1.0 Fraction of null above which the column is dropped. If `drop_null_fraction` is