Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .github/workflows/bench.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: bench

on:
push:
branches: [ master ]
pull_request:
branches: [ master ]

jobs:
bench:
runs-on: ubuntu-latest
timeout-minutes: 10

steps:
- uses: actions/checkout@v6.0.2

- name: Set up Python
uses: actions/setup-python@v6.2.0
with:
python-version: '3.14'
cache: 'pip'

- name: Install package
run: |
python -m pip install --upgrade pip
pip install .

- name: Run accuracy regression check
run: |
python experiments/bench.py \
--quick \
--output bench_results.json \
--baseline experiments/bench_baseline.json \
--accuracy-only

- name: Upload bench results
if: always()
uses: actions/upload-artifact@v4
with:
name: bench-results
path: bench_results.json
if-no-files-found: warn
1 change: 1 addition & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install .
pip install ".[polars]" || true
pip install pytest==9.0.2 coverage==7.13.5 ruff==0.15.9

- name: Ruff lint (must pass)
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/ci-build-test-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install .
pip install ".[polars]" || true

- name: Install test deps
run: pip install pytest==9.0.2
Expand Down Expand Up @@ -104,6 +105,7 @@ jobs:
else
pip install dist/*.tar.gz
fi
pip install polars || true

- name: Install test deps
run: pip install pytest==9.0.2
Expand Down
2 changes: 0 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,6 @@ htmlcov/
.env
.env.local

experiments/

# Zensical build output & cache
site/
.zensical/
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@

---

A Python package for capturing potential relationships among columns of different tabular datasets, given as pandas DataFrames.
A Python package for capturing potential relationships among columns of different tabular datasets, given as pandas or Polars DataFrames.
Valentine is based on the paper [**Valentine: Evaluating Matching Techniques for Dataset Discovery**](https://ieeexplore.ieee.org/abstract/document/9458921).

📚 **Full documentation:** <https://delftdata.github.io/valentine/> — getting started, matcher guide, API reference, and migration notes.
Expand All @@ -57,9 +57,15 @@ To install Valentine simply run:
pip install valentine
```

To enable **Polars** support, install the optional extra:

```shell
pip install valentine[polars]
```


## Usage
Valentine can be used to find matches among columns of a given pair of pandas DataFrames.
Valentine can be used to find matches among columns of a given pair of pandas or Polars DataFrames. You can even mix pandas and Polars frames in the same call — Valentine auto-detects the frame type.

### Matching methods
In order to do so, the user can choose one of the following matching methods:
Expand Down Expand Up @@ -103,10 +109,10 @@ In order to do so, the user can choose one of the following matching methods:

### Matching DataFrames

Pass two or more DataFrames as a list (or any iterable) along with a matcher. Valentine will match columns across all unique pairs:
Pass two or more DataFrames as a list (or any iterable) along with a matcher. Valentine will match columns across all unique pairs. Pandas and Polars frames can be freely mixed:

```python
# Match a pair of DataFrames
# Match a pair of DataFrames (pandas, Polars, or mixed)
matches = valentine_match([df1, df2], matcher)

# Match multiple DataFrames (computes all N×(N-1)/2 pairs)
Expand Down Expand Up @@ -171,7 +177,7 @@ metrics_predefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECI


### Example
The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about job candidates, and then 2) how to assess its effectiveness based on a given ground truth (a more extensive example is shown in [`valentine_example.py`](https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py)):
The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about job candidates, and then 2) how to assess its effectiveness based on a given ground truth. More examples are available in the [`examples/`](https://github.com/delftdata/valentine/tree/master/examples) directory, including a [pandas example](https://github.com/delftdata/valentine/blob/master/examples/valentine_example_pandas.py), a [Polars example](https://github.com/delftdata/valentine/blob/master/examples/valentine_example_polars.py), and a [mixed pandas+Polars example](https://github.com/delftdata/valentine/blob/master/examples/valentine_example_mixed.py).

```python
import pandas as pd
Expand Down
46 changes: 39 additions & 7 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,21 @@ from valentine import (

```python
valentine_match(
dfs: Iterable[pd.DataFrame],
dfs: Iterable[pd.DataFrame | pl.DataFrame],
matcher: BaseMatcher,
df_names: list[str] | None = None,
instance_sample_size: int | None = 1000,
) -> MatcherResults
```

Match columns across every unique pair of DataFrames.
Match columns across every unique pair of DataFrames. Accepts both pandas
and Polars DataFrames, which can be freely mixed within the same call.

**Parameters**

| Name | Type | Default | Description |
|------------------------|------------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `dfs` | `Iterable[pd.DataFrame]` | — | Two or more DataFrames to match against each other. Any iterable works (list, tuple, generator). |
| `dfs` | `Iterable[pd.DataFrame \| pl.DataFrame]` | — | Two or more DataFrames to match against each other. Any iterable works (list, tuple, generator). Pandas and Polars frames may be mixed freely. |
| `matcher` | `BaseMatcher` | — | Matcher instance (e.g. `Coma()`, `Cupid()`). |
| `df_names` | `list[str] \| None` | `None` | Optional names for each DataFrame. When `None`, defaults to `"aaa"`, `"bbb"`, `"ccc"`, … (chosen for minimum string similarity so defaults don't influence schema-based matchers). Limited to 26 unnamed tables. |
| `instance_sample_size` | `int \| None` | `1000` | Cap on the number of non-empty rows sampled per column for instance-based matchers (Coma with `use_instances=True`, `DistributionBased`, `JaccardDistanceMatcher`). Pass `None` to use every row. Pass `0` to skip instance data entirely — schema-only matchers are unaffected, but instance-based matchers will see empty columns. |
Expand Down Expand Up @@ -666,10 +667,11 @@ are gold matches. One-to-one filtering is **off** by default here.
## Data sources (`valentine.data_sources`)

Valentine wraps each DataFrame in a [`DataframeTable`](#dataframetable)
before handing it to a matcher. Most users never touch this layer —
[`valentine_match`](#valentine_match) builds the tables for you — but
the classes are public so that custom matchers and custom data sources
can be written against the abstractions.
(pandas) or [`PolarsTable`](#polarstable) (Polars) before handing it to
a matcher. Most users never touch this layer —
[`valentine_match`](#valentine_match) auto-detects the frame type and
builds the tables for you — but the classes are public so that custom
matchers and custom data sources can be written against the abstractions.

```python
from valentine.data_sources import (
Expand All @@ -678,6 +680,9 @@ from valentine.data_sources import (
DataframeTable,
DataframeColumn,
)

# With the polars extra installed:
from valentine.data_sources import PolarsTable, PolarsColumn
```

### `BaseTable`
Expand Down Expand Up @@ -756,6 +761,33 @@ Constructed internally by [`DataframeTable`](#dataframetable); exposes
the column name, detected data type, unique identifier, and sampled
instance values via the standard [`BaseColumn`](#basecolumn) interface.

### `PolarsTable`

```python
PolarsTable(
df: pl.DataFrame,
name: str,
instance_sample_size: int | None = 1000,
)
```

[`BaseTable`](#basetable) adapter for a Polars DataFrame. Requires the
`polars` extra (`pip install valentine[polars]`). Has the same interface
as [`DataframeTable`](#dataframetable).

| Parameter | Type | Default | Description |
|------------------------|-----------------|---------|-------------------------------------------------------------------------------------------------|
| `df` | `pl.DataFrame` | — | The Polars DataFrame to wrap. |
| `name` | `str` | — | Name of the table. |
| `instance_sample_size` | `int \| None` | `1000` | Cap on the number of non-empty rows sampled per column. Pass `None` to use the full DataFrame; pass `0` to expose no instance data at all. |

### `PolarsColumn`

[`BaseColumn`](#basecolumn) adapter for a single Polars `Series`.
Constructed internally by [`PolarsTable`](#polarstable); exposes
the column name, detected data type, unique identifier, and sampled
instance values via the standard [`BaseColumn`](#basecolumn) interface.

### Writing a custom data source

If your data doesn't live in a pandas DataFrame, implement
Expand Down
11 changes: 8 additions & 3 deletions docs/example.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,16 @@ ground truth. Every API touched here is documented in the
!!! note

The same script lives in the repo at
[`examples/valentine_example.py`][source].
[`examples/valentine_example_pandas.py`][source]. Additional examples:

[source]: https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py
- [`valentine_example_polars.py`][polars] — Polars DataFrames
- [`valentine_example_mixed.py`][mixed] — mixing pandas and Polars in the same call

```python title="valentine_example.py"
[source]: https://github.com/delftdata/valentine/blob/master/examples/valentine_example_pandas.py
[polars]: https://github.com/delftdata/valentine/blob/master/examples/valentine_example_polars.py
[mixed]: https://github.com/delftdata/valentine/blob/master/examples/valentine_example_mixed.py

```python title="valentine_example_pandas.py"
import pprint
from pathlib import Path

Expand Down
20 changes: 20 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,26 @@ icon: lucide/help-circle
Common questions and gotchas. If yours isn't here, open an issue on
[GitHub](https://github.com/delftdata/valentine/issues).

## Can I use Polars instead of pandas?

Yes. Install the optional extra with `pip install valentine[polars]` and pass
Polars DataFrames directly to [`valentine_match`](api.md#valentine_match).
You can even mix pandas and Polars frames in the same call — Valentine
auto-detects the frame type and wraps each one in the appropriate data
source (`DataframeTable` or `PolarsTable`).

```python
import pandas as pd
import polars as pl
from valentine import valentine_match
from valentine.algorithms import Coma

df_pandas = pd.read_csv("source.csv")
df_polars = pl.read_csv("target.csv")

matches = valentine_match([df_pandas, df_polars], Coma())
```

## Which matcher should I use?

Start with [`Coma`](api.md#coma). It is the strongest default, handles
Expand Down
87 changes: 73 additions & 14 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,28 @@ command. It requires **Python 3.10 or newer** (and is tested up to 3.14).
poetry add valentine
```

### Polars support

To use Polars DataFrames, install the optional `polars` extra:

=== "pip"

```shell
pip install valentine[polars]
```

=== "uv"

```shell
uv add valentine[polars]
```

=== "poetry"

```shell
poetry add valentine -E polars
```

For local development, clone the repo and install in editable mode:

```shell
Expand All @@ -41,24 +63,61 @@ pip install -e ".[dev]"

The single entry point for matching is
[`valentine_match`](api.md#valentine_match). It takes an iterable of
DataFrames and a matcher instance, and returns a
DataFrames (pandas or Polars) and a matcher instance, and returns a
[`MatcherResults`](api.md#matcherresults) mapping — see the
[Matcher results](results.md) guide for everything you can do with it.

```python
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma
=== "pandas"

df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")
```python
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)
df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")

for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
```
matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
```

=== "Polars"

```python
import polars as pl
from valentine import valentine_match
from valentine.algorithms import Coma

df1 = pl.read_csv("source_candidates.csv")
df2 = pl.read_csv("target_candidates.csv")

matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
```

=== "Mixed (pandas + Polars)"

```python
import pandas as pd
import polars as pl
from valentine import valentine_match
from valentine.algorithms import Coma

df_pandas = pd.read_csv("source_candidates.csv")
df_polars = pl.read_csv("target_candidates.csv")

matcher = Coma(use_instances=True)
matches = valentine_match([df_pandas, df_polars], matcher)

for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
```

!!! note "Table names"

Expand All @@ -70,8 +129,8 @@ for pair, score in matches.items():

## Matching many DataFrames

Pass any iterable of DataFrames — list, tuple, generator — and Valentine
computes all unique pairs:
Pass any iterable of DataFrames (pandas, Polars, or mixed) — list, tuple,
generator — and Valentine computes all unique pairs:

```python
matches = valentine_match(
Expand Down
10 changes: 6 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,17 @@ hide:
</div>

Valentine is a Python package for capturing potential relationships among
columns of different tabular datasets, given as pandas DataFrames. It
implements several schema- and instance-based matching algorithms behind a
columns of different tabular datasets, given as pandas or Polars DataFrames.
It implements several schema- and instance-based matching algorithms behind a
single, uniform API, and ships with evaluation metrics so you can measure
match quality against a ground truth.
match quality against a ground truth. Pandas and Polars frames can be freely
mixed in the same call.

## Installation

```shell
pip install valentine
pip install valentine # pandas only
pip install valentine[polars] # pandas + Polars support
```

Requires Python **>=3.10, <3.15**.
Expand Down
Loading
Loading