Skip to content

feat: add TableSchemaBuilder and store partition columns as Fields#22496

Open
adriangb wants to merge 2 commits into
apache:mainfrom
pydantic:table-schema-builder
Open

feat: add TableSchemaBuilder and store partition columns as Fields#22496
adriangb wants to merge 2 commits into
apache:mainfrom
pydantic:table-schema-builder

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

TableSchema has one required input (the file schema) and a growing set of optional column groups: partition columns today, virtual columns in #22026. The current API expresses this awkwardly:

  • new(file_schema, partition_cols) privileges partition columns with a positional slot while virtual columns only get a builder method — an asymmetry that grows with every new column kind.
  • TableSchema eagerly recomputes and caches the concatenated table schema on every incremental setter call, so from_file_schema(s).with_table_partition_cols(p) rebuilds it twice (three times once virtual columns are added). This is exactly why new()'s docs told callers to avoid the builder-style chain.
  • The setter mutated an inner Arc<Vec<FieldRef>> in place, which is what caused the shared-Arc panic fixed in fix: avoid panic in TableSchema::with_table_partition_cols on shared Arc #22372.

A dedicated builder addresses all three, and mirrors the existing FileScanConfigBuilder (the type that owns a TableSchema).

What changes are included in this PR?

  • TableSchemaBuilder: new(file_schema).with_table_partition_cols(impl Into<Fields>).build(). The concatenated table schema is computed exactly once, in build(). The setter takes impl Into<Fields>, so an existing schema's Fields is accepted zero-copy.
  • Partition columns are now stored as arrow::datatypes::Fields (an immutable Arc<[FieldRef]>) instead of Arc<Vec<FieldRef>>: one fewer indirection, shareable zero-copy, and — being immutable — the shared-Arc mutation panic is structurally impossible.
  • TableSchema::table_partition_cols() and the delegating FileScanConfig::table_partition_cols() now return &Fields. Fields derefs to &[FieldRef], so iteration/indexing/len/is_empty are unchanged; only the arrow FileFormat path needed .to_vec().
  • TableSchema::with_table_partition_cols is deprecated in favor of the builder. It now replaces rather than appends. (Note: main currently appends here — the replace change in fix: avoid panic in TableSchema::with_table_partition_cols on shared Arc #22372 was not captured by that PR's squash merge — so this also restores the intended replace semantics.)
  • new / from_file_schema are kept as conveniences that route through the builder.
  • Documented in the 54.0.0 upgrade guide.

This intentionally leaves virtual columns out; #22026 should extend the builder with with_virtual_columns once it lands.

Are these changes tested?

Yes. New unit tests cover building with partition columns, replace-on-repeat, zero-copy Fields input, and the deprecated setter's behavior; existing TableSchema / FileScanConfig tests and doctests pass. cargo clippy --all-targets -- -D warnings is clean across the datasource/proto/arrow/parquet/catalog-listing crates.

Are there any user-facing changes?

Yes — please apply the api change label:

  • TableSchema::table_partition_cols() / FileScanConfig::table_partition_cols() return &Fields instead of &Vec<FieldRef> (source-compatible for most uses via Deref).
  • TableSchema::with_table_partition_cols is deprecated (use the builder) and now replaces rather than appends.

🤖 Generated with Claude Code

@github-actions github-actions Bot added documentation Improvements or additions to documentation datasource Changes to the datasource crate labels May 24, 2026
Introduce `TableSchemaBuilder` as the preferred way to construct a
`TableSchema`. The file schema is the only required input; partition
columns are optional, and the concatenated table schema is computed
exactly once in `build()` (rather than being recomputed on every
incremental setter call).

`TableSchema` now stores its partition columns as `arrow::datatypes::Fields`
(an immutable `Arc<[FieldRef]>`) instead of `Arc<Vec<FieldRef>>`: the
idiomatic Arrow field-list type, a single `Arc<[FieldRef]>` (one fewer
indirection), shareable zero-copy with an existing schema, and -- being
immutable -- it makes the shared-`Arc` mutation panic that motivated recent
changes structurally impossible. `TableSchemaBuilder::with_table_partition_cols`
takes `impl Into<Fields>`, accepting an existing schema's `Fields` without a
`Vec` round-trip.

`TableSchema::table_partition_cols()` (and the delegating
`FileScanConfig::table_partition_cols()`) now return `&Fields`. `Fields`
derefs to `&[FieldRef]`, so iteration/indexing/`len`/`is_empty` callers are
unchanged; only the arrow `FileFormat` path needed `.to_vec()`.

The mutating `TableSchema::with_table_partition_cols` setter is deprecated
in favor of the builder; `new`/`from_file_schema` are kept as conveniences
that route through the builder. Documented in the 55.0.0 upgrade guide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the table-schema-builder branch from 5430399 to b88ac6c Compare May 24, 2026 14:49
@github-actions github-actions Bot added core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate labels May 24, 2026
@adriangb adriangb force-pushed the table-schema-builder branch from b88ac6c to 99556d3 Compare May 24, 2026 14:57
@adriangb adriangb requested review from alamb and comphead May 24, 2026 15:14
@adriangb
Copy link
Copy Markdown
Contributor Author

cc @comphead and @Dandandan since you reviewed #22372

cc @alamb: this brings us back to #19137 where you foresaw this. We now use Fields instead of Vec<FieldRef>, I'm not sure why I rejected it in that PR (maybe I was too narrowly focused on appending?) but I now think you were right 😄.

…lder + From

Consolidate `TableSchema` construction on `TableSchemaBuilder` plus the
idiomatic `From<SchemaRef>` conversion, giving a single way to do each thing:

- with partition columns: `TableSchema::builder(file_schema)
                              .with_table_partition_cols(cols).build()`
- without:                `TableSchema::from(file_schema)` / `file_schema.into()`

`TableSchema::new`, `TableSchema::from_file_schema`, and the mutating
`TableSchema::with_table_partition_cols` setter are all deprecated. `new` and
`from_file_schema` can only ever express a subset of the eventual column groups
(partition, and later virtual), so the builder is the single complete path;
`from_file_schema` was also redundant with the `From<SchemaRef>` impl.

All in-tree callers are migrated accordingly. Documented in the 55.0.0
upgrade guide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the table-schema-builder branch from 99556d3 to 26cb424 Compare May 24, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant