feat: add TableSchemaBuilder and store partition columns as Fields#22496
Open
adriangb wants to merge 2 commits into
Open
feat: add TableSchemaBuilder and store partition columns as Fields#22496adriangb wants to merge 2 commits into
adriangb wants to merge 2 commits into
Conversation
Introduce `TableSchemaBuilder` as the preferred way to construct a `TableSchema`. The file schema is the only required input; partition columns are optional, and the concatenated table schema is computed exactly once in `build()` (rather than being recomputed on every incremental setter call). `TableSchema` now stores its partition columns as `arrow::datatypes::Fields` (an immutable `Arc<[FieldRef]>`) instead of `Arc<Vec<FieldRef>>`: the idiomatic Arrow field-list type, a single `Arc<[FieldRef]>` (one fewer indirection), shareable zero-copy with an existing schema, and -- being immutable -- it makes the shared-`Arc` mutation panic that motivated recent changes structurally impossible. `TableSchemaBuilder::with_table_partition_cols` takes `impl Into<Fields>`, accepting an existing schema's `Fields` without a `Vec` round-trip. `TableSchema::table_partition_cols()` (and the delegating `FileScanConfig::table_partition_cols()`) now return `&Fields`. `Fields` derefs to `&[FieldRef]`, so iteration/indexing/`len`/`is_empty` callers are unchanged; only the arrow `FileFormat` path needed `.to_vec()`. The mutating `TableSchema::with_table_partition_cols` setter is deprecated in favor of the builder; `new`/`from_file_schema` are kept as conveniences that route through the builder. Documented in the 55.0.0 upgrade guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5430399 to
b88ac6c
Compare
b88ac6c to
99556d3
Compare
Contributor
Author
|
cc @comphead and @Dandandan since you reviewed #22372 cc @alamb: this brings us back to #19137 where you foresaw this. We now use |
…lder + From
Consolidate `TableSchema` construction on `TableSchemaBuilder` plus the
idiomatic `From<SchemaRef>` conversion, giving a single way to do each thing:
- with partition columns: `TableSchema::builder(file_schema)
.with_table_partition_cols(cols).build()`
- without: `TableSchema::from(file_schema)` / `file_schema.into()`
`TableSchema::new`, `TableSchema::from_file_schema`, and the mutating
`TableSchema::with_table_partition_cols` setter are all deprecated. `new` and
`from_file_schema` can only ever express a subset of the eventual column groups
(partition, and later virtual), so the builder is the single complete path;
`from_file_schema` was also redundant with the `From<SchemaRef>` impl.
All in-tree callers are migrated accordingly. Documented in the 55.0.0
upgrade guide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
99556d3 to
26cb424
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
TableSchema::with_table_partition_cols) and the API discussion it spawned, and is informed by feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener #22026 (which adds a third column group, virtual columns, toTableSchema).Rationale for this change
TableSchemahas one required input (the file schema) and a growing set of optional column groups: partition columns today, virtual columns in #22026. The current API expresses this awkwardly:new(file_schema, partition_cols)privileges partition columns with a positional slot while virtual columns only get a builder method — an asymmetry that grows with every new column kind.TableSchemaeagerly recomputes and caches the concatenated table schema on every incremental setter call, sofrom_file_schema(s).with_table_partition_cols(p)rebuilds it twice (three times once virtual columns are added). This is exactly whynew()'s docs told callers to avoid the builder-style chain.Arc<Vec<FieldRef>>in place, which is what caused the shared-Arcpanic fixed in fix: avoid panic in TableSchema::with_table_partition_cols on shared Arc #22372.A dedicated builder addresses all three, and mirrors the existing
FileScanConfigBuilder(the type that owns aTableSchema).What changes are included in this PR?
TableSchemaBuilder:new(file_schema)→.with_table_partition_cols(impl Into<Fields>)→.build(). The concatenated table schema is computed exactly once, inbuild(). The setter takesimpl Into<Fields>, so an existing schema'sFieldsis accepted zero-copy.arrow::datatypes::Fields(an immutableArc<[FieldRef]>) instead ofArc<Vec<FieldRef>>: one fewer indirection, shareable zero-copy, and — being immutable — the shared-Arcmutation panic is structurally impossible.TableSchema::table_partition_cols()and the delegatingFileScanConfig::table_partition_cols()now return&Fields.Fieldsderefs to&[FieldRef], so iteration/indexing/len/is_emptyare unchanged; only the arrowFileFormatpath needed.to_vec().TableSchema::with_table_partition_colsis deprecated in favor of the builder. It now replaces rather than appends. (Note:maincurrently appends here — the replace change in fix: avoid panic in TableSchema::with_table_partition_cols on shared Arc #22372 was not captured by that PR's squash merge — so this also restores the intended replace semantics.)new/from_file_schemaare kept as conveniences that route through the builder.This intentionally leaves virtual columns out; #22026 should extend the builder with
with_virtual_columnsonce it lands.Are these changes tested?
Yes. New unit tests cover building with partition columns, replace-on-repeat, zero-copy
Fieldsinput, and the deprecated setter's behavior; existingTableSchema/FileScanConfigtests and doctests pass.cargo clippy --all-targets -- -D warningsis clean across the datasource/proto/arrow/parquet/catalog-listing crates.Are there any user-facing changes?
Yes — please apply the
api changelabel:TableSchema::table_partition_cols()/FileScanConfig::table_partition_cols()return&Fieldsinstead of&Vec<FieldRef>(source-compatible for most uses viaDeref).TableSchema::with_table_partition_colsis deprecated (use the builder) and now replaces rather than appends.🤖 Generated with Claude Code