Skip to content

DSL Kernels as First-Class CTable Columns#659

Merged
FrancescAlted merged 13 commits into
mainfrom
dsl-kernels-as-cols
Jun 4, 2026
Merged

DSL Kernels as First-Class CTable Columns#659
FrancescAlted merged 13 commits into
mainfrom
dsl-kernels-as-cols

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

DSL kernels (functions decorated with @blosc2.jit) can now be registered as computed columns in CTable, alongside the existing expression-string computed columns.

Key changes:

  • CTable — computed column extensions (ctable.py)

    • add_computed_col() and add_materialized_col() now accept a DSL/UDF kernel as the expression, not just a string.
    • New _build_computed_lazy(cc) helper unifies dispatch between expression-string columns and DSL-kernel columns.
    • CTable.where() now accepts UDF/DSL kernels as filter predicates, not only expression strings.
    • DSL computed and materialized columns are correctly serialized/deserialized (dsl_source + jit_backend are persisted), so they survive save/load cycles.
    • New internal helpers: _resolve_dsl_kernel, _dsl_result_dtype, _materialized_dsl_kernel, _evaluate_dsl_materialized_batch.
  • dsl_kernel.py

    • New kernel_from_source(source) utility that reconstructs a DSLKernel from its stored source text (used by the deserializer).
  • lazyexpr.py

    • lazyudf() dtype parameter is now optional for DSL kernels — dtype is inferred via NumPy type promotion of input dtypes.
    • convert_inputs() now automatically unwraps CTable.Column objects to their backing NDArray.
    • BLOSC_ME_JIT env var now takes full priority over both jit= and jit_backend= keyword arguments, making it easy to switch JIT backends from the CLI without touching code.

Tests & docs

  • New test file tests/ctable/test_ctable_dsl_columns.py (244 lines).
  • New test file tests/ctable/test_ctable_computed_cols.py (252 lines).
  • New example examples/ctable/udf-computed-col.py.
  • New benchmark bench/ctable/query-backends.py.
  • Documentation: new CTable section in overview.rst with performance tips; expanded ctable.rst reference.

  @blosc2.dsl_kernel-decorated functions can now back both virtual computed
  columns (add_computed_column) and stored generated columns
  (add_generated_column), survive save/open round-trips, and be referenced
  inside where() predicates.

  API:
  - add_computed_column(name, kernel, inputs=[...], dtype=None) binds one
    stored scalar column per kernel parameter; the callable form returning a
    blosc2.lazyudf(...) is also accepted.
  - add_generated_column(..., values=kernel, inputs=[...]) adds a stored DSL
    column (new transformer_kind="dsl") with append/extend auto-fill,
    refresh_generated_column, and indexing.
  - dtype is inferred by NumPy type promotion of the input column dtypes when
    omitted; pass dtype explicitly for type-changing kernels (comparisons/casts).

  Internals:
  - Factor the safe-exec DSL reconstruction into dsl_kernel.kernel_from_source()
    and route b2objects.decode_structured_lazyudf through it.
  - Persist computed entries as kind:"dsl" + dsl_source (no expression); rebuild
    the kernel and LazyUDF on open.
  - DSL computed columns materialize via LazyUDF.compute() on access: the
    miniexpr DSL path only supports full-array getitem, so reads/where() cannot
    slice lazily. Materializing also lets a DSL column join where() as a plain
    NDArray operand (chunked staged co-evaluation, not single-kernel fusion).
  - Guard the LazyExpr-only sites (compact, info display, materialize, per-row
    access, sort, index reuse) and the three materialized eval paths
    (row/batch autofill, refresh). Pad length-1 DSL batches to 2 and slice back,
    since miniexpr rejects shape-(1,) inputs.

  Add tests/ctable/test_ctable_dsl_columns.py (20 tests) covering values,
  dtype inference, partial slicing, where() (incl. multi-chunk streaming),
  persistence, compact, materialize, and generated-column autofill/refresh/index.
…d= params

This lets users switch JIT on/off and change backends entirely from the
command line without touching code, which is the natural experimentation
workflow.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes DSL kernels (functions decorated with @blosc2.dsl_kernel) first-class inputs for CTable derived columns, enabling both virtual computed columns and stored generated/materialized columns to be backed by DSL kernels and to survive save/open round-trips.

Changes:

  • Extend CTable.add_computed_column(), add_generated_column(), materialize_computed_column(), and where() to accept DSL kernels / LazyUDF predicates, with persistence of dsl_source (+ optional jit_backend).
  • Add dsl_kernel.kernel_from_source() and refactor structured LazyUDF decoding to reuse it.
  • Improve DSL ergonomics: lazyudf() can infer dtype for DSL kernels; convert_inputs() unwraps CTable.Column; BLOSC_ME_JIT now overrides both jit and jit_backend.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/ndarray/test_ndarray.py Adds coverage for blosc2.array() copy semantics vs asarray().
tests/ctable/test_ctable_dsl_columns.py New end-to-end tests for DSL-backed computed/generated columns and persistence.
tests/ctable/test_ctable_computed_cols.py Expands existing computed-column suite with DSL/LazyUDF behaviors and jit_backend persistence tests.
src/blosc2/lazyexpr.py Adds Column unwrapping in convert_inputs, optional dtype inference for DSL lazyudf, and env var precedence for JIT.
src/blosc2/dsl_kernel.py Introduces kernel_from_source() utility for reconstructing persisted DSL kernels.
src/blosc2/ctable.py Implements DSL transformer normalization, persistence, evaluation paths, and where() support for LazyUDF.
src/blosc2/b2objects.py Refactors DSL LazyUDF deserialization to use kernel_from_source().
examples/ctable/udf-computed-col.py New example demonstrating DSL computed/generated columns and persistence.
doc/reference/ctable.rst Updates CTable docs to distinguish computed vs generated columns and documents DSL support.
doc/getting_started/overview.rst Adds introductory CTable overview section and JIT performance tip.
bench/ctable/query-backends.py Adds a benchmark for CTable.where() across interpreted/tcc/cc backends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/blosc2/lazyexpr.py
Comment thread src/blosc2/ctable.py
Comment thread src/blosc2/ctable.py Outdated
Comment thread src/blosc2/dsl_kernel.py
Comment thread src/blosc2/dsl_kernel.py
Comment thread doc/reference/ctable.rst
   - lazyexpr.py: avoid double _raw_col property lookup (hasattr + access)
   - ctable.py: only materialise referenced DSL computed columns in
     _where_expression_operands, not all of them eagerly
   - ctable.py: preserve jit_backend in DSL computed column metadata
     during _empty_copy (was silently dropped)
   - dsl_kernel.py: kernel_from_source() now validates source contains
     only a function definition (rejects side-effectful top-level nodes)
   - dsl_kernel.py: raise ValueError with clear message when source does
     not define the requested function name (was a cryptic KeyError)
   - ctable.rst: add security warning about opening .b2d files from
     untrusted sources when DSL columns are present
@FrancescAlted FrancescAlted merged commit 2987623 into main Jun 4, 2026
13 of 22 checks passed
@FrancescAlted FrancescAlted deleted the dsl-kernels-as-cols branch June 4, 2026 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants