Use a proc macro for being able to derive schemas for our action structs by nicklan · Pull Request #129 · delta-io/delta-kernel-rs

nicklan · 2024-02-28T18:52:57Z

Work to be able to do:

#[derive(Schema)]
pub struct Protocol {
    pub min_reader_version: i32,
    pub min_writer_version: i32,
    pub reader_features: Option<Vec<String>>,
    pub writer_features: Option<Vec<String>>,
}

This also then uses the generated schemas from our actions everywhere, and removes the old schema definitions. To use these we define a static LOG_SCHEMA which calls all the generated get_field methods to build the total log schema. We also add a project method to StructType so we can then pick out the columns we want when doing something like parsing only metadata.

How it works

We introduce a trait, GetField which is:

trait GetField {
    fn get_field(name: impl Into<String>) -> SchemaRef;
}

And then adding the derive will generate an impl that looks like:

    impl crate::actions::schemas::GetField for Protocol {
        fn get_field(name: impl Into<String>) -> crate::schema::StructField {
            use crate::actions::schemas::GetField;
            crate::schema::StructField::new(
                name,
                crate::schema::StructType::new(
                    <[_]>::into_vec(
                        #[rustc_box]
                        ::alloc::boxed::Box::new([
                            i32::get_field("minReaderVersion"),
                            i32::get_field("minWriterVersion"),
                            Option::<Vec<String>>::get_field("readerFeatures"),
                            Option::<Vec<String>>::get_field("writerFeatures"),
                        ]),
                    ),
                ),
                false,
            )
        }
    }

Right now all error handling is via panic, but that's generally okay as the compiler just fails with the panic message.

If you want to inspect the generated code, you can use cargo expand. Install it, and then run something like:
$ cargo expand | grep -A50 "impl crate::actions::GetField for Protocol"
and replace Protocol with whatever struct you want to see the impl for.

ryan-johnson-databricks

Thanks for taking a stab at this! Having something to iterate on is so much better than guessing what it might be like to implement it.

Once we get this sorted out, we should also be able to derive the various visitors such as MetadataVisitor?

ryan-johnson-databricks · 2024-02-28T21:41:45Z

+    }
+}
+
+fn get_data_type(path_segment: &PathSegment) -> Option<TokenStream> {


Some related issues here...

The name of the field is wrong. Unfortunately the struct name isn't exactly what we want (i.e. struct is named Metadata but schema name is metaData. also deletionVector vs. DeletionVectorDescriptor). We can likely use an attribute or a transformation rule to fix this.

I think the problem is that we try to return a StructField instead of DataType, which also makes life harder for everyone.

For example, the schema of any type doesn't have a "name" -- the type only becomes associated with a field name if/when it becomes a field of some struct. In Delta spark, for example, the Metadata action's name is metaData because that's the name given to it by the SingleAction type that unions all other action types, and the latter provides the schema we use to parse commit .json files.

Using StructField instead of DataType also leads to this get_data_type function. Once we define a schema as DataType instead, we can just impl GetSchema for all the basic types we want to support. See playground example.

The key to handling nullability is to recognize that it's not a property of the data type that might be null -- it's a property of the owning complex type. Thus, we would not impl<T: GetSchema> GetSchema for Option<T>. Instead, e.g. impl<T: GetSchema> GetSchema for Vec<T> covers non-nullable array elements, and impl<T: GetSchema> GetSchema for Vec<Option<T>> covers nullable array elements.

If we do all that, this macro's job gets simpler: Figure out the name, type, and nullability of each field (with name literally being the struct field name, type being the "base" type of that field, and nullability decided by whether the base type is wrapped in Option or not). The base types are handled recursively by appeal to the GetSchema trait, and any nested object that fails to implement the trait will trigger a compilation failure because the type bound is not satisfied.

Yeah, that makes sense. Perhaps once #109 merges, we can look at changing the way we represent schemas and then make this macro less complex.

ryan-johnson-databricks · 2024-02-28T22:02:13Z

+                        if let Some(fin) = type_path.path.segments.iter().last() {
+                            get_data_type(fin)
+                        } else {
+                            panic!("Path for generic type must have last")


Right now all error handling is via panic

Is that a problem for macros that are "running" at compile-time?
Intuitively, it should just result in a compilation error?

Yeah, it's not really an issue, but doing something more complex like this will give nicer errors.

Not something we need to do up front, hence all the panics here

nicklan · 2024-03-06T01:37:40Z

        let size: Option<i64> = getters[5].get_opt(row_index, "remove.size")?;

-        // TODO(nick) stats are skipped in getters[6] and tags are skipped in getters[7]
+        // TODO(nick) tags are skipped in getters[6]


the stats field appears to have been mistakenly copied over into the schema from Add. Remove does not actually have a stats field, so this was all incorrect below (but we just hadn't tested it properly before)

Next PR: derive macro for basic visitors? That would eliminate the possibility that they get out of sync.

For fields that are themselves structs, we could use that struct's own visitor (after verifying that the first non-nullable sub-field is non-null, in case the field was also nullable).

The one annoyance is all those e.g. "remove.xxx" error message helpers -- we'd have to either derive remove from the struct name, or else tell the macro the field name to use.

nicklan · 2024-03-06T01:53:54Z


    /// Map containing metadata about this logical file.
-    pub tags: HashMap<String, Option<String>>,
+    pub tags: Option<HashMap<String, Option<String>>>,


tags are optional in the spec, so this is a bug fix

ryan-johnson-databricks · 2024-03-01T21:38:53Z

+                    let tokens: Vec<TokenTree> = list.tokens.clone().into_iter().collect();
+                    match tokens[..] {


Out of curiosity -- do we actually need to clone tokens? Or can we just slice and match it directly?

Suggested change

let tokens: Vec<TokenTree> = list.tokens.clone().into_iter().collect();

match tokens[..] {

match list.tokens[..] {

(might not need the ref matching any more after that?)

(TokenStream)[https://doc.rust-lang.org/proc_macro/struct.TokenStream.html] doesn't support being used as a slice. it's really only an iter, which is why we collect it and then use it as a slice.

I think I could probably re-write this to avoid the clone, but it would look more like the previous code where we'd have to match one token at a time, and was much uglier.

ryan-johnson-databricks · 2024-03-08T17:33:52Z

-        let schema = StructType::new(vec![crate::actions::schemas::METADATA_FIELD.clone()]);
        let mut visitor = MetadataVisitor::default();
-        data.extract(Arc::new(schema), &mut visitor)?;
+        data.extract(Metadata::get_schema(), &mut visitor)?;


I'm not convinced this get_schema method is helpful. Every read we perform is ultimately some projection of fields from actions::get_log_schema(), and the latter already names all the fields (metaData in this case).

Instead of "creating" a metadata schema, and needing to worry about "magically" getting the right field name, can we just filter get_log_schema()? Spark's StructType has specific methods for extracting one or several fields -- you pass 1+ fields to be projected, and they are returned in schema order. The rust analogue would be something like:

Suggested change

data.extract(Metadata::get_schema(), &mut visitor)?;

data.extract(get_log_schema().project_one("metaData"), &mut visitor)?;

or, for the add+remove case in scan/mod.rs below,

let action_schema = Arc::new(StructType::new(vec![ Option::<Add>::get_field("add"), Option::<Remove>::get_field("remove"), ]));

becomes

let action_schema = Arc::new(get_log_schema().project(&["add", "remove"]));

The one bummer is, I don't see any way to actually get away from the field names. Even if we use e.g. Metadata::get_schema() to "hide" the name, the selection logic still needs to know it.

Alternatively, we could observe that the above is equivalent to:

Suggested change

data.extract(Metadata::get_schema(), &mut visitor)?;

data.extract(Arc::new(StructType::new(vec![Metadata::get_field("metaData")])), &mut visitor)?;

... which is annoying because of arc/struct/vec wrappings. But we can fix that once for everyone by defining something like:

trait GetSchema : GetField { fn get_schema(name: Into<String>) -> SchemaRef { Arc::new(StructType::new(vec![Self::get_field(name)])) } }

... which produces here:

Suggested change

data.extract(Metadata::get_schema(), &mut visitor)?;

data.extract(Metadata::get_schema("metaData"), &mut visitor)?;

I actually quite like the project option. It has a few advantages:

It lets us make only LOG_SCHEMA have to be in a lazy_static, so we can get rid of the somewhat tricky OnceLock construct currently used to make the generated schema static

We no longer need the annotations to allow field rename, so we can remove the most complex macro parsing code

Sounds good to me!

We still need to solve the problem that the magic constant column name passed to project might not be correct, and trigger a runtime error. But that problem existed before, and at least now we can define a constant (in LogSchema perhaps?) for each top-level column name if we want?

ryan-johnson-databricks · 2024-03-08T21:50:40Z

        let mut visitor = ProtocolVisitor::default();
-        let schema = StructType::new(vec![crate::actions::schemas::PROTOCOL_FIELD.clone()]);
-        data.extract(Arc::new(schema), &mut visitor)?;
+        data.extract(Protocol::get_schema(), &mut visitor)?;


How does this work? The protocol column of the EngineData should be nullable, since most rows will contain some other type. But this is non-nullable?

It works because even if we didn't read any protocol objects we still read using a schema with a protocol column in it. That means that at the top level when you do column_by_name("protocol") in the arrow you do get a StructArray with children that match the schema, it's just that those columns are all null.

This would give: Error::MissingData("Found required field protocol, but it's null") if called on an EngineData that had been read with the incorrect schema. That feels maybe correct, but we could also have it just not error out be return None in that case.

It does get a bit tricky to model though. What are the semantics of a struct that is nullable, with fields that are not? I guess it's reasonable to say that if the struct is null, everything can be null. I'd have to modify the existing extract code though, I don't think it would handle that case properly (i.e. it would complain that your schema says a Protocol must have a minReaderVersion even if the data had no protocol column at all and we marked protocol as nullable)

What are the semantics of a struct that is nullable, with fields that are not?

IIRC, parquet handles this case very badly in practice (corrupt file). Spark compensates by trying to force all children of a nullable field to themselves be nullable. See StructType.asNullable, for example, tho the latter only goes one layer deep instead of being fully transitive.

Given that spark, parquet, and arrow all seem to treat null-struct vs struct-of-null as ~equivalent, maybe we should just formalize the idea in kernel? An exploded field is nullable if it or any parent is nullable, and null if it or any parent is null?

An exploded field is nullable if it or any parent is nullable, and null if it or any parent is null?

yes, I think this is the most logical way to represent this. I can update our extraction code to do this, and we'll need to be careful to document it for connectors so their extraction code can do the same.

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>

(at cost of returning a ref)

nicklan · 2024-03-12T19:08:01Z

-                    Self::extract_columns_from_array(out_col_array, schema, None)?;
+            } else if array.is_none() || field.is_nullable() {
+                if let DataType::Struct(inner_struct) = field.data_type() {
+                    Self::extract_columns_from_array(out_col_array, inner_struct.as_ref(), None)?;


this was a bug before where we were passing the parent schema instead of the child one

ryan-johnson-databricks · 2024-03-16T13:07:56Z

+                        }
+                    } else {
+                        quote_spanned! {field.span()=>
+                                        #type_ident::get_field(stringify!(#name))


Don't we need to emit the fully qualified type name, in case the user didn't use the (full) path to it?
(especially since, if I understand correctly, this is an unresolved token stream, so any qualifiers the user gave are probably needed for it to compile at all)

ryan-johnson-databricks · 2024-03-16T13:12:53Z

+    proc_macro::TokenStream::from(output)
+}
+
+// turn our struct name into the schema name, goes from snake_case to camelCase


I don't know where to put the doc comment, but somewhere we should be careful to explain that the actual field names are all mandated by Delta spec, and so the user of this macro is responsible to ensure that e.g. Metadata::schema_string is the snake-case-ified version of schemaString from Delta's Change Metadata action, in order to keep rust happy. This macro is written with the assumption that it merely undoes that (previously correctly performed) transformation.

The same explains why it's ok to use to_ascii_uppercase below -- all Delta field names are plain ASCII.

ryan-johnson-databricks · 2024-03-16T13:15:47Z

+    (f64, DataType::DOUBLE),
+    (bool, DataType::BOOLEAN),
+    (HashMap<String, String>, MapType::new(DataType::STRING, DataType::STRING, false)),
+    (HashMap<String, Option<String>>, MapType::new(DataType::STRING, DataType::STRING, true)),


Do we have a ticket somewhere that tracks getting rid of this map-of-option-of-string thing?
(it keeps coming up in various PR)

Delta was originally written in Scala (Java variant), where there's no such thing as a nullable map entry: looking up a non-existent key returns null as a sentinel value. That makes it hard for me to imagine a case where map-of-option-of-string could ever be anything but a semantic overhead.

I guess we need to double check whether the json-serialized form of a string-string map somehow allows null entries. But even if it does somehow happen, I'd rather filter out such entries at parsing time rather than propagate them through the rest of the system?

ryan-johnson-databricks · 2024-03-16T13:24:17Z

        let size: Option<i64> = getters[5].get_opt(row_index, "remove.size")?;

-        // TODO(nick) stats are skipped in getters[6] and tags are skipped in getters[7]
+        // TODO(nick) tags are skipped in getters[6]


Next PR: derive macro for basic visitors? That would eliminate the possibility that they get out of sync.

For fields that are themselves structs, we could use that struct's own visitor (after verifying that the first non-nullable sub-field is non-null, in case the field was also nullable).

The one annoyance is all those e.g. "remove.xxx" error message helpers -- we'd have to either derive remove from the struct name, or else tell the macro the field name to use.

ryan-johnson-databricks · 2024-03-16T13:36:15Z

+            .iter()
+            .map(|name| {
+                self.fields
+                    .get_index_of(name.as_ref())


This will have O(n*m) cost, where n is the schema arity and m is the projection arity. Perhaps we could borrow spark's approach, which creates a set from the names (presumably m is much smaller than n) and then does a single filter-map pass over the fields, returning only those fields whose name is present in the set of names. Also avoids the other two passes (sort and index).

ryan-johnson-databricks · 2024-03-16T16:20:18Z

-                crate::actions::schemas::REMOVE_FIELD.clone(),
-            ]
+        let schema_to_use = if is_log_batch {
+            get_log_schema().project_as_schema(&[ADD_NAME, REMOVE_NAME])?


Because project_as_schema preserves schema order, this will only work because add comes before remove in the log schema. Could be a surprising pitfall worth a comment?

ryan-johnson-databricks · 2024-03-16T16:21:19Z

-            crate::actions::schemas::PROTOCOL_FIELD.clone(),
-        ]);
-        let data_batches = self.replay(engine_interface, Arc::new(schema), None)?;
+        let schema = get_log_schema().project_as_schema(&[METADATA_NAME, PROTOCOL_NAME])?;


similar to other comment -- now this is only correct if metadata comes before protocol in the log schema.

nicklan force-pushed the schema_derive_macro branch from 1fd9ff9 to a52592f Compare February 28, 2024 18:55

ryan-johnson-databricks reviewed Feb 28, 2024

View reviewed changes

ryan-johnson-databricks reviewed Mar 1, 2024

View reviewed changes

Comment thread derive-macros/src/lib.rs Outdated

nicklan force-pushed the schema_derive_macro branch 2 times, most recently from 8da02b1 to 9389f06 Compare March 5, 2024 00:02

nicklan requested review from roeap, wjones127 and zachschuermann March 6, 2024 01:35

nicklan commented Mar 6, 2024

View reviewed changes

nicklan changed the title ~~POC: using a proc macro for being able to derive schemas for our action structs~~ Use a proc macro for being able to derive schemas for our action structs Mar 6, 2024

nicklan commented Mar 6, 2024

View reviewed changes

ryan-johnson-databricks reviewed Mar 11, 2024

View reviewed changes

nicklan and others added 15 commits March 12, 2024 12:01

initial derive macros

c5f5e81

simpler version

f2f6e34

Update derive-macros/src/lib.rs

9abe8ea

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>

cleanup get_schema_name_from_attr<

addb7a3

to_ascii_uppercase makes more sense

7f00ac1

add test

9dcc096

fmt

6d00e43

clippy

48219ee

make init static

5e62148

(at cost of returning a ref)

better static via returning SchemaRef

2cd6027

use get_schema in actions/mod.rs

746b6f3

use dervied schemas, and fix remove schema

150b024

tags is optional on Add, use get_schema everywhere

c4e037e

remove leftover code in schemas.rs

c7785c2

switch to using "project" to get schemas

b51b7b1

nicklan force-pushed the schema_derive_macro branch from 44b7e07 to b51b7b1 Compare March 12, 2024 19:02

remove unneeded annotations

9bca3f5

nicklan commented Mar 12, 2024

View reviewed changes

nicklan added 3 commits March 12, 2024 12:09

move snapshot to new way

0bf4c3a

use static names

b273e3b

fix comment

505983e

nicklan requested a review from ryan-johnson-databricks March 12, 2024 19:15

ryan-johnson-databricks reviewed Mar 16, 2024

View reviewed changes

nicklan closed this Mar 19, 2024

nicklan reopened this Mar 19, 2024

nicklan closed this Mar 19, 2024

nicklan mentioned this pull request Mar 28, 2024

Schema derive macro (reborn) #156

Merged

		let tokens: Vec<TokenTree> = list.tokens.clone().into_iter().collect();
		match tokens[..] {

	let tokens: Vec<TokenTree> = list.tokens.clone().into_iter().collect();
	match tokens[..] {
	match list.tokens[..] {

	data.extract(Metadata::get_schema(), &mut visitor)?;
	data.extract(get_log_schema().project_one("metaData"), &mut visitor)?;

	data.extract(Metadata::get_schema(), &mut visitor)?;
	data.extract(Arc::new(StructType::new(vec![Metadata::get_field("metaData")])), &mut visitor)?;

	data.extract(Metadata::get_schema(), &mut visitor)?;
	data.extract(Metadata::get_schema("metaData"), &mut visitor)?;

Conversation

nicklan commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Uh oh!

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-johnson-databricks Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nicklan commented Feb 28, 2024 •

edited

Loading

nicklan Feb 28, 2024 •

edited

Loading

ryan-johnson-databricks Mar 12, 2024 •

edited

Loading