feat(kafka): support Avro-encoded message key via schema registry#1201
Open
burnerlee wants to merge 6 commits into
Open
feat(kafka): support Avro-encoded message key via schema registry#1201burnerlee wants to merge 6 commits into
burnerlee wants to merge 6 commits into
Conversation
Added a new setting `message_key_schema_name` to Kafka external streams. When set, the Kafka message key is decoded using the Confluent Avro wire format (5-byte prefix: 0x00 magic + 4-byte schema ID) against the named subject in the configured schema registry, and the result is emitted as a JSON string into the `_tp_message_key` virtual column. Changes: - ExternalStreamSettings.h: add `message_key_schema_name` String setting - Kafka.cpp: validate setting requires `kafka_schema_registry_url`; restrict `_tp_message_key` column to String/FixedString when Avro key mode is active; build and store a `KafkaSchemaRegistryForAvro` for the key subject in the constructor; pass registry and subject name to each KafkaSource - Kafka.h: add `avro_key_schema_registry` member; include KafkaSchemaRegistryForAvro - KafkaSource.h: add `avro_key_schema_registry` and `avro_key_schema_subject` members; add constructor params; declare `decodeAvroKey()` private method - KafkaSource.cpp: implement `decodeAvroKey()` using AvroDeserializer against a single String column header; branch both the typed (String/FixedString switch case) and untyped (RESERVED_MESSAGE_KEY fallback) virtual column paths to call `decodeAvroKey()` when the registry is present, raw bytes otherwise
Collaborator
|
Need implement KafkaSink related things to encode message key or throw an exception on write if currently not supported. |
- Return empty string (not decode) when key_len == 0 and column is non-nullable - Remove unused avro_key_schema_subject param/member — schema ID comes from key bytes - Use angle-bracket includes for Formats headers in Kafka.cpp Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace broken AvroDeserializer approach (which mapped Avro fields by column name and always returned empty string) with GenericDatum round-trip: binary Avro -> GenericDatum -> Avro JSON string. Also fix include style in KafkaSource.h (angle brackets). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writing _tp_message_key as Confluent Avro binary is not yet implemented — only decoding on read is supported. Throw a clear error on write with guidance to use a plain string key instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
@yuzifeng1984 how can I test my changes in local - can you share your testing setup? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR checklist:
proton: starts/endsfor new code in existing community code base? — not applicable, all changes are insrc/Storages/ExternalStream/Kafka/which is Proton-specific codePlease write user-readable short description of the changes:
Adds a new
message_key_schema_namesetting to Kafka external streams. When set, Proton decodes the Kafka message key using the Confluent Avro wire format (5-byte wire prefix: magic byte + 4-byte schema ID) and emits the decoded value as a JSON string into_tp_message_key.Motivation: Producers that use Confluent Schema Registry commonly Avro-encode message keys, not just values. Reading
_tp_message_keyon such topics previously returned raw bytes.Behaviour:
message_key_schema_namenames an Avro schema subject in the schema registrykafka_schema_registry_url(validated atCREATE EXTERNAL STREAMtime)_tp_message_keycolumn must beStringorNullable(String)in Avro key mode (validated atCREATEtime)KafkaSchemaRegistryForAvromechanismGenericDatum, then re-encoded as Avro JSON (not plain JSON — union types use Avro union encoding, bytes fields are base64). The result is stored as a string in_tp_message_keyNOT_IMPLEMENTEDifmessage_key_schema_nameis set — Avro-encoding keys on write is not yet supported. Plain string keys via_tp_message_keywithoutmessage_key_schema_namecontinue to workkey_len == 0) returnsNULLforNullable(String)columns, empty string forStringcolumnsExample:
CREATE EXTERNAL STREAM orders ( _tp_message_key String, payload String ) SETTINGS type = 'kafka', brokers = 'localhost:9092', topic = 'orders', data_format = 'JSONEachRow', kafka_schema_registry_url = 'http://localhost:8081', message_key_schema_name = 'orders-key'; SELECT _tp_message_key FROM orders; -- Returns Avro JSON string, e.g. {"user_id": 42, "tenant": "acme"}Closes #915