[KYUUBI #7379][2a/4] Data Agent Engine: tool system, data source, and prompt templates#7400
Conversation
…urce, and prompt templates
Tool system with risk-based separation (RunSelectQueryTool / RunMutationQueryTool),
ToolRegistry with JSON schema generation, and SqlReadOnlyChecker keyword whitelist.
Data source abstraction with JdbcDialect auto-detection (Spark/Trino/MySQL/SQLite),
GenericDialect fallback for unknown JDBC subprotocols, TableRef value object for
structured table references with Jackson deserialization support, and HikariCP-backed
DataSourceFactory with credential isolation.
Composable SystemPromptBuilder with per-dialect prompt templates (base.md +
datasource-{name}.md), SQL workflow guidance, and query risk classification.
… remove jdbcUrl shortcut - SystemPromptBuilder.datasource() now replaces the previous datasource section instead of appending, matching the single-datasource-per-session model - Remove jdbcUrl() convenience method; callers use JdbcDialect.fromUrl() directly - Remove redundant tests (pool config defaults, toString, tool metadata, fast timeout) - Clean up todo comment in JdbcDialect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi @pan3793, could you help review this PR when you have time? This is Part 2a of the Data Agent Engine series — it adds the tool system, data source abstraction, and prompt builder. Thanks! 🙏 |
Register the three Apache-2.0 libraries bundled by the data-agent engine (openai-java, victools jsonschema-generator and jsonschema-module-jackson) so the binary distribution's LICENSE file reflects the dependencies it ships.
There was a problem hiding this comment.
Pull request overview
This PR (KYUUBI #7379, part 2a/4) introduces the Data Agent Engine’s foundational “tool + datasource + prompt” infrastructure that later PRs will use to run SQL via JDBC and interact with an LLM using OpenAI-style function calling.
Changes:
- Add tool SPI (
AgentTool), registry (ToolRegistry) with JSON-schema generation, and SQL tools (run_select_query,run_mutation_query). - Add datasource abstraction (
JdbcDialect,TableRef,DataSourceFactory) and dialect-specific prompt templates. - Expand configuration/docs (timeouts) and add extensive unit + integration tests (SQLite + MySQL via Testcontainers).
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pom.xml | Adds dependency/version management for OpenAI SDK, victools JSON schema generator, and MySQL Testcontainers. |
| kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala | Refines SQL timeout config doc and adds tool call timeout config. |
| docs/configuration/settings.md | Updates Data Agent timeout settings documentation and adds tool call timeout setting. |
| LICENSE-binary | Adds third-party notices for new dependencies. |
| externals/kyuubi-data-agent-engine/pom.xml | Adds module dependencies for OpenAI SDK + schema generator + MySQL integration testing. |
| externals/kyuubi-data-agent-engine/src/main/scala/org/apache/kyuubi/engine/dataagent/operation/ExecuteStatement.scala | Emits tool args as structured JSON in SSE events. |
| externals/kyuubi-data-agent-engine/src/main/resources/prompts/base.md | Adds base system prompt template and tool usage rules. |
| externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-mysql.md | Adds MySQL-specific prompt guidance. |
| externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-spark.md | Adds Spark SQL prompt guidance. |
| externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-sqlite.md | Adds SQLite-specific prompt guidance. |
| externals/kyuubi-data-agent-engine/src/main/resources/prompts/datasource-trino.md | Adds Trino-specific prompt guidance. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/AgentTool.java | Defines the tool interface used by the agent runtime. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/ToolRegistry.java | Implements tool registration, OpenAI-tool export, JSON args deserialization, and timeout enforcement. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/ToolSchemaGenerator.java | Generates JSON Schema for tool args via victools + Jackson annotations. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlQueryArgs.java | Defines shared SQL tool args schema (with JSON annotations). |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlExecutor.java | Shared JDBC execution + markdown formatting for SQL tool outputs. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlReadOnlyChecker.java | Adds a keyword-based “read-only” gate for run_select_query. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunSelectQueryTool.java | Implements read-only SQL tool with read-only guard. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunMutationQueryTool.java | Implements destructive SQL tool for DML/DDL, marked destructive. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/prompt/SystemPromptBuilder.java | Composes system prompts from markdown templates + optional sections. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/DataSourceFactory.java | Creates HikariCP pooled DataSources with optional credentials. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/JdbcDialect.java | Detects dialect from JDBC URL and provides identifier quoting/qualification. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/GenericDialect.java | Fallback dialect carrying subprotocol name without quoting support. |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/SparkDialect.java | Spark dialect implementation (backticks). |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/MysqlDialect.java | MySQL dialect implementation (backticks). |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/SqliteDialect.java | SQLite dialect implementation (double quotes). |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/TrinoDialect.java | Trino dialect implementation (double quotes). |
| externals/kyuubi-data-agent-engine/src/main/java/org/apache/kyuubi/engine/dataagent/datasource/TableRef.java | Immutable catalog/schema/table reference with JSON support. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/sql/SqlReadOnlyCheckerTest.java | Unit tests for read-only SQL detection. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunSelectQueryToolTest.java | SQLite-backed tests for select tool behavior/formatting/timeouts. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/sql/RunMutationQueryToolTest.java | SQLite-backed tests for mutation tool behavior/risk level. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/ToolTest.java | Tests registry tool export, dispatch, timeouts, and error handling. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/ToolSchemaGeneratorTest.java | Tests schema generation and type/annotation mapping. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/tool/ToolRegistryThreadSafetyTest.java | Concurrency tests for registry register/access/execute. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/prompt/SystemPromptBuilderTest.java | Tests prompt composition and datasource template selection. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/WithMySQLContainer.java | Shared MySQL Testcontainers harness for integration tests. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/ToolExecutionTest.java | Registry-level MySQL integration tests (dispatch/timeout/errors/workflows). |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/RunSelectQueryTest.java | MySQL integration tests for select tool. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/RunMutationQueryTest.java | MySQL integration tests for mutation tool. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/DialectTest.java | MySQL integration tests for dialect detection/quoting + prompt integration. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/mysql/DataSourceFactoryTest.java | MySQL integration tests for DataSourceFactory pooling/auth failures. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/datasource/TableRefTest.java | Unit tests for TableRef construction/equality/JSON. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/datasource/JdbcDialectTest.java | Unit tests for dialect detection/quoting/qualification. |
| externals/kyuubi-data-agent-engine/src/test/java/org/apache/kyuubi/engine/dataagent/datasource/DataSourceFactoryAuthTest.java | SQLite-backed tests for DataSourceFactory credential handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…labels, config validation - ToolRegistry: replace unbounded cached thread pool with bounded ThreadPoolExecutor (MAX=8, SynchronousQueue, reject-fast); implement AutoCloseable so the engine can shut the pool down cleanly - SqlExecutor: use ResultSetMetaData.getColumnLabel so SQL AS aliases render in the markdown header the LLM sees - KyuubiConf: add checkValue(>= 1s) on the two data-agent timeout entries; units stay consistent with existing timeConf conventions - Tests: new MySQL integration test covering the column-alias fix; MySQL harness now uses ObjectMapper for tool args and closes the registry in teardown
…ail rationale The previous design note argued against SQL parsers by claiming they fail on dialect-specific queries — which isn't accurate for Calcite Babel with a matching SqlDialect. Replace it with the real rationale: the checker is a guardrail against LLM misrouting, not a security boundary, because the approval gate for writes is the separate run_mutation_query tool. Also drop the now-redundant Postgres EXPLAIN ANALYZE caveat from the keyword list.
|
Thanks @yaooqinn for triggering the Copilot review. Addressed the feedback in ab5b658 + 9245098: Fixed
Kept as-is (by design)
PTAL when you have time — thanks! cc @pan3793 |
| * | ||
| * <ul> | ||
| * <li>{@code SELECT} — standard query | ||
| * <li>{@code WITH} — CTE-prefixed query |
There was a problem hiding this comment.
Only checking the firstToken may not be enough, like:
FROM t1
INSERT INTO t2
SELECT c1, c2;
WITH tmp AS (SELECT * FROM t1)
INSERT INTO t2 SELECT * FROM tmp;
There was a problem hiding this comment.
Good catch, thank you @wForget — both of these are legitimate Spark/Hive bypasses. The whitelist has FROM (for the FROM t SELECT ... read variant) and WITH (for SELECT CTEs), and your examples exploit exactly those entries.
I will open a dedicated follow-up issue to track this and commit to closing the gap there. Our practical observation is that the realistic risk is moderate: modern LLMs rarely produce FROM t INSERT ... or WITH ... INSERT on their own — those patterns are thin in training data, and the tool descriptions + run_mutation_query split already steer the model toward the straightforward INSERT ... SELECT shape. The exposure is mostly when a user pastes such a statement into the conversation verbatim. Still, a tool named "read-only" should actually be read-only for the configured dialects, and we intend to fix this properly. For now, we'd like to keep the data-agent series moving and handle the hardening in the tracked follow-up.
|
@yaooqinn, use the merge script instead of the merge button, update Assignees, Milestone, when merging PRs |
Why are the changes needed?
Part 2a of 4 for the Data Agent Engine (umbrella, KPIP-7373).
This PR adds the tool system, data source abstraction, and composable prompt builder — the infrastructure that the agent runtime (PR 2b) will use to execute SQL and interact with the LLM.
Changes include:
AgentToolinterface withToolRiskLevel, JSON schema generation for LLM function callingToolRegistry— thread-safe tool registration, dispatch, timeout enforcement, and OpenAI-compatible tool definition exportRunSelectQueryTool/RunMutationQueryTool— read-only vs. mutation SQL execution, output truncation, andSqlReadOnlyCheckerSqlExecutor— shared JDBC execution logic with statement timeout and result formattingDataSourceFactory— HikariCP connection pool creation with optional user/passwordJdbcDialect— auto-detection from JDBC URL with dialect-specific identifier quoting (Spark, MySQL, SQLite, Trino, generic fallback)TableRef— catalog/schema/table reference with JSON deserialization supportSystemPromptBuilder— composable Markdown prompt assembly with date injection, per-dialect datasource sections, and free-form text sectionsbase.md,datasource-{mysql,spark,sqlite,trino}.mdkyuubi.engine.data.agent.tool.*andkyuubi.engine.data.agent.datasource.*configuration entriesHow was this patch tested?
JdbcDialectTest,TableRefTest,DataSourceFactoryAuthTest,SqlReadOnlyCheckerTest,RunSelectQueryToolTest,RunMutationQueryToolTest,ToolTest,ToolRegistryThreadSafetyTest,ToolSchemaGeneratorTest,SystemPromptBuilderTestDataSourceFactoryTest,DialectTest,RunSelectQueryTest,RunMutationQueryTest,ToolExecutionTest— all run against a real MySQL containerWas this patch authored or co-authored using generative AI tooling?
Partially assisted by Claude Code (Claude Opus 4.6) for test generation, code review, and PR formatting. Core design and implementation are human-authored.