Simple LLM scenario example based on the Airflow survey data by vikramkoka · Pull Request #65172 · apache/airflow

vikramkoka · 2026-04-13T18:08:02Z

Here is a simple example for the common.ai provider based on public data which happens to be the Airflow 2025 Survey data.

The goal is to demonstrate an interactive LLM use case which can be used to try out exploratory data analysis and then a scheduled LLM use case which would be more representative of expected usage.

Both of these can be used by the developer as starting points, where the data extraction and LLM prompts can be replaced with other integrations pulling other data sets.

Was generative AI tooling used to co-author this PR?

[ x] Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Here is a simple example for the common.ai provider based on public data which happens to be the Airflow 2025 Survey data. The goal is to demonstrate an interactive LLM use case which can be used to try out exploratory data analysis and then a scheduled LLM use case which would be more representative of expected usage. Both of these can be used by the developer as starting points, where the data extraction and LLM prompts can be replaced with other integrations pulling other data sets.

This example demonstrated how to answer a more complex question based on survey data which requires an LLM synthesis across multiple queries on the survey data.

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py

…t_date

Copilot

Pull request overview

Adds new common.ai provider example DAGs showcasing LLM-driven analysis of the Apache Airflow 2025 survey CSV, demonstrating interactive (HITL) and scheduled/agentic (mapped fan-out/fan-in) patterns.

Changes:

Introduces an interactive + scheduled survey analysis example using LLM-to-SQL generation and DataFusion analytics execution.
Introduces an “agentic” example using Dynamic Task Mapping to decompose a question into multiple SQL queries and synthesize results.
Extends the docs spelling wordlist with terms used by the new examples.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py	New interactive + scheduled survey analysis example DAGs.
providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py	New mapped multi-query (“agentic”) survey analysis example DAG.
docs/spelling_wordlist.txt	Adds “auditable” and “retryable” to satisfy spellcheck.

kaxil · 2026-04-15T00:42:07Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+    def prepare_csv(csv_text: str) -> None:
+        os.makedirs(os.path.dirname(SURVEY_CSV_PATH), exist_ok=True)
+        with open(SURVEY_CSV_PATH, "w", encoding="utf-8") as f:


Already moved to module level in e690303.

kaxil · 2026-04-15T00:42:14Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+                )
+        else:
+            print(f"Survey analysis result:\n{data}")


The data comes from a CSV the user placed on disk, not from untrusted input. HTML-escaping survey results inside a <pre> block is over-engineering for an example DAG.

kaxil · 2026-04-15T00:42:17Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+        os.makedirs(os.path.dirname(SURVEY_CSV_PATH), exist_ok=True)
+        with open(SURVEY_CSV_PATH, "w", encoding="utf-8") as f:
+            f.write(csv_text)
+
+        # Write a single-row reference CSV from the schema context so
+        # LLMSchemaCompareOperator has a structured baseline to compare against.
+        os.makedirs(os.path.dirname(REFERENCE_CSV_PATH), exist_ok=True)


Same as above -- the default always has a directory. Not worth guarding in an example DAG.

kaxil · 2026-04-15T00:42:18Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py

+from airflow.providers.common.sql.config import DataSourceConfig
+from airflow.providers.common.sql.operators.analytics import AnalyticsOperator
+from airflow.providers.standard.operators.hitl import ApprovalOperator
+


common.sql is a declared dependency of common.ai. Installing common.ai always brings common.sql with it -- no guard needed.

kaxil · 2026-04-15T00:42:21Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+    # ------------------------------------------------------------------
+    @task
+    def extract_data(raw: str) -> str:
+        results = json.loads(raw)
+        data = [row for item in results for row in item["data"]]


Example DAGs should be self-contained and copy-pasteable. Extracting a shared helper makes them harder to use as starting points. The duplication is intentional.

kaxil · 2026-04-15T00:42:20Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+
+from airflow.providers.common.ai.operators.llm_schema_compare import LLMSchemaCompareOperator
+from airflow.providers.common.ai.operators.llm_sql import LLMSQLQueryOperator
+from airflow.providers.common.compat.sdk import dag, task
+from airflow.providers.common.sql.config import DataSourceConfig
+from airflow.providers.common.sql.operators.analytics import AnalyticsOperator
+from airflow.providers.http.operators.http import HttpOperator
+from airflow.providers.standard.operators.hitl import ApprovalOperator, HITLEntryOperator
+from airflow.sdk import Param
+


common.sql is a dependency of common.ai, so it's always available. The HttpOperator try/except guard was intentionally removed -- example DAGs should fail clearly on missing deps rather than silently hiding functionality.

kaxil · 2026-04-15T00:42:08Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+
+
+# [START example_llm_survey_scheduled]
+@dag(schedule="@monthly", start_date=datetime.datetime(2025, 1, 1))


Good catch -- added catchup=False in ae2aca2.

kaxil · 2026-04-15T00:42:15Z

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py

+        os.makedirs(os.path.dirname(SURVEY_CSV_PATH), exist_ok=True)
+        with open(SURVEY_CSV_PATH, "w", encoding="utf-8") as f:
+            f.write(csv_text)
+
+        # Write a single-row reference CSV from the schema context so
+        # LLMSchemaCompareOperator has a structured baseline to compare against.
+        os.makedirs(os.path.dirname(REFERENCE_CSV_PATH), exist_ok=True)


The default path is /opt/airflow/data/airflow-user-survey-2025.csv which always has a directory component. Setting it to a bare filename is a config error -- no need to guard against it in example code.

boring-cyborg bot added area:providers provider:common-ai labels Apr 13, 2026

Multi-query agentic example based on the survey data

0d8f24f

This example demonstrated how to answer a more complex question based on survey data which requires an LLM synthesis across multiple queries on the survey data.

jedcunningham reviewed Apr 14, 2026

View reviewed changes

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py Outdated Show resolved Hide resolved

providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py Show resolved Hide resolved

kaxil reviewed Apr 14, 2026

View reviewed changes

Fix merge conflicts and clean up example DAGs

716d999

kaxil marked this pull request as ready for review April 14, 2026 18:55

kaxil requested a review from gopidesupavan as a code owner April 14, 2026 18:55

kaxil added 4 commits April 14, 2026 21:08

Use SmtpHook instead of EmailOperator anti-pattern, add explicit star…

8f16b86

…t_date

Remove try/except guard around HttpOperator import

34d9aae

Fix CI: mypy, docs RST, spelling, provider deps

181165a

Fix RST: add blank line after Prerequisites label

38ed385

kaxil requested a review from Copilot April 15, 2026 00:28

Copilot started reviewing on behalf of kaxil April 15, 2026 00:29 View session

Move inline imports to module level

e690303

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Add catchup=False to scheduled DAG to prevent backfill

ae2aca2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple LLM scenario example based on the Airflow survey data#65172

Simple LLM scenario example based on the Airflow survey data#65172
vikramkoka wants to merge 9 commits intomainfrom
aip99_survey_example

vikramkoka commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

kaxil Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		# [START example_llm_survey_scheduled]
		@dag(schedule="@monthly", start_date=datetime.datetime(2025, 1, 1))

Conversation

vikramkoka commented Apr 13, 2026

Was generative AI tooling used to co-author this PR?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants