Skip to content

Simple LLM scenario example based on the Airflow survey data#65172

Open
vikramkoka wants to merge 9 commits intomainfrom
aip99_survey_example
Open

Simple LLM scenario example based on the Airflow survey data#65172
vikramkoka wants to merge 9 commits intomainfrom
aip99_survey_example

Conversation

@vikramkoka
Copy link
Copy Markdown
Contributor

Here is a simple example for the common.ai provider based on public data which happens to be the Airflow 2025 Survey data.

The goal is to demonstrate an interactive LLM use case which can be used to try out exploratory data analysis and then a scheduled LLM use case which would be more representative of expected usage.

Both of these can be used by the developer as starting points, where the data extraction and LLM prompts can be replaced with other integrations pulling other data sets.


Was generative AI tooling used to co-author this PR?
  • [ x] Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Here is a simple example for the common.ai provider based on public data which happens to be the Airflow 2025 Survey data.

The goal is to demonstrate an interactive LLM use case which can be used to try out exploratory data analysis and then a scheduled LLM use case which would be more representative of expected usage.

Both of these can be used by the developer as  starting points, where the data extraction and LLM prompts can be replaced with other integrations pulling other data sets.
This example demonstrated how to answer a more complex question based on survey data which requires an LLM synthesis across multiple queries on the survey data.
@kaxil kaxil marked this pull request as ready for review April 14, 2026 18:55
@kaxil kaxil requested a review from gopidesupavan as a code owner April 14, 2026 18:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new common.ai provider example DAGs showcasing LLM-driven analysis of the Apache Airflow 2025 survey CSV, demonstrating interactive (HITL) and scheduled/agentic (mapped fan-out/fan-in) patterns.

Changes:

  • Introduces an interactive + scheduled survey analysis example using LLM-to-SQL generation and DataFusion analytics execution.
  • Introduces an “agentic” example using Dynamic Task Mapping to decompose a question into multiple SQL queries and synthesize results.
  • Extends the docs spelling wordlist with terms used by the new examples.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_analysis.py New interactive + scheduled survey analysis example DAGs.
providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_survey_agentic.py New mapped multi-query (“agentic”) survey analysis example DAG.
docs/spelling_wordlist.txt Adds “auditable” and “retryable” to satisfy spellcheck.

Comment on lines +282 to +284
def prepare_csv(csv_text: str) -> None:
os.makedirs(os.path.dirname(SURVEY_CSV_PATH), exist_ok=True)
with open(SURVEY_CSV_PATH, "w", encoding="utf-8") as f:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already moved to module level in e690303.

Comment on lines +361 to +363
)
else:
print(f"Survey analysis result:\n{data}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data comes from a CSV the user placed on disk, not from untrusted input. HTML-escaping survey results inside a <pre> block is over-engineering for an example DAG.

Comment on lines +283 to +289
os.makedirs(os.path.dirname(SURVEY_CSV_PATH), exist_ok=True)
with open(SURVEY_CSV_PATH, "w", encoding="utf-8") as f:
f.write(csv_text)

# Write a single-row reference CSV from the schema context so
# LLMSchemaCompareOperator has a structured baseline to compare against.
os.makedirs(os.path.dirname(REFERENCE_CSV_PATH), exist_ok=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- the default always has a directory. Not worth guarding in an example DAG.

Comment on lines +69 to +72
from airflow.providers.common.sql.config import DataSourceConfig
from airflow.providers.common.sql.operators.analytics import AnalyticsOperator
from airflow.providers.standard.operators.hitl import ApprovalOperator

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.sql is a declared dependency of common.ai. Installing common.ai always brings common.sql with it -- no guard needed.

Comment on lines +207 to +211
# ------------------------------------------------------------------
@task
def extract_data(raw: str) -> str:
results = json.loads(raw)
data = [row for item in results for row in item["data"]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example DAGs should be self-contained and copy-pasteable. Extracting a shared helper makes them harder to use as starting points. The duplication is intentional.

Comment on lines +50 to +59

from airflow.providers.common.ai.operators.llm_schema_compare import LLMSchemaCompareOperator
from airflow.providers.common.ai.operators.llm_sql import LLMSQLQueryOperator
from airflow.providers.common.compat.sdk import dag, task
from airflow.providers.common.sql.config import DataSourceConfig
from airflow.providers.common.sql.operators.analytics import AnalyticsOperator
from airflow.providers.http.operators.http import HttpOperator
from airflow.providers.standard.operators.hitl import ApprovalOperator, HITLEntryOperator
from airflow.sdk import Param

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.sql is a dependency of common.ai, so it's always available. The HttpOperator try/except guard was intentionally removed -- example DAGs should fail clearly on missing deps rather than silently hiding functionality.



# [START example_llm_survey_scheduled]
@dag(schedule="@monthly", start_date=datetime.datetime(2025, 1, 1))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- added catchup=False in ae2aca2.

Comment on lines +283 to +289
os.makedirs(os.path.dirname(SURVEY_CSV_PATH), exist_ok=True)
with open(SURVEY_CSV_PATH, "w", encoding="utf-8") as f:
f.write(csv_text)

# Write a single-row reference CSV from the schema context so
# LLMSchemaCompareOperator has a structured baseline to compare against.
os.makedirs(os.path.dirname(REFERENCE_CSV_PATH), exist_ok=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default path is /opt/airflow/data/airflow-user-survey-2025.csv which always has a directory component. Setting it to a bare filename is a config error -- no need to guard against it in example code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants