Skip to content

ma2za/docqa-stream

Repository files navigation

DocQA Stream

Self-hosted PDF Q&A API with FastAPI, Weaviate or pgvector, streaming responses, and Azure OpenAI, OpenAI-compatible models, Vertex AI, or local Ollama.

DocQA Stream is a small RAG starter for uploading PDFs, indexing them in a local vector store, and asking questions over the indexed content. Answers stream back over server-sent events and finish with citation metadata.

Architecture

Browser or curl
  -> FastAPI
  -> Unstructured PDF parsing
  -> LangChain text splitting
  -> Weaviate text2vec-transformers or PostgreSQL pgvector
  -> Azure OpenAI, OpenAI-compatible, Vertex AI, or Ollama chat model

Features

  • Docker Compose setup for FastAPI, Weaviate, local transformer embeddings, and optional pgvector.
  • PDF upload with chunk metadata: document ID, filename, page, and chunk index.
  • Streaming Q&A endpoint with citation metadata.
  • Document list and delete endpoints.
  • Minimal browser demo at http://localhost:8000.
  • Azure OpenAI by default, with OpenAI-compatible, Vertex AI, and Ollama provider options.
  • VECTOR_STORE=weaviate by default, or VECTOR_STORE=pgvector for PostgreSQL-backed vectors.

Quickstart

1. Prerequisites

  • Docker and Docker Compose
  • Azure OpenAI, OpenAI-compatible, or Vertex AI model credentials

2. Install options

For local development with the default Weaviate and Azure/OpenAI-compatible path:

poetry install --extras "weaviate azure"

For pgvector and Vertex AI:

poetry install --extras "pgvector vertexai"

For pgvector with local embeddings:

poetry install --extras "pgvector local"

For the fully local Ollama example:

poetry install --extras "pgvector local ollama"

For all supported backends and model providers:

poetry install --extras all

3. Configure environment

cp .env.example .env

Fill in these values in .env for the default Weaviate backend.

OPENAI_DEPLOYMENT_NAME=
OPENAI_API_KEY=
OPENAI_API_BASE=

The default .env.example values run FastAPI on 8000, Weaviate on 8080, and Weaviate gRPC on 50051.

For pgvector, also set an embeddings deployment and switch the vector store backend.

VECTOR_STORE=pgvector
OPENAI_EMBEDDING_DEPLOYMENT_NAME=
PGVECTOR_CONNECTION=postgresql+psycopg://docqa:docqa@pgvector:5432/docqa

4. Start the stack

docker compose up --build

To run with the pgvector backend, set VECTOR_STORE=pgvector and include the pgvector profile:

docker compose --profile pgvector up --build

Fully Local Example

This example uses pgvector, local sentence-transformers embeddings, and Ollama. No cloud API key is required.

You need:

  • Docker and Docker Compose
  • Internet access for the first run to download container images, the Ollama model, and the local embedding model
  • Enough memory to run PostgreSQL, the API, the embedding model, and the Ollama model

Start the local stack:

docker compose -f examples/compose.fully-local.yaml --env-file examples/fully-local.env up --build -d

Pull the Ollama model:

docker compose -f examples/compose.fully-local.yaml --env-file examples/fully-local.env exec ollama ollama pull qwen2.5:0.5b

The default is intentionally small but still useful. Ollama lists qwen2.5:0.5b at 398MB with a 32K context window. smollm2:135m is smaller at 271MB, but it is too weak for reliable answers in this demo.

If the Ollama container image is slow to pull, run Ollama on the host and keep pgvector in Docker:

ollama pull qwen2.5:0.5b
docker compose -f examples/compose.fully-local.yaml --env-file examples/fully-local.env up -d pgvector

Then run the API locally with OLLAMA_BASE_URL=http://localhost:11434 and PGVECTOR_CONNECTION=postgresql+psycopg://docqa:docqa@localhost:5432/docqa.

Check the API:

curl http://localhost:8000/health

Run the smoke example:

poetry install --extras "pgvector local ollama"
poetry run python examples/fully_local_smoke.py

The first smoke run can take several minutes while the local embedding model is downloaded and loaded. Later runs reuse the indexed sample PDF and cached embedding model.

Open the demo UI:

http://localhost:8000

Check the API health endpoint.

curl http://localhost:8000/health

Expected response:

{"message":"OK"}

Open the demo UI:

http://localhost:8000

Open the API docs:

http://localhost:8000/docs

API Usage

Upload a PDF.

curl -X POST "http://localhost:8000/files/upload?chunk_size=1000" \
  -F "file=@tests/data/rome_guide.pdf;type=application/pdf"

Expected response:

{
  "document_id": "8f50a3f9-8a95-4c72-b4ad-3e17613d1219",
  "filename": "rome_guide.pdf",
  "status": "queued",
  "chunks_added": 0,
  "error": null
}

Check upload status.

curl "http://localhost:8000/files/uploads/8f50a3f9-8a95-4c72-b4ad-3e17613d1219"

Expected response after indexing finishes:

{
  "document_id": "8f50a3f9-8a95-4c72-b4ad-3e17613d1219",
  "filename": "rome_guide.pdf",
  "status": "completed",
  "chunks_added": 73,
  "error": null
}

Ask a question.

curl -N "http://localhost:8000/files/query?question=when%20was%20rome%20founded%3F&temperature=0&n_docs=3"

The response is an event stream.

event: token
data: {"text":"Rome"}

event: citations
data: {"citations":[{"filename":"rome_guide.pdf","page":1,"chunk_index":0,"score":0.82,"preview":"..."}]}

Provider failures are returned as an error event.

event: error
data: {"message":"provider unavailable"}

List indexed documents.

curl "http://localhost:8000/files?limit=50&offset=0"

Delete one indexed document.

curl -X DELETE http://localhost:8000/files/{document_id}

Configuration

Runtime integrations are optional dependencies:

azure       Azure OpenAI chat and embeddings
openai      OpenAI-compatible chat and embeddings
vertexai    Google Vertex AI chat and embeddings
ollama      Local Ollama chat models
local       Local sentence-transformers embeddings
weaviate    Weaviate vector store
pgvector    PostgreSQL pgvector store
all         All model providers and vector stores

Required Azure OpenAI values:

LLM_PROVIDER=azure
OPENAI_DEPLOYMENT_NAME=
OPENAI_API_VERSION=2023-07-01-preview
OPENAI_API_KEY=
OPENAI_API_BASE=

Pgvector uses client-side embeddings, so it also needs an embeddings deployment:

EMBEDDINGS_PROVIDER=azure
OPENAI_EMBEDDING_DEPLOYMENT_NAME=

Optional OpenAI-compatible provider:

LLM_PROVIDER=openai
EMBEDDINGS_PROVIDER=openai
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=
OPENAI_API_KEY=
OPENAI_BASE_URL=

Optional Vertex AI provider:

LLM_PROVIDER=vertexai
EMBEDDINGS_PROVIDER=vertexai
VERTEXAI_MODEL=
VERTEXAI_EMBEDDING_MODEL=
VERTEXAI_PROJECT=
VERTEXAI_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=

If GOOGLE_APPLICATION_CREDENTIALS is not set, Vertex AI uses Google Application Default Credentials from the environment.

Optional Ollama provider:

LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:0.5b
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_KEEP_ALIVE=30m
OLLAMA_NUM_CTX=2048
OLLAMA_NUM_THREAD=
OLLAMA_NUM_PREDICT=256

Optional local embeddings:

EMBEDDINGS_PROVIDER=local
LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
LOCAL_EMBEDDING_DEVICE=cpu
LOCAL_EMBEDDING_CACHE_FOLDER=
LOCAL_EMBEDDING_LOCAL_FILES_ONLY=False
EMBEDDING_DIMENSIONS=

EMBEDDING_DIMENSIONS is optional. Pgvector uses it as a fixed vector length, and supported embedding providers use it to truncate or request that dimension.

Vector store backends:

VECTOR_STORE=weaviate
VECTOR_STORE=pgvector
PGVECTOR_CONNECTION=postgresql+psycopg://docqa:docqa@pgvector:5432/docqa
PGVECTOR_COLLECTION=docqa_stream

Scale controls:

WEB_CONCURRENCY=1
MAX_UPLOAD_BYTES=26214400
UPLOAD_READ_CHUNK_BYTES=1048576
MAX_CHUNKS_PER_UPLOAD=5000
VECTORSTORE_ADD_BATCH_SIZE=128
CHUNK_OVERLAP=20
CITATION_PREVIEW_CHARS=240
LIST_DOCUMENT_CHUNK_SCAN_LIMIT=10000
MAX_QUESTION_CHARS=2000

GET /files is paginated with limit and offset. LIST_DOCUMENT_CHUNK_SCAN_LIMIT caps how many vector-store chunks are scanned to produce document summaries. MAX_QUESTION_CHARS bounds query text accepted by GET /files/query.

Production Notes

  • Upload jobs are tracked in API worker memory. Keep WEB_CONCURRENCY=1 unless you add a shared job store or route upload-status polling to the same worker.
  • Each worker keeps its own embedding model cache. Increasing WEB_CONCURRENCY improves concurrency but also multiplies local embedding memory use and upload-status state.
  • Keep MAX_UPLOAD_BYTES and MAX_CHUNKS_PER_UPLOAD low enough to prevent one upload from monopolizing memory.
  • Increase VECTORSTORE_ADD_BATCH_SIZE only after checking vector-store write latency and memory use.
  • Use pgvector with fixed EMBEDDING_DIMENSIONS when you want PostgreSQL indexes over a known vector width.
  • Keep OLLAMA_NUM_CTX and OLLAMA_NUM_PREDICT bounded for local models. Large context windows raise memory use even when the model file is small.
  • For very large installations, move document manifests and ingestion jobs into a separate database and queue. This API queues work after the upload response, but job state is still process-local.

Troubleshooting

  • GET /health fails: check that the API container is running and FASTAPI_PORT matches the port mapping.
  • Upload fails with a Weaviate connection error: check that both WEAVIATE_PORT and WEAVIATE_GRPC_PORT are exposed.
  • Upload or query fails with pgvector: check that docker compose --profile pgvector up --build started the pgvector service and that PGVECTOR_CONNECTION uses the postgresql+psycopg:// driver.
  • Upload returns 413: raise MAX_UPLOAD_BYTES or MAX_CHUNKS_PER_UPLOAD, or split the document before uploading.
  • Upload status is failed: check the error field from GET /files/uploads/{document_id}.
  • Query fails with an authentication error: check the model deployment name, endpoint, API key, and API version.
  • Pgvector embedding fails: check EMBEDDINGS_PROVIDER and the embedding model or deployment variable.
  • Local embeddings fail on first run: check network access to download the model, or set LOCAL_EMBEDDING_LOCAL_FILES_ONLY=True after the model is already cached.
  • Local example returns weak answers: use qwen2.5:0.5b or a larger Ollama model. smollm2:135m is mainly useful for proving the wiring on very small machines.
  • Vertex AI fails before model invocation: check VERTEXAI_PROJECT, VERTEXAI_LOCATION, and Google Application Default Credentials or GOOGLE_APPLICATION_CREDENTIALS.
  • Empty PDF extraction: try a text-based PDF first. Scanned PDFs may require OCR-related system dependencies.

Tests

docker exec fastapi-application poetry run pytest .

When dependencies are installed locally:

poetry run pytest .

Contributors