Self-hosted PDF Q&A API with FastAPI, Weaviate or pgvector, streaming responses, and Azure OpenAI, OpenAI-compatible models, Vertex AI, or local Ollama.
DocQA Stream is a small RAG starter for uploading PDFs, indexing them in a local vector store, and asking questions over the indexed content. Answers stream back over server-sent events and finish with citation metadata.
Browser or curl
-> FastAPI
-> Unstructured PDF parsing
-> LangChain text splitting
-> Weaviate text2vec-transformers or PostgreSQL pgvector
-> Azure OpenAI, OpenAI-compatible, Vertex AI, or Ollama chat model
- Docker Compose setup for FastAPI, Weaviate, local transformer embeddings, and optional pgvector.
- PDF upload with chunk metadata: document ID, filename, page, and chunk index.
- Streaming Q&A endpoint with citation metadata.
- Document list and delete endpoints.
- Minimal browser demo at
http://localhost:8000. - Azure OpenAI by default, with OpenAI-compatible, Vertex AI, and Ollama provider options.
VECTOR_STORE=weaviateby default, orVECTOR_STORE=pgvectorfor PostgreSQL-backed vectors.
- Docker and Docker Compose
- Azure OpenAI, OpenAI-compatible, or Vertex AI model credentials
For local development with the default Weaviate and Azure/OpenAI-compatible path:
poetry install --extras "weaviate azure"For pgvector and Vertex AI:
poetry install --extras "pgvector vertexai"For pgvector with local embeddings:
poetry install --extras "pgvector local"For the fully local Ollama example:
poetry install --extras "pgvector local ollama"For all supported backends and model providers:
poetry install --extras allcp .env.example .envFill in these values in .env for the default Weaviate backend.
OPENAI_DEPLOYMENT_NAME=
OPENAI_API_KEY=
OPENAI_API_BASE=
The default .env.example values run FastAPI on 8000, Weaviate on 8080, and Weaviate gRPC on 50051.
For pgvector, also set an embeddings deployment and switch the vector store backend.
VECTOR_STORE=pgvector
OPENAI_EMBEDDING_DEPLOYMENT_NAME=
PGVECTOR_CONNECTION=postgresql+psycopg://docqa:docqa@pgvector:5432/docqa
docker compose up --buildTo run with the pgvector backend, set VECTOR_STORE=pgvector and include the pgvector profile:
docker compose --profile pgvector up --buildThis example uses pgvector, local sentence-transformers embeddings, and Ollama. No cloud API key is required.
You need:
- Docker and Docker Compose
- Internet access for the first run to download container images, the Ollama model, and the local embedding model
- Enough memory to run PostgreSQL, the API, the embedding model, and the Ollama model
Start the local stack:
docker compose -f examples/compose.fully-local.yaml --env-file examples/fully-local.env up --build -dPull the Ollama model:
docker compose -f examples/compose.fully-local.yaml --env-file examples/fully-local.env exec ollama ollama pull qwen2.5:0.5bThe default is intentionally small but still useful. Ollama lists qwen2.5:0.5b at 398MB with a 32K context window. smollm2:135m is smaller at 271MB, but it is too weak for reliable answers in this demo.
If the Ollama container image is slow to pull, run Ollama on the host and keep pgvector in Docker:
ollama pull qwen2.5:0.5b
docker compose -f examples/compose.fully-local.yaml --env-file examples/fully-local.env up -d pgvectorThen run the API locally with OLLAMA_BASE_URL=http://localhost:11434 and PGVECTOR_CONNECTION=postgresql+psycopg://docqa:docqa@localhost:5432/docqa.
Check the API:
curl http://localhost:8000/healthRun the smoke example:
poetry install --extras "pgvector local ollama"
poetry run python examples/fully_local_smoke.pyThe first smoke run can take several minutes while the local embedding model is downloaded and loaded. Later runs reuse the indexed sample PDF and cached embedding model.
Open the demo UI:
http://localhost:8000
Check the API health endpoint.
curl http://localhost:8000/healthExpected response:
{"message":"OK"}Open the demo UI:
http://localhost:8000
Open the API docs:
http://localhost:8000/docs
Upload a PDF.
curl -X POST "http://localhost:8000/files/upload?chunk_size=1000" \
-F "file=@tests/data/rome_guide.pdf;type=application/pdf"Expected response:
{
"document_id": "8f50a3f9-8a95-4c72-b4ad-3e17613d1219",
"filename": "rome_guide.pdf",
"status": "queued",
"chunks_added": 0,
"error": null
}Check upload status.
curl "http://localhost:8000/files/uploads/8f50a3f9-8a95-4c72-b4ad-3e17613d1219"Expected response after indexing finishes:
{
"document_id": "8f50a3f9-8a95-4c72-b4ad-3e17613d1219",
"filename": "rome_guide.pdf",
"status": "completed",
"chunks_added": 73,
"error": null
}Ask a question.
curl -N "http://localhost:8000/files/query?question=when%20was%20rome%20founded%3F&temperature=0&n_docs=3"The response is an event stream.
event: token
data: {"text":"Rome"}
event: citations
data: {"citations":[{"filename":"rome_guide.pdf","page":1,"chunk_index":0,"score":0.82,"preview":"..."}]}
Provider failures are returned as an error event.
event: error
data: {"message":"provider unavailable"}
List indexed documents.
curl "http://localhost:8000/files?limit=50&offset=0"Delete one indexed document.
curl -X DELETE http://localhost:8000/files/{document_id}Runtime integrations are optional dependencies:
azure Azure OpenAI chat and embeddings
openai OpenAI-compatible chat and embeddings
vertexai Google Vertex AI chat and embeddings
ollama Local Ollama chat models
local Local sentence-transformers embeddings
weaviate Weaviate vector store
pgvector PostgreSQL pgvector store
all All model providers and vector stores
Required Azure OpenAI values:
LLM_PROVIDER=azure
OPENAI_DEPLOYMENT_NAME=
OPENAI_API_VERSION=2023-07-01-preview
OPENAI_API_KEY=
OPENAI_API_BASE=
Pgvector uses client-side embeddings, so it also needs an embeddings deployment:
EMBEDDINGS_PROVIDER=azure
OPENAI_EMBEDDING_DEPLOYMENT_NAME=
Optional OpenAI-compatible provider:
LLM_PROVIDER=openai
EMBEDDINGS_PROVIDER=openai
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=
OPENAI_API_KEY=
OPENAI_BASE_URL=
Optional Vertex AI provider:
LLM_PROVIDER=vertexai
EMBEDDINGS_PROVIDER=vertexai
VERTEXAI_MODEL=
VERTEXAI_EMBEDDING_MODEL=
VERTEXAI_PROJECT=
VERTEXAI_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=
If GOOGLE_APPLICATION_CREDENTIALS is not set, Vertex AI uses Google Application Default Credentials from the environment.
Optional Ollama provider:
LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:0.5b
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_KEEP_ALIVE=30m
OLLAMA_NUM_CTX=2048
OLLAMA_NUM_THREAD=
OLLAMA_NUM_PREDICT=256
Optional local embeddings:
EMBEDDINGS_PROVIDER=local
LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
LOCAL_EMBEDDING_DEVICE=cpu
LOCAL_EMBEDDING_CACHE_FOLDER=
LOCAL_EMBEDDING_LOCAL_FILES_ONLY=False
EMBEDDING_DIMENSIONS=
EMBEDDING_DIMENSIONS is optional. Pgvector uses it as a fixed vector length, and supported embedding providers use it to truncate or request that dimension.
Vector store backends:
VECTOR_STORE=weaviate
VECTOR_STORE=pgvector
PGVECTOR_CONNECTION=postgresql+psycopg://docqa:docqa@pgvector:5432/docqa
PGVECTOR_COLLECTION=docqa_stream
Scale controls:
WEB_CONCURRENCY=1
MAX_UPLOAD_BYTES=26214400
UPLOAD_READ_CHUNK_BYTES=1048576
MAX_CHUNKS_PER_UPLOAD=5000
VECTORSTORE_ADD_BATCH_SIZE=128
CHUNK_OVERLAP=20
CITATION_PREVIEW_CHARS=240
LIST_DOCUMENT_CHUNK_SCAN_LIMIT=10000
MAX_QUESTION_CHARS=2000
GET /files is paginated with limit and offset. LIST_DOCUMENT_CHUNK_SCAN_LIMIT caps how many vector-store chunks are scanned to produce document summaries. MAX_QUESTION_CHARS bounds query text accepted by GET /files/query.
- Upload jobs are tracked in API worker memory. Keep
WEB_CONCURRENCY=1unless you add a shared job store or route upload-status polling to the same worker. - Each worker keeps its own embedding model cache. Increasing
WEB_CONCURRENCYimproves concurrency but also multiplies local embedding memory use and upload-status state. - Keep
MAX_UPLOAD_BYTESandMAX_CHUNKS_PER_UPLOADlow enough to prevent one upload from monopolizing memory. - Increase
VECTORSTORE_ADD_BATCH_SIZEonly after checking vector-store write latency and memory use. - Use pgvector with fixed
EMBEDDING_DIMENSIONSwhen you want PostgreSQL indexes over a known vector width. - Keep
OLLAMA_NUM_CTXandOLLAMA_NUM_PREDICTbounded for local models. Large context windows raise memory use even when the model file is small. - For very large installations, move document manifests and ingestion jobs into a separate database and queue. This API queues work after the upload response, but job state is still process-local.
GET /healthfails: check that the API container is running andFASTAPI_PORTmatches the port mapping.- Upload fails with a Weaviate connection error: check that both
WEAVIATE_PORTandWEAVIATE_GRPC_PORTare exposed. - Upload or query fails with pgvector: check that
docker compose --profile pgvector up --buildstarted thepgvectorservice and thatPGVECTOR_CONNECTIONuses thepostgresql+psycopg://driver. - Upload returns
413: raiseMAX_UPLOAD_BYTESorMAX_CHUNKS_PER_UPLOAD, or split the document before uploading. - Upload status is
failed: check theerrorfield fromGET /files/uploads/{document_id}. - Query fails with an authentication error: check the model deployment name, endpoint, API key, and API version.
- Pgvector embedding fails: check
EMBEDDINGS_PROVIDERand the embedding model or deployment variable. - Local embeddings fail on first run: check network access to download the model, or set
LOCAL_EMBEDDING_LOCAL_FILES_ONLY=Trueafter the model is already cached. - Local example returns weak answers: use
qwen2.5:0.5bor a larger Ollama model.smollm2:135mis mainly useful for proving the wiring on very small machines. - Vertex AI fails before model invocation: check
VERTEXAI_PROJECT,VERTEXAI_LOCATION, and Google Application Default Credentials orGOOGLE_APPLICATION_CREDENTIALS. - Empty PDF extraction: try a text-based PDF first. Scanned PDFs may require OCR-related system dependencies.
docker exec fastapi-application poetry run pytest .When dependencies are installed locally:
poetry run pytest .