GitHub - luannamorim/mcp-forge: MCP server for Brazilian personal finance. OFX/CSV parsing with per-bank quirk fixes, PT-BR transaction categorization (rules → LLM cascade), CPF/CNPJ validation, and CNAE lookup. Works with any MCP-compatible client — Claude Desktop, Cursor, Zed, Windsurf, and others.

MCP server for Brazilian personal finance.

A Brazilian personal-finance MCP server with OFX/CSV parsing, PT-BR categorization, and CPF/CNPJ validation — no bank APIs required.

Demo

Two chained tool calls in an MCP client:

User: Importei o OFX do Itaú de fevereiro. Quanto gastei com alimentação
      versus transporte este mês?

Claude: [parse_ofx("/tmp/itau_fev2025.ofx")]
        → 143 transactions, bank_hint: "itau", 0 errors

        [categorize_transaction × 143]
        → rules tier: 98 txs  |  llm tier: 45 txs

        Alimentação + Restaurantes: R$ 1.247,30 (34 transações)
          IFOOD*PEDIDO, PADARIA NOSSA SRA, MCDONALD'S SP...

        Transporte: R$ 389,10 (18 transações)
          UBER *TRIP, 99APP, METRÔ SP BILHETE...

        Total do mês: R$ 8.643,22 em 143 transações.

Every categorize_transaction call returns which tier classified it:

{
  "category": "transporte",
  "confidence": 1.0,
  "path": "rules"
}

path ∈ {rules, embeddings, llm} — the classification source is always explicit. A rules hit at confidence 1.0 means a regex matched; an LLM hit at 0.7 means the model was less certain. The caller decides what to do with that signal.

Why this exists

Brazilian bank exports are not standardized. Itaú splits PIX transactions across three OFX lines that must be merged by FITID prefix. Inter exports in cp1252 but labels the file with ENCODING:1252 — a value that ofxparse rejects with an UnboundLocalError. Nubank embeds proprietary metadata in the description field. BB omits payee CNPJ for certain transaction types. Existing tools either skip Brazilian formats entirely or paper over the quirks silently, giving you wrong transaction counts without saying so.

The Model Context Protocol (released by Anthropic in late 2024) has become the standard for connecting LLM clients — Claude Desktop, Cursor, IDE agents — to external tools and data. As of mid-2025, there are effectively zero published MCP servers targeting Brazilian financial workflows. Every BR developer building a personal-finance assistant re-implements OFX parsing from scratch. This project solves the problem once and publishes the solution as a reusable MCP server.

Key features

Bank-specific parsers with documented quirks — cp1252 encoding fix for Itaú/Inter, PIX-split detection for Itaú, header-fingerprint dialect detection for all 5 banks in CSV mode
Transparent categorization — path: rules | llm on every response; no black box; clients see exactly why a category was assigned
PII masking before any LLM call — utils/pii.py:mask_br_documents strips CPF/CNPJ from description text unconditionally before it leaves the process (LGPD alignment)
Decimal everywhere — monetary amounts are never float; rounding errors in downstream aggregations are impossible by construction
Offline-safe core — validate_cpf, validate_cnpj, parse_ofx, parse_csv make zero network calls; only categorize_transaction and lookup_cnae reach the network

Quickstart

git clone <this repo> && cd mcp-forge
uv sync
export ANTHROPIC_API_KEY=sk-...   # only needed for categorize_transaction
uv run mcpforge                    # start MCP server on stdio
uv run pytest                      # 198 tests

Offline mode (no Anthropic key)

categorize_transaction can run fully offline against a local Ollama server with Phi-3.5. Selected automatically when ANTHROPIC_API_KEY is unset, or explicitly via MCPFORGE_LLM_BACKEND=ollama.

ollama pull phi3.5
uv sync --extra ollama
unset ANTHROPIC_API_KEY
uv run mcpforge

Override the Ollama host/model with MCPFORGE_OLLAMA_HOST and MCPFORGE_OLLAMA_MODEL. Boot- and call-timeouts: MCPFORGE_OLLAMA_HEALTH_TIMEOUT_S (default 2.0), MCPFORGE_OLLAMA_CALL_TIMEOUT_S (default 30.0).

OpenTelemetry metrics (optional)

Two instruments are emitted alongside the JSONL trace log when an OTel endpoint is configured:

mcpforge_tool_calls_total{tool, success} — counter per tool invocation
mcpforge_tool_latency_ms{tool} — histogram of tool latency

uv sync --extra otel
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector
export OTEL_SERVICE_NAME=mcpforge   # default: mcpforge
uv run mcpforge

Without the env var or extra, metric emission is a silent no-op; JSONL traces under logs/traces.jsonl remain the default observability surface.

Client config

Add to your MCP client config. Example for Claude Desktop (claude_desktop_config.json) — Cursor, Zed, and Windsurf follow the same command + args + cwd pattern:

{
  "mcpServers": {
    "mcpforge": {
      "command": "uv",
      "args": ["run", "mcpforge"],
      "cwd": "/path/to/mcp-forge"
    }
  }
}

Available tools

Tool	What it does	Network?
`validate_cpf(cpf)`	Receita Federal mod-11 checksum	No
`validate_cnpj(cnpj)`	Receita Federal mod-11 checksum	No
`parse_ofx(file_path)`	Parse BR bank OFX; returns normalized transactions + `bank_hint` + per-tx error list	No
`parse_csv(file_path, bank_hint?)`	Parse BR bank CSV with auto dialect detection	No
`lookup_cnae(cnpj)`	Primary CNAE classification via BrasilAPI; cached 24 h	Yes — BrasilAPI
`categorize_transaction(description, amount?)`	Rules → LLM cascade; returns `{category, confidence, path}`	Conditionally — Anthropic

All inputs validated with Pydantic v2. Errors return MCP-compliant error objects, not Python tracebacks.

Architecture

flowchart LR
    A["MCP Client\nClaude Desktop · Cursor · Zed · Windsurf"]
    A <-->|stdio| B["FastMCP Server"]

    B --> P["parsers/\nofx · csv"]
    B --> V["validators/\ncpf · cnpj · cnae"]
    B --> C["classifiers/\ncascade"]

    P --> Banks["bank adapters\nItaú · Bradesco · Nubank · Inter · BB"]
    V --> BrasilAPI["BrasilAPI\ncached 24 h"]
    C --> Rules["rules tier\nzero API cost"]
    C -.->|"roadmap"| Emb["embeddings tier\nBGE-m3"]
    C --> PII["pii.mask_br_documents"]
    PII --> Haiku["Anthropic Haiku 4.5"]

Single-process Python server on stdio transport — compatible with any standard MCP client (Claude Desktop, Cursor, Zed, Windsurf, and others), no proxy needed. Parsers are stateless. CNAE lookup carries an in-memory 24h TTL cache. Every tool call appends a structured trace to logs/traces.jsonl with latency, token counts, and cost in USD.

Differentiation

1. Brazilian vertical focus, not a translation layer

Bank-specific OFX/CSV quirks are first-class, not an afterthought:

parsers/ofx.py:_fix_encoding_header — remaps the non-standard ENCODING:1252 declaration (produced by Itaú, Inter, and others) to USASCII before ofxparse sees the file, then decodes the content as cp1252. Without this fix, these files throw UnboundLocalError mid-parse — the kind of failure you only discover when a user reports missing transactions.
parsers/itau.py — PIX transactions are detected via FITID prefix pattern; merge logic stub is in place for when real multi-line OFX samples are available (contributions welcome).
validators/cpf.py, validators/cnpj.py — Receita Federal mod-11 checksum, pure Python, sub-millisecond, no external service.

Full 25-category PT-BR taxonomy

alimentacao · restaurantes · transporte · combustivel · moradia · utilities · telecom · saude · farmacia · educacao · lazer · streaming · vestuario · viagens · beleza · pets · impostos · tarifas_bancarias · salario · investimentos · transferencias · saques · doacoes · presentes · outros · uncategorized

Defined in classifiers/taxonomy.py. Configurable taxonomy is a v2 feature — fixed taxonomy in v1 so the golden dataset can be labeled before the cascade is tuned.

2. Categorization that shows its work

Most transaction categorizers return a label — you trust it or you don't. MCPForge returns path on every call, so the caller has enough information to decide how to handle uncertainty:

path: "rules"      → regex matched; confidence is 1.0; zero API cost
path: "llm"        → rules missed; Haiku was called; confidence reflects model certainty
path: "embeddings" → (roadmap) nearest-neighbor on labeled set; no LLM cost

An agent can surface an llm/0.6 hit to the user for confirmation while silently accepting a rules/1.0 hit. The classification source is transparent because the distinction matters: a wrong category from a regex miss is a different kind of bug than a wrong category from an LLM that had no good match.

Current state: the cascade is two-tier (rules → LLM). The embeddings tier is stubbed at cascade.py:42 — it slots in without changing the tool signature or path semantics.

3. Honest evaluation, not headline numbers

Per-category accuracy matters more than overall accuracy. A system with 87% top-1 can have 42% F1 on impostos — common in BR workflows with DARF, IPTU, and IPVA payments — and the aggregate hides it entirely.

The plan: Inspect AI running in CI on a 500-transaction labeled golden set (balanced across all 5 banks and all 25 categories), with per-category F1 published alongside overall accuracy, and a PR gate that blocks on a >2pp regression from the accepted baseline. The golden set will be committed to the repo, not stored separately.

Current state: the eval harness is not yet built. The numbers in Benchmarks are SPEC targets, not measurements. They will be replaced with measured results — including any categories that miss the target — when the harness lands.

Benchmarks / Evaluation

Targets (from SPEC.md — not yet measured)

Metric	Target
Top-1 categorization accuracy	≥ 85%
`parse_ofx` p95 on a 1 MB file	< 1.5 s
`categorize_transaction` p95	< 400 ms
LLM cost per transaction	< $0.0005
OFX coverage	5/5 banks, ≥ 95% transactions parsed

These are calibrated estimates, not measured values. The eval harness is on the roadmap.

Status today

Subsystem	State
Unit tests	181 passing (`uv run pytest`)
CSV parsing	5 banks, header-fingerprint dialect detection; fixtures in `tests/parsers/fixtures/`
OFX parsing	Generic ofxparse + cp1252 quirk fix; Itaú PIX-split stub; bank hint detection
Categorizer eval	Not yet — Inspect AI + 500-tx golden set on roadmap
Latency / cost benchmarks	Not yet — per-call traces land in `logs/traces.jsonl`; no aggregated report

Targets will be replaced with measured numbers — including misses — when the eval harness lands.

Tech stack

Choice	Why
Python 3.11+	`mcp` SDK is Python-first; `StrEnum` for the taxonomy
`uv`	Reproducible, fast installs; all dev commands are `uv run …`
FastMCP (`mcp ≥ 1.9`)	Official SDK; stdio transport = compatible with any standard MCP client
Pydantic v2	Every tool input/output is a model — MCP spec compliance and input validation in one layer
`ofxparse` + encoding fix	Cheaper than rolling a custom SGML parser; the cp1252 quirk is a 10-line patch
`chardet`	OFX file extensions lie about encoding; content is sniffed
Anthropic Haiku 4.5	Cheap and fast for short PT-BR text classification; ~$0.0005/tx target
`Decimal` (stdlib)	Money is never `float`
`ruff`	Lint + format in one tool
`pytest` + `pytest-asyncio`	181 tests today

Roadmap

Done in v1:

Inspect AI eval harness (500-tx golden set, CI regression gate, mocked-LLM run)
Three-tier cascade — rules → embeddings (BGE-m3 k-NN) → LLM
Ollama offline backend (Phi-3.5) with bounded timeouts and startup health check
Stateless per-file duplicate annotation (Transaction.duplicate_of)
Optional OpenTelemetry metrics (mcpforge_tool_calls_total, mcpforge_tool_latency_ms)
docs/ARCHITECTURE.md with the full module map and design rationale

Open for v1.x:

Real-LLM eval baseline — current 100% in evals/baseline_run.json is against the deterministic mock in evals/_mock_llm.py. Running once against live Haiku 4.5 (cost ~$0.20–0.30 for the full 500-tx set) is what validates the SPEC ≥85% target. Procedure documented in evals/README.md.
Per-bank postprocess modules for Bradesco, Nubank, Inter, BB (blocked on samples — see Contributing) — today the generic OFX/CSV parser handles all five banks via header-fingerprint detection, but bank-specific quirks (PIX splits, embedded metadata, encoding edge cases) need real exports to extract and test against.
Itaú PIX-split merge (blocked on samples) — stub at parsers/itau.py, passthrough pending a real multi-line OFX sample.

Out of scope for v1 (v2 candidates):

Configurable taxonomy
Cross-import duplicate detection (stateful)
Investment account types (CDB, FIIs, Tesouro Direto)
.xlsx / PDF imports
CNAE LLM-generated summaries on top of BrasilAPI

Known limitations

File imports only. No Open Finance / direct bank API; Itaú/Bradesco API access requires institutional registration and Open Finance compliance — out of scope.
No PDF or .xlsx. OFX and CSV only.
Per-bank quirk fixes are Itaú-only. Bradesco/Nubank/Inter/BB are parsed by the generic OFX/CSV pipeline. Bank-hint detection identifies them correctly, but no bank-specific postprocess runs — pending real anonymised statement samples (see Contributing). The Itaú PIX-split merge is also a passthrough stub for the same reason.
Eval accuracy is mock-LLM only. The 100% baseline in evals/baseline_run.json is against a deterministic keyword mock, not a live Anthropic call. A real-LLM run is needed before claiming the SPEC ≥85% target.
Single-process, single-user, local only. stdio transport; no multi-tenant or hosted deployment.
BrasilAPI has no paid fallback. If it's unavailable, lookup_cnae returns cnae_available: false and does not crash — but there is no alternative data source.

Contributing

Open an issue or PR. Before submitting, uv run pytest && uv run ruff check && uv run ruff format --check must be clean.

Highest-value contribution: anonymised OFX/CSV statement samples from any of the five supported banks. Specifically needed:

Bank	What unlocks
Itaú	Real OFX with PIX → implement the multi-line merge in `parsers/itau.py`
Bradesco	Any OFX/CSV → create `parsers/bradesco.py` if quirks justify it
Nubank	CSV with proprietary description metadata → `parsers/nubank.py`
Inter	OFX with cp1252/ISO-8859 edge cases → confirm encoding fix coverage
BB	OFX missing payee CNPJ on specific transaction types → `parsers/bb.py`

Anonymise by masking CPFs/CNPJs (the repo has mask_br_documents if useful), replacing amounts with synthetic values, and clearing personal names/addresses. Structure and field shapes are what matter for fixture-based tests.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude		.claude
.github		.github
docs		docs
evals		evals
src/mcpforge		src/mcpforge
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demo

Why this exists

Key features

Quickstart

Offline mode (no Anthropic key)

OpenTelemetry metrics (optional)

Client config

Available tools

Architecture

Differentiation

1. Brazilian vertical focus, not a translation layer

2. Categorization that shows its work

3. Honest evaluation, not headline numbers

Benchmarks / Evaluation

Targets (from SPEC.md — not yet measured)

Status today

Tech stack

Roadmap

Known limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Demo

Why this exists

Key features

Quickstart

Offline mode (no Anthropic key)

OpenTelemetry metrics (optional)

Client config

Available tools

Architecture

Differentiation

1. Brazilian vertical focus, not a translation layer

2. Categorization that shows its work

3. Honest evaluation, not headline numbers

Benchmarks / Evaluation

Targets (from SPEC.md — not yet measured)

Status today

Tech stack

Roadmap

Known limitations

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages