MCP server for Brazilian personal finance.
A Brazilian personal-finance MCP server with OFX/CSV parsing, PT-BR categorization, and CPF/CNPJ validation — no bank APIs required.
Two chained tool calls in an MCP client:
User: Importei o OFX do Itaú de fevereiro. Quanto gastei com alimentação
versus transporte este mês?
Claude: [parse_ofx("/tmp/itau_fev2025.ofx")]
→ 143 transactions, bank_hint: "itau", 0 errors
[categorize_transaction × 143]
→ rules tier: 98 txs | llm tier: 45 txs
Alimentação + Restaurantes: R$ 1.247,30 (34 transações)
IFOOD*PEDIDO, PADARIA NOSSA SRA, MCDONALD'S SP...
Transporte: R$ 389,10 (18 transações)
UBER *TRIP, 99APP, METRÔ SP BILHETE...
Total do mês: R$ 8.643,22 em 143 transações.
Every categorize_transaction call returns which tier classified it:
{
"category": "transporte",
"confidence": 1.0,
"path": "rules"
}path ∈ {rules, embeddings, llm} — the classification source is always explicit. A rules hit at confidence 1.0 means a regex matched; an LLM hit at 0.7 means the model was less certain. The caller decides what to do with that signal.
Brazilian bank exports are not standardized. Itaú splits PIX transactions across three OFX lines that must be merged by FITID prefix. Inter exports in cp1252 but labels the file with ENCODING:1252 — a value that ofxparse rejects with an UnboundLocalError. Nubank embeds proprietary metadata in the description field. BB omits payee CNPJ for certain transaction types. Existing tools either skip Brazilian formats entirely or paper over the quirks silently, giving you wrong transaction counts without saying so.
The Model Context Protocol (released by Anthropic in late 2024) has become the standard for connecting LLM clients — Claude Desktop, Cursor, IDE agents — to external tools and data. As of mid-2025, there are effectively zero published MCP servers targeting Brazilian financial workflows. Every BR developer building a personal-finance assistant re-implements OFX parsing from scratch. This project solves the problem once and publishes the solution as a reusable MCP server.
- Bank-specific parsers with documented quirks — cp1252 encoding fix for Itaú/Inter, PIX-split detection for Itaú, header-fingerprint dialect detection for all 5 banks in CSV mode
- Transparent categorization —
path: rules | llmon every response; no black box; clients see exactly why a category was assigned - PII masking before any LLM call —
utils/pii.py:mask_br_documentsstrips CPF/CNPJ from description text unconditionally before it leaves the process (LGPD alignment) Decimaleverywhere — monetary amounts are neverfloat; rounding errors in downstream aggregations are impossible by construction- Offline-safe core —
validate_cpf,validate_cnpj,parse_ofx,parse_csvmake zero network calls; onlycategorize_transactionandlookup_cnaereach the network
git clone <this repo> && cd mcp-forge
uv sync
export ANTHROPIC_API_KEY=sk-... # only needed for categorize_transaction
uv run mcpforge # start MCP server on stdio
uv run pytest # 198 testscategorize_transaction can run fully offline against a local Ollama server with
Phi-3.5. Selected automatically when ANTHROPIC_API_KEY is unset, or explicitly
via MCPFORGE_LLM_BACKEND=ollama.
ollama pull phi3.5
uv sync --extra ollama
unset ANTHROPIC_API_KEY
uv run mcpforgeOverride the Ollama host/model with MCPFORGE_OLLAMA_HOST and MCPFORGE_OLLAMA_MODEL.
Boot- and call-timeouts: MCPFORGE_OLLAMA_HEALTH_TIMEOUT_S (default 2.0),
MCPFORGE_OLLAMA_CALL_TIMEOUT_S (default 30.0).
Two instruments are emitted alongside the JSONL trace log when an OTel endpoint is configured:
mcpforge_tool_calls_total{tool, success}— counter per tool invocationmcpforge_tool_latency_ms{tool}— histogram of tool latency
uv sync --extra otel
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector
export OTEL_SERVICE_NAME=mcpforge # default: mcpforge
uv run mcpforgeWithout the env var or extra, metric emission is a silent no-op; JSONL traces
under logs/traces.jsonl remain the default observability surface.
Add to your MCP client config. Example for Claude Desktop (claude_desktop_config.json) — Cursor, Zed, and Windsurf follow the same command + args + cwd pattern:
{
"mcpServers": {
"mcpforge": {
"command": "uv",
"args": ["run", "mcpforge"],
"cwd": "/path/to/mcp-forge"
}
}
}| Tool | What it does | Network? |
|---|---|---|
validate_cpf(cpf) |
Receita Federal mod-11 checksum | No |
validate_cnpj(cnpj) |
Receita Federal mod-11 checksum | No |
parse_ofx(file_path) |
Parse BR bank OFX; returns normalized transactions + bank_hint + per-tx error list |
No |
parse_csv(file_path, bank_hint?) |
Parse BR bank CSV with auto dialect detection | No |
lookup_cnae(cnpj) |
Primary CNAE classification via BrasilAPI; cached 24 h | Yes — BrasilAPI |
categorize_transaction(description, amount?) |
Rules → LLM cascade; returns {category, confidence, path} |
Conditionally — Anthropic |
All inputs validated with Pydantic v2. Errors return MCP-compliant error objects, not Python tracebacks.
flowchart LR
A["MCP Client\nClaude Desktop · Cursor · Zed · Windsurf"]
A <-->|stdio| B["FastMCP Server"]
B --> P["parsers/\nofx · csv"]
B --> V["validators/\ncpf · cnpj · cnae"]
B --> C["classifiers/\ncascade"]
P --> Banks["bank adapters\nItaú · Bradesco · Nubank · Inter · BB"]
V --> BrasilAPI["BrasilAPI\ncached 24 h"]
C --> Rules["rules tier\nzero API cost"]
C -.->|"roadmap"| Emb["embeddings tier\nBGE-m3"]
C --> PII["pii.mask_br_documents"]
PII --> Haiku["Anthropic Haiku 4.5"]
Single-process Python server on stdio transport — compatible with any standard MCP client (Claude Desktop, Cursor, Zed, Windsurf, and others), no proxy needed. Parsers are stateless. CNAE lookup carries an in-memory 24h TTL cache. Every tool call appends a structured trace to logs/traces.jsonl with latency, token counts, and cost in USD.
Bank-specific OFX/CSV quirks are first-class, not an afterthought:
parsers/ofx.py:_fix_encoding_header— remaps the non-standardENCODING:1252declaration (produced by Itaú, Inter, and others) toUSASCIIbeforeofxparsesees the file, then decodes the content ascp1252. Without this fix, these files throwUnboundLocalErrormid-parse — the kind of failure you only discover when a user reports missing transactions.parsers/itau.py— PIX transactions are detected via FITID prefix pattern; merge logic stub is in place for when real multi-line OFX samples are available (contributions welcome).validators/cpf.py,validators/cnpj.py— Receita Federal mod-11 checksum, pure Python, sub-millisecond, no external service.
Full 25-category PT-BR taxonomy
alimentacao · restaurantes · transporte · combustivel · moradia · utilities · telecom · saude · farmacia · educacao · lazer · streaming · vestuario · viagens · beleza · pets · impostos · tarifas_bancarias · salario · investimentos · transferencias · saques · doacoes · presentes · outros · uncategorized
Defined in classifiers/taxonomy.py. Configurable taxonomy is a v2 feature — fixed taxonomy in v1 so the golden dataset can be labeled before the cascade is tuned.
Most transaction categorizers return a label — you trust it or you don't. MCPForge returns path on every call, so the caller has enough information to decide how to handle uncertainty:
path: "rules" → regex matched; confidence is 1.0; zero API cost
path: "llm" → rules missed; Haiku was called; confidence reflects model certainty
path: "embeddings" → (roadmap) nearest-neighbor on labeled set; no LLM cost
An agent can surface an llm/0.6 hit to the user for confirmation while silently accepting a rules/1.0 hit. The classification source is transparent because the distinction matters: a wrong category from a regex miss is a different kind of bug than a wrong category from an LLM that had no good match.
Current state: the cascade is two-tier (rules → LLM). The embeddings tier is stubbed at cascade.py:42 — it slots in without changing the tool signature or path semantics.
Per-category accuracy matters more than overall accuracy. A system with 87% top-1 can have 42% F1 on impostos — common in BR workflows with DARF, IPTU, and IPVA payments — and the aggregate hides it entirely.
The plan: Inspect AI running in CI on a 500-transaction labeled golden set (balanced across all 5 banks and all 25 categories), with per-category F1 published alongside overall accuracy, and a PR gate that blocks on a >2pp regression from the accepted baseline. The golden set will be committed to the repo, not stored separately.
Current state: the eval harness is not yet built. The numbers in Benchmarks are SPEC targets, not measurements. They will be replaced with measured results — including any categories that miss the target — when the harness lands.
Targets (from SPEC.md — not yet measured)
| Metric | Target |
|---|---|
| Top-1 categorization accuracy | ≥ 85% |
parse_ofx p95 on a 1 MB file |
< 1.5 s |
categorize_transaction p95 |
< 400 ms |
| LLM cost per transaction | < $0.0005 |
| OFX coverage | 5/5 banks, ≥ 95% transactions parsed |
These are calibrated estimates, not measured values. The eval harness is on the roadmap.
| Subsystem | State |
|---|---|
| Unit tests | 181 passing (uv run pytest) |
| CSV parsing | 5 banks, header-fingerprint dialect detection; fixtures in tests/parsers/fixtures/ |
| OFX parsing | Generic ofxparse + cp1252 quirk fix; Itaú PIX-split stub; bank hint detection |
| Categorizer eval | Not yet — Inspect AI + 500-tx golden set on roadmap |
| Latency / cost benchmarks | Not yet — per-call traces land in logs/traces.jsonl; no aggregated report |
Targets will be replaced with measured numbers — including misses — when the eval harness lands.
| Choice | Why |
|---|---|
| Python 3.11+ | mcp SDK is Python-first; StrEnum for the taxonomy |
uv |
Reproducible, fast installs; all dev commands are uv run … |
FastMCP (mcp ≥ 1.9) |
Official SDK; stdio transport = compatible with any standard MCP client |
| Pydantic v2 | Every tool input/output is a model — MCP spec compliance and input validation in one layer |
ofxparse + encoding fix |
Cheaper than rolling a custom SGML parser; the cp1252 quirk is a 10-line patch |
chardet |
OFX file extensions lie about encoding; content is sniffed |
| Anthropic Haiku 4.5 | Cheap and fast for short PT-BR text classification; ~$0.0005/tx target |
Decimal (stdlib) |
Money is never float |
ruff |
Lint + format in one tool |
pytest + pytest-asyncio |
181 tests today |
Done in v1:
- Inspect AI eval harness (500-tx golden set, CI regression gate, mocked-LLM run)
- Three-tier cascade — rules → embeddings (BGE-m3 k-NN) → LLM
- Ollama offline backend (Phi-3.5) with bounded timeouts and startup health check
- Stateless per-file duplicate annotation (
Transaction.duplicate_of) - Optional OpenTelemetry metrics (
mcpforge_tool_calls_total,mcpforge_tool_latency_ms) docs/ARCHITECTURE.mdwith the full module map and design rationale
Open for v1.x:
- Real-LLM eval baseline — current 100% in
evals/baseline_run.jsonis against the deterministic mock inevals/_mock_llm.py. Running once against live Haiku 4.5 (cost ~$0.20–0.30 for the full 500-tx set) is what validates the SPEC ≥85% target. Procedure documented inevals/README.md. - Per-bank postprocess modules for Bradesco, Nubank, Inter, BB (blocked on samples — see Contributing) — today the generic OFX/CSV parser handles all five banks via header-fingerprint detection, but bank-specific quirks (PIX splits, embedded metadata, encoding edge cases) need real exports to extract and test against.
- Itaú PIX-split merge (blocked on samples) — stub at
parsers/itau.py, passthrough pending a real multi-line OFX sample.
Out of scope for v1 (v2 candidates):
- Configurable taxonomy
- Cross-import duplicate detection (stateful)
- Investment account types (CDB, FIIs, Tesouro Direto)
.xlsx/ PDF imports- CNAE LLM-generated summaries on top of BrasilAPI
- File imports only. No Open Finance / direct bank API; Itaú/Bradesco API access requires institutional registration and Open Finance compliance — out of scope.
- No PDF or .xlsx. OFX and CSV only.
- Per-bank quirk fixes are Itaú-only. Bradesco/Nubank/Inter/BB are parsed by the generic OFX/CSV pipeline. Bank-hint detection identifies them correctly, but no bank-specific postprocess runs — pending real anonymised statement samples (see Contributing). The Itaú PIX-split merge is also a passthrough stub for the same reason.
- Eval accuracy is mock-LLM only. The 100% baseline in
evals/baseline_run.jsonis against a deterministic keyword mock, not a live Anthropic call. A real-LLM run is needed before claiming the SPEC ≥85% target. - Single-process, single-user, local only. stdio transport; no multi-tenant or hosted deployment.
- BrasilAPI has no paid fallback. If it's unavailable,
lookup_cnaereturnscnae_available: falseand does not crash — but there is no alternative data source.
Open an issue or PR. Before submitting, uv run pytest && uv run ruff check && uv run ruff format --check must be clean.
Highest-value contribution: anonymised OFX/CSV statement samples from any of the five supported banks. Specifically needed:
| Bank | What unlocks |
|---|---|
| Itaú | Real OFX with PIX → implement the multi-line merge in parsers/itau.py |
| Bradesco | Any OFX/CSV → create parsers/bradesco.py if quirks justify it |
| Nubank | CSV with proprietary description metadata → parsers/nubank.py |
| Inter | OFX with cp1252/ISO-8859 edge cases → confirm encoding fix coverage |
| BB | OFX missing payee CNPJ on specific transaction types → parsers/bb.py |
Anonymise by masking CPFs/CNPJs (the repo has mask_br_documents if useful), replacing amounts with synthetic values, and clearing personal names/addresses. Structure and field shapes are what matter for fixture-based tests.
MIT — see LICENSE.