Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions MEETING_PREP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Meeting Prep — Field Extraction Quality

## The Two Root Causes of Poor Scores

All the fields with bad scores trace back to exactly two problems:

---

### Problem 1 — Schema Type Mismatch (Multi-Span Fields)

The model is designed to extract **one value per field**. But many fields in CUAD have **multiple valid answer spans per document** — sometimes up to 55. When the model extracts 1 and the GT has 6, scoring it as 50% or even 16.7% misrepresents what actually happened. The model found the right thing, it just didn't find all of them.

This is a **schema architecture issue**, not a model quality issue. The fix is to change these fields from `string` → `array` type in the schema, so the model is prompted to return a list and the evaluator knows to score against each span individually.

**Fields that must become array type (sorted by severity):**

| Field | % of Docs with GT | Docs with Multiple Spans | Max Spans in One Doc |
|-------|-------------------|--------------------------|----------------------|
| Parties | 99.8% | 508 / 509 | **55** |
| Post-Termination Services | 35.7% | 91 / 182 | **23** |
| Rofr/Rofo/Rofn | 16.7% | 62 / 85 | **22** |
| Audit Rights | 42.0% | 136 / 214 | **19** |
| Insurance | 32.5% | 115 / 166 | **17** |
| Cap on Liability | 53.9% | 161 / 275 | **16** |
| License Grant | 50.0% | 164 / 255 | **16** |
| IP Ownership Assignment | 24.3% | 79 / 124 | **15** |
| Minimum Commitment | 32.4% | 98 / 165 | **14** |
| Non-Compete | 23.3% | 61 / 119 | **12** |
| Exclusivity | 35.3% | 104 / 180 | **12** |
| Non-Transferable License | 27.1% | 68 / 138 | **12** |
| Change of Control | 23.7% | 72 / 121 | **11** |
| Joint IP Ownership | 9.0% | 21 / 46 | **11** |
| Affiliate License-Licensor | 4.5% | 15 / 23 | **10** |
| Affiliate License-Licensee | 11.6% | 26 / 59 | **10** |
| Liquidated Damages | 12.0% | 27 / 61 | **10** |
| Irrevocable or Perpetual License | 13.7% | 36 / 70 | **10** |
| Warranty Duration | 14.7% | 37 / 75 | **10** |

**Concrete example — Audit Rights (PACIRA doc):**
- GT has **6 spans**: each covering a different obligation (right to inspect, timing, cost allocation, frequency limit, scope, etc.)
- Model extracts **1 span** (the main right-to-audit sentence)
- Current score: **~17%** (but displayed as 50% due to UI bucketing)
- Correct framing: model found the primary clause correctly, but missed 5 supporting clauses
- Fix: schema expects array → model returns array → each span scored independently → Span Recall = 1/6 = 16.7%, which is honest

---

### Problem 2 — Field Descriptions Too Narrow (False Negatives)

The model is looking for structured, clean values (a date, a duration, a yes/no flag) but CUAD's ground truth for several fields is the **full clause text** — a complete sentence or paragraph. The model returns null because it doesn't find a clean value, even though the clause is right there in the document.

This is a **field description issue**. The fix is to broaden the extraction prompt to accept clause-level text, not just structured values.

**Affected fields with GT examples:**

#### `expiration_date`
The model looks for a date like `2024-12-31`. But CUAD GT is:
- `"The term of this Agreement shall be ten (10) years which shall commence on the date..."`
- `"The Contract is valid for 5 years, beginning from and ended on ."`
- `"This Agreement shall commence on the Effective Date and shall continue for a period of six (6) months..."`

The contract describes the term in prose — there is no clean date to extract.

**Current description (too narrow):**
> Extract the date on which the contract's initial term expires.

**Improved description:**
> Extract the expiration date or term duration of the contract. This may be a specific date (e.g. "December 31, 2025"), a duration from the effective date (e.g. "two years from the Effective Date"), or a full clause describing the term length (e.g. "The term of this Agreement shall be ten (10) years commencing on..."). If no explicit expiration is stated but the contract term is described, extract the full term description clause.

---

#### `warranty_duration`
The model looks for a duration like `24 months`. But CUAD GT is:
- `"If, within the twenty-four (24) month warranty period set forth above..."` (full sentence)
- `"Within 7 days after the arrival of the goods at destination, should the quality... be found not in conformity..."` (a relative timeframe clause, not a clean duration)
- `"Google warrants that the Distribution Products will for a period of [*] from the date of supply..."` (redacted period, full warranty clause)
- Up to **8 spans** in a single document covering different warranty obligations

**Current description (too narrow):**
> Extract the duration of any warranty against defects or errors.

**Improved description:**
> Extract all clauses that describe warranty duration or warranty obligations related to defects, errors, or product performance. This includes specific durations (e.g. "24 months"), relative timeframes (e.g. "within 7 days of delivery"), and full warranty clauses that define the warranty period and its conditions. Extract each distinct warranty clause as a separate item if multiple warranties apply.

---

#### `post_termination_services`
The model may look for a simple yes/no or a short description. But CUAD GT is full clause text describing obligations that survive termination — IP handover, wind-down, continued performance, transition assistance. Up to **23 spans** in a single document.

**Current description (too narrow):**
> Identify if a party has obligations after the contract terminates or expires.

**Improved description:**
> Extract all clauses describing obligations that apply after the contract terminates or expires. This includes: transition assistance, IP or data transfer on termination, wind-down commitments, last-buy rights, continued performance during notice periods, payment obligations surviving termination, and any other post-termination duties explicitly stated. Extract the full clause text for each distinct obligation, not just a yes/no flag.

---

#### `rofr_rofo_rofn`
CUAD GT for this field is always multi-span (62 of 85 docs with GT have multiple spans, max 22). The full mechanism of a right of first refusal requires multiple clauses to be legally meaningful — the trigger event, the notice period, the response window, what happens if the right is not exercised.

**Current description (too narrow):**
> Does the contract grant a right of first refusal, right of first offer, or right of first negotiation?

**Improved description:**
> Extract all clauses related to any right of first refusal (ROFR), right of first offer (ROFO), or right of first negotiation (ROFN). This includes: the clause granting the right, the triggering event or condition, the notice and response procedures, the time window to exercise the right, what happens if the right is not exercised, and any carve-outs or exceptions. Each component clause should be extracted separately.

---

## What This Means for Scores

Before any fixes, here is the honest state of scoring:

| Issue | What the score shows | What it actually means |
|-------|---------------------|----------------------|
| Parties extracted 2 of 5 spans | 50% (UI bucket) | Model found 2 of 5 named parties — partial, not failed |
| Audit Rights extracted 1 of 6 | 50% (UI bucket) | 1/6 = 16.7% span recall — model found the right clause but missed supporting ones |
| Expiration Date returned null | 0% | Field description mismatch — the clause exists but model couldn't parse it as a date |
| Warranty Duration returned null | 0% | Same — GT is a full clause, model expected a clean duration |

---

## Prioritised Fix List for Tomorrow

### Quick wins (field description changes only — no code changes):
1. [ ] Update `expiration_date` description → accept term duration clauses
2. [ ] Update `warranty_duration` description → accept full warranty clauses, expect array
3. [ ] Update `post_termination_services` description → full clause text, expect array
4. [ ] Update `rofr_rofo_rofn` description → multi-clause extraction, expect array

### Schema changes (require schema + evaluator update):
5. [ ] Change `parties` to array type — this is the most impactful single change (99.8% of docs, up to 55 spans)
6. [ ] Change `audit_rights` to array type
7. [ ] Change `cap_on_liability` to array type
8. [ ] Change `license_grant` to array type
9. [ ] Change `post_termination_services` to array type

### Metric changes (require benchmarking pipeline update):
10. [ ] Add **Span Recall** metric — number of GT spans covered / total GT spans
11. [ ] Add raw field counts — "extracted X of Y expected fields" per document
12. [ ] Fix document score formula inconsistency between `import_to_db.py` and `document_service.py`

---

## What to Show in the Meeting

1. **The multi-span table above** — shows clearly that the scoring problem is architectural, not model quality
2. **The Parties example** — 508 out of 509 documents have multiple party spans (up to 55). Extracting only the first party and scoring it as a failure is wrong by design.
3. **The Audit Rights example** — PACIRA doc has 6 GT spans. Model extracted 1. True Span Recall = 16.7%, not 50% or 0%.
4. **The 4 improved field descriptions above** — concrete, implementable changes that will immediately improve false negative rates on `expiration_date`, `warranty_duration`, `post_termination_services`, `rofr_rofo_rofn`.
5. **The confidence display fix** — already done. Fields now show real values (e.g. 88%) instead of snapping to 50%. New thresholds pending: ≥0.85 green, ≥0.60 amber, <0.60 red.
209 changes: 209 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Forage Project — To-Do

---

## 1. Schema: Multi-Span Fields

**Problem:** Some fields (e.g. Audit Rights) have multiple valid answer spans in the ground truth — up to 6. The model currently extracts only one span, so scoring 1/6 as 50% is wrong; it should be ~16.7%.

**Action Items:**
- [ ] Change the schema format for multi-span fields from a single string to an array/list type
- [ ] For each document, the schema should declare how many spans are expected per field so the model and evaluator both know what to target
- [ ] Identify which of the 41 fields are multi-span (start with Audit Rights, License Grant, Parties, Governing Law)
- [ ] Stephen to confirm the full list of fields that need array type in schema

---

## 2. Confidence Display Fix (Documents Page)

**Context:** Confidence shown per field in the Documents page currently passes through multiple stages:

1. **Base confidence** — LLM's self-reported certainty (0–1 float in JSON response)
2. **Adjusted confidence** — overwritten by validation pipeline: `base × validation_score × penalty × grounding signal`, then bucket-snapped to `passed=1.0 / partial=0.5 / failed=0.0`
3. **MQS override** — final stored value is the MQS score (`er.confidence = mqs.final_score` at line 838)

The 50% display is a UI quirk: `ui.js` (lines 51–63) snaps any value between 0.49–0.98 to "medium → 50%", making nearly everything show as 50%.

**What the score actually is:** The MQS score (measures trustworthiness of extraction — no ground truth involved). It is **not** an accuracy score.

**Action Items:**
- [x] Fix `ConfPill` to display the real float value (e.g. 87%) instead of snapping to 0/50/100
- [ ] Apply updated confidence tier thresholds:

| Tier | Threshold | Rationale |
|------|-----------|-----------|
| High (green) | ≥ 0.85 | Model is confident, well-evidenced extraction |
| Medium (amber) | ≥ 0.60 | Hedged or weak evidence |
| Low (red) | < 0.60 | Unreliable, treat with caution |

> Current `>= 0.99` threshold for "high" is unrealistically tight — almost nothing reaches it.

---

## 3. New Metrics: Span Recall

**Context:** Two metrics exist today:

| Metric | What it measures | Where it lives |
|--------|-----------------|----------------|
| Token F1 | Word overlap between extraction and full GT text. Partial credit for matching words. | Benchmarks page |
| LLM Judge Score | Correct / Partial / Incorrect vs GT | Benchmarks page |
| MQS | Trustworthiness of extraction (no GT needed) | Documents page |

**Missing metric — Span Recall:** Measures how many GT answer spans are fully covered by the extraction. Binary per span (covered or not).

Example — Audit Rights (6 GT spans, model extracted 1):

| Metric | Score | Why |
|--------|-------|-----|
| Token F1 | ~20–30% | Some word overlap but most words missing |
| Span Recall | 16.7% | 1 of 6 spans covered |

For legal contracts, Span Recall is more meaningful than Token F1 — a partially captured clause may have no legal value at all.

**Action Items:**
- [ ] Implement Span Recall metric in the benchmarking pipeline
- [ ] Add Span Recall column to the Benchmarks page per-field breakdown
- [ ] Decide whether Span Recall should also influence the document-level score

---

## 4. Two Scoring Systems — Clarify and Connect

**Context:** There are currently two completely separate scoring systems with no connection between them:

| System | Page | What it measures | Needs GT? |
|--------|------|-----------------|-----------|
| MQS | Documents page | Can we trust this extraction? (agreement, grounding, schema, history) | No |
| LLM Judge + Token F1 | Benchmarks page | Is this extraction actually correct vs CUAD GT? | Yes |

Span Recall doesn't exist anywhere yet — discussed as a potential improvement.

**Important nuance on document score:**
- **Benchmarks page** `document_score` = LLM correct rate (%) if available, otherwise Token F1 (%) — defined in `import_to_db.py`
- **Backend pipeline** `document_score` = average of `field_validation_score × (1 − min(field_penalty_total, 1))` across all fields — defined in `document_service.py`
- The Benchmarks.js UI describes DOC as "penalty-adjusted aggregate" but the import code uses LLM-correct% or F1% — **these are inconsistent and need to be aligned**

**Action Items:**
- [ ] Decide whether MQS and accuracy scores should be shown side-by-side in the UI, or kept separate
- [ ] Align the `document_score` formula between benchmark import and pipeline scoring — pick one definition and apply it consistently
- [ ] Add a tooltip or label in the UI making clear which score is MQS (trust) vs accuracy (vs GT)

---

## 5. Accuracy Prompt Changes

**Context:** The accuracy prompt change (collapsing whitespace, ignoring quotes) applies **only** to the LLM Judge score on the Benchmarks page. It has **no effect on MQS**.

**Action Items:**
- [ ] Verify the whitespace/quote normalisation changes are working correctly on the next benchmark run
- [ ] Document what normalisations are applied before LLM judge comparison so they're not accidentally changed later

---

## 6. False Positives — Model Over-Extracts

**Context:** The model is biased toward over-extraction. Example: NELNETINC document — model extracted `2020-03-27` as effective date from "Dated: March 27, 2020", but CUAD GT has no effective date for this document. The extraction may actually be correct, but it counts as a false positive against the benchmark.

**Rule:** If GT is absent but model predicted a value → scored as incorrect.

**Action Items:**
- [ ] Review all false positive cases in the current benchmark run
- [ ] Flag cases where CUAD GT may be incomplete/wrong vs cases where the model is genuinely hallucinating
- [ ] Consider adding a "model may be right, GT may be wrong" flag to debatable false positives

---

## 7. False Negatives — Model Misses GT Fields

**Context:** 5 cases across 4 fields where CUAD has a value but model returned nothing (as of last 5-doc run):

| Field | Missed in (docs) | Root Cause |
|-------|-----------------|------------|
| expiration_date | 2/5 | GT is a duration clause ("10 years"), model looks for a date |
| rofr_rofo_rofn | 1/5 | Field prompt/description too narrow |
| post_termination_services | 1/5 | Field prompt/description too narrow |
| warranty_duration | 1/5 | GT is a relative clause, model looks for a structured duration |

**Action Items:**
- [ ] Update field descriptions for `expiration_date`, `rofr_rofo_rofn`, `post_termination_services`, `warranty_duration` to accept duration expressions and full clause sentences, not just structured values
- [ ] Re-run benchmark after description changes and compare recall
- [ ] Watch FP rate alongside recall — broader descriptions can increase false positives

---

## 8. Metrics: Field-Level Recall and False Positive Rate

**Definitions:**
- **Field Recall %** — of all fields where the contract has a clause (GT non-empty), what % did the model extract? (Last run: 89.8%)
- **FP Rate %** — of all fields where the clause is absent (GT empty), what % did the model wrongly extract? (Last run: 5%)

**Also needed — raw counts metric:**
- [ ] Add a metric that shows: fields extracted / fields that should have been extracted (raw integers, not just %)
- [ ] Surface this per-document on the Documents page and per-field on the Benchmarks page

---

## 9. Fields to QA / Spot-Check

The following fields need manual spot-checking across multiple documents for extraction quality:

- [ ] Parties
- [ ] Termination for Convenience
- [ ] Affiliate License-Licensor
- [ ] Affiliate License-Licensee
- [ ] Governing Law
- [ ] Cap on Liability
- [ ] License Grant
- [ ] Audit Rights
- [ ] Anti-Assignment
- [ ] Exclusivity
- [ ] Minimum Commitment
- [ ] No-Solicit of Customers
- [ ] Non-Compete
- [ ] Revenue/Profit Sharing
- [ ] Renewal Term
- [ ] Rofr/Rofo/Rofn

---

## 10. Dashboard — Evidence and GT Display

**Current state:**
- Amber highlight = evidence (the grounding text from the contract)
- Blue highlight = first match / extracted field value
- Ground truth is **not** shown on the dashboard

**Action Items:**
- [ ] Decide whether GT should be surfaced on the Documents page (useful for debugging, but only relevant for benchmarked documents)
- [ ] If yes, add a GT column/panel that shows the CUAD ground truth span alongside the model extraction

---

## 11. Seed Data

- [ ] Review `seed_data.json` — check what documents and fields are seeded, whether they are representative, and whether any updates are needed

---

## 12. More Loan Agreement Documents

**Context:** Current benchmark runs are on a small set of documents. More loan-specific documents are needed for meaningful evaluation.

**Action Items:**
- [ ] Source additional loan agreement documents (note: CUAD skews toward tech/SaaS — loan-specific clauses like interest rate, collateral, default triggers are not well represented)
- [ ] Run extraction pipeline on the new documents
- [ ] Add them to the benchmark dataset with ground truth labels
- [ ] Check whether CUAD field descriptions map correctly to loan agreement language — some fields may need loan-specific wording

---

## 13. Expand Beyond 41 CUAD Fields

**Context:** CUAD defines 41 fields. The project may need additional fields not covered by CUAD, particularly for loan agreements.

**Action Items:**
- [ ] Audit the CUAD dataset for any unique fields not in the current 41 (check raw data for edge cases)
- [ ] Identify loan-specific fields missing from CUAD (e.g. interest rate, collateral, default triggers, prepayment penalty, covenant compliance)
- [ ] Define schema entries for any new fields: field name, description, answer format, expected type (string / array / date / yes-no)
- [ ] Discuss with Stephen which new fields are in scope for the next milestone
Loading