Skip to content

merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes)#94

Open
paulalbert1 wants to merge 32 commits into
masterfrom
dev
Open

merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes)#94
paulalbert1 wants to merge 32 commits into
masterfrom
dev

Conversation

@paulalbert1

Copy link
Copy Markdown
Contributor

Propagates dev to master (prod). You control the merge — opened for review, not auto-merged.

Key contents (dev not in master)

  • setup/alter_add_admin_user_scope_columns_v1.5.sql — idempotent ALTER adding scope_person_types/scope_org_units/proxy_person_ids (JSON NULL) to admin_users. PM-deploy blocker (PM dev SELECTs these on every login). Reviewed APPROVE.
  • setup/table_authorship_review.sql — durable authorship_review table (authorship-review initiative).
  • NIH RePORTER ETL + provenance; abstract-corruption repair fixes; 4 Feature Generator schema fields.

⚠️ Merging does NOT apply the SQL

The pipeline does not run setup/*.sql DDL against the DB. The v1.5 ALTER must be applied directly to each existing DB (the prod reciterdb is being updated separately as part of this work). Merging only updates the repo/fresh-build path.

Divergence

master is 35 commits ahead of dev (bidirectional divergence) — this is a real integration, reconcile on merge. v1.5 ALTER confirmed not present on master.

paulalbert1 and others added 26 commits March 5, 2026 12:16
The analysis_nih table was loaded with ~527K rows (2x duplicates) on
Dec 18, 2025. Since then, every nightly run retrieves the correct ~267K
rows but validation rejects the swap (267K/527K = 50.7% < 80% threshold).

Changes:
- validate_data() now detects duplicate pmids in production and uses the
  unique count for percentage comparison, allowing the swap to self-heal
- create_staging_tables() adds a UNIQUE constraint on pmid for
  analysis_nih_new to prevent future duplicate inserts
- Added check_production_integrity() utility for diagnostic use
- Schema: changed analysis_nih.idx_pmid from KEY to UNIQUE KEY

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update README, cleanup .gitignore, remove unused artifacts
…ass-dev

Fix: NIH validation blocking table swap for 77+ days
…standalone SP

STEP 4: Replace person.title with person_person_type LEFT JOIN chain
(COALESCE) for facultyRank derivation. Replace article counts that
counted all articles with filtered counts (Research Article +
percentileNIH IS NOT NULL only).

STEP 5: Replace all 8 combined UPDATE statements (using wrong
threshold-percentage formula) with 24 separate UPDATE statements
matching the canonical createEventsProceduresReciterDb.sql v2:
- Percentile: AVG of top N articles ranked by percentileNIH DESC
- Denominator: Count of peers at same facultyRank meeting thresholds
- Rank: RANK() OVER (PARTITION BY facultyRank ORDER BY percentile DESC)

All 8 metrics (top5/10 x All/First/Senior/FirstSenior) now match the
canonical logic. Steps 1-3, 6, 7 verified identical and unchanged.

Ref: createEventsProceduresReciterDb.sql lines 3720-4252
Capture new top-level and feedbackEvidence fields emitted by ReCiter:
  - datePublicationAddedToPMC (person_article, analysis_summary_article)
  - feedbackScoreTextSimilarity (person_article)
  - feedbackScoreJournalTitleSimilarity (person_article)
  - feedbackScoreBibliographicCoupling (person_article)

Schema, nightly SP (v2 + inline copy), legacy loose-SQL insert,
CSV transformer, and LOAD DATA column list updated in lockstep.

Adds idempotent migration setup/alter_add_feature_generator_fields_v1.1.sql
for applying to prod and dev DBs out-of-band. ALTER must run BEFORE
deploying the updated ETL image, otherwise LOAD DATA fails on unknown
columns.
Applied to prod + dev on MariaDB 11.4 / 10.6 in ~25s (network-bound).
AFTER placement forces INPLACE algorithm with metadata lock; appending
at end allows INSTANT (no table rewrite, no lock hold).
The v2 populateAnalysisSummaryTables SP was selecting articleYear directly
from person_article without a fallback, leaving 74% of analysis_summary_article
rows with articleYear = 0. The legacy SP had a post-INSERT UPDATE step that
derived the year from publicationDateStandardized in those cases, which the
v2 rewrite dropped.

Push the fallback into the SELECT itself:
IF(articleYear != 0, articleYear, LEFT(publicationDateStandardized, 4))

Verified post-deployment: 0 rows with articleYear = 0 in prod and dev.
The iCite API returns two arrays per record:
  - cited_by: PMIDs that cite the queried article
  - references: PMIDs the queried article cites

The CSV write for cited_by entries used [cited_by_elem, queried_pmid],
but the LOAD DATA columns are (cited_pmid, citing_pmid). Since cited_by
elements are the CITING articles (not the cited), this stored every
cited_by edge with the cited/citing columns swapped. The references
loop was already correct. Same inversion affected analysis_nih_cites_clin.

Effect: queries like
  SELECT COUNT(*) FROM analysis_nih_cites WHERE cited_pmid = <pmid>
returned only the rows sourced from other queried PMIDs' references
arrays, missing the (typically much larger) cited_by set. Example:
PMID 32432483 has iCite citation_count=192 but only 19 rows surfaced
because just 19 WCM-tracked papers happened to reference it.

After this change, on the next nightly run the table swap will load
analysis_nih_cites with semantically correct (cited_pmid, citing_pmid)
columns, matching the join used by downstream consumers
(Scholars-Profile-System/lib/api/publication-detail.ts).
fix(retrieveNIH): correct column inversion in analysis_nih_cites load
abstractImport.py wrote abstracts to a CSV with csv.writer, then bulk-loaded
it with LOAD DATA LOCAL INFILE. The two are incompatible: csv.writer emits
\r\n line endings and doubled-quote escaping (""), while the LOAD used
LINES TERMINATED BY '\n' and ENCLOSED BY '"' with backslash escaping. Any
abstract containing a double quote desynced MySQL's row parser, so ~99.9% of
every file was dropped with no error raised (~43 of ~72,000 rows loaded).
The unresolved PMIDs were re-selected every cycle and the unbounded
while-True loop ran until the 15,000s pipeline timeout, restart-looping
indefinitely and blocking the nightly CronJob.

- Replace CSV + LOAD DATA with a parameterized, batched executemany INSERT
  so abstract text is stored verbatim regardless of content.
- Bound the fetch/insert loop: stop on no progress, cap at 25 cycles, so
  unresolvable PMIDs can no longer hang the pipeline.
- Retry batch_get_item UnprocessedKeys with backoff instead of dropping
  throttled keys.
- Connect with charset=utf8mb4; drop the now-unused local_infile flag.
- Add a --dry-run mode that verifies the fetch/insert path against a
  TEMPORARY table without touching reporting_abstracts.
fix(abstractImport): replace CSV bulk-load that silently dropped rows
Adds a new nightly ETL step (retrieveReporter.py) that pulls grant metadata
and pub-grant linkages from NIH RePORTER (api.reporter.nih.gov/v2) and
reconciles them against the existing PubMed-derived person_article_grant
table.

Three new tables (alter_add_reporter_fields_v1.2.sql):
  - grant_reporter_project — RePORTER /projects/search results, refreshed
    each cycle (truncate-reload). Includes abstract_text for cross-reference.
  - grant_reporter_link — RePORTER /publications/search (pmid, appl_id) pairs.
  - grant_provenance — long-lived per-(person, pmid, grant) audit log with
    source_reporter, source_reciterdb, *_first_seen, last_verified. Survives
    the nightly truncate-reload of person_article_grant. Keyed by
    (personIdentifier, pmid, core_project_num) where core_project_num is the
    normalized NIH grant identifier (e.g. R01DK127777).

ETL strategy:
  - Projects: filtered by org_names=["WEILL MEDICAL COLL OF CORNELL UNIV"],
    partitioned by fiscal year to stay under the 9,999 offset cap (WCM has
    ~15K projects historically). 1 req/sec rate limit honored.
  - Publications: keyed by appl_ids from projects, batched.
  - Reconciliation: bulk INSERT...SELECT ON DUPLICATE KEY UPDATE on both
    sides. Reciterdb side stages normalized grant strings via temp table +
    LOAD DATA LOCAL INFILE; RePORTER side joins grant_reporter_link to
    person_article (userAssertion='ACCEPTED') as the false-positive guard.
  - Subaward caveat: WCM-as-subaward will be missed by the org filter
    (false-negative tradeoff accepted to keep false positives near zero).

Validated on dev: 235K provenance rows from 33K reciterdb + 132K RePORTER
inputs. End-to-end ~18 min (15 min publications fetch is the rate-limited
floor; reconciliation completes in ~10 sec via bulk SQL).
RePORTER /projects/search returns NIH-curated keyword vocabularies
alongside the abstract. Capture them into grant_reporter_project so the
Scholars-Profile-System funding ETL can project them onto grants as a
topical search signal (issue #291).

- alter_add_reporter_terms_v1.3.sql: ADD COLUMN project_terms, pref_terms
  via the information_schema-guarded idiom, safe on the live table.
- v1.2 CREATE TABLE: mirror the two columns for fresh builds.
- retrieveReporter.py: pull `terms` and `pref_terms` from each project
  dict, stored raw (terms angle-bracket-wrapped, pref_terms
  semicolon-delimited).
feat(reporter): NIH RePORTER ETL + project-term capture (#291) [dev]
…t reload

fetch_projects() partitions the search by fiscal year when the corpus
exceeds RePORTER's 9,999 offset cap (WCM has ~15K projects). RePORTER
returns a multi-year project under every fiscal year it was active, so
the same appl_id comes back in multiple FY pages. appl_id is
grant_reporter_project's PRIMARY KEY, so the unguarded reload hit
"IntegrityError 1062: Duplicate entry for key 'PRIMARY'" and the run
aborted after TRUNCATE had already emptied the table.

Dedup the projects loop by appl_id, mirroring the seen_pairs guard the
publications loop already uses.
fix(reporter): dedup projects by appl_id before grant_reporter_project reload [dev]
run_all.py invokes retrieveReporter.py, but the Dockerfile's per-file
COPY list was never updated when the RePORTER ETL step was added (PR #81),
so the built image lacked the script. Every nightly run crashed at step 4
with "can't open file '/usr/src/app/retrieveReporter.py'", halting the
pipeline before nightly indexing, abstractImport, and conflictsImport.
With restartPolicy: OnFailure this produced an indefinite ~95-min restart loop.
fix(docker): add missing COPY for retrieveReporter.py
…SV/LOAD DATA path

The pre-PR-#78 abstractImport.py wrote rows with csv.writer (RFC-4180
doubled-quote escaping) and LOAD DATA LOCAL INFILE (backslash escaping).
On rows whose abstract text contained ", MySQL's parser desynced and
consumed multiple TSV rows into one field until the next recoverable
delimiter, producing a single reporting_abstracts row attributed to one
pmid but containing concatenated text from several other papers in the
same DynamoDB batch.

The May 16 fix (PR #78) stopped new corruption but did not repair
existing rows. fetch_missing_pmids() uses LEFT JOIN ... WHERE a.pmid IS
NULL, so any pmid with a (corrupted) row was permanently skipped.

Audit against prod found 3,124 corrupted rows out of 391,238 total.
84% are pegged at the 65,535-byte BLOB cap with the original pmid's
abstract at the head and text from several unrelated papers at the tail.
181 additional rows are clean-pair duplicates from the same era (no
UNIQUE constraint on pmid; concurrent old-import runs produced identical
content twice). Verification against DynamoDB also spared 464 legitimately
long abstracts (structured / consensus / multi-arm trials, lengths up to
~32K) that a length-only filter would have wrongly purged.

This change ships the tooling and schema for the cleanup:

- update/auditAbstracts.py -- read-only forensic audit. Compares
  reporting_abstracts.abstract against DynamoDB.PubMedArticle (the same
  source abstractImport.py uses) and classifies each row >= 4000 chars
  as CLEAN, PREFIX_CORRUPTED, DISJOINT, EMPTY_IN_DYNAMO, or
  MISSING_IN_DYNAMO. Writes audit_abstracts.csv plus a dump of the worst
  offenders.

- update/repairAbstracts.py -- destructive cleanup. Backs up affected
  rows to a timestamped table, deletes corrupted rows in batches, then
  dedupes any remaining pmid duplicates by keeping MIN(id). Requires
  --apply; default is dry-run. Confirms the v1.4 migration precondition
  (no duplicates) after running.

- setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql -- adds UNIQUE
  KEY on reporting_abstracts.pmid. information_schema-guarded; aborts
  if any duplicates remain. Mirrors the analysis_nih fix from PR #71
  after the Dec 2025 duplicate-loading incident.

- setup/createDatabaseTableReciterDb.sql -- idx_pmid is now UNIQUE for
  fresh installs.

- .gitignore -- excludes audit / repair artifacts at the repo root
  (they contain prod abstract text).

DBA runbook after merge:
  1. python3 update/auditAbstracts.py
  2. python3 update/repairAbstracts.py            (dry-run; review counts)
  3. python3 update/repairAbstracts.py --apply
  4. mysql ... < setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql
  5. Next nightly abstractImport.py backfills the deleted PMIDs via the
     parameterized path.

Refs: #87, PR #78.
…y halts

Two bugs discovered when running the runbook end to end against prod:

1. update/repairAbstracts.py used `CREATE TABLE LIKE` + `INSERT INTO ...
   SELECT *`. Prod added a STORED generated column `abstract_len` (with
   a composite index `idx_abs_pmid_len`) directly without updating the
   repo's schema file. The SELECT * pulled the generated-column value;
   MariaDB's strict mode rejected it with error 1906 and the backup
   aborted with the destination table empty.

   The fix queries information_schema for non-generated columns and
   enumerates them in both the corrupted-rows INSERT and the dedup
   INSERT. The script now works whether or not abstract_len exists.

2. setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql's precondition
   guard used `SELECT 'aborted...' AS error, 1/0 AS force_error`. The
   1/0 only emits a warning in MariaDB's default SQL mode -- it does
   NOT halt the script. So when the DBA uploaded the migration before
   running the repair, the precondition SELECT was printed but
   execution continued to the ALTER, which then failed with
   `Duplicate entry '9182809' for key 'idx_pmid'`.

   The fix synthesizes a SELECT against a non-existent table whose
   name encodes the duplicate count
   (`__migration_aborted_reporting_abstracts_has_N_duplicate_pmids__
   _run_update_repairAbstracts_py_first`). The resulting "Table
   doesn't exist" error halts the SQL client immediately, and the
   error text itself tells the operator what to do.

Schema drift (the live abstract_len + idx_abs_pmid_len that aren't in
createDatabaseTableReciterDb.sql) is a separate concern not addressed
here.
fix(reporting_abstracts): repair cross-paper concatenation from old CSV/LOAD DATA path (#87)
Durable curator-state table for the Publication Manager 'Authorships' review queue
(Curator_All). Survives the nightly truncate-reload like grant_provenance — it is in
no truncate list (updateReciterDB.py all_tables) and touched by no stored procedure;
populated externally by the adversarial-attribution-review producer. CREATE TABLE IF
NOT EXISTS, additive, no change to existing ETL.
Add durable authorship_review table (PM Authorships queue)
Adds scope_person_types, scope_org_units, proxy_person_ids (JSON NULL)
to admin_users via an idempotent information_schema-guarded ALTER, for
existing databases that predate the fresh-build schema (#92, master).

The Publication Manager dev-branch AdminUser model (commit 579d32f)
selects these columns during login; without them findOrcreateAdminUser
fails with 'Unknown column' and authentication returns 401 for every
user. Run before deploying the PM dev branch against an existing
reciterdb (e.g. production).

Additive only; admin_users is durable (not in updateReciterDB.py
all_tables truncate list). No-op on re-run.
feat(setup): add admin_users scope/proxy column migration (v1.5)
@paulalbert1 paulalbert1 self-assigned this Jun 9, 2026
paulalbert1 and others added 3 commits June 14, 2026 12:40
…ble (#95)

Add a nightly ETL step that scans the ReCiter DynamoDB ArticleProvenance
table and loads first-retrieval provenance into a new reciterdb table,
article_provenance, so Publication Manager can display the date a
publication was first retrieved (PM #737).

- update/retrieveArticleProvenance.py: streams the scan into an
  article_provenance_new staging table (INSERT IGNORE per page to bound
  memory and collapse duplicates), validates row counts against
  production, then atomic RENAME-swaps. Converts frd (epoch seconds, UTC)
  to DATETIME. Mirrors the retrieveNIH staging->swap pattern.
- setup/createDatabaseTableReciterDb.sql: article_provenance DDL for
  fresh installs, keyed (pmid, personIdentifier).
- setup/alter_add_article_provenance_v1.6.sql: migration for existing
  dev/prod databases.
- update/run_all.py: run the step before nightly indexing as non-fatal so
  a failure cannot block the indexing SP.
- Dockerfile: COPY the new script.
- README: document the step and backfill the missing retrieveReporter row.

The table is keyed on (pmid, personIdentifier): the source DynamoDB table
has a composite key uid (HASH) + articleId (RANGE), one item per
person+article, so frd is per-person -- not global-per-article as issue
#95 assumed.
feat(etl): scan ArticleProvenance (DynamoDB) -> article_provenance table (#95)
…ce-etl"

This reverts commit acd26bb, reversing
changes made to ffd0631.
paulalbert1 and others added 3 commits June 14, 2026 14:02
Revert #96: drop article_provenance batch ETL in favor of on-the-fly DynamoDB reads
3-places mirror of ReCiter-Publication-Manager schema:
- add admin_permissions, admin_role_permissions, admin_permission_resources
  to createDatabaseTableReciterDb.sql
- add impersonatedByUserID column to admin_feedback_log
- add table_admin_permissions.sql with the permission/role-mapping/nav seed

Mirrors PM scripts/migrations/add-permission-tables.sql and
add-impersonated-by-feedbacklog.sql. Refs #739, #733.
feat(setup): mirror RBAC permission tables + impersonatedByUserID column (#739, #733)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant