merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes) by paulalbert1 · Pull Request #94 · wcmc-its/ReCiterDB

paulalbert1 · 2026-06-09T13:55:54Z

Propagates dev to master (prod). You control the merge — opened for review, not auto-merged.

Key contents (dev not in master)

setup/alter_add_admin_user_scope_columns_v1.5.sql — idempotent ALTER adding scope_person_types/scope_org_units/proxy_person_ids (JSON NULL) to admin_users. PM-deploy blocker (PM dev SELECTs these on every login). Reviewed APPROVE.
setup/table_authorship_review.sql — durable authorship_review table (authorship-review initiative).
NIH RePORTER ETL + provenance; abstract-corruption repair fixes; 4 Feature Generator schema fields.

⚠️ Merging does NOT apply the SQL

The pipeline does not run setup/*.sql DDL against the DB. The v1.5 ALTER must be applied directly to each existing DB (the prod reciterdb is being updated separately as part of this work). Merging only updates the repo/fresh-build path.

Divergence

master is 35 commits ahead of dev (bidirectional divergence) — this is a real integration, reconcile on merge. v1.5 ALTER confirmed not present on master.

The analysis_nih table was loaded with ~527K rows (2x duplicates) on Dec 18, 2025. Since then, every nightly run retrieves the correct ~267K rows but validation rejects the swap (267K/527K = 50.7% < 80% threshold). Changes: - validate_data() now detects duplicate pmids in production and uses the unique count for percentage comparison, allowing the swap to self-heal - create_staging_tables() adds a UNIQUE constraint on pmid for analysis_nih_new to prevent future duplicate inserts - Added check_production_integrity() utility for diagnostic use - Schema: changed analysis_nih.idx_pmid from KEY to UNIQUE KEY Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update README, cleanup .gitignore, remove unused artifacts

…ass-dev Fix: NIH validation blocking table swap for 77+ days

…standalone SP STEP 4: Replace person.title with person_person_type LEFT JOIN chain (COALESCE) for facultyRank derivation. Replace article counts that counted all articles with filtered counts (Research Article + percentileNIH IS NOT NULL only). STEP 5: Replace all 8 combined UPDATE statements (using wrong threshold-percentage formula) with 24 separate UPDATE statements matching the canonical createEventsProceduresReciterDb.sql v2: - Percentile: AVG of top N articles ranked by percentileNIH DESC - Denominator: Count of peers at same facultyRank meeting thresholds - Rank: RANK() OVER (PARTITION BY facultyRank ORDER BY percentile DESC) All 8 metrics (top5/10 x All/First/Senior/FirstSenior) now match the canonical logic. Steps 1-3, 6, 7 verified identical and unchanged. Ref: createEventsProceduresReciterDb.sql lines 3720-4252

Capture new top-level and feedbackEvidence fields emitted by ReCiter: - datePublicationAddedToPMC (person_article, analysis_summary_article) - feedbackScoreTextSimilarity (person_article) - feedbackScoreJournalTitleSimilarity (person_article) - feedbackScoreBibliographicCoupling (person_article) Schema, nightly SP (v2 + inline copy), legacy loose-SQL insert, CSV transformer, and LOAD DATA column list updated in lockstep. Adds idempotent migration setup/alter_add_feature_generator_fields_v1.1.sql for applying to prod and dev DBs out-of-band. ALTER must run BEFORE deploying the updated ETL image, otherwise LOAD DATA fails on unknown columns.

Applied to prod + dev on MariaDB 11.4 / 10.6 in ~25s (network-bound). AFTER placement forces INPLACE algorithm with metadata lock; appending at end allows INSTANT (no table rewrite, no lock hold).

The v2 populateAnalysisSummaryTables SP was selecting articleYear directly from person_article without a fallback, leaving 74% of analysis_summary_article rows with articleYear = 0. The legacy SP had a post-INSERT UPDATE step that derived the year from publicationDateStandardized in those cases, which the v2 rewrite dropped. Push the fallback into the SELECT itself: IF(articleYear != 0, articleYear, LEFT(publicationDateStandardized, 4)) Verified post-deployment: 0 rows with articleYear = 0 in prod and dev.

The iCite API returns two arrays per record: - cited_by: PMIDs that cite the queried article - references: PMIDs the queried article cites The CSV write for cited_by entries used [cited_by_elem, queried_pmid], but the LOAD DATA columns are (cited_pmid, citing_pmid). Since cited_by elements are the CITING articles (not the cited), this stored every cited_by edge with the cited/citing columns swapped. The references loop was already correct. Same inversion affected analysis_nih_cites_clin. Effect: queries like SELECT COUNT(*) FROM analysis_nih_cites WHERE cited_pmid = <pmid> returned only the rows sourced from other queried PMIDs' references arrays, missing the (typically much larger) cited_by set. Example: PMID 32432483 has iCite citation_count=192 but only 19 rows surfaced because just 19 WCM-tracked papers happened to reference it. After this change, on the next nightly run the table swap will load analysis_nih_cites with semantically correct (cited_pmid, citing_pmid) columns, matching the join used by downstream consumers (Scholars-Profile-System/lib/api/publication-detail.ts).

fix(retrieveNIH): correct column inversion in analysis_nih_cites load

abstractImport.py wrote abstracts to a CSV with csv.writer, then bulk-loaded it with LOAD DATA LOCAL INFILE. The two are incompatible: csv.writer emits \r\n line endings and doubled-quote escaping (""), while the LOAD used LINES TERMINATED BY '\n' and ENCLOSED BY '"' with backslash escaping. Any abstract containing a double quote desynced MySQL's row parser, so ~99.9% of every file was dropped with no error raised (~43 of ~72,000 rows loaded). The unresolved PMIDs were re-selected every cycle and the unbounded while-True loop ran until the 15,000s pipeline timeout, restart-looping indefinitely and blocking the nightly CronJob. - Replace CSV + LOAD DATA with a parameterized, batched executemany INSERT so abstract text is stored verbatim regardless of content. - Bound the fetch/insert loop: stop on no progress, cap at 25 cycles, so unresolvable PMIDs can no longer hang the pipeline. - Retry batch_get_item UnprocessedKeys with backoff instead of dropping throttled keys. - Connect with charset=utf8mb4; drop the now-unused local_infile flag. - Add a --dry-run mode that verifies the fetch/insert path against a TEMPORARY table without touching reporting_abstracts.

fix(abstractImport): replace CSV bulk-load that silently dropped rows

Adds a new nightly ETL step (retrieveReporter.py) that pulls grant metadata and pub-grant linkages from NIH RePORTER (api.reporter.nih.gov/v2) and reconciles them against the existing PubMed-derived person_article_grant table. Three new tables (alter_add_reporter_fields_v1.2.sql): - grant_reporter_project — RePORTER /projects/search results, refreshed each cycle (truncate-reload). Includes abstract_text for cross-reference. - grant_reporter_link — RePORTER /publications/search (pmid, appl_id) pairs. - grant_provenance — long-lived per-(person, pmid, grant) audit log with source_reporter, source_reciterdb, *_first_seen, last_verified. Survives the nightly truncate-reload of person_article_grant. Keyed by (personIdentifier, pmid, core_project_num) where core_project_num is the normalized NIH grant identifier (e.g. R01DK127777). ETL strategy: - Projects: filtered by org_names=["WEILL MEDICAL COLL OF CORNELL UNIV"], partitioned by fiscal year to stay under the 9,999 offset cap (WCM has ~15K projects historically). 1 req/sec rate limit honored. - Publications: keyed by appl_ids from projects, batched. - Reconciliation: bulk INSERT...SELECT ON DUPLICATE KEY UPDATE on both sides. Reciterdb side stages normalized grant strings via temp table + LOAD DATA LOCAL INFILE; RePORTER side joins grant_reporter_link to person_article (userAssertion='ACCEPTED') as the false-positive guard. - Subaward caveat: WCM-as-subaward will be missed by the org filter (false-negative tradeoff accepted to keep false positives near zero). Validated on dev: 235K provenance rows from 33K reciterdb + 132K RePORTER inputs. End-to-end ~18 min (15 min publications fetch is the rate-limited floor; reconciliation completes in ~10 sec via bulk SQL).

RePORTER /projects/search returns NIH-curated keyword vocabularies alongside the abstract. Capture them into grant_reporter_project so the Scholars-Profile-System funding ETL can project them onto grants as a topical search signal (issue #291). - alter_add_reporter_terms_v1.3.sql: ADD COLUMN project_terms, pref_terms via the information_schema-guarded idiom, safe on the live table. - v1.2 CREATE TABLE: mirror the two columns for fresh builds. - retrieveReporter.py: pull `terms` and `pref_terms` from each project dict, stored raw (terms angle-bracket-wrapped, pref_terms semicolon-delimited).

feat(reporter): NIH RePORTER ETL + project-term capture (#291) [dev]

…t reload fetch_projects() partitions the search by fiscal year when the corpus exceeds RePORTER's 9,999 offset cap (WCM has ~15K projects). RePORTER returns a multi-year project under every fiscal year it was active, so the same appl_id comes back in multiple FY pages. appl_id is grant_reporter_project's PRIMARY KEY, so the unguarded reload hit "IntegrityError 1062: Duplicate entry for key 'PRIMARY'" and the run aborted after TRUNCATE had already emptied the table. Dedup the projects loop by appl_id, mirroring the seen_pairs guard the publications loop already uses.

fix(reporter): dedup projects by appl_id before grant_reporter_project reload [dev]

run_all.py invokes retrieveReporter.py, but the Dockerfile's per-file COPY list was never updated when the RePORTER ETL step was added (PR #81), so the built image lacked the script. Every nightly run crashed at step 4 with "can't open file '/usr/src/app/retrieveReporter.py'", halting the pipeline before nightly indexing, abstractImport, and conflictsImport. With restartPolicy: OnFailure this produced an indefinite ~95-min restart loop.

fix(docker): add missing COPY for retrieveReporter.py

…SV/LOAD DATA path The pre-PR-#78 abstractImport.py wrote rows with csv.writer (RFC-4180 doubled-quote escaping) and LOAD DATA LOCAL INFILE (backslash escaping). On rows whose abstract text contained ", MySQL's parser desynced and consumed multiple TSV rows into one field until the next recoverable delimiter, producing a single reporting_abstracts row attributed to one pmid but containing concatenated text from several other papers in the same DynamoDB batch. The May 16 fix (PR #78) stopped new corruption but did not repair existing rows. fetch_missing_pmids() uses LEFT JOIN ... WHERE a.pmid IS NULL, so any pmid with a (corrupted) row was permanently skipped. Audit against prod found 3,124 corrupted rows out of 391,238 total. 84% are pegged at the 65,535-byte BLOB cap with the original pmid's abstract at the head and text from several unrelated papers at the tail. 181 additional rows are clean-pair duplicates from the same era (no UNIQUE constraint on pmid; concurrent old-import runs produced identical content twice). Verification against DynamoDB also spared 464 legitimately long abstracts (structured / consensus / multi-arm trials, lengths up to ~32K) that a length-only filter would have wrongly purged. This change ships the tooling and schema for the cleanup: - update/auditAbstracts.py -- read-only forensic audit. Compares reporting_abstracts.abstract against DynamoDB.PubMedArticle (the same source abstractImport.py uses) and classifies each row >= 4000 chars as CLEAN, PREFIX_CORRUPTED, DISJOINT, EMPTY_IN_DYNAMO, or MISSING_IN_DYNAMO. Writes audit_abstracts.csv plus a dump of the worst offenders. - update/repairAbstracts.py -- destructive cleanup. Backs up affected rows to a timestamped table, deletes corrupted rows in batches, then dedupes any remaining pmid duplicates by keeping MIN(id). Requires --apply; default is dry-run. Confirms the v1.4 migration precondition (no duplicates) after running. - setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql -- adds UNIQUE KEY on reporting_abstracts.pmid. information_schema-guarded; aborts if any duplicates remain. Mirrors the analysis_nih fix from PR #71 after the Dec 2025 duplicate-loading incident. - setup/createDatabaseTableReciterDb.sql -- idx_pmid is now UNIQUE for fresh installs. - .gitignore -- excludes audit / repair artifacts at the repo root (they contain prod abstract text). DBA runbook after merge: 1. python3 update/auditAbstracts.py 2. python3 update/repairAbstracts.py (dry-run; review counts) 3. python3 update/repairAbstracts.py --apply 4. mysql ... < setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql 5. Next nightly abstractImport.py backfills the deleted PMIDs via the parameterized path. Refs: #87, PR #78.

…y halts Two bugs discovered when running the runbook end to end against prod: 1. update/repairAbstracts.py used `CREATE TABLE LIKE` + `INSERT INTO ... SELECT *`. Prod added a STORED generated column `abstract_len` (with a composite index `idx_abs_pmid_len`) directly without updating the repo's schema file. The SELECT * pulled the generated-column value; MariaDB's strict mode rejected it with error 1906 and the backup aborted with the destination table empty. The fix queries information_schema for non-generated columns and enumerates them in both the corrupted-rows INSERT and the dedup INSERT. The script now works whether or not abstract_len exists. 2. setup/alter_add_uq_pmid_reporting_abstracts_v1.4.sql's precondition guard used `SELECT 'aborted...' AS error, 1/0 AS force_error`. The 1/0 only emits a warning in MariaDB's default SQL mode -- it does NOT halt the script. So when the DBA uploaded the migration before running the repair, the precondition SELECT was printed but execution continued to the ALTER, which then failed with `Duplicate entry '9182809' for key 'idx_pmid'`. The fix synthesizes a SELECT against a non-existent table whose name encodes the duplicate count (`__migration_aborted_reporting_abstracts_has_N_duplicate_pmids__ _run_update_repairAbstracts_py_first`). The resulting "Table doesn't exist" error halts the SQL client immediately, and the error text itself tells the operator what to do. Schema drift (the live abstract_len + idx_abs_pmid_len that aren't in createDatabaseTableReciterDb.sql) is a separate concern not addressed here.

fix(reporting_abstracts): repair cross-paper concatenation from old CSV/LOAD DATA path (#87)

Durable curator-state table for the Publication Manager 'Authorships' review queue (Curator_All). Survives the nightly truncate-reload like grant_provenance — it is in no truncate list (updateReciterDB.py all_tables) and touched by no stored procedure; populated externally by the adversarial-attribution-review producer. CREATE TABLE IF NOT EXISTS, additive, no change to existing ETL.

Add durable authorship_review table (PM Authorships queue)

Adds scope_person_types, scope_org_units, proxy_person_ids (JSON NULL) to admin_users via an idempotent information_schema-guarded ALTER, for existing databases that predate the fresh-build schema (#92, master). The Publication Manager dev-branch AdminUser model (commit 579d32f) selects these columns during login; without them findOrcreateAdminUser fails with 'Unknown column' and authentication returns 401 for every user. Run before deploying the PM dev branch against an existing reciterdb (e.g. production). Additive only; admin_users is durable (not in updateReciterDB.py all_tables truncate list). No-op on re-run.

feat(setup): add admin_users scope/proxy column migration (v1.5)

…ble (#95) Add a nightly ETL step that scans the ReCiter DynamoDB ArticleProvenance table and loads first-retrieval provenance into a new reciterdb table, article_provenance, so Publication Manager can display the date a publication was first retrieved (PM #737). - update/retrieveArticleProvenance.py: streams the scan into an article_provenance_new staging table (INSERT IGNORE per page to bound memory and collapse duplicates), validates row counts against production, then atomic RENAME-swaps. Converts frd (epoch seconds, UTC) to DATETIME. Mirrors the retrieveNIH staging->swap pattern. - setup/createDatabaseTableReciterDb.sql: article_provenance DDL for fresh installs, keyed (pmid, personIdentifier). - setup/alter_add_article_provenance_v1.6.sql: migration for existing dev/prod databases. - update/run_all.py: run the step before nightly indexing as non-fatal so a failure cannot block the indexing SP. - Dockerfile: COPY the new script. - README: document the step and backfill the missing retrieveReporter row. The table is keyed on (pmid, personIdentifier): the source DynamoDB table has a composite key uid (HASH) + articleId (RANGE), one item per person+article, so frd is per-person -- not global-per-article as issue #95 assumed.

feat(etl): scan ArticleProvenance (DynamoDB) -> article_provenance table (#95)

…ce-etl" This reverts commit acd26bb, reversing changes made to ffd0631.

Revert #96: drop article_provenance batch ETL in favor of on-the-fly DynamoDB reads

3-places mirror of ReCiter-Publication-Manager schema: - add admin_permissions, admin_role_permissions, admin_permission_resources to createDatabaseTableReciterDb.sql - add impersonatedByUserID column to admin_feedback_log - add table_admin_permissions.sql with the permission/role-mapping/nav seed Mirrors PM scripts/migrations/add-permission-tables.sql and add-impersonated-by-feedbacklog.sql. Refs #739, #733.

feat(setup): mirror RBAC permission tables + impersonatedByUserID column (#739, #733)

paulalbert1 and others added 26 commits March 5, 2026 12:16

Merge pull request #74 from wcmc-its/docs/cleanup-and-agent-guide-dev

275b607

Update README, cleanup .gitignore, remove unused artifacts

Merge pull request #72 from wcmc-its/fix/nih-duplicate-validation-byp…

12231dd

…ass-dev Fix: NIH validation blocking table swap for 77+ days

Updated journal impact score inference

4605fc2

fix(migration): drop AFTER clauses to enable ALGORITHM=INSTANT

9bd507c

Applied to prod + dev on MariaDB 11.4 / 10.6 in ~25s (network-bound). AFTER placement forces INPLACE algorithm with metadata lock; appending at end allows INSTANT (no table rewrite, no lock hold).

Merge pull request #76 from wcmc-its/fix/nih-cites-column-inversion-dev

eae7c78

fix(retrieveNIH): correct column inversion in analysis_nih_cites load

Merge pull request #80 from wcmc-its/fix/abstract-import-loop-dev

4520e69

fix(abstractImport): replace CSV bulk-load that silently dropped rows

Merge pull request #82 from wcmc-its/feat/reporter-etl-dev

7cbf2e6

feat(reporter): NIH RePORTER ETL + project-term capture (#291) [dev]

Merge pull request #84 from wcmc-its/fix/reporter-projects-dedup-dev

813662c

fix(reporter): dedup projects by appl_id before grant_reporter_project reload [dev]

Merge pull request #86 from wcmc-its/fix/dockerfile-retrievereporter-dev

12ce6fd

fix(docker): add missing COPY for retrieveReporter.py

Merge pull request #88 from wcmc-its/fix/abstract-corruption-cleanup-dev

8ff6cda

fix(reporting_abstracts): repair cross-paper concatenation from old CSV/LOAD DATA path (#87)

Merge pull request #91 from wcmc-its/feature/authorship-review-table

9cd385b

Add durable authorship_review table (PM Authorships queue)

Merge pull request #93 from wcmc-its/feature/admin-user-scope-columns

ffd0631

feat(setup): add admin_users scope/proxy column migration (v1.5)

paulalbert1 self-assigned this Jun 9, 2026

paulalbert1 and others added 3 commits June 14, 2026 12:40

Merge pull request #96 from wcmc-its/feature/article-provenance-etl

acd26bb

feat(etl): scan ArticleProvenance (DynamoDB) -> article_provenance table (#95)

Revert "Merge pull request #96 from wcmc-its/feature/article-provenan…

fca972a

…ce-etl" This reverts commit acd26bb, reversing changes made to ffd0631.

paulalbert1 and others added 3 commits June 14, 2026 14:02

Merge pull request #97 from wcmc-its/revert/article-provenance-etl

20bacf4

Revert #96: drop article_provenance batch ETL in favor of on-the-fly DynamoDB reads

Merge pull request #100 from wcmc-its/feature/rbac-schema-mirror-dev

407aa44

feat(setup): mirror RBAC permission tables + impersonatedByUserID column (#739, #733)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes)#94

merge: dev → master (admin_users scope migration v1.5 + authorship_review table + reporter ETL + abstract fixes)#94
paulalbert1 wants to merge 32 commits into
masterfrom
dev

paulalbert1 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paulalbert1 commented Jun 9, 2026

Key contents (dev not in master)

⚠️ Merging does NOT apply the SQL

Divergence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant