diff --git a/README.md b/README.md index f8d6cac..5ce600f 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ This tool provides a method for retrieving figures from NCBI's [PMC](https://www ## Features -- **Automated Species Search**: Searches for publications related to 30+ plant species +- **Automated Species Search**: Searches for publications related to 27 plant species - **Figure Extraction**: Downloads high-quality figures from PMC articles - **Resume Capability**: Caches processed PMC IDs to resume interrupted downloads - **Rate Limiting**: Respects NCBI API limits (3 requests/second, 10 with API key) @@ -88,8 +88,7 @@ build/output/ ├── Arabidopsis_thaliana/ │ ├── PMC123456/ │ │ ├── figure1.jpg -│ │ ├── figure2.png -│ │ └── metadata.json # Article metadata +│ │ └── figure2.png │ └── PMC789012/ │ └── figure1.svg ├── Cannabis_sativa/ diff --git a/docs/architecture/index.md b/docs/architecture/index.md index 0095aaa..4d38d55 100644 --- a/docs/architecture/index.md +++ b/docs/architecture/index.md @@ -8,6 +8,9 @@ The Publication Figure Retrieval Tool follows a modular, pipeline-based architec ```mermaid graph TB + accTitle: High-Level System Architecture + accDescr: Inputs (species configuration, environment variables, and API keys) feed the main orchestrator, which drives the search, fetch, parse, and download modules. The search and fetch modules call the NCBI E-utilities API and PMC database. The download module writes to the output file system, cache, and progress tracking. A throttled queue mediates the search, fetch, and download API requests under an API rate controller. + subgraph "Input Layer" A[Species Configuration] B[Environment Variables] @@ -71,6 +74,9 @@ The central coordinator that manages the entire workflow: ```mermaid flowchart TD + accTitle: Main Orchestrator Control Flow + accDescr: The orchestrator initializes the throttle queue, loads the species list, and processes each species by searching articles. If articles are found it fetches article details, otherwise it logs that no articles were found. It then processes the next species, repeating until no species remain, at which point it completes. + A[Initialize Throttle Queue] --> B[Load Species List] B --> C[Process Each Species] C --> D[Search Articles by Species] @@ -97,6 +103,9 @@ Handles publication discovery through NCBI's E-utilities API: ```mermaid sequenceDiagram + accTitle: Search Module Request Sequence + accDescr: The search module constructs a query string and sends a GET request to the NCBI esearch endpoint with the PMC database and species term. The API returns a JSON response containing PMC IDs. The module extracts the ID list and returns the PMC ID array to the caller. + participant SM as Search Module participant API as NCBI E-utilities participant Cache as Local Cache @@ -124,6 +133,9 @@ Retrieves detailed article metadata and content: ```mermaid graph TD + accTitle: Fetch Module Batch Processing + accDescr: The fetch module batches PMC IDs and checks the cache. Already-cached batches are skipped, while uncached batches are fetched as article details, parsed from XML, and recorded in the cache before figures are processed. Each batch leads to the next until all are handled. + A[Batch PMC IDs] --> B[Check Cache] B --> C{Already Cached?} C -->|Yes| D[Skip Batch] @@ -147,6 +159,9 @@ Processes XML article data to extract PMC IDs and orchestrate package downloads: ```mermaid graph LR + accTitle: Parse Module Figure Extraction + accDescr: The parse module takes XML article data, parses its structure, extracts the article metadata, locates the PMC ID, requests the article package, extracts images from the package, and saves the images to the file system. + A[XML Article Data] --> B[Parse XML Structure] B --> C[Extract Article Metadata] C --> D[Locate PMC ID] @@ -175,7 +190,7 @@ The parser locates the PMC identifier in the article front matter (see implement Downloads a complete PMC article package (.tar.gz) and extracts image files. The implementation fetches a package URL from the OA Web Service API, downloads the archive, extracts media, and selects the highest-priority image format per basename before copying results to the output directory (see implementation: [`src/processor/downloadArticlePackage.ts`](../../src/processor/downloadArticlePackage.ts)). -Key implementation behaviors (implementation proof): +Key implementation behaviours (implementation proof): - Fetches OA package metadata via the OA API and converts FTP links to HTTPS (see [`src/processor/fetchPackageUrl.ts`](../../src/processor/fetchPackageUrl.ts)). - Downloads the package archive and extracts it to a temporary directory (see [`src/processor/downloadArticlePackage.ts`](../../src/processor/downloadArticlePackage.ts)). @@ -189,6 +204,9 @@ Console-level messages written by the implementation include `Fetching package U ```mermaid graph TD + accTitle: Primary Data Pipeline + accDescr: The species list generates species queries that drive the PMC search and article-detail API calls. Responses are parsed from XML, the PMC ID is extracted, the article package is requested and its images extracted. Output directories are created and images written to the file system. A progress cache records completed work, and resume logic skips already-processed PMC IDs when fetching article details. + subgraph "Input Processing" A[Species List] --> B[Species Query Generation] end @@ -221,19 +239,21 @@ graph TD ```mermaid graph TD + accTitle: Error Handling and Continuation Flow + accDescr: When an operation runs, the tool checks whether an error occurred. If not, processing continues. If an error occurs, the tool classifies where it happened. A species search error is logged with console.error and returns an empty array. An article batch fetch error is logged and processing continues with the next batch. An article package error is logged and processing continues with the next article, with an extra log when the message mentions the Open Access subset. The tool does not retry or apply backoff. + A[Operation Start] --> B{Error Occurred?} B -->|No| C[Continue Processing] - B -->|Yes| D{Error Type?} - D -->|Network| E[Retry with Backoff] - D -->|API Limit| F[Wait and Retry] - D -->|Invalid Data| G[Log and Skip] - D -->|File System| H[Create Directories] - E --> I{Retry Count < 3?} - I -->|Yes| A - I -->|No| G - F --> A - G --> C - H --> A + B -->|Yes| D{Where did it occur?} + D -->|Species search| E[console.error and return empty array] + D -->|Article batch fetch| F[console.error and continue next batch] + D -->|Article package| G[console.error and continue next article] + G --> H{Message mentions Open Access subset?} + H -->|Yes| I[Log article not in Open Access subset] + H -->|No| C + I --> C + E --> C + F --> C C --> J[Operation Complete] ``` @@ -252,6 +272,9 @@ Each module has a single, well-defined responsibility: ```mermaid graph LR + accTitle: Rate Limiting Strategy + accDescr: An API request enters the throttled queue, which checks the rate limit. Requests within the limit execute immediately, while requests that exceed it are queued until a slot is available and then executed. Each executed request updates the rate counter. + A[API Request] --> B[Throttled Queue] B --> C{Rate Limit Check} C -->|Within Limits| D[Execute Request] @@ -271,6 +294,9 @@ graph LR ```mermaid graph TB + accTitle: Caching and Resume Capability + accDescr: At process start the tool checks for the cache file. If it exists, the cached PMC IDs are loaded; if not, an empty cache is initialized. New PMC IDs are filtered from the cache, processed, and the cache is updated and saved to disk. + A[Process Start] --> B[Check Cache File] B --> C{Cache Exists?} C -->|Yes| D[Load Cached PMC IDs] @@ -297,9 +323,12 @@ The system logs operation-level failures and continues processing subsequent spe ```mermaid graph TD + accTitle: Memory Management Through Batching + accDescr: A large dataset is handled through batch processing. The tool processes 50 PMC IDs at a time, so only one batch is held in memory at once and the previous batch becomes eligible for garbage collection, repeating until no batches remain. + A[Large Dataset] --> B[Batch Processing] B --> C[Process 50 PMC IDs] - C --> D[Clear Memory] + C --> D[Previous Batch Eligible for GC] D --> E{More Batches?} E -->|Yes| C E -->|No| F[Complete] diff --git a/docs/architecture/pipelines.md b/docs/architecture/pipelines.md index 9041df9..80fd8a8 100644 --- a/docs/architecture/pipelines.md +++ b/docs/architecture/pipelines.md @@ -8,6 +8,9 @@ This document provides detailed workflow diagrams and explanations of the data p ```mermaid graph TD + accTitle: Main Processing Pipeline + accDescr: The application starts, loads environment variables, configures API rate limiting, loads the species configuration, and initializes the throttled queue. It then loops over species, searching PMC for articles. If no articles are found it logs that and moves on; otherwise it fetches article details, parses the XML, downloads the article package, extracts images, and updates the progress cache. The loop continues until no species remain. + A[Application Start] --> B[Load Environment Variables] B --> C[Configure API Rate Limiting] C --> D[Load Species Configuration] @@ -39,6 +42,9 @@ graph TD ```mermaid sequenceDiagram + accTitle: Species Search Pipeline + accDescr: The main loop calls the search module with a species name. The module constructs a query of the form species[organism] and sends a GET request to the NCBI esearch endpoint. The API returns a JSON response whose esearchresult.idlist holds the PMC IDs, which the module returns to the main loop. When no results are found it returns an empty array and the main loop logs that no articles were found. + participant M as Main Loop participant S as Search Module participant API as NCBI E-search API @@ -65,6 +71,9 @@ sequenceDiagram ```mermaid graph TD + accTitle: Article Detail Fetching Pipeline + accDescr: The PMC IDs array is filtered against cached IDs loaded from disk. If a batch has no new IDs it is skipped as already cached. Otherwise the tool builds a batch of 50, constructs the efetch URL, makes a throttled API call, receives the XML response, passes it to the parse module, and updates the cache with the new IDs. Processing continues until all batches are handled. + A[PMC IDs Array] --> B[Check Cached IDs] B --> C[Load Existing Cache] C --> D[Filter Uncached IDs] @@ -89,6 +98,9 @@ graph TD ```mermaid graph LR + accTitle: XML Parsing and Package Extraction + accDescr: Raw XML data is parsed by xml2js into a JavaScript object. The tool extracts the article array and, for each article, gets the PMC ID. It then resolves the OA package URL, downloads the .tar.gz package, extracts the contents, and selects the highest-priority image per basename. Finally it creates the output directory, copies the selected images, and updates progress. + subgraph "XML Processing" A[Raw XML Data] --> B[xml2js Parser] B --> C[JavaScript Object] @@ -121,48 +133,47 @@ graph LR ```mermaid stateDiagram-v2 - [*] --> Validate_URL - Validate_URL --> Check_Directory - Check_Directory --> Create_Directory : Directory Missing - Check_Directory --> Download_Image : Directory Exists - Create_Directory --> Download_Image - - Download_Image --> Success : HTTP 200 - Download_Image --> Not_Found : HTTP 404 - Download_Image --> Rate_Limited : HTTP 429 - Download_Image --> Network_Error : Connection Error - - Success --> [*] - Not_Found --> Log_Skip - Log_Skip --> [*] - - Rate_Limited --> Wait_Retry - Network_Error --> Wait_Retry - Wait_Retry --> Retry_Count - - Retry_Count --> Download_Image : Count < 3 - Retry_Count --> Failed : Count >= 3 - Failed --> [*] + accTitle: Article Package Download and Extraction States + accDescr: For each article the tool resolves the OA package URL, creates the output directory, downloads the .tar.gz package, and extracts it with tar. If extraction fails, the temporary directory is removed and the error is rethrown to the caller. On success the tool selects the highest-priority image per basename, copies the selected images to the output directory, and removes the temporary files. There is no per-figure retry or HTTP status handling. + + [*] --> Resolve_Package_URL + Resolve_Package_URL --> Create_Output_Directory + Create_Output_Directory --> Download_Targz + Download_Targz --> Extract_With_Tar + + Extract_With_Tar --> Extraction_Error : tar fails + Extraction_Error --> Cleanup_Temp + Cleanup_Temp --> Rethrow_Error + Rethrow_Error --> [*] + + Extract_With_Tar --> Select_Preferred_Images : extraction succeeds + Select_Preferred_Images --> Copy_To_Output + Copy_To_Output --> Cleanup_Temp_Final + Cleanup_Temp_Final --> [*] ``` -### Parallel Download Management +### Package Download Management ```mermaid graph TD - A[Figure URL List] --> B[Sequential Processing] - B --> C[Apply Rate Limiting] - C --> D[Create Directory Structure] - D --> E[Download Single Figure] - E --> F[Save to File System] - F --> G{More Figures?} - G -->|Yes| E - G -->|No| H[Article Complete] + accTitle: Package Download Management + accDescr: For each article the tool applies rate limiting through the throttled queue, resolves the OA package URL, downloads the .tar.gz package, and extracts its contents. It selects the highest-priority image per basename, copies the selected images to the output directory, and removes temporary files. Articles are processed sequentially rather than in parallel. + + A[Article PMC ID] --> B[Apply Rate Limiting] + B --> C[Resolve OA Package URL] + C --> D[Download .tar.gz Package] + D --> E[Extract Package Contents] + E --> F[Select Highest-Priority Image Per Basename] + F --> G[Copy Selected Images to Output Directory] + G --> H[Remove Temporary Files] + H --> I{More Articles?} + I -->|Yes| A + I -->|No| J[Species Complete] subgraph "Rate Limiting" - C --> I[Check Queue Status] - I --> J[Wait for Available Slot] - J --> K[Execute Download] - K --> C + B --> K[Check Queue Status] + K --> L[Wait for Available Slot] + L --> M[Execute Request] end ``` @@ -172,30 +183,22 @@ graph TD ```mermaid graph TD + accTitle: Error Recovery Workflow + accDescr: When an operation runs and an error occurs, the tool identifies where it happened and logs it with console.error or console.log, then continues with the next item. A species search error returns an empty array, a batch fetch error continues with the next batch, and an article package error continues with the next article. Missing output and cache directories are created on demand before writes. The tool does not back off or retry. + A[Operation Start] --> B[Execute Operation] B --> C{Error Occurred?} C -->|No| D[Operation Success] - C -->|Yes| E[Identify Error Type] - - E --> F{Network Error?} - E --> G{Rate Limit Error?} - E --> H{Data Format Error?} - E --> I{File System Error?} - - F -->|Yes| J[Exponential Backoff] - G -->|Yes| K[Wait for Rate Reset] - H -->|Yes| L[Log and Skip] - I -->|Yes| M[Create Directories] + C -->|Yes| E[Identify Where It Occurred] - J --> N{Retry Count < 3?} - K --> N - M --> N + E --> F[Species search: log and return empty array] + E --> G[Article batch fetch: log and continue next batch] + E --> H[Article package: log and continue next article] - N -->|Yes| B - N -->|No| L - - L --> O[Continue with Next Item] + F --> O[Continue with Next Item] + G --> O + H --> O D --> P[Update Progress] O --> P P --> Q[Complete] @@ -207,6 +210,9 @@ graph TD ```mermaid sequenceDiagram + accTitle: Cache Read and Write Workflow + accDescr: The application initializes the cache, and the cache manager checks whether cache/id.json exists. If it exists, its contents are read and parsed into an array; if not, the cache directory is created and an empty array is initialized. For each batch the application filters cached IDs, processes the uncached IDs, and asks the cache manager to add the new IDs, which writes the updated cache to the file system. + participant App as Application participant CM as Cache Manager participant FS as File System @@ -247,11 +253,14 @@ sequenceDiagram ```mermaid graph TD + accTitle: Batch Processing Strategy + accDescr: A large PMC ID list is split into batches of 50. Each batch is processed and the previous batch becomes eligible for garbage collection until no batches remain. Within a batch the tool makes the fetch API call, parses the XML, extracts figures, downloads images, and updates the cache. + A[Large PMC ID List] --> B[Split into Batches of 50] B --> C[Process Batch 1] - C --> D[Memory Cleanup] + C --> D[Previous Batch Eligible for GC] D --> E[Process Batch 2] - E --> F[Memory Cleanup] + E --> F[Previous Batch Eligible for GC] F --> G{More Batches?} G -->|Yes| H[Process Next Batch] G -->|No| I[All Batches Complete] @@ -275,6 +284,9 @@ graph TD ```mermaid graph LR + accTitle: Throttling Implementation + accDescr: An API request enters the throttled queue, which checks the current load against the rate limit. Requests within the limit execute immediately; requests over the limit are queued until a slot is available and then executed. Each request updates the rate counter when complete. The configured rate is 10 requests per second when an API key is present and 3 requests per second otherwise. + A[API Request] --> B[Throttled Queue] B --> C[Check Current Load] C --> D{Within Rate Limit?} @@ -304,6 +316,9 @@ graph LR ```mermaid graph TD + accTitle: Progress Tracking and Logging + accDescr: Processing starts and counters are initialized. For each species the tool logs the species start, searches articles, logs the articles found, processes batches while logging batch progress, downloads figures while logging download status, and logs species completion. It repeats for each species and logs a final summary when done. + A[Start Processing] --> B[Initialize Counters] B --> C[Process Species] C --> D[Log Species Start] @@ -324,16 +339,16 @@ graph TD ### Example Log Output Flow ```bash -[INFO] Searching articles for the species: Arabidopsis_thaliana... -[INFO] Found 1,234 articles for Arabidopsis_thaliana -[INFO] Fetching Arabidopsis thaliana article details for batch 1-50... -[INFO] Processing article PMC ID: PMC123456 -[INFO] Fetching package URL for PMC123456... -[INFO] Package downloaded. Extracting images... -[INFO] Extracted image: figure1.jpg (priority: jpg) -[INFO] Successfully extracted 1 images from package. -[INFO] All IDs in Arabidopsis thaliana batch 51-100 are already cached. -[INFO] Processing complete for Arabidopsis_thaliana +Searching articles for the species: Arabidopsis_thaliana... +Fetching Arabidopsis thaliana article details for batch 1-50... +Processing article PMC ID: PMC123456 +Fetching package URL for PMC123456... +Downloading package from https://.../PMC123456.tar.gz... +Package downloaded. Extracting images... +Extracted image: figure1.jpg (priority: jpg) +Successfully extracted 1 images from package. +Successfully processed article package for PMC123456 +All IDs in Arabidopsis thaliana batch 51-100 are already cached. ``` ## Resume and Recovery Pipeline @@ -342,6 +357,9 @@ graph TD ```mermaid graph TD + accTitle: Interrupted Process Recovery + accDescr: After an application restart the tool loads the cache file. If the cache exists, the cached PMC IDs are parsed and compared with the species list to identify processed work and continue from the last position, filtering cached IDs and processing the remaining ones while updating the cache incrementally. If no cache exists, a new cache is initialized and processing starts fresh. + A[Application Restart] --> B[Load Cache File] B --> C{Cache Exists?} diff --git a/docs/faq.md b/docs/faq.md index abafc62..337ecc4 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -12,7 +12,7 @@ The Publication Figure Retrieval Tool is an open-source Node.js application that - **Input**: Species names (scientific and common names) - **Output**: Image formats extracted from article packages; supported extensions are defined in the code (`IMAGE_EXTENSIONS`) and include common formats such as `jpg`, `png`, `tiff`, `gif`, and `svg` (see [`src/constants.ts`](../src/constants.ts)). -- **Data**: JSON metadata files with article and figure information +- **Data**: A progress cache file (`build/output/cache/id.json`) that stores processed PMC IDs as a JSON array, written alongside the extracted image files ### Is this tool free to use? @@ -98,7 +98,7 @@ For the example above: ### Q: Where are the downloaded figures saved? -**A:** At runtime the tool writes extracted images to the `build/output/` directory (when running the compiled JavaScript). The layout is organized by species and PMC ID; the package extraction and write behavior are implemented in [`src/processor/parseFigures.ts`](../src/processor/parseFigures.ts) and [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts). Example: +**A:** At runtime the tool writes extracted images to the `build/output/` directory (when running the compiled JavaScript). The layout is organized by species and PMC ID; the package extraction and write behaviour are implemented in [`src/processor/parseFigures.ts`](../src/processor/parseFigures.ts) and [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts). Example: ```text build/output/ @@ -120,7 +120,7 @@ build/output/ ```typescript // In src/index.ts -const throttle = throttledQueue(5, 1000); // 5 requests per second +const throttle = throttledQueue({ maxPerInterval: 5, interval: 1000 }); // 5 requests per second ``` 2. **Use an API key** for better rate limits @@ -138,10 +138,10 @@ const throttle = throttledQueue(5, 1000); // 5 requests per second 3. **Reduce request rate**: ```typescript -const throttle = throttledQueue(2, 2000); // Slower rate +const throttle = throttledQueue({ maxPerInterval: 2, interval: 2000 }); // Slower rate ``` -4. **Add retry logic** (already implemented) +4. **Note**: Failed requests are logged and skipped; the tool continues with the next item rather than retrying automatically ### Q: No figures are being downloaded. Why? @@ -153,7 +153,7 @@ const throttle = throttledQueue(2, 2000); // Slower rate 2. **Articles have no figures**: - Some articles may not contain extractable figures - - Check the metadata files for article details + - Check the console output for per-article processing logs 3. **Network issues**: - Check console output for error messages diff --git a/docs/index.md b/docs/index.md index 09d397d..e77a3da 100644 --- a/docs/index.md +++ b/docs/index.md @@ -32,6 +32,9 @@ This tool is particularly valuable for researchers in bioinformatics, comparativ ```mermaid graph TD + accTitle: End-to-End Figure Retrieval Workflow + accDescr: The tool starts, loads the species list, and for each species searches PMC articles, gets PMC IDs, fetches article details, parses the XML response, downloads the article package, extracts images, and saves them to a species and PMC ID directory before moving to the next species until all are complete. + A[Start] --> B[Load Species List] B --> C[For Each Species] C --> D[Search PMC Articles] @@ -123,6 +126,9 @@ Get your API key from [NCBI](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/ne ```mermaid sequenceDiagram + accTitle: Data Flow Between Pipeline Functions and PMC + accDescr: The user runs npm run start, which calls main. Main calls searchArticlesBySpecies, which queries the PMC esearch endpoint and returns PMC IDs. Main then calls fetchArticleDetails, which queries the efetch endpoint in batches and receives XML. fetchArticleDetails calls parseFigures, which calls downloadArticlePackage to download and extract images that are saved to disk and returned to the user as organized files. + participant User participant Main participant Search @@ -205,13 +211,16 @@ rm build/output/cache/id.json ## Supported Species -The tool processes species defined in `src/data/species.json`. Currently includes: +The tool processes 27 plant species defined in [`src/data/species.json`](../src/data/species.json). These include: - Arabidopsis thaliana (model plant) -- Cannabis sativa -- Homo sapiens -- Mus musculus -- And many more... +- Cannabis sativa (hemp) +- Oryza sativa (rice) +- Triticum aestivum (wheat) +- Zea mays (maize) +- Glycine max (soybean) +- Solanum lycopersicum (tomato) +- And 20 more... Each species entry includes aliases for better search coverage: diff --git a/docs/usage/api/index.md b/docs/usage/api/index.md index 22d2315..66bcd71 100644 --- a/docs/usage/api/index.md +++ b/docs/usage/api/index.md @@ -60,7 +60,7 @@ flowchart TD ### `main(): Promise` - Location: [`src/index.ts`](../../../src/index.ts) -- Behavior: +- Behaviour: - Configures API request throughput via `throttled-queue` - Iterates all species keys in [`src/data/species.json`](../../../src/data/species.json) - Dispatches species-level processing through `searchArticlesBySpecies` and `fetchArticleDetails` @@ -68,7 +68,7 @@ flowchart TD ### `searchArticlesBySpecies(throttle, species): Promise` - Location: [`src/processor/searchArticleBySpecies.ts`](../../../src/processor/searchArticleBySpecies.ts) -- Behavior: +- Behaviour: - Builds an NCBI ESearch query with `term=[organism]` - Calls `esearch.fcgi` with `db=pmc`, `retmode=json`, and `retmax=1000000` - Adds `api_key` when `NCBI_API_KEY` is present @@ -78,7 +78,7 @@ flowchart TD ### `fetchArticleDetails(throttle, pmids, species): Promise` - Location: [`src/processor/fetchArticleDetails.ts`](../../../src/processor/fetchArticleDetails.ts) -- Behavior: +- Behaviour: - Reads/writes cached IDs in `build/output/cache/id.json` - Splits IDs into 50-item batches - Skips IDs already present in cache @@ -89,7 +89,7 @@ flowchart TD ### `parseFigures(throttle, xmlData, species): Promise` - Location: [`src/processor/parseFigures.ts`](../../../src/processor/parseFigures.ts) -- Behavior: +- Behaviour: - Parses XML with `xml2js` - Extracts each article's PMC ID from `article.front[0]["article-meta"][0]["article-id"]` - Creates per-article output directories under `build/output` @@ -99,7 +99,7 @@ flowchart TD ### `downloadArticlePackage(throttle, pmcId, outputDir): Promise` - Location: [`src/processor/downloadArticlePackage.ts`](../../../src/processor/downloadArticlePackage.ts) -- Behavior: +- Behaviour: - Resolves OA package links via `fetchPackageUrl` - Downloads a `.tar.gz` package stream - Extracts package contents with `tar` @@ -111,7 +111,7 @@ flowchart TD ### `fetchPackageUrl(pmcId): Promise` - Location: [`src/processor/fetchPackageUrl.ts`](../../../src/processor/fetchPackageUrl.ts) -- Behavior: +- Behaviour: - Calls PMC OA service endpoint `https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi` - Normalizes IDs to `PMC...` - Parses XML response and extracts package links @@ -121,4 +121,5 @@ flowchart TD ## Notes - `extractFigureUrls` in [`src/processor/extractFigureUrls.ts`](../../../src/processor/extractFigureUrls.ts) is currently a standalone utility and is not invoked by the active main pipeline. +- `fetchPackageUrlsBatch` in [`src/processor/fetchPackageUrl.ts`](../../../src/processor/fetchPackageUrl.ts) is also exported but not invoked by the active main pipeline; the pipeline uses `fetchPackageUrl` for one article at a time. - Cache file format is a JSON array of PMC ID strings, not an object. diff --git a/docs/usage/api/searchArticleBySpecies.md b/docs/usage/api/searchArticleBySpecies.md index 1c89aa4..f2fbeab 100644 --- a/docs/usage/api/searchArticleBySpecies.md +++ b/docs/usage/api/searchArticleBySpecies.md @@ -8,6 +8,9 @@ The `searchArticleBySpecies` module handles the discovery of scientific publicat ```mermaid graph TD + accTitle: searchArticlesBySpecies Module Architecture + accDescr: A species name is turned into a query, which is sent to the NCBI esearch API. The JSON response is processed to extract the PMC ID list, which is returned as an array. If the API call raises a network error, the error is logged and an empty array is returned. + A[Species Name Input] --> B[Query Construction] B --> C[NCBI E-search API Call] C --> D[JSON Response Processing] @@ -32,10 +35,10 @@ export async function searchArticlesBySpecies(throttle: ThrottleFunction, specie ### Parameters -| Parameter | Type | Required | Description | -| ---------- | -------- | -------- | ---------------------------------------------------------------- | -| `throttle` | `any` | Yes | Throttling function from `throttled-queue` for API rate limiting | -| `species` | `string` | Yes | Species name in underscore format (e.g., "Homo_sapiens") | +| Parameter | Type | Required | Description | +| ---------- | ------------------ | -------- | ---------------------------------------------------------------- | +| `throttle` | `ThrottleFunction` | Yes | Throttling function from `throttled-queue` for API rate limiting | +| `species` | `string` | Yes | Species name in underscore format (e.g., "Homo_sapiens") | ### Return Value @@ -64,11 +67,11 @@ const params = { ```typescript // Human articles const humanQuery = "Homo_sapiens[organism]"; -// URL: ...esearch.fcgi?db=pmc&term=Homo_sapiens%5Borganism%5D&retmode=json +// URL: ...esearch.fcgi?db=pmc&term=Homo_sapiens%5Borganism%5D&retmode=json&retmax=1000000 // Plant model organism const plantQuery = "Arabidopsis_thaliana[organism]"; -// URL: ...esearch.fcgi?db=pmc&term=Arabidopsis_thaliana%5Borganism%5D&retmode=json +// URL: ...esearch.fcgi?db=pmc&term=Arabidopsis_thaliana%5Borganism%5D&retmode=json&retmax=1000000 ``` ### Usage Examples @@ -201,6 +204,9 @@ console.log(pmcIds); // [] (empty array, not an error) ```mermaid sequenceDiagram + accTitle: searchArticlesBySpecies Pipeline Integration + accDescr: The main process calls searchArticlesBySpecies with a species name. The function sends an HTTP GET request to NCBI esearch and receives a JSON response with PMC IDs, which it returns to the main process. If PMC IDs are found, the main process passes them to fetchArticleDetails; otherwise it logs that no articles were found. + participant M as Main Process participant S as searchArticlesBySpecies participant API as NCBI E-search diff --git a/docs/usage/index.md b/docs/usage/index.md index 7fb3df3..0e6d1ea 100644 --- a/docs/usage/index.md +++ b/docs/usage/index.md @@ -64,11 +64,10 @@ The tool will: ```bash Searching articles for the species: Arabidopsis_thaliana... -Found 1,247 articles for Arabidopsis_thaliana Fetching Arabidopsis thaliana article details for batch 1-50... Processing article PMC ID: PMC123456 Fetching package URL for PMC123456... -Downloading package from https://.../PMC123456.tar.gz +Downloading package from https://.../PMC123456.tar.gz... Package downloaded. Extracting images... Extracted image: figure1.jpg (priority: jpg) Extracted image: figure2.png (priority: png) @@ -191,6 +190,9 @@ tail -f output.log # If you redirect output to a log file ```mermaid graph TD + accTitle: Research Dataset Creation Workflow + accDescr: A researcher defines a research question, selects target species, edits species.json, configures an API key, runs the tool, monitors progress, verifies downloads, and analyzes the resulting figures. + A[Define Research Question] --> B[Select Target Species] B --> C[Edit species.json] C --> D[Configure API Key] @@ -214,6 +216,9 @@ graph TD ```mermaid sequenceDiagram + accTitle: Comparative Analysis Workflow + accDescr: A researcher configures target species and the tool searches PMC for articles. PMC returns article lists, the tool downloads figures, and PMC returns figure files. The tool hands the researcher an organized figure dataset, which the researcher loads into analysis software to obtain comparative results. + participant R as Researcher participant T as Tool participant PMC as PMC Database diff --git a/package-lock.json b/package-lock.json index cbaa1be..6293afd 100644 --- a/package-lock.json +++ b/package-lock.json @@ -693,9 +693,9 @@ } }, "node_modules/@eslint/plugin-kit": { - "version": "0.7.1", - "resolved": "https://registry.npmjs.org/@eslint/plugin-kit/-/plugin-kit-0.7.1.tgz", - "integrity": "sha512-rZAP3aVgB9ds9KOeUSL+zZ21hPmo8dh6fnIFwRQj5EAZl9gzR7wxYbYXYysAM8CTqGmUGyp2S4kUdV17MnGuWQ==", + "version": "0.7.2", + "resolved": "https://registry.npmjs.org/@eslint/plugin-kit/-/plugin-kit-0.7.2.tgz", + "integrity": "sha512-+CNAzxglkrpNf/kKywqQfk74QjtceuOE7Qm+AF8miRvPF/wmmK5+OJOgVh3AVTT3RP2mH3+FOaxlE5v72owk0A==", "dev": true, "license": "Apache-2.0", "dependencies": { @@ -2174,9 +2174,9 @@ "license": "MIT" }, "node_modules/baseline-browser-mapping": { - "version": "2.10.32", - "resolved": "https://registry.npmjs.org/baseline-browser-mapping/-/baseline-browser-mapping-2.10.32.tgz", - "integrity": "sha512-wbPvpyjJPC0zdfdKXxqEL3Ea+bOMD/87X4lftiJkkaBiuG6ALQy1SLmEd7BSmVCuwCQsBrCamgBoLyfFDD1EPg==", + "version": "2.10.33", + "resolved": "https://registry.npmjs.org/baseline-browser-mapping/-/baseline-browser-mapping-2.10.33.tgz", + "integrity": "sha512-bA6+tcSLpz2tIEdDXZPpPTIuxBcC4+w6SieaYyfigIa4h8GlFxbA17v22Vx3JUtuZQj9SgOsnbK+aTBzyDyEuw==", "dev": true, "license": "Apache-2.0", "bin": { @@ -2820,9 +2820,9 @@ } }, "node_modules/eslint": { - "version": "10.4.0", - "resolved": "https://registry.npmjs.org/eslint/-/eslint-10.4.0.tgz", - "integrity": "sha512-loXy6bWOoP3EP6JA7jo6p5jMpBJmHmsNZM5SFRHLdh1MGOPurMnNBj4ZlAbaqUAaQWbCr7jHV4P7gzAyryZWkQ==", + "version": "10.4.1", + "resolved": "https://registry.npmjs.org/eslint/-/eslint-10.4.1.tgz", + "integrity": "sha512-AyIKhnOBuOAdueD7RB3xB+YeAWScb9jHsJBgH2Hcde8InP5JYhqrRR6iTMHyTEwgENK54Cp44e4v8BwNhsuHuw==", "dev": true, "license": "MIT", "dependencies": { @@ -2831,7 +2831,7 @@ "@eslint/config-array": "^0.23.5", "@eslint/config-helpers": "^0.6.0", "@eslint/core": "^1.2.1", - "@eslint/plugin-kit": "^0.7.1", + "@eslint/plugin-kit": "^0.7.2", "@humanfs/node": "^0.16.6", "@humanwhocodes/module-importer": "^1.0.1", "@humanwhocodes/retry": "^0.4.2", @@ -6139,9 +6139,9 @@ }, "node_modules/react-is-19": { "name": "react-is", - "version": "19.2.6", - "resolved": "https://registry.npmjs.org/react-is/-/react-is-19.2.6.tgz", - "integrity": "sha512-XjBR15BhXuylgWGuslhDKqlSayuqvqBX91BP8pauG8kd1zY8kotkNWbXksTCNRarse4kuGbe2kIY05ARtwNIvw==", + "version": "19.2.7", + "resolved": "https://registry.npmjs.org/react-is/-/react-is-19.2.7.tgz", + "integrity": "sha512-kZFnouyVv7eP/Phmrlo9FK+zcAdriZJvzxXHF1Sl1P377WSGe2G/JxVolhTrB/jeV47lKImhNUsijjHAAbcl/A==", "dev": true, "license": "MIT" }, @@ -6642,9 +6642,9 @@ "license": "MIT" }, "node_modules/tinyglobby": { - "version": "0.2.16", - "resolved": "https://registry.npmjs.org/tinyglobby/-/tinyglobby-0.2.16.tgz", - "integrity": "sha512-pn99VhoACYR8nFHhxqix+uvsbXineAasWm5ojXoN8xEwK5Kd3/TrhNn1wByuD52UxWRLy8pu+kRMniEi6Eq9Zg==", + "version": "0.2.17", + "resolved": "https://registry.npmjs.org/tinyglobby/-/tinyglobby-0.2.17.tgz", + "integrity": "sha512-wXR/dYpcqKmfWpEdZjiKJOwCNFndD0DMnrW/cYjVGttEkBfVgcLFHoNrlj47mjOVic9yyNu65alsgF4NQyTa2g==", "dev": true, "license": "MIT", "dependencies": { diff --git a/src/constants.ts b/src/constants.ts index adb1510..9228a76 100644 --- a/src/constants.ts +++ b/src/constants.ts @@ -11,7 +11,7 @@ * - png: Lossless compression, supports transparency * - tiff: High quality, but large file sizes * - webp: Modern format with good compression - * - gif: Limited colors, mainly for animations + * - gif: Limited colours, mainly for animations * - svg: Vector format, scalable * - ico: Icon format, typically low resolution * - heif: Modern format, not widely supported diff --git a/src/processor/parseFigures.ts b/src/processor/parseFigures.ts index d5e7244..8dbbfd0 100644 --- a/src/processor/parseFigures.ts +++ b/src/processor/parseFigures.ts @@ -17,7 +17,7 @@ import { downloadArticlePackage } from "./downloadArticlePackage"; * @returns {Promise} A promise that resolves when all article packages have been processed. * * @example - * const throttle = throttledQueue(2, 1000); + * const throttle = throttledQueue({ maxPerInterval: 2, interval: 1000 }); * const xmlData = "mock data"; * const species = "Homo sapiens"; * await parseFigures(throttle, xmlData, species);