A Python library and CLI tool for extracting figures and their captions from scientific publications in PDF format. Originally developed for processing PubMed ophthalmology papers.
- Detects figures and their associated captions in multi-page PDFs
- Handles compound figures (multiple image panels sharing one caption)
- Exports cropped figure images (PNG) and structured metadata (JSON)
- Supports parallel processing across pages for faster throughput
- Optional deep-learning-based layout detection via LayoutParser
Note: LayoutParser is installed from a fork that fixed import errors (see here)
Requires Python 3.12+. Install with uv or pip:
pip install .For deep-learning layout detection (optional):
pip install ".[dl]"pmo-parser <input_path> [--output-path <output_path>]input_path— directory containing one or more.pdffiles--output-path— destination directory (defaults to<input_path>/results)
For each PDF, the tool creates a subdirectory under output_path containing:
- One
.pngper detected figure - A
.jsonfile with figure bounding boxes, caption text, page numbers, and confidence scores
A log.text file is written to the output directory listing any processing errors.
from pmo_parser import caption_pdf
figures = caption_pdf("path/to/paper.pdf")
for fig in figures:
serialized, image = fig.serialize()
print(serialized["caption"]) # list of caption dicts with text and bbox
if image is not None:
image.save(f"figure_{fig.page}_{fig.name}.png")caption_pdf accepts either a file path string or a BytesIO object and returns a list of OutputFigure objects.
| Parameter | Default | Description |
|---|---|---|
pdf_path |
— | Path or BytesIO of the PDF |
use_dl |
False |
Use LayoutParser DL model for layout detection |
always_create_screenshots |
False |
Render page screenshots even when not needed |
num_processes |
1 |
Number of parallel worker processes |
Each figure in the JSON output has the following structure:
{
"page": 2,
"name": null,
"type": "FIGURE",
"figure_bbox": {"x0": 50.0, "y0": 100.0, "x1": 300.0, "y1": 400.0},
"caption": [
{
"text": "Figure 1. Example caption text.",
"x0": 50.0, "y0": 405.0, "x1": 300.0, "y1": 420.0
}
],
"caption_scores": [4.5],
"dpi": 150,
"image_path": "results/paper/page_2_figure_None.png"
}Install development dependencies:
pip install -e ".[dev]"Run tests:
pytestMIT — see LICENSE.
@article{hallitschke2026pubmedophtha,
title = {PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature},
author = {Hallitschke, Verena Jasmin and Eickhoff, Carsten and Berens, Philipp},
journal = {arXiv preprint arXiv:2605.02720},
year = {2026}
}