Add indexers for common documentation portals by yarikoptic · Pull Request #16 · magland/qp

yarikoptic · 2025-11-13T21:48:17Z

The overall idea is to get assistant specifications more declarative and just to point to corresponding websites.
Indexers (introduced here) should be able to figure out per URL an index of pages to point to. The inspiration was the https://github.com/yarikoptic/qp/blob/main/src/assistants/figpack-assistant/retrieveFigpackDocs.tsx#L71 and the

❯ curl --silent https://flatironinstitute.github.io/figpack/doc-pages.json | head
{
  "docPages": [
    {
      "title": "API Reference",
      "url": "https://flatironinstitute.github.io/figpack/api/index.html",
      "sourceUrl": "https://flatironinstitute.github.io/figpack/_sources/api/index.md.txt",
      "includeFromStart": true
    },
    {
      "title": "Developer Guide",
...

so with this script (claude vibe coded) it is already possible to get such listing for various doc engines (with -o would be just to file without occluding logging)

mkdocs

❯ ./doc_pages_index.py https://bids-specification.readthedocs.io/en/stable/ --format figpack | head
Detecting documentation type for https://bids-specification.readthedocs.io/en/stable/
✓ Detected: MkDocs documentation
Fetching MkDocs search index from: https://bids-specification.readthedocs.io/en/stable/search/search_index.json
Found 43 unique pages (filtered from 1458 entries)
Attempting to detect source repository from 'Edit this page' links...
✓ Detected source repository: https://github.com/bids-standard/bids-specification/blob/master/src
{
  "docPages": [
    {
      "title": "Changelog",
      "url": "https://bids-specification.readthedocs.io/en/stable/CHANGES.html",
      "sourceUrl": "https://github.com/bids-standard/bids-specification/blob/master/src/CHANGES.md",
      "includeFromStart": true
    },
    {
      "title": "Arterial Spin Labeling",

sphinx

❯ ./doc_pages_index.py https://docs.datalad.org/en/stable/ --format figpack | head
Detecting documentation type for https://docs.datalad.org/en/stable/
✓ Detected: Sphinx documentation
Fetching Sphinx search index from: https://docs.datalad.org/en/stable/searchindex.js
Found 143 documents
Checking for _sources/ directory...
✓ Found _sources/ directory (will use for source URLs)
{
  "docPages": [
    {
      "title": "Acknowledgments",
      "url": "https://docs.datalad.org/en/stable/acknowledgements.html",
      "sourceUrl": "https://docs.datalad.org/en/stable/_sources/acknowledgements.rst.txt",
      "includeFromStart": true
    },
    {
      "title": "Background and motivation",

some other sphinx -- fails to determine sources on https://www.hedtags.org/hed-resources/ (see https://github.com/magland/qp/pull/18/files#r2525286258):

❯ ./doc_pages_index.py https://www.hedtags.org/hed-resources/ | xclip -i
Detecting documentation type for https://www.hedtags.org/hed-resources/
✓ Detected: Sphinx documentation
Fetching Sphinx search index from: https://www.hedtags.org/hed-resources/searchindex.js
Found 28 documents
Checking for _sources/ directory...
✗ No _sources/ directory found
  Consider providing --source-repo for source file URLs

so if not discoverable -- should indeed provide source-repo ... TODO: mention that to copilot if I review that PR it made ;-)

hugo ???

So we are just to decide on how we want to wrap it up. Ideally, I would like to see most of the assistants mostly to be a .yaml file definition containing the prompts and documentation sources. But I have not yet reviewed/understood what each one has custom. @magland could you guide me?

establishing then some CI to dump/update them daily should be trivial. We could also add

not bothering to update if didn't change

…uilds - Extracts all pages from Sphinx documentation via searchindex.js - Prioritizes _sources/ directory for reliable source URLs - Supports three output formats: json, text, and figpack - Figpack format compatible with https://flatironinstitute.github.io/figpack/ - Optional URL validation with concurrent checking - Optional repository URL mapping (GitHub, etc.) - Includes comprehensive documentation in SPHINX_INDEX_README.md Key features: - Auto-detects _sources/ directory (View page source) - No hallucinated URLs (uses actual served sources) - Field naming: source_url (generic), repo_source_url (specific) - Concurrent URL validation (10 workers) - Both _sources/ and repo URLs when both available Tested with: - NWB Schema docs (6 pages) - DataLad docs (143 pages) - Sphinx docs (152 pages) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Major refactoring to support both Sphinx and MkDocs documentation: Core Changes: - Auto-detects documentation type (Sphinx vs MkDocs) - Added detect_doc_type() function (tries searchindex.js, then search/search_index.json) - Refactored get_all_pages() to dispatch to type-specific handlers - New get_mkdocs_pages() for MkDocs documentation parsing - Renamed get_all_pages() → get_sphinx_pages() (internal) MkDocs-specific Features: - Parses search/search_index.json (JSON format) - Filters anchor entries (page.html#section) to get unique pages - Auto-detects source repository from "Edit this page" links - Maps .html pages to .md source files - Handles sites without _sources/ directory Unified Interface: - Same CLI works for both Sphinx and MkDocs - Output includes 'doc_type' field ('sphinx' or 'mkdocs') - All three formats (json, text, figpack) work for both types - Validation works for both types Documentation Updates: - Renamed to "Sphinx and MkDocs Documentation Index Extraction" - Added MkDocs index file descriptions - Added Sphinx vs MkDocs comparison table - Added MkDocs examples (BIDS specification) - Updated command-line options documentation - Reorganized examples by doc type Testing: - Sphinx: NWB (6 pages), DataLad (143 pages) - working - MkDocs: BIDS (43 pages from 1458 entries) - working - Auto-detection: Both types detected correctly - Figpack format: Works for both types 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

rename to reflect its broader functionality. Changes: - Renamed: sphinx_pages_index.py → doc_pages_index.py - Updated docstring to use new filename - Updated all references in SPHINX_INDEX_README.md (36 occurrences) The script now has a more accurate name that represents its ability to extract page indexes from multiple documentation types, not just Sphinx.

yarikoptic · 2025-11-13T21:49:16Z

NB submitted as PR from a branch here since my personal fork got "disconnected" from this repo, I guess due to its short trip to the "dark side"

yarikoptic · 2025-11-13T21:52:22Z

@copilot review assistants codebase for which documentation sites it uses across assistants, create

some centralized registry (yaml file?) listing all those resources,
cook up github CI to run (on PR/branch changes to that registry or the code here; and daily on cron) which would create or update per each website dump in the --format figpack. Filenames should be constructed based on the TLD of the websites, e.g. be docs.datalad.org_figpack.json for https://docs.datalad.org/en/stable/ and so on.

Copilot · 2025-11-13T21:52:29Z

@yarikoptic I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you.

magland · 2025-11-13T22:16:56Z

@yarikoptic I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you.

Sounds great!

yarikoptic and others added 3 commits November 13, 2025 16:29

Copilot AI mentioned this pull request Nov 13, 2025

Add centralized documentation registry and CI workflow #17

Draft

yarikoptic mentioned this pull request Nov 13, 2025

Add HED-Assistant with comprehensive documentation support #18

Merged

yarikoptic mentioned this pull request Dec 4, 2025

Add youtube support #30

Open

neuromechanist mentioned this pull request Jan 16, 2026

Declarative YAML registry for modular community onboarding OpenScience-Collective/osa#42

Closed

6 tasks

neuromechanist mentioned this pull request Feb 10, 2026

Remove HED assistant, migrated to OSA platform #46

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indexers for common documentation portals#16

Add indexers for common documentation portals#16
yarikoptic wants to merge 3 commits into
mainfrom
enh-indexers

yarikoptic commented Nov 13, 2025 •

edited

Loading

Uh oh!

yarikoptic commented Nov 13, 2025

Uh oh!

yarikoptic commented Nov 13, 2025

Uh oh!

Copilot AI commented Nov 13, 2025

Uh oh!

magland commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yarikoptic commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikoptic commented Nov 13, 2025

Uh oh!

yarikoptic commented Nov 13, 2025

Uh oh!

Copilot AI commented Nov 13, 2025

Uh oh!

magland commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yarikoptic commented Nov 13, 2025 •

edited

Loading