Skip to content

Add indexers for common documentation portals#16

Draft
yarikoptic wants to merge 3 commits into
mainfrom
enh-indexers
Draft

Add indexers for common documentation portals#16
yarikoptic wants to merge 3 commits into
mainfrom
enh-indexers

Conversation

@yarikoptic

@yarikoptic yarikoptic commented Nov 13, 2025

Copy link
Copy Markdown
Collaborator

The overall idea is to get assistant specifications more declarative and just to point to corresponding websites.
Indexers (introduced here) should be able to figure out per URL an index of pages to point to. The inspiration was the https://github.com/yarikoptic/qp/blob/main/src/assistants/figpack-assistant/retrieveFigpackDocs.tsx#L71 and the

❯ curl --silent https://flatironinstitute.github.io/figpack/doc-pages.json | head
{
  "docPages": [
    {
      "title": "API Reference",
      "url": "https://flatironinstitute.github.io/figpack/api/index.html",
      "sourceUrl": "https://flatironinstitute.github.io/figpack/_sources/api/index.md.txt",
      "includeFromStart": true
    },
    {
      "title": "Developer Guide",
...

so with this script (claude vibe coded) it is already possible to get such listing for various doc engines (with -o would be just to file without occluding logging)

  • mkdocs
❯ ./doc_pages_index.py https://bids-specification.readthedocs.io/en/stable/ --format figpack | head
Detecting documentation type for https://bids-specification.readthedocs.io/en/stable/
✓ Detected: MkDocs documentation
Fetching MkDocs search index from: https://bids-specification.readthedocs.io/en/stable/search/search_index.json
Found 43 unique pages (filtered from 1458 entries)
Attempting to detect source repository from 'Edit this page' links...
✓ Detected source repository: https://github.com/bids-standard/bids-specification/blob/master/src
{
  "docPages": [
    {
      "title": "Changelog",
      "url": "https://bids-specification.readthedocs.io/en/stable/CHANGES.html",
      "sourceUrl": "https://github.com/bids-standard/bids-specification/blob/master/src/CHANGES.md",
      "includeFromStart": true
    },
    {
      "title": "Arterial Spin Labeling",
  • sphinx
❯ ./doc_pages_index.py https://docs.datalad.org/en/stable/ --format figpack | head
Detecting documentation type for https://docs.datalad.org/en/stable/
✓ Detected: Sphinx documentation
Fetching Sphinx search index from: https://docs.datalad.org/en/stable/searchindex.js
Found 143 documents
Checking for _sources/ directory...
✓ Found _sources/ directory (will use for source URLs)
{
  "docPages": [
    {
      "title": "Acknowledgments",
      "url": "https://docs.datalad.org/en/stable/acknowledgements.html",
      "sourceUrl": "https://docs.datalad.org/en/stable/_sources/acknowledgements.rst.txt",
      "includeFromStart": true
    },
    {
      "title": "Background and motivation",
❯ ./doc_pages_index.py https://www.hedtags.org/hed-resources/ | xclip -i
Detecting documentation type for https://www.hedtags.org/hed-resources/
✓ Detected: Sphinx documentation
Fetching Sphinx search index from: https://www.hedtags.org/hed-resources/searchindex.js
Found 28 documents
Checking for _sources/ directory...
✗ No _sources/ directory found
  Consider providing --source-repo for source file URLs

so if not discoverable -- should indeed provide source-repo ... TODO: mention that to copilot if I review that PR it made ;-)

  • hugo ???

So we are just to decide on how we want to wrap it up. Ideally, I would like to see most of the assistants mostly to be a .yaml file definition containing the prompts and documentation sources. But I have not yet reviewed/understood what each one has custom. @magland could you guide me?

establishing then some CI to dump/update them daily should be trivial. We could also add

  • not bothering to update if didn't change

yarikoptic and others added 3 commits November 13, 2025 16:29
…uilds

- Extracts all pages from Sphinx documentation via searchindex.js
- Prioritizes _sources/ directory for reliable source URLs
- Supports three output formats: json, text, and figpack
- Figpack format compatible with https://flatironinstitute.github.io/figpack/
- Optional URL validation with concurrent checking
- Optional repository URL mapping (GitHub, etc.)
- Includes comprehensive documentation in SPHINX_INDEX_README.md

Key features:
- Auto-detects _sources/ directory (View page source)
- No hallucinated URLs (uses actual served sources)
- Field naming: source_url (generic), repo_source_url (specific)
- Concurrent URL validation (10 workers)
- Both _sources/ and repo URLs when both available

Tested with:
- NWB Schema docs (6 pages)
- DataLad docs (143 pages)
- Sphinx docs (152 pages)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Major refactoring to support both Sphinx and MkDocs documentation:

Core Changes:
- Auto-detects documentation type (Sphinx vs MkDocs)
- Added detect_doc_type() function (tries searchindex.js, then search/search_index.json)
- Refactored get_all_pages() to dispatch to type-specific handlers
- New get_mkdocs_pages() for MkDocs documentation parsing
- Renamed get_all_pages() → get_sphinx_pages() (internal)

MkDocs-specific Features:
- Parses search/search_index.json (JSON format)
- Filters anchor entries (page.html#section) to get unique pages
- Auto-detects source repository from "Edit this page" links
- Maps .html pages to .md source files
- Handles sites without _sources/ directory

Unified Interface:
- Same CLI works for both Sphinx and MkDocs
- Output includes 'doc_type' field ('sphinx' or 'mkdocs')
- All three formats (json, text, figpack) work for both types
- Validation works for both types

Documentation Updates:
- Renamed to "Sphinx and MkDocs Documentation Index Extraction"
- Added MkDocs index file descriptions
- Added Sphinx vs MkDocs comparison table
- Added MkDocs examples (BIDS specification)
- Updated command-line options documentation
- Reorganized examples by doc type

Testing:
- Sphinx: NWB (6 pages), DataLad (143 pages) - working
- MkDocs: BIDS (43 pages from 1458 entries) - working
- Auto-detection: Both types detected correctly
- Figpack format: Works for both types

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
rename to reflect its broader functionality.

Changes:
- Renamed: sphinx_pages_index.py → doc_pages_index.py
- Updated docstring to use new filename
- Updated all references in SPHINX_INDEX_README.md (36 occurrences)

The script now has a more accurate name that represents its ability
to extract page indexes from multiple documentation types, not just Sphinx.
@yarikoptic

Copy link
Copy Markdown
Collaborator Author

NB submitted as PR from a branch here since my personal fork got "disconnected" from this repo, I guess due to its short trip to the "dark side"

@yarikoptic

Copy link
Copy Markdown
Collaborator Author

@copilot review assistants codebase for which documentation sites it uses across assistants, create

  • some centralized registry (yaml file?) listing all those resources,
  • cook up github CI to run (on PR/branch changes to that registry or the code here; and daily on cron) which would create or update per each website dump in the --format figpack. Filenames should be constructed based on the TLD of the websites, e.g. be docs.datalad.org_figpack.json for https://docs.datalad.org/en/stable/ and so on.

Copilot AI commented Nov 13, 2025

Copy link
Copy Markdown

@yarikoptic I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you.

@magland

magland commented Nov 13, 2025

Copy link
Copy Markdown
Owner

@yarikoptic I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you.

Sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants