Add indexers for common documentation portals#16
Draft
yarikoptic wants to merge 3 commits into
Draft
Conversation
…uilds - Extracts all pages from Sphinx documentation via searchindex.js - Prioritizes _sources/ directory for reliable source URLs - Supports three output formats: json, text, and figpack - Figpack format compatible with https://flatironinstitute.github.io/figpack/ - Optional URL validation with concurrent checking - Optional repository URL mapping (GitHub, etc.) - Includes comprehensive documentation in SPHINX_INDEX_README.md Key features: - Auto-detects _sources/ directory (View page source) - No hallucinated URLs (uses actual served sources) - Field naming: source_url (generic), repo_source_url (specific) - Concurrent URL validation (10 workers) - Both _sources/ and repo URLs when both available Tested with: - NWB Schema docs (6 pages) - DataLad docs (143 pages) - Sphinx docs (152 pages) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Major refactoring to support both Sphinx and MkDocs documentation:
Core Changes:
- Auto-detects documentation type (Sphinx vs MkDocs)
- Added detect_doc_type() function (tries searchindex.js, then search/search_index.json)
- Refactored get_all_pages() to dispatch to type-specific handlers
- New get_mkdocs_pages() for MkDocs documentation parsing
- Renamed get_all_pages() → get_sphinx_pages() (internal)
MkDocs-specific Features:
- Parses search/search_index.json (JSON format)
- Filters anchor entries (page.html#section) to get unique pages
- Auto-detects source repository from "Edit this page" links
- Maps .html pages to .md source files
- Handles sites without _sources/ directory
Unified Interface:
- Same CLI works for both Sphinx and MkDocs
- Output includes 'doc_type' field ('sphinx' or 'mkdocs')
- All three formats (json, text, figpack) work for both types
- Validation works for both types
Documentation Updates:
- Renamed to "Sphinx and MkDocs Documentation Index Extraction"
- Added MkDocs index file descriptions
- Added Sphinx vs MkDocs comparison table
- Added MkDocs examples (BIDS specification)
- Updated command-line options documentation
- Reorganized examples by doc type
Testing:
- Sphinx: NWB (6 pages), DataLad (143 pages) - working
- MkDocs: BIDS (43 pages from 1458 entries) - working
- Auto-detection: Both types detected correctly
- Figpack format: Works for both types
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
rename to reflect its broader functionality. Changes: - Renamed: sphinx_pages_index.py → doc_pages_index.py - Updated docstring to use new filename - Updated all references in SPHINX_INDEX_README.md (36 occurrences) The script now has a more accurate name that represents its ability to extract page indexes from multiple documentation types, not just Sphinx.
Collaborator
Author
|
NB submitted as PR from a branch here since my personal fork got "disconnected" from this repo, I guess due to its short trip to the "dark side" |
Collaborator
Author
|
@copilot review assistants codebase for which documentation sites it uses across assistants, create
|
|
@yarikoptic I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you. |
Owner
Sounds great! |
6 tasks
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The overall idea is to get assistant specifications more declarative and just to point to corresponding websites.
Indexers (introduced here) should be able to figure out per URL an index of pages to point to. The inspiration was the https://github.com/yarikoptic/qp/blob/main/src/assistants/figpack-assistant/retrieveFigpackDocs.tsx#L71 and the
so with this script (claude vibe coded) it is already possible to get such listing for various doc engines (with
-owould be just to file without occluding logging)so if not discoverable -- should indeed provide source-repo ... TODO: mention that to copilot if I review that PR it made ;-)
So we are just to decide on how we want to wrap it up. Ideally, I would like to see most of the assistants mostly to be a .yaml file definition containing the prompts and documentation sources. But I have not yet reviewed/understood what each one has custom. @magland could you guide me?
establishing then some CI to dump/update them daily should be trivial. We could also add