Skip to content

Adding extensions for generating md version of docs#2173

Open
rcap107 wants to merge 32 commits into
skrub-data:mainfrom
rcap107:doc-improve-agentic-docs
Open

Adding extensions for generating md version of docs#2173
rcap107 wants to merge 32 commits into
skrub-data:mainfrom
rcap107:doc-improve-agentic-docs

Conversation

@rcap107

@rcap107 rcap107 commented Jun 17, 2026

Copy link
Copy Markdown
Member

This is a WIP PR where I'm trying to generate the markdown version of the documentation.

I had to add two sphinx extensions (sphinx_llm and sphinx_markdown_builder)

So far I was able to generate the md files, but there are some issues.
In some places, the html styling is bleeding in the markdown, which adds a lot of unnecessary clutter.

Overall, the PR:

  • adds two sphinx addons that deal with building the documentation and the llms.txt file
  • adds a way to skip running jupyterlite for the quick doc build
  • copies the generated markdown files to the doc/_sources directory so that llms.txt points in the right place
  • copies the markdown documentation into skrub/_docs so that it is available in the wheel
  • updates the package init so that it includes a reference to the bundled docs
  • slightly extends some docstrings based on empirical testing with agents

@rcap107

rcap107 commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Something more to be done:

Each page's entry in llms.txt includes a short description. If a page defines an html_meta description — via .. meta:: :description: in rST or html_meta: description: in MyST frontmatter — that value is used. Otherwise the extension falls back to the first 100 characters of the page content.

Right now there is no html_meta description, we might want to try and fix that

@rcap107

rcap107 commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

After fixing the links in the markdown, I would also like to add them to a skrub/_doc/ folder so that the files are bundled with the package when it's built.

@rcap107

rcap107 commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Now the documentation is built in markdown, in addition to the standard html docs, and the llms.txt file correctly points to the location of the markdown files.

For now, the location of the md files is in generated/doc/sources, which is not the same folder as the html docs: this means that I can't go from StringEncoder.html to StringEncoder.md to get the markdown version. I'm not sure if that would be a problem in practice.

Additionally, the markdown files are also added to skrub/data/docs so that they are bundled with the wheel to make everything available with the package.

I still need to add a pixi command that does that.

@rcap107

rcap107 commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Tests are failing for unrelated reasons (#2178 )

@rcap107

rcap107 commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

We might want to avoid copying some of the files to the skrub/docs folder. A big part of the docs is just the md version of what's already in the docstrings of the modules, so that part can probably be skipped.

Comment thread pyproject.toml Outdated
skrub = [
"_docs/**/*.md",
"_docs/**/*.py",
"_docs/**/*.css",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's plain-text docs for llms to read we probably don't need the javascript or css

@rcap107 rcap107 marked this pull request as ready for review June 22, 2026 13:52
@rcap107

rcap107 commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

The PR is ready for review. Some sticking points:

  • currently, build-doc-quick is also running install-docs, which may be unnecessary. we may want to run that only for the full doc build
  • building the package does not build the docs on its own, so we have to first generate the md files and then move them to the _docs file, which is a manual step that may be forgotten easily when we want to do a release

@jeromedockes

Copy link
Copy Markdown
Member

short summary of IRL conversation: it would be great if having the source user guide files inside the package and listed as package data was enough, because:

  • having a sphinx extension + custom script generate files that get inserted into the python module when building the wheel is a security risk. those files generated during the build would not undergo the scrutiny that source files do. it is better to keep the build as simple and standard as possible.
  • it would keep the build process for releases simple and the wheel similar to editable installs
  • it would mean that the plain-text docs read by agents on users' machines are the same as those in the source repo and editable installs, so improvements to them also benefit contributors to the code or documentation in the same way, and maintainers are more likely to discover ways in which they can be improved (because they need to navigate those docs themselves to edit them)

@rcap107 is going to experiment with this approach. there are probably several ways we can make the easier to navigate both for editors and consumers, such as more explicit filenames, more explicit cross-reference targets, inserting numbers at the start of filenames to make the reading order visible, etc.

Comment thread .gitignore

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding files copied from the skrub/_docs folder to the gitignore so they don't get counted twice (like the changelog)

Comment thread skrub/conftest.py

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ignores all the .py files that are stored in the _docs folder, which are otherwise executed any time the test collection is run

@rcap107

rcap107 commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

This PR has become too complex and extensive and is mixing various different changes, which have now been spun out into #2190 and #2191 for now.

Yet another separate PR will deal with moving the documentation to the skrub/_docs folder, but it needs to wait until the other two PRs have been merged.

This PR will be closed in the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants