Disallow /live/longpoll in the generated robots.txt#6699
Conversation
Clients that can't open a WebSocket fall back to the LongPoll transport, and search engine crawlers are in that group: their renderers don't open WebSockets. With longPollFallbackMs set (the default in generated apps), they fall back and repeatedly fetch /live/longpoll while rendering each page. That endpoint serves no indexable content, so the requests are wasted crawl budget. Add the rule to the scaffolded robots.txt and assert it in the installer test so a future template edit can't silently drop it.
Clients without WebSocket support fall back to LongPoll, and search engine crawlers are the common case: their renderers don't open WebSockets and repeatedly request the LongPoll endpoint (/live/longpoll for LiveView), which serves no indexable content. Note this next to the option that enables the fallback so the robots.txt advice has a home.
|
I'm not yet convinced that this is the default we should generate. If a crawler opens the longpoll connection, it is because it's executing the JavaScript. So letting it do that seems just fine to me. As long as we don't see crawls failing due to this? |
|
@SteffenDE Fair pushback. I pulled my site's crawl stats before replying, since my first instinct was the same as yours. You're right the longpoll fetch means the renderer is running our JS, and we want that. But And it adds up. With On "as long as we don't see crawls failing": nothing does. 94% are clean 200s, which is why it's easy to miss. The cost isn't errors, it's crawl budget spent on a transport endpoint instead of pages. Either way the commits are split on purpose, so if you'd rather not touch the default I'm happy to keep just the |
New LiveView apps enable a longpoll fallback by default, but the generated
robots.txtis empty — so search engine crawlers spend crawl budget polling atransport endpoint that has no indexable content.
The chain:
socket "/live"withlongpoll:enabled, and thegenerated
app.jssetslongPollFallbackMs: 2500.longpoll and fetches
/live/longpoll.Every fetch is wasted crawl budget.
I ran into this on a production LiveView site:
This PR adds
Disallow: /live/longpollto the scaffoldedrobots.txt, and documentsthe behavior on
longPollFallbackMsinPhoenix.Socket.Why robots.txt is the right lever
The renderer honors
robots.txtfor the resources it fetches during rendering(JS/CSS/XHR), so the
Disallowactually stops the longpoll fetch — it doesn't merelydeindex it.
It also can't hide content. LiveView's disconnected ("dead") render already emits the
page's server-rendered HTML; the socket is for interactivity, not first-paint content.
So for apps that render in
mount/render(the disconnected pass), nothing indexablelives behind the socket. The one exception is an app that gates indexable content behind
connected?(socket)— it serves that content only over the socket, and this rule removesthe renderer's last path to it (WebSocket was already unavailable). That's a discouraged
pattern for SEO and such content indexes poorly today regardless, but it's worth calling
out.
Scope (a deliberate default)
/live/longpollis correct for the default socket mount (Phoenix.LiveView.Socketat/live). An app that remounts the socket, runs multiple sockets, or uses a plainPhoenix.Socketwith longpoll would need to adjust the rule. Since this is a scaffoldedfile that bakes in the generator's own defaults, I kept it as a single unconditional line
rather than templating it: it matches the existing "one shared static
robots.txt"approach, and it's a harmless no-op for
--no-html/--no-liveapps where the/livesocket is commented out anyway.
The two commits are split so the
Phoenix.Socketdoc change can stand on its own if you'drather take them separately.
Refs
https://developers.google.com/search/docs/crawling-indexing/javascript/fix-search-javascript
robots.txtgoverns the resources the renderer fetches —https://developers.google.com/search/blog/2024/12/crawling-december-resources