Skip to content

Disallow /live/longpoll in the generated robots.txt#6699

Open
patrols wants to merge 2 commits into
phoenixframework:mainfrom
patrols:robots-disallow-live-longpoll
Open

Disallow /live/longpoll in the generated robots.txt#6699
patrols wants to merge 2 commits into
phoenixframework:mainfrom
patrols:robots-disallow-live-longpoll

Conversation

@patrols

@patrols patrols commented Jun 7, 2026

Copy link
Copy Markdown

New LiveView apps enable a longpoll fallback by default, but the generated
robots.txt is empty — so search engine crawlers spend crawl budget polling a
transport endpoint that has no indexable content.

The chain:

  • The generated endpoint mounts socket "/live" with longpoll: enabled, and the
    generated app.js sets longPollFallbackMs: 2500.
  • Googlebot's renderer doesn't open WebSockets, so on every render it falls back to
    longpoll and fetches /live/longpoll.
  • That endpoint is transport-only — it returns serialized socket messages, not HTML.
    Every fetch is wasted crawl budget.

I ran into this on a production LiveView site:

image

This PR adds Disallow: /live/longpoll to the scaffolded robots.txt, and documents
the behavior on longPollFallbackMs in Phoenix.Socket.

Why robots.txt is the right lever

The renderer honors robots.txt for the resources it fetches during rendering
(JS/CSS/XHR), so the Disallow actually stops the longpoll fetch — it doesn't merely
deindex it.

It also can't hide content. LiveView's disconnected ("dead") render already emits the
page's server-rendered HTML; the socket is for interactivity, not first-paint content.
So for apps that render in mount/render (the disconnected pass), nothing indexable
lives behind the socket. The one exception is an app that gates indexable content behind
connected?(socket) — it serves that content only over the socket, and this rule removes
the renderer's last path to it (WebSocket was already unavailable). That's a discouraged
pattern for SEO and such content indexes poorly today regardless, but it's worth calling
out.

Scope (a deliberate default)

/live/longpoll is correct for the default socket mount (Phoenix.LiveView.Socket at
/live). An app that remounts the socket, runs multiple sockets, or uses a plain
Phoenix.Socket with longpoll would need to adjust the rule. Since this is a scaffolded
file that bakes in the generator's own defaults, I kept it as a single unconditional line
rather than templating it: it matches the existing "one shared static robots.txt"
approach, and it's a harmless no-op for --no-html/--no-live apps where the /live
socket is commented out anyway.

The two commits are split so the Phoenix.Socket doc change can stand on its own if you'd
rather take them separately.

Refs

patrols added 2 commits June 6, 2026 22:57
Clients that can't open a WebSocket fall back to the LongPoll transport,
and search engine crawlers are in that group: their renderers don't open
WebSockets. With longPollFallbackMs set (the default in generated apps),
they fall back and repeatedly fetch /live/longpoll while rendering each
page. That endpoint serves no indexable content, so the requests are
wasted crawl budget.

Add the rule to the scaffolded robots.txt and assert it in the installer
test so a future template edit can't silently drop it.
Clients without WebSocket support fall back to LongPoll, and search
engine crawlers are the common case: their renderers don't open
WebSockets and repeatedly request the LongPoll endpoint (/live/longpoll
for LiveView), which serves no indexable content. Note this next to the
option that enables the fallback so the robots.txt advice has a home.
@patrols patrols changed the title Robots disallow live longpoll Disallow /live/longpoll in the generated robots.txt Jun 7, 2026
@SteffenDE

Copy link
Copy Markdown
Member

I'm not yet convinced that this is the default we should generate. If a crawler opens the longpoll connection, it is because it's executing the JavaScript. So letting it do that seems just fine to me. As long as we don't see crawls failing due to this?

@patrols

patrols commented Jun 8, 2026

Copy link
Copy Markdown
Author

@SteffenDE Fair pushback. I pulled my site's crawl stats before replying, since my first instinct was the same as yours.

You're right the longpoll fetch means the renderer is running our JS, and we want that. But Disallow: /live/longpoll doesn't stop it. The dead render already emits the full HTML before any socket exists, so Googlebot still runs the JS, renders, and indexes. It just skips the one request that returns transport frames instead of content.

And it adds up. With longPollFallbackMs: 2500 it keeps polling on every render. In my Search Console, 569 of 684 sampled JSON URLs (83%) were /live/longpoll, and JSON was 66% of all Googlebot requests vs 26% for real HTML pages. It's the single biggest thing Googlebot does on the site.

On "as long as we don't see crawls failing": nothing does. 94% are clean 200s, which is why it's easy to miss. The cost isn't errors, it's crawl budget spent on a transport endpoint instead of pages.

Either way the commits are split on purpose, so if you'd rather not touch the default I'm happy to keep just the Phoenix.Socket doc note.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants