Skip to content

Fix possible 100% CPU loop in CivetWeb#2882

Open
DL6ER wants to merge 1 commit into
developmentfrom
fix/spinning-civet
Open

Fix possible 100% CPU loop in CivetWeb#2882
DL6ER wants to merge 1 commit into
developmentfrom
fix/spinning-civet

Conversation

@DL6ER

@DL6ER DL6ER commented May 6, 2026

Copy link
Copy Markdown
Member

What does this implement/fix?

Try tiny backoff to avoid tight retry loops on idle HTTPS keep-alive connections


Related issue or feature (if applicable): N/A

Pull request in docs with documentation (if applicable): N/A


By submitting this pull request, I confirm the following:

  1. I have read and understood the contributors guide, as well as this entire template. I understand which branch to base my commits and Pull Requests against.
  2. I have commented my proposed changes within the code.
  3. I am willing to help maintain this change if there are issues with it later.
  4. It is compatible with the EUPL 1.2 license
  5. I have squashed any insignificant commits. (git rebase)

Checklist:

  • The code change is tested and works locally.
  • I based my code and PRs against the repositories development branch.
  • I signed off all commits. Pi-hole enforces the DCO for all contributions
  • I signed all my commits. Pi-hole requires signatures to verify authorship
  • I have read the above and my PR is ready for review.

…connections

Signed-off-by: Dominik <dl6er@dl6er.de>
@DL6ER DL6ER marked this pull request as ready for review May 11, 2026 19:54
@DL6ER DL6ER requested a review from a team as a code owner May 11, 2026 19:54
Copilot AI review requested due to automatic review settings May 11, 2026 19:54
@DL6ER

DL6ER commented May 11, 2026

Copy link
Copy Markdown
Member Author

TODO: Need to submit a PR upstream

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a small configurable backoff for non-blocking mbedTLS operations in CivetWeb to prevent tight retry loops that can otherwise drive a worker thread to 100% CPU on idle/keep-alive HTTPS connections.

Changes:

  • Define MG_MBEDTLS_WANT_RETRY_DELAY_MS (default: 5ms) as a tunable backoff interval.
  • Sleep briefly in the mbedTLS handshake loop when WANT_READ/WRITE (or async-in-progress) is returned.
  • Sleep briefly in the mbedTLS read path when WANT_READ/WRITE (or async-in-progress) is returned after a poll-readability event.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/webserver/civetweb/mod_mbedtls.inl Adds a tiny sleep in the mbedTLS handshake retry loop to avoid spinning on non-blocking sockets.
src/webserver/civetweb/civetweb.c Introduces the backoff macro and applies it in the mbedTLS read path when WANT_* is returned.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yubiuser yubiuser left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should such a change not better be a patch at https://github.com/pi-hole/FTL/tree/master/patch/civetweb

@rdwebdesign

Copy link
Copy Markdown
Member

I think this is the intention:

TODO: Need to submit a PR upstream

@gkuchta

gkuchta commented May 24, 2026

Copy link
Copy Markdown

FWIW, I think I maybe ran into this after issuing a request to the admin UI (clean session; was just going to add a blocklist entry). I saw a single pihole-FTL thread spike to, and stay at, 100% cpu use. From strace:

3185 20:44:39.629235 <... select resumed>) = 0 (Timeout) <0.048528> 3185 20:44:39.629253 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000} <unfinished ...> 3191 20:44:39.629265 poll([{fd=35, events=POLLIN}, {fd=36, events=POLLIN}, {fd=37, events=POLLIN}, {fd=38, events=POLLIN}, {fd=34, events=POLLIN}], 5, 2000) = 4 ([{fd=35, revents=POLLIN}, {fd=36, revents=POLLIN}, { fd=37, revents=POLLIN}, {fd=38, revents=POLLIN}]) <0.000007> 3191 20:44:39.629292 poll([{fd=35, events=POLLIN}, {fd=36, events=POLLIN}, {fd=37, events=POLLIN}, {fd=38, events=POLLIN}, {fd=34, events=POLLIN}], 5, 2000) = 4 ([{fd=35, revents=POLLIN}, {fd=36, revents=POLLIN}, { fd=37, revents=POLLIN}, {fd=38, revents=POLLIN}]) <0.000007> 3191 20:44:39.629319 poll([{fd=35, events=POLLIN}, {fd=36, events=POLLIN}, {fd=37, events=POLLIN}, {fd=38, events=POLLIN}, {fd=34, events=POLLIN}], 5, 2000) = 4 ([{fd=35, revents=POLLIN}, {fd=36, revents=POLLIN}, { fd=37, revents=POLLIN}, {fd=38, revents=POLLIN}]) <0.000007>

fd35 = 0.0.0.0:80
fd36 = 0.0.0.0:443
fd37 = [::]:80
fd38 = [::]:443
poll() returns POLLIN for all four listener fds in ~7us
FTL performs no accept/read/write (or any other calls) between polls; basically just an infinite poll() loop.
Recv-Q remains nonzero (Recv-Q was at 2) on listeners
no inbound 80/443 traffic observed via tcpdump
dns resolution continued without interruption

If I run into it again I can try to get some more useful info via gdb or something, but it's just my home network so I just dumped what info I could and HUP'd the process

@DL6ER

DL6ER commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

Sorry for the long delay in replying - real life has been really busy lately.

Thanks for the detailed dump - that's genuinely useful.

One thing stands out though: the strace shows the spinning thread sitting in a tight poll() loop over the listener sockets (fd 35–38 = your :80/:443 listeners), returning POLLIN on all of them but never calling accept(), with Recv-Q=2 pending. That's the master/accept thread, whereas this PR fixes a busy-spin in the mbedTLS read/handshake path on connections that have already been accepted (pull_inner / mbed_ssl_handshake).

So as it stands this looks like it may be a related but distinct loop rather than the exact one this PR targets — possibly the accept loop or the worker queue getting wedged. It's plausible the two are connected (e.g. workers stuck spinning never return to the pool and starve the accept side), but the trace only shows one hot thread and it's the master on the listeners, so I can't tie it to the TLS path from this alone.

If you hit it again, the one thing that would nail it down is a backtrace of the 100%-CPU thread under gdb:

gdb -p <pid>
(gdb) info threads
(gdb) thread apply all bt

(or at least bt for the hot LWP). That'll tell us immediately whether it's the mbedTLS loop this PR addresses or the accept side. The Recv-Q nonzero + no accept() symptom in particular makes me suspect the latter.

Appreciate you grabbing what you could before HUP'ing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants