Skip to content

Make threads thread-safe#24439

Open
iabyn wants to merge 52 commits into
bleadfrom
davem/thread_races
Open

Make threads thread-safe#24439
iabyn wants to merge 52 commits into
bleadfrom
davem/thread_races

Conversation

@iabyn
Copy link
Copy Markdown
Contributor

@iabyn iabyn commented May 24, 2026

A programmer had a problem. They thought to themselves, "I know I'll solve it with threads!". have now problems. two They.

 -- a comment seen in a Reddit discussion.

This branch of nearly 50 commits fixes many thread-related race conditions and similar, and collectively make all the threads and threads::shared tests pass under valgrind's helgrind and drd tools.

These commits are approximately sorted into three batches: 1) fix up the threads distro; 2) fix up threads::shared; 3) add atomic stuff to make threads and signal handlers work together.

Some commits are actual fixes; others are refactoring in preparation for a fix; some are adding comments etc; some are fixing other (non-race related) issues I spotted along the way.

Given the large number of commits, potential reviewers might want to announce in advance that they'll review only a specific subset of the commits; or just proofread; or whatever.

  • This set of changes requires a perldelta entry, and it is included.

iabyn added 15 commits May 22, 2026 12:44
This fixes a race condition that was very occasionally causing
dist/threads/t/free.t to fail.

PL_veto_switch_non_tTHX_context was originally added as a global
boolean-ish variable. It was temporarily set to true during a thread's
freeing, to stop the thread's locale being switched to when that locale
was in the middle of being freed. The variable was then restored to its
previous value at the end of freeing the thread.

Unfortunately this wasn't thread-safe: if multiple threads were being
freed at the same time, the variable's value could get overwritten
and/or restored to the wrong value.

This commit changes the variable to be per-thread. This actually
simplifies things: just set the thread's instance variable to true
when freeing the thread, and you don't even need to restore the value to
false later on.

Given that there isn't always a one-to-one mapping between threads and
interpreters, it's possible that there's some subtlety I've overlooked.
But it seems to work and makes 'valgrind --tool=helgrind' happier.
Add some code comments to some functions and variables whose details
confused me.
Add some code comments to join() and detach(), better explaining when
the thread's interpreter is freed and when the thread itself is freed.
In S_ithread_create(), rename these variables:

    rc_stack_size     => setstack_err
    rc_thread_create  => create_err

They are return codes from function calls (hence the 'rc') but that
wasn't obvious to me at the time, and cognitively I was reading things
like

    if (rc_stack_size) { ...}

as "if a non-default stack size has been specified" rather than
"if there was an error while setting the stack size".

No doubt someone else's cognitive biases will work differently from mine
and in 10 years time they will rename these vars back to rc_foo.
In the ithread_create() XS function, the variable 'thread' is used early
on to refer to the parent thread, and later on to refer to the
just-created thread. This is confusing. Add a separate, tightly scoped
var - parent_thread - for the former purpose.
Currently a newly-created thread struct (not yet associated with a
running OS thread) is added to the list of threads very early.

This commit changes it so that it's added to the list after the thread
is fully created (including creating the OS thread) but hasn't yet
started executing (the underlying OS thread is still waiting on a lock
to be released to indicate that it should start running).

This change shouldn't make any practical difference, as currently the
MY_POOL.create_destruct_mutex mutex is held throughout the creation
process; so no other threads could see the not-fully-formed thread in
the list of threads. But it will become useful during the next few
commits which will try to reduce how long create_destruct_mutex is held
for.
Previously, create_destruct_mutex was locked on the line before
calling S_ithread_create(); that function would then be responsible
for unlocking it later.

This commit moves that locking to (almost) the first line of
S_ithread_create() instead. Since that static function is only called
from one place, this can be done safely without changing functionality.
However, by having both the locking and unlocking done in the same
function, it makes static analysers happier and doesn't require special
hints like PERL_TSA_RELEASE() any more.

And it also makes it easier for humans to understand the locking.
Previously, the XSUB ithread_create() called S_ithread_create(),
which created the thread struct, including its mutex, and locked it
early on. On return from S_ithread_create(), ithread_create() unlocked
that mutex, allowing the child to start running.

This commit moves that unlocking from just after where S_ithread_create()
is called from, to being the last action of S_ithread_create() itself.

This is virtually functionally the same, apart from a couple of lines of
code  in the parent which are now run *after* the unlock, but which are
unaffected by whether the child has just started running.

By having both the locking and unlocking done in the same
function, it makes static analysers happier and doesn't require special
hints like CLANG_DIAG_IGNORE_STMT(-Wthread-safety) any more.
There's a mutex, create_destruct_mutex, which is sort of global,
and which is currently held for the entirely of the duration of creating
a new perl thread (including the time-consuming cloning the parent's
interpreter).

This means that on a multi-processor system, only one new thread can be
created at a time, which is very slow.

I can see no good reason for holding this global lock for an extended
duration: all the cloning is just something done privately by the parent
thread copying stuff into a new interpreter which hasn't even been
allocated a new OS thread yet. Any other thread shouldn't care about
this and should be allowed to create its own child threads at the same
time.

So this commit modifies S_ithread_create() to only have
create_destruct_mutex held during the brief places where the global
thread pool state is modified.

Consider the following somewhat contrived code example:

    use threads;

    sub empty {}

    sub f {
        for (1..10000) {
            threads->new(\&empty)->join();
        }
    }

    for my $i (1..16) {
        push @t, threads->new(\&f);
    }
    $_->join() for @t;

It creates 16 parallel jobs, where each job is just creating and
destroying a thread 10000 times.

On my 8 core laptop, before this commit it took 64 wallclock secs;
after, it takes 10 secs. A similar total CPU (in fact slightly more),
but now spread over all the CPUs in parallel.

As well as a performance boost, part of the motivation for this commit
is to rationalise and simplify when both a thread's mutex and the global
mutex are being held at the same time - currently helgrind is
complaining the ordering of acquiring both locks isn't consistent.
Subsequent commits will address this further.
Following this commit, the global(ish) create_destruct_mutex lock will
now always be locked *before* a thread's mutex in situations where both
need to be held. Previously this wasn't consistent, and deadlocks were
possible. A new test file has been added, zz_deadlock.t, which
demonstrates a reproducible (if somewhat contrived) deadlock.

The biggest change in this commit is that a call to perl_destruct for a
thread is no longer done with the thread's mutex held. This should be
safe, as at that point the thread is no longer running. Other threads
may still hold a pointer to the thread object, and they can still lock
that to get the thread's status, but whether the thread's interpreter
is being actively freed or not at that point wont affect that.

The other change in this commit is a few tweaks to ordering in
S_ithread_create(). First, the newly-created thread struct doesn't get
its mutex locked now until we're about to actually create the OS thread
- it doesn't need locking before then. Second, the new thread struct is
added to the pool slightly earlier, just *before* the thread mutex has
to be locked and held until the end. (It has to be held until the end
because the lock is being used to stop the just-created OS thread from
starting to run).

The issue with perl_destruct() was that it would lock thread A's mutex,
then free all its objects, which might include thread objects if that
thread itself created other threads. If such a thread B was finished,
then deleting the thread B object might cause thread B's refcount to
reach zero and the thread struct would be removed from the pool,
involving locking create_destruct_mutex. Thus we have a situation of
thread A's mutex being locked and held and then the pool mutex being
locked.

Conversely, methods like threads->list() lock the pool, then traverse
the list of of threads, briefly locking each thread in turn. This is an
example of the opposite order. This is what zz_deadlock.t exercises.

The new test file is prefixed 'zz' so that it's the last to run. So if
it deadlocks and hangs, all the other test files at least get to run
first.
Two variants of 'if (...) {' are selected via #ifdef; this commit
moves the '{' at the end of both lines outside of the #ifdef to ensure
balanced braces for text editors.
Clarify the order of locking.
Rename the static function S_ithread_free() to S_ithread_dec_free() to
better indicate that its job is more like SvREFCNT_dec(): i.e. rather
than unconditionally free the thread, decrement its ref count and only
free if it reaches zero.

Then add some more code comments explaining what's happening with the
ref count at various places.
During the last stages of freeing a thread, its interpreter is freed.
Sometimes my_perl is set to MY_POOL.main_thread.interp for the remaining
few steps of thread destruction, to provide an interpreter for those
bits of code which need it. This is on the basis that the main
interpreter is freed last, so is always available.

However, this is a deeply unsatisfactory arrangement. It means that for
a short while both the being-free thread and the main thread share the
same interpreter, and even if that doesn't cause any active harm, it
makes tools like helgrind spit out false positives. In particular, it
was getting upset about free2.t, since the main thread was updating
PL_phase while the other thread was testing its value.

This commit instead resets my_perl to NULL in places where it would
otherwise point to a freed interpreter, and only narrowly and
temporarily sets it to MY_POOL.main_thread.interp in one place where
PerlMemShared_free() needs it.

My rationale is that it is better to set my_perl to NULL and deal with
any immediate and obvious SEGV fallout, than to set it to
MY_POOL.main_thread.interp and silently suffer from rare race
conditions.
@iabyn iabyn added the defer-next-dev This PR should not be merged yet, but await the next development cycle label May 24, 2026
iabyn added 13 commits May 24, 2026 15:09
This commit adds a harmless extra final lock and immediate unlock of
create_destruct_mutex just before the OS thread exits. This is to
workaround a false positive being reported by the

    valgrind --tool=helgrind

thread-race detection tool.

See the code comments for the gory details.
Add some comments at the top of this test file explaining its purpose.
This commit adds a new field, pool, to the thread struct to provide an
alternative way to determine the pool which the thread is a part of.
This avoids a rare race condition where the usual lookup method via
PL_modglobal crashes because PL_modglobal has already been freed.

In more detail. There is a pool of threads created via the 'threads'
module. Typically there is only a single pool, but where the perl
interpreter is embedded within an application that creates its own
interpreters and/or threads, each interpreter's invocation of 'use
threads' creates a new pool which is used by all threads created within
that interpreter via 'threads->new()' etc.

For a particular pool, each thread within that pool has an interpreter
with a PL_modglobal hash, and the "threads::_pool$XS_VERSION" key holds
an SvIV whose value is a pointer to the pool.

Many thread XS functions start with dMY_POOL, which declares and
retrieves the pool pointer via the passed pTHX.

There is a specific issue with S_ithread_dec_free(). This also does
dMY_POOL, but there is a rare race condition where PL_modglobal has
already been freed by that point, and so a SEGV occurs in dMY_POOL.

In particular, it's a race between detach() and the thread ending that
occasionally manifested itself in t/blocks.t. If a thread finishes at
almost exactly the same time as another thread calls detach() on that
finishing thread, the following can happen:

    running thread:             thread calling detach():

    finish perl code and
    return from perl to
    S_ithread_run()

    lock thread
    set PERL_ITHR_FINISHED
    unlock thread

                                ithread_detach():
                                    lock thread
                                    if PERL_ITHR_FINISHED
                                    call S_ithread_clear(), which does:
                                        perl_destruct(thread->interp)
                                        thread->interp = NULL

    lock thread
    call S_ithread_dec_free()
    which does:
        dMY_POOL;
        which tries to access
            thread->interp->Imodglobal
            SEGV

Now, that race could possibly be fixed by not unlocking and relocking
the thread between setting PERL_ITHR_FINISHED and calling
S_ithread_dec_free(),  but in reality its a bit more complex than that
as two locks are actually held and need to be unlocked and relocked in
the right order.

So this commit takes a different approach, and in particular makes
S_ithread_dec_free() no longer reliant on PL_modglobal still being
valid. It achieves this by storing a pool pointer in the thread
structure whenever a new thread is created, and using that in any
functions where a thread pointer is available in addition to pTHX.

The next couple of commits will fix up some of the untidiness.
In S_ithread_clear(), when freeing a perl's interpreter, set
thread->interp() early on, so that nothing else can access the
interpreter in a semi-freed state.

This commit doesn't fix any particular bug; its just a bit of general
defensive coding.
Remove this internal state flag.

When S_ithread_create() fails to successfully create an OS thread, this
flag used to be set to indicate to S_ithread_dec_free() and
S_ithread_clear() that the thread and interpreter should be
unconditionally freed, bypassing the normal reference count and flag
checks. This is a bit of a hack.

Instead, this commit makes it so that if the OS thread creation fails,
the ref count is set to 1, "finished" flags are set, then
S_ithread_dec_free() is called and behaves in a normal fashion (without
special-case handling) to free the otherwise unused interpreter and
threads struct.
whitespace-only: reindent a code block after previous commit removed a
condition.
This internal flag is defined as:

    #define PERL_ITHR_UNCALLABLE  (PERL_ITHR_DETACHED|PERL_ITHR_JOINED)

Eliminate PERL_ITHR_UNCALLABLE and instead explicitly use
(DETATCHED|JOINED) everywhere. The define obfuscates rather than
enlightens (IMHO) - it's not immediately obvious what is meant by an
"uncallable" thread.
Rather than declaring all the threads::foo() XS methods via:

    MODULE = threads    PACKAGE = threads    PREFIX = ithread_

    void
    ithread_foo(...)

just declare them as:

    MODULE = threads    PACKAGE = threads

    void
    foo(...)
        ....

Since we're not wrapping an underlying C library which has ithread_foo()
functions, the PREFIX is pointless: both XS variants above generate the
same threads.c file. It just confuses things. So remove it.
Add a new test file which tests for all of (well, most of) the warnings
and errors which shared.pm and shared.xs can generate.
When cloning an interpreter, the refcount of shared objects was being
incremented without being locked: a potential race condition.

This commit makes it hold the shared interpreter lock before doing so.

That lock is a recursive lock, where currently the only way to unlock it
is that recursive_lock_acquire() does a

    SAVEDESTRUCTOR_X(recursive_lock_release,lock)

so that the lock is automatically released during scope exit. This is
too heavyweight for our needs here (and because we're cloning we don't
necessarily even have a savestack). Also, incrementing an SV's RC can't
possibly croak(), so there's no danger of the lock left dangling.

So this commit adds a boolean arg to  recursive_lock_acquire() to
indicate whether it should do SAVEDESTRUCTOR_X() or not.
The bodies of the two XSUBs cond_wait() and cond_timedwait() are
virtually identical. So this commit moves the common code out into a
new static function, S_do_cond_timedwait().

No functional changes, except that in one warning message
cond_timedwait() no longer misidentifies itself as 'cond_wait'.
Explain better how the perl/XS cond_wait() functions mimic the
equivalent OS functions.
iabyn added 15 commits May 24, 2026 15:09
There are two forms of cond_wait():

    cond_wait($cond);
    cond_wait($cond, $lock);

The first form uses the same variable for both condition and lock.

When signalling, cond_signal() warns if the variable wasn't locked
prior: you're supposed to do:

    lock($cond);
    cond_signal($cond);

and if you skip the locking, you get the warning.

Unfortunately the warning check in threads.xs always checks whether the
$cond lock was locked; but if the second form of wait() was used, then
instead it aught to check whether $lock was locked.

So after doing a 2-arg wait, this:

    lock($lock);
    cond_signal($cond);

is the correct form; unfortunately it was giving a spurious warning:

    cond_signal() called on unlocked variable

A couple of commits ago wait() was made to start tracking what lock was
used as its argument; the presence of this value now allows us to check
the correct lock and thus avoid the incorrect warning.
This is mainly to keep thread-debugging tools like helgrind happy rather
than strictly being necessary. See the added code comments for more
details.
These two XS methods require their first argument to be a reference,
but this wasn't checked for and passing a non-ref argument used to SEGV.
shared_clone() is documented as refusing to clone certain SV types such
as typeglobs: instead it is supposed to warn or croak, depending on the
setting of $threads::shared::clone_warn.

However, the cloning code only did the check for *references* to
disallowed types. So the following:

    my $x = shared_clone(\*ABC);

would correctly produce this error:

    Unsupported ref type: GLOB at ...

while this:

    my $x = shared_clone([*ABC]);

produced this:

    perl: sv.c:3908: S_glob_assign_glob: Assertion `isGV_with_GP(gvgp_)' failed.

This commit changes the code so that non-ref types are ban-checked too.

The checking isn't comprehensive - I've currently made it just look for
GLOB and CODE as these are the documented exceptions.

GLOB in particular needs to be banned as it is hard to successfully
share a typeglob between threads: S_glob_assign_glob() in sv.c does all
sorts of things like adding a backreference from the glob's stash, and
it all goes horribly wrong in relation to which pointers are in which
interpreter. This is even if it gets that far; currently the XS sharing
code upgrades a new SV to a typeglob without giving it a GP, resulting
in the specific assertion failure shown above. This is fixed in the next
commit.

Also, its not very clear what the semantics of sharing or cloning a
typeglob between threads should be. If *foo is gets shared, do $foo,
@foo etc also become shared?

See the next commit for further banning of shared typeglobs.
Previously this code:

    my $sh : shared = *ABC;

produced:

    perl: sv.c:3908: S_glob_assign_glob: Assertion `isGV_with_GP(gvgp_)' failed.

Now it produces:

    Invalid value for shared scalar at ...

This is a follow-up to the previous commit, which did a similar thing
for shared_clone(). That commit worked at the perl level, objecting if a
typeglob was found while recursing through a structure to be cloned.

This commit works at the XS level, croaking on any attempt to copy a
private typeglob value into the shadow SV in the shared interpreter.

In addition, the magic-set code now avoids upgrading the shared SV to
the target type: this is now left to sv_setsv() itself, which will know
the best way to upgrade. In particular, a naive
sv_upgrade(nullsv, SVt_PVGV) left the SV as a GV with no GP, leading to
the assertion failure seen above.

See previous commit for some reasons why sharing a GV is a bad idea.
This debugging code, as used by Devel::Peek::Dump(),  would fail an
assert when dumping a malformed GV which had no GvNAME_HEK().

Now fixed.

I've also made it display the raw GvNAME_HEK() address in addition to
displaying its decoded NAME/NAMELEN values.
Make the code more readable, especially as it has many, and nested,
 #ifdefs.

Its mainly just whitespace changes, although I did put a pair of braces
round a complex single-statement else clause to better delineate it
visually, and I reformatted a code comment.
This commit is essentially a revert of
d525a7b from 2009, which changed

    Newxz(PL_psig_ptr,  SIG_SIZE, SV*);
    Newxz(PL_psig_name, SIG_SIZE, SV*);
to
    Newxz(PL_psig_name, 2 * SIG_SIZE, SV*);
    PL_psig_ptr = PL_psig_name + SIG_SIZE;

The original commit doesn't give any specific reasoning, but I assume it
was for a miniscule startup efficiency boost.

It worked on the assumption that both PL_psig_name and PL_psig_ptr
were SIG_SIZE sized arrays of SV pointers and so could be handled with a
single double-sized alloc(); but I want to break that assumption:
allowing for example, the pointers in PL_psig_ptr to be declared atomic.
For various signal-related vars, change e.g.

    PL_psig_ptr = (SV**)NULL;
to
    PL_psig_ptr = NULL;

We don't need casts on NULL these days, and the casts will get in the
way of declaring some vars to be atomic types in the next few commits.
This allows C variable types to be declared atomic, so that they can be
used in a lock-free thread-safe way.

However, there are are currently issues with portability and C++, so it
isn't enabled by default, nor via a Configure probe: it has to be
explicitly enabled via -Accflags='-DPERL_USE_ATOMIC'.

See the thread http://nntp.perl.org/group/perl.perl5.porters/270882 for
details of the issue.

For now it will be used in the next few commits to add optional
atomicity to PL_sig_pending and similar to achieve the target of making
'valgrind --tool=helgrind' run cleanly. So even if the thread signalling
issues can't yet be fixed in production, at least it can be confirmed
that the fix is conceptually correct, and that there are no other issues
being masked by all the noise from the failing threads signal code.
Declare PL_sig_pending, PL_psig_pend, PL_psig_ptr as PERL_ATOMIC()
types. Note that unless perl is explicitly built with

    -Accflags='-DPERL_USE_ATOMIC'

this macro is currently a NOOP. So by default this commit makes no
change in behaviour. (This macro was added by the previous commit.)

This fixes possible race conditions where the PL_[p]sig_foo variables
are being checked and/or updated by multiple threads. In particular, the
$thread->kill() method works by allowing the caller to update $thread's
PL_sig_foo state. The kill() XSUB locks the target thread's mutex, but
that isn't sufficient, as the target thread isn't also locking its mutex
every time it checks PL_sig_pending in PERL_ASYNC_CHECK().

In any case, normal mutex locking isn't appropriate here, as
PL_sig_pending is checked after every op is executed, so would slow
things down; and some of these variables are modified within a signal
handler, where locking would inappropriate.

The issues show up mainly in the dist/threads/t/kill*.t test files when
run under valgrind --tool=helgrind.
Add some comments at the top of this test file explaining what its
purpose is. This is essentially just a cut+paste of the commit message
which created the file.
Fix a threads race condition.

Perl_rcpv_free() does OP_REFCNT_LOCK then messes with a
reference-counted OP's reference count. A couple of asserts are done
just before the lock is acquired; this commit moves the locking earlier
so that in particular, the

    assert(rcpv->refcount);

is done while the ref count is locked. This fixes a race condition
flagged by valgrind --tool=drd

(It's not a very significant race condition, as it could only be
triggered on DEBUGGING builds).
The other PERL_REENTRANT_foo() macros have this line:

        CLANG_DIAG_IGNORE(-Wthread-safety)

Add it to PERL_REENTRANT_UNLOCK() too. Otherwise building perl under
clang with -Wthread-safety gives zillions of errors along the lines of:

    warning: mutex 'PL_env_mutex.lock' is not held on every path through here.

This is because the static analysis done by -Wthread-safety isn't smart
enough to realise that a recursive lock won't actually be
locked/unlocked when count > 1.

It's not clear to me why this wasn't needed in the UNLOCK macro when the
macros were first created.
@iabyn iabyn force-pushed the davem/thread_races branch from cec0ba9 to 19dbad9 Compare May 24, 2026 14:30
@iabyn
Copy link
Copy Markdown
Contributor Author

iabyn commented May 24, 2026

force-pushed a windows portability fix in commit "add workaround for helgrind false +ve"

@tonycoz
Copy link
Copy Markdown
Contributor

tonycoz commented May 24, 2026

Looks like it's deadlocking on MacOS in threads-shared, and somewhere in Cygwin.

Here's the run and tracebacks on MacOS, could the panic be the problem?:

tony@hermes threads-shared % ../../t/perl -I../.. -MTestInit=U2T t/err.t
ok 1 - loaded
ok 2 - warn bless main
ok 3 - warn _refcnt undef retval
ok 4 - warn _refcnt
ok 5 - warn cond_signal not locked
ok 6 - warn cond_broadcast not locked
ok 7 - $lock acquired
ok 8 - no warn cond_signal not locked
# sleeping for 1 second to acquire $cond
# sleeping for 1 second to acquire $lock
(much more than 1 second)
(lldb) thread list
Process 15293 stopped
* thread #1: tid = 0x95c404, 0x00000001816b89dc libsystem_kernel.dylib`__psynch_mutexwait + 8, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  thread #2: tid = 0x95c406, 0x00000001816b950c libsystem_kernel.dylib`__psynch_cvwait + 8
  thread #3: tid = 0x95c40b, 0x00000001816b89dc libsystem_kernel.dylib`__psynch_mutexwait + 8
(lldb) thread backtrace all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001816b89dc libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x00000001816f6e9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x00000001816f48e0 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 216
    frame #3: 0x00000001816fa1b4 libsystem_pthread.dylib`_pthread_cond_wait + 1120
    frame #4: 0x00000001050bfe4c shared.bundle`recursive_lock_acquire(my_perl=0x0000000104811df0, lock=0x000000010481ad80, defer_unlock=1, file="shared.xs", line=1290) at shared.xs:271:13
    frame #5: 0x00000001050c3f1c shared.bundle`Perl_sharedsv_lock(my_perl=0x0000000104811df0, ssv=0x000000080e63e6f0) at shared.xs:1290:5
    frame #6: 0x00000001050c3c8c shared.bundle`Perl_sharedsv_locksv(my_perl=0x0000000104811df0, sv=0x000000080e577eb8) at shared.xs:1305:5
    frame #7: 0x00000001043c5d20 perl`Perl_pp_lock(my_perl=0x0000000104811df0) at pp.c:7283:5
    frame #8: 0x00000001044ea4b4 perl`Perl_runops_standard(my_perl=0x0000000104811df0) at run.c:43:26
...
  thread #2
    frame #0: 0x00000001816b950c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x00000001816fa128 libsystem_pthread.dylib`_pthread_cond_wait + 980
    frame #2: 0x00000001050c3390 shared.bundle`S_do_cond_timedwait(my_perl=0x000000080e004000, ref_cond=0x000000080e6c2f90, abs=0, ref_lock=0x0000000000000000, timed=false) at shared.xs:1485:9
    frame #3: 0x00000001050bed94 shared.bundle`XS_threads__shared_cond_wait(my_perl=0x000000080e004000, cv=0x000000080e66acf0) at shared.xs:1787:15
    frame #4: 0x0000000104400f30 perl`Perl_rpp_invoke_xs(my_perl=0x000000080e004000, cv=0x000000080e66acf0) at inline.h:1221:9
    frame #5: 0x00000001044007fc perl`Perl_pp_entersub(my_perl=0x000000080e004000) at pp_hot.c:6628:9
    frame #6: 0x00000001044ea4b4 perl`Perl_runops_standard(my_perl=0x000000080e004000) at run.c:43:26
    frame #7: 0x00000001042dbfb4 perl`Perl_call_sv(my_perl=0x000000080e004000, sv=0x000000080e7e43c0, arg_flags=10) at perl.c:3269:13
    frame #8: 0x00000001050acb48 threads.bundle`S_jmpenv_run(my_perl=0x000000080e004000, action=0, thread=0x0000000810048000, len_p=0x000000016bbeef9c, exit_app_p=0x000000016bbeefa4, exit_code_p=0x000000016bbeefa0) at threads.xs:578:27
    frame #9: 0x00000001050abb54 threads.bundle`S_ithread_run(arg=0x0000000810048000) at threads.xs:688:18
    frame #10: 0x00000001816f9c58 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001816b89dc libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x00000001816f6e9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x00000001816f48e0 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 216
    frame #3: 0x00000001045786a8 perl`perl_tsa_mutex_lock(mutex=0x000000010481ad80) at util.c:6766:12
    frame #4: 0x00000001050c06dc shared.bundle`recursive_lock_release(my_perl=0x000000080e005000, ptr=0x000000010481ad80) at shared.xs:218:5
    frame #5: 0x00000001044ec458 perl`Perl_leave_scope(my_perl=0x000000080e005000, base=4) at scope.c:1653:13
    frame #6: 0x00000001043d575c perl`Perl_dounwind(my_perl=0x000000080e005000, cxix=0) at pp_ctl.c:1877:9
    frame #7: 0x00000001043d6620 perl`Perl_die_unwind(my_perl=0x000000080e005000, msv=0x000000080ed20cf0) at pp_ctl.c:2177:17
    frame #8: 0x0000000104577530 perl`Perl_vcroak(my_perl=0x000000080e005000, pat="panic: COND_WAIT (%d) [%s:%d]", args=0x000000016bc7a568) at util.c:1840:5
    frame #9: 0x0000000104577640 perl`Perl_croak_nocontext(pat="panic: COND_WAIT (%d) [%s:%d]") at util.c:1901:5
    frame #10: 0x00000001050c33c8 shared.bundle`S_do_cond_timedwait(my_perl=0x000000080e005000, ref_cond=0x000000080ed8b0c0, abs=0, ref_lock=0x000000080ed8b0f0, timed=false) at shared.xs:1485:9
    frame #11: 0x00000001050bed94 shared.bundle`XS_threads__shared_cond_wait(my_perl=0x000000080e005000, cv=0x000000080ec32cf0) at shared.xs:1787:15
    frame #12: 0x0000000104400f30 perl`Perl_rpp_invoke_xs(my_perl=0x000000080e005000, cv=0x000000080ec32cf0) at inline.h:1221:9
    frame #13: 0x00000001044007fc perl`Perl_pp_entersub(my_perl=0x000000080e005000) at pp_hot.c:6628:9
    frame #14: 0x00000001044ea4b4 perl`Perl_runops_standard(my_perl=0x000000080e005000) at run.c:43:26
    frame #15: 0x00000001042dbfb4 perl`Perl_call_sv(my_perl=0x000000080e005000, sv=0x000000080ed9a4f8, arg_flags=10) at perl.c:3269:13
    frame #16: 0x00000001050acb48 threads.bundle`S_jmpenv_run(my_perl=0x000000080e005000, action=0, thread=0x0000000810048180, len_p=0x000000016bc7af9c, exit_app_p=0x000000016bc7afa4, exit_code_p=0x000000016bc7afa0) at threads.xs:578:27
    frame #17: 0x00000001050abb54 threads.bundle`S_ithread_run(arg=0x0000000810048180) at threads.xs:688:18
    frame #18: 0x00000001816f9c58 libsystem_pthread.dylib`_pthread_start + 136
...

@tonycoz
Copy link
Copy Markdown
Contributor

tonycoz commented May 25, 2026

Cygwin also blocked on threads-shared/t/err.t in the same way:

tony@GANYMEDE ~/dev/perl/git/perl/dist/threads-shared
$ ../../perl -I../.. -MTestInit=U2T t/err.t
ok 1 - loaded
ok 2 - warn bless main
ok 3 - warn _refcnt undef retval
ok 4 - warn _refcnt
ok 5 - warn cond_signal not locked
ok 6 - warn cond_broadcast not locked
ok 7 - $lock acquired
ok 8 - no warn cond_signal not locked
# sleeping for 1 second to acquire $cond
# sleeping for 1 second to acquire $lock
(much more than 1 second)

Cygwin backtraces below, looks like this one also has a panic:

(gdb) info threads
  Id   Target Id                  Frame
  1    Thread 16332.0x24d8 "perl" 0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  2    Thread 16332.0x619c "sig"  0x00007fffd694d714 in ntdll!ZwReadFile ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  3    Thread 16332.0x4594 "perl" 0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  4    Thread 16332.0x7008 "perl" 0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
* 5    Thread 16332.0x1f34        0x00007fffd6951181 in ntdll!DbgBreakPoint ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
(gdb) thread 1
[Switching to thread 1 (Thread 16332.0x24d8)]
#0  0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
(gdb) bt
#0  0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#1  0x00007fffd3f70d10 in WaitForMultipleObjectsEx ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#2  0x00007fffd3f70c0e in WaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#3  0x00007fffa4596007 in cygwait (object=<optimized out>, timeout=0x0,
    mask=mask@entry=36)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygwait.cc:79
#4  0x00007fffa45f3271 in pthread_mutex::lock (this=0xa00724ba0,
    timeout=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/thread.cc:1812
#5  0x00007fffa46ddc54 in _sigfe () at sigfe.s:35
#6  0x00000004d75315bd in recursive_lock_acquire (my_perl=0xa000005e0,
    lock=0xa00702370, defer_unlock=1,
    file=0x4d753a0f0 <fake_rv+8> "shared.xs", line=1290)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:254
#7  0x00000004d7533e24 in Perl_sharedsv_lock (my_perl=0xa000005e0,
    ssv=0xa006e2848)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:1290
#8  0x00000004d7533e93 in Perl_sharedsv_locksv (my_perl=0xa000005e0,
    sv=0xa00638f48)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:1305
#9  0x00000003800fdd90 in Perl_pp_lock (my_perl=0xa000005e0) at pp.c:7283
#10 0x00000003801e719c in Perl_runops_standard (my_perl=0xa000005e0)
    at run.c:43
#11 0x000000038003902e in S_run_body (my_perl=0xa000005e0, oldscope=1)
    at perl.c:2901
#12 0x0000000380038baf in perl_run (my_perl=0xa000005e0) at perl.c:2814
#13 0x000000010040119e in main (argc=4, argv=0xa000004b0, env=0xa00000160)
    at perlmain.c:127
(gdb) thread 2
[Switching to thread 2 (Thread 16332.0x619c)]
#0  0x00007fffd694d714 in ntdll!ZwReadFile ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
(gdb) bt
#0  0x00007fffd694d714 in ntdll!ZwReadFile ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#1  0x00007fffd3f4dc43 in ReadFile ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#2  0x00007fffa460ec52 in wait_sig ()
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/sigproc.cc:1498
#3  0x00007fffa4594cd0 in cygthread::callfunc (
    this=this@entry=0x7fffa47855c0 <threads>,
    issimplestub=issimplestub@entry=false)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygthread.cc:130
#4  0x00007fffa45951c9 in cygthread::stub (
    arg=arg@entry=0x7fffa47855c0 <threads>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygthread.cc:173
#5  0x00007fffa4595cc5 in _cygtls::call2 (this=0xe1ce00,
    func=0x7fffa4595160 <cygthread::stub(void*)>,
    arg=0x7fffa47855c0 <threads>, buf=buf@entry=0xe1cd20)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygtls.cc:41
#6  0x00007fffa4595d7a in _cygtls::call (func=<optimized out>,
    arg=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygtls.cc:28
#7  0x00007fffd5137374 in KERNEL32!BaseThreadInitThunk ()
   from /cygdrive/c/WINDOWS/System32/KERNEL32.DLL
#8  0x00007fffd68fcc91 in ntdll!RtlUserThreadStart ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#9  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 16332.0x4594)]
#0  0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
(gdb) bt
#0  0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#1  0x00007fffd3f70d10 in WaitForMultipleObjectsEx ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#2  0x00007fffd3f70c0e in WaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#3  0x00007fffa4596007 in cygwait (object=<optimized out>,
    timeout=timeout@entry=0x0, mask=mask@entry=33)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygwait.cc:79
#4  0x00007fffa45f40b5 in pthread_cond::wait (this=0xa009e6e50,
    mutex=0xa008b6b80, timeout=0x0)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/thread.cc:1283
#5  0x00007fffa46ddc54 in _sigfe () at sigfe.s:35
#6  0x00000004d75345ba in S_do_cond_timedwait (my_perl=0xa0072e110,
    ref_cond=0xa00a7ebd8, abs=0, ref_lock=0x0, timed=false)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:1485
#7  0x00000004d7536999 in XS_threads__shared_cond_wait (my_perl=0xa0072e110,
    cv=0xa008c2808)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:1787
#8  0x000000038011be0d in Perl_rpp_invoke_xs (my_perl=0xa0072e110,
    cv=0xa008c2808) at /home/tony/dev/perl/git/perl/inline.h:1221
#9  0x000000038012df64 in Perl_pp_entersub (my_perl=0xa0072e110)
    at pp_hot.c:6628
#10 0x00000003801e719c in Perl_runops_standard (my_perl=0xa0072e110)
    at run.c:43
#11 0x0000000380039a75 in Perl_call_sv (my_perl=0xa0072e110, sv=0xa00ab2970,
    arg_flags=10) at perl.c:3269
#12 0x00000005aad22a6e in S_jmpenv_run (my_perl=0xa0072e110, action=0,
    thread=0xa00a77050, len_p=0x7ffdfccb4, exit_app_p=0x7ffdfccac,
    exit_code_p=0x7ffdfccb0)
    at /home/tony/dev/perl/git/perl/dist/threads/threads.xs:578
#13 0x00000005aad22e23 in S_ithread_run (arg=0xa00a77050)
    at /home/tony/dev/perl/git/perl/dist/threads/threads.xs:688
#14 0x00007fffa45f3b3d in pthread::thread_init_wrapper (arg=0xa0077a060)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/thread.cc:2034
#15 0x00007fffa4594781 in pthread_wrapper (arg=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/create_posix_thread.cc:79
#16 pthread_wrapper (arg=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/create_posix_thread.cc:39
#17 0x0000000000000000 in ?? ()
[Switching to thread 4 (Thread 16332.0x7008)]
#0  0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
(gdb) bt
#0  0x00007fffd694e1a4 in ntdll!ZwWaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#1  0x00007fffd3f70d10 in WaitForMultipleObjectsEx ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#2  0x00007fffd3f70c0e in WaitForMultipleObjects ()
   from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#3  0x00007fffa4596007 in cygwait (object=<optimized out>, timeout=0x0,
    mask=mask@entry=36)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/cygwait.cc:79
#4  0x00007fffa45f3271 in pthread_mutex::lock (this=0xa00724ba0,
    timeout=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/thread.cc:1812
#5  0x00007fffa46ddc54 in _sigfe () at sigfe.s:35
#6  0x00000004d753149a in recursive_lock_release (my_perl=0xa00ab8810,
    ptr=0xa00702370)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:218
#7  0x00000003801ebb74 in Perl_leave_scope (my_perl=0xa00ab8810, base=4)
    at scope.c:1653
#8  0x0000000380109e74 in Perl_dounwind (my_perl=0xa00ab8810, cxix=0)
    at pp_ctl.c:1877
#9  0x000000038010a79c in Perl_die_unwind (my_perl=0xa00ab8810,
    msv=0xa00baf508) at pp_ctl.c:2177
#10 0x0000000380261e43 in Perl_vcroak (my_perl=0xa00ab8810,
    pat=0x4d753a1f1 <fake_rv+265> "panic: COND_WAIT (%d) [%s:%d]",
    args=0x7ffbfc488) at util.c:1840
#11 0x0000000380261ea1 in Perl_croak_nocontext (
    pat=0x4d753a1f1 <fake_rv+265> "panic: COND_WAIT (%d) [%s:%d]")
    at util.c:1901
#12 0x00000004d75345e4 in S_do_cond_timedwait (my_perl=0xa00ab8810,
    ref_cond=0xa00ceb7c0, abs=0, ref_lock=0xa00ceb7f0, timed=false)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:1485
#13 0x00000004d7536999 in XS_threads__shared_cond_wait (my_perl=0xa00ab8810,
    cv=0xa008ecb68)
    at /home/tony/dev/perl/git/perl/dist/threads-shared/shared.xs:1787
#14 0x000000038011be0d in Perl_rpp_invoke_xs (my_perl=0xa00ab8810,
    cv=0xa008ecb68) at /home/tony/dev/perl/git/perl/inline.h:1221
#15 0x000000038012df64 in Perl_pp_entersub (my_perl=0xa00ab8810)
    at pp_hot.c:6628
#16 0x00000003801e719c in Perl_runops_standard (my_perl=0xa00ab8810)
    at run.c:43
#17 0x0000000380039a75 in Perl_call_sv (my_perl=0xa00ab8810, sv=0xa00d1e438,
    arg_flags=10) at perl.c:3269
#18 0x00000005aad22a6e in S_jmpenv_run (my_perl=0xa00ab8810, action=0,
    thread=0xa00ab8790, len_p=0x7ffbfccb4, exit_app_p=0x7ffbfccac,
    exit_code_p=0x7ffbfccb0)
    at /home/tony/dev/perl/git/perl/dist/threads/threads.xs:578
#19 0x00000005aad22e23 in S_ithread_run (arg=0xa00ab8790)
    at /home/tony/dev/perl/git/perl/dist/threads/threads.xs:688
#20 0x00007fffa45f3b3d in pthread::thread_init_wrapper (arg=0xa00b104c0)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/thread.cc:2034
--Type <RET> for more, q to quit, c to continue without paging--
#21 0x00007fffa4594781 in pthread_wrapper (arg=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/create_posix_thread.cc:79
#22 pthread_wrapper (arg=<optimized out>)
    at /usr/src/debug/cygwin-3.6.9-1/winsup/cygwin/create_posix_thread.cc:39
#23 0x0000000000000000 in ?? ()

Thread 5 is a system thread from attaching the debugger. Thread 2 looks like some cygwin signal delivery thread.

@tonycoz
Copy link
Copy Markdown
Contributor

tonycoz commented May 25, 2026

eC_ is 22 (Invalid argument) in the panic threads on both Cygwin and MacOS.

It might be worth pushing a smoke-me branch to get some coverage on the BSDs and Solaris.

(there's also https://github.com/vmactions/ but these are slow)

@jkeenan
Copy link
Copy Markdown
Contributor

jkeenan commented May 25, 2026

Is this p.r. something we're going to try to get in to perl-5.44? (It's large enough that, IMO, it should be deferred to the 5.45 dev cycle.)

@tonycoz
Copy link
Copy Markdown
Contributor

tonycoz commented May 25, 2026

Is this p.r. something we're going to try to get in to perl-5.44? (It's large enough that, IMO, it should be deferred to the 5.45 dev cycle.)

It's already marked defer-next-dev.

Comment thread dist/threads/threads.xs Outdated

/* Allocate thread structure in context of the main thread's interpreter */
{
MUTEX_LOCK(&my_pool->create_destruct_mutex);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these two lines need to be locked?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks they don't need to be - this is vestigal behaviour from when the function itself was invoked while locked.

Comment thread perl.h
# define PERL_ATOMIC(atype) _Atomic(atype)
# endif
# endif
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use atomic_int instead? AFAIK that should be portable between C and C++. Or do we really need specific atomic types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PL_sig_ptr has atomic SV* pointers, and there doesn't seem to be a built-in atomic pointer type?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a bit unfortunate.

iabyn added 4 commits May 25, 2026 13:31
This test checks that a warning is issued when two threads call
cond_wait() for the same condition var, but with different locks.

The problem is that shared.xs then calls the underlying OS's cond_wait()
functions with similarly bad values, triggering undefined behaviour. On
some platforms this caused pthreads_cond_wait() to return an error code,
triggering a panic and deadlock.

So just remove the test.
After some refactoring, there was a vestigial piece of locking around
a PerlMemShared_malloc() call in S_ithread_create(). Remove it, since
its unnecessary now.
A recent commit of mine made threads set their context to NULL when in
the late stages of being freed - previously they were set briefly to the
main thread's context. In theory at that point the context shouldn't be
used at all; if accessed, it likely indicates a bug.  I decided that
dereffing a null pointer was more likely to obviously show up a fault
than randomly accessing a different thread's context at this point.

This worked out on platforms where PERL_SET_CONTEXT() is implemented
directly; on platforms where it falls back to Perl_set_context(), it was
assert-failing due to the NULL argument.
The previous commit fixed the general Perl_set_context() function; it
turns out win32 has its own version of that function: fix that too.
Copy link
Copy Markdown
Contributor

@esabol esabol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the typo in the comment added to perlvars.h.

Comment thread perlvars.h
#endif

#ifdef USE_POSIX_2008_LOCALE
/* This variable is is initialised just once, at process start-up time, to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo! is is should just be is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

defer-next-dev This PR should not be merged yet, but await the next development cycle

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants