The ISO C++ Standards Committee (WG21) held its first in-person meeting since our February 2020 meeting in Prague the week of November 6-12 in Kona, Hawaii. This was the first hybrid in-person and remotely attended WG21 meeting. As usual, Red Hat sent Jason Merrill and myself to attend in-person, with Jonathan Wakely chairing the Library Working Group remotely. The meeting was primarily focused on completing C++23.
As usual, I spent the majority of the time in the Concurrency and Parallelism study group (SG1). SG1 had a light schedule for this meeting, with early focus on papers pertinent to the overall agenda of completing C++23.
First up was P2588 - Relax std::barrier phase completion step guarantees.
This paper was in response to two national body (NB) comments, DE-135 and US 63-131. For certain architectures, GPUs in particular, there was a desire to relax the original wording for std::barrier phase completion to allow the supplied phase completion handler to run on an arbitrary thread, not necessarily one that arrived at the barrier. The reasoning here is that for locality of reference, it is desirable to give the implementation wide latitude as to where it runs the completion. The counter argument is that this behavior can be quite surprising if the completion handler relies on thread local storage. After discussion, SG1 took the following polls:
- In response to DE-135 and US 63-131, we move to apply the change suggested in p2588R1.
- In response to DE-135 and US 63-131, we move to apply the change suggested in P2588R1 with the words "or it is a new thread" removed.
The results were no consensus for the first poll, and consensus for the second. The good news being that Barrier phase completion will continue to have the usual surprises for those who enjoy playing with matches in the pool of gasoline that is thread local storage.
Next up was P2559 - Plan for Concurrency Technical Specification Version 2.
This paper and the associated discussion are an attempt to determine what to do with the Concurrency Technical Specification (TS) v2. Some features, Hazard Pointers and Read-Copy-Update, for instance are likely to make it into C++23 outside of the TS, albeit in a less feature-rich form. Other long languishing features, Concurrent Queues for example, lack an active champion. SG1 had consensus on applying the following papers to the TS:
- P0290R2 - synchronized_value
- P1478R7 - Byte-wise Atomic memcpy
- P2396R0 - Concurrency TS 2 Fixes
- P1202R4 - Asymmetric Fences
Regarding Concurrent Queues lacking a champion, Detlef Vollman agreed to bring a revised version by end of year, incorporating the most recent feedback from the Library Evolution Working Group (LEWG).
Monday concluded with discussion of P2629 - Barrier token-less split arrive/wait
The goal of this paper is to extend the interface of std::barrier in a way that supports a particularly efficient data pipelining idiom built from pairs of barriers. Much of the discussion focused on whether this was a type with a different name than std::barrier (which would otherwise share almost all of barrier's implementation detail). There was consensus within SG1 that we are interested in this approach, with guidance to the authors to provide more clarity around algorithms that might mix current std::barrier operations with barrier pipelining as proposed by the paper.
First up on Tuesday was P2690 - C++17 parallel algorithms and P2300. The goal of this paper is unify C++17 parallel algorithms with P2300's schedulers. The paper proposes an execute_on(<scheduler>, <execution policy>) mechanism that can be passed as an argument to parallel algorithm in place of the existing execution policy types and allows the possibility to customize an algorithm by scheduler and policy type. Most of the discussion was around whether this approach, or extending execution policies to accept a scheduler (and potentially other tuning parameters) as is existing practice within Kokkos was preferred. The discussion also identified a need to clarify what an execution policy really means, as the current state is a bit nebulous.
Next up on Tuesday was P2079 - Shared execution engine for executors.
The aim of this paper is to define the library surface area of a 'system' thread pool for p2300 Sender/Receivers. The question here is how to expose a system-provided thread pool (e.g. libdispatch, the Windows thread pool, etc.) in a way that meets the structured concurrency aims of P2300. There are many interesting problems here with how to scope the lifetime of work, and what to do if work is discovered to have leaked through those scopes, over the threads the runtime provides.
The topic is made more difficult, because there are entirely reasonable expectations that this mechanism might be replaceable within in a particular domain; for instance scheduling work within an application's GUI message pump, or substitution of a third party scheduler (e.g. oneTBB), or custom scheduler in case of embedded systems (e.g. FreeRTOS).
SG1 spent some time discussing lifetime concerns, the relationship of the proposal to the (as of yet not formally proposed in a paper) async_scope, how this might work in a freestanding context, and how to perhaps expose an extension point to allow non-stdlib replacement.
I have performed some experiments within libstdc++'s implementation to determine the feasibility of having the implementation gain control at exit of main() to enforce task lifetimes in a runtime thread pool. These experiments depend currently on the internals of how the particular combination of libstdc++ and glibc handle pre/post main() processing. We could perhaps discuss standardizing that, which might simplify the problem, but requires cooperation with various C runtimes.
Wednesday started with a discussion of three SIMD related papers:
- P1928 - Merge data-parallel types from the Parallelism TS 2
- P2638 - Intel's response to P1915 for std::simd parallelism in TS 2
- P2663 - Proposal to support interleaved complex values in std::simd
I did not attend these sessions, but the results of polling were as follows:
Unanimous consent to forward P1928; SG1 does not believe there are SG1 concerns related to future revisions and papers for std::simd so those papers should go to the Library Evolution Working Group (LEWG) directly.
Next up on Wednesday was P2616 - Making std::atomic notification/wait usable in more situations.
The gist of this paper is that as currently specified, it is possible to call notify() on an atomic which is in the process of being destroyed, which is undefined behavior. All three (libstdc++, libc++, Microsoft STL) implementations work because they use the address of the atomic only as a 'bag of bits' to compute a hash (the interested reader can read about libstdc++'s current implementation here).
The paper proposes a few ways to address the situation. SG1's preference was to standardize a new atomic_notify_token that notifiers acquire from an atomic that is known to not be in the process of being destroyed and which can perform that 'bag of bits to hash' dance in way that is pleasing to the Undefined Behavior deities. This change would see the .notify_one() and .notify_all() members of atomic deprecated in (presumably) C++26.
Last up on Wednesday was P2643 - Improving C++ concurrency features.
This paper proposes several enhancements to the atomic wait functionality. The first of which is to return the last observed value of the atomic from .wait(). The reasoning being that most uses of .wait() will end up performing an atomic load to get the value of the atomic, but the implementation had already performed that load in order to exit the wait. Another improvement proposed by the paper is to allow a predicate to be passed to .wait() which will be called with the current value of the atomic inside of the library's wait loop. Two of the stdlib implementations already implement wait in terms of a function accepting a predicate internally. P2643 also proposes adding timed waits to atomic. This latter proposal runs into problems with <atomic> being part of the freestanding C++ specification but <chrono> is not. P2643 also suggests the possibility of having the algorithmic choices employed by wait() be controllable through some sort of hinting mechanism. The authors of P2643 believe the design surface area of such a feature is sufficiently large that it is beyond the scope of the P2643 proposal, and will bring a new version of P2643 to the winter meeting without that aspect of the proposal included.
First up on Thursday was P2019 - Usability improvements for std::thread.
This proposal aims to introduce attributes for thread name and stack size that can be passed to the constructor of std::thread and std::jthread. Most of the discussion was around what 'stack size' means in terms of the Standard, which is itself fairly mum on the topic of 'stack' as a whole. There was usual sorts of name and namespace bike shedding to round out the discussion. In the end there was unanimous consent to forward the proposal for the thread name attribute to LEWG and consensus to forward the stack size attribute. SG1 wants to the see the paper again when LWG takes up the wording review.
Next up was review of LWG3756, which sought to clarify whether atomic_flag is signal safe.
SG1 agreed that it was, and providing suggested clarifying wording.
Next up on Thursday was P2689 - Atomic Refs Bounded to Memory Orderings & Atomic Accessors
This paper proposes adding an atomic_accessor type for use with std::mdspan. One issue however, is that the most common use of a std::atomic_ref returned by such an accessor is atomic arithmetic operations, the operators for, which enforce sequentially consistent memory ordering where the most common usage would prefer relaxed ordering. SG1 discussed several options and settled on adding three new atomic_ref types:
- atomic_ref_acq_rel (load acquire, store release)
These types will provide no option to supply a different memory order to their atomic operations. Oddball usages where there might be the desire to mix memory models can always construct an atomic_ref over the same address.
Next up on Thursday's schedule was P2633 - thread_local_inherit: Enhancing thread-local storage.
Lacking a champion for the paper, SG1 adjourned early for lunch.
After lunch, Dr. Mark Batty gave a presentation on P1780 Modular Relaxed Dependencies: A new approach to the Out-Of-Thin-Air Problem.
Having not attained sufficient mastery in these dark arts, I did not attend this session, opting instead to retreat pool-side and implement a prototype of atomic_ref_[relaxed, acq_rel, seq_cst] for libstdc++.