all: asynchronous Eigen threadpool runtime support #4297

dzarukin · 2025-11-07T06:22:13Z

This is the final PR that covers all the current needs for asynchronous Eigen threadpool runtime support.

Important notes:

Benchdnn treats async runtime same way as GPU when it comes for perf measurements. It also got an abstraction to ensure that for correctness validation submission happens before any thread starts execution.
Auto-gen based implementation becomes disabled for any threadpool implementation as it's impossible to determine the threadpool flag at creation time.
Verbose functionality doesn't properly return times as the call can't be blocking, otherwise, it'll lead to a deadlock. It's an area for a future improvement.
The support itself (if not considering the amount of preliminary work promoted to main) is limited to two major changes:
- Parallel calls change their capture from the capture-by-reference to the capture-by-copy. This ensures that all stack variables and objects and also an execution context will be used in delayed parallel tasks with their actual values or addresses on the heap instead of referencing them on stack where they can be already destroyed by the time of execution.
- Sequential calls to perform the last reduction (or any kind of post parallel work) are wrapped into a parallel call with a single task issued. This is necessary to ensure the dependencies are followed (guarantees of proper synchronization between consecutive parallel calls are provided by the runtime itself). Paired with a change on the main branch that forces a parallel call with a single task to be redirected into the threadpool runtime and not to shortcut an execution on-the-spot as for other runtimes, it works. The latter runtimes shouldn't be affected performance-wise (besides one extra call).

These two key principles now must be followed for updated CPU implementations (in the future, probably, every primitive and every implementation).

In a nutshell, a developer should treat async Eigen runtime in the following way:

Submitter thread comes and extracts all or most of the pointers from the execution context, creates local variables, and reaches:
- the parallel section - in this case it saves all the stack variable till that moment and provides this snapshot to parallel tasks (lambda functions). All the objects will be saved either by value (POD types) or by ref_count increase, ensuring those objects won't be destroyed till all the tasks finished.
- nested primitive - in this case it prepares a new execution context and scratchpad grantor ensuring the lifetime of all newly created objects, and steps into the next execute call, either repeating this step, or getting into the previous one.
and then leaving the execute function saving parallel tasks on the runtime side and destroying everything created along the way even before tasks started execution. In the reality it's not always the case, and threads can pick up tasks while submitter is still inside the execute, but for simplicity of thinking, that is the worst case scenario and the runtime should be treated as such when changes are considered.

Covered primitives:

Reorder
Matmul
Convolution (all flavors)
Lnorm
Softmax
Eltwise/Binary/Reduction
Graph patterns based on those primitives + SDPA.

Gemm-based infrastructure requires a lot of changes to support asynchronous runtime. However, it's impossible to disable asynchronous runtime solely as threadpool is not avaialble at creation time through conventional API. There's an option to introduce API to register a threadpool before calling a primitive_desc creation function. If this gets implemented, then introduced calls can be modified to check for registered threadpool and if it is present, then check its flag to decide if an implementation should be enabled/disabled. For the simplicity this API is not a part of this commit.

Changing capture is needed to copy a lambda function as it will be destroyed after submitter exits its creation context.

Co-authored-by: Dmitrii Zarukin <[email protected]>

dzarukin · 2025-11-07T06:22:27Z

make test

TaoLv

Thank you Dima for taking care of the graph part. 👍

TaoLv · 2025-11-07T09:11:59Z

src/graph/backend/dnnl/common.cpp

 }

+void prolong_temporary_scratchpad_lifetime(const stream_t *g_stream,
+        const std::shared_ptr<temporary_scratchpad_t> &scratchpad) {


We may need below to avoid compiler or tidy complains:
MAYBE_UNUSED(g_stream);
MAYBE_UNUSED(scratchpad);

mgouicem and others added 16 commits November 6, 2025 21:35

benchdnn: generalize aggregate measure_perf to async runtimes

561c03e

benchdnn: add stalling to trigger async issues with threadpool

6fda5b4

common: dnnl_thread: add asynchronous runtime support

09f845f

Changing capture is needed to copy a lambda function as it will be destroyed after submitter exits its creation context.

common: dnnl_thread: do not deactivate threadpools in parallel

9cec8a4

common: verbose: do not block on wait in async runtime stream

d7ef7fe

common: zero_pad: add async runtime support

e7628a5

Co-authored-by: Dmitrii Zarukin <[email protected]>

cpu: reorder: add async runtime support

1fc19df

cpu: lnorm: add async runtime support

6da4820

cpu: softmax: add async runtime support

d6d407b

cpu: binary: add async runtime support

c780094

cpu: eltwise: add async runtime support

4dbfc07

cpu: reduction: add async runtime support

8bab720

cpu: matmul: add async runtime support

6780f29

cpu: conv: add async runtime support

d49b5b9

graph: backend: dnnl: add async runtime support

af72909

dzarukin requested review from a team as code owners November 7, 2025 06:22

github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Nov 7, 2025

TaoLv reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

all: asynchronous Eigen threadpool runtime support #4297

all: asynchronous Eigen threadpool runtime support #4297

Uh oh!

dzarukin commented Nov 7, 2025

Uh oh!

dzarukin commented Nov 7, 2025

Uh oh!

TaoLv left a comment

Uh oh!

TaoLv Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

all: asynchronous Eigen threadpool runtime support #4297

Are you sure you want to change the base?

all: asynchronous Eigen threadpool runtime support #4297

Uh oh!

Conversation

dzarukin commented Nov 7, 2025

Uh oh!

dzarukin commented Nov 7, 2025

Uh oh!

TaoLv left a comment

Choose a reason for hiding this comment

Uh oh!

TaoLv Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants