Skip to content

Conversation

@dzarukin
Copy link
Contributor

@dzarukin dzarukin commented Nov 7, 2025

This is the final PR that covers all the current needs for asynchronous Eigen threadpool runtime support.

Important notes:

  • Benchdnn treats async runtime same way as GPU when it comes for perf measurements. It also got an abstraction to ensure that for correctness validation submission happens before any thread starts execution.
  • Auto-gen based implementation becomes disabled for any threadpool implementation as it's impossible to determine the threadpool flag at creation time.
  • Verbose functionality doesn't properly return times as the call can't be blocking, otherwise, it'll lead to a deadlock. It's an area for a future improvement.
  • The support itself (if not considering the amount of preliminary work promoted to main) is limited to two major changes:
    • Parallel calls change their capture from the capture-by-reference to the capture-by-copy. This ensures that all stack variables and objects and also an execution context will be used in delayed parallel tasks with their actual values or addresses on the heap instead of referencing them on stack where they can be already destroyed by the time of execution.
    • Sequential calls to perform the last reduction (or any kind of post parallel work) are wrapped into a parallel call with a single task issued. This is necessary to ensure the dependencies are followed (guarantees of proper synchronization between consecutive parallel calls are provided by the runtime itself). Paired with a change on the main branch that forces a parallel call with a single task to be redirected into the threadpool runtime and not to shortcut an execution on-the-spot as for other runtimes, it works. The latter runtimes shouldn't be affected performance-wise (besides one extra call).

These two key principles now must be followed for updated CPU implementations (in the future, probably, every primitive and every implementation).

In a nutshell, a developer should treat async Eigen runtime in the following way:

  • Submitter thread comes and extracts all or most of the pointers from the execution context, creates local variables, and reaches:
    • the parallel section - in this case it saves all the stack variable till that moment and provides this snapshot to parallel tasks (lambda functions). All the objects will be saved either by value (POD types) or by ref_count increase, ensuring those objects won't be destroyed till all the tasks finished.
    • nested primitive - in this case it prepares a new execution context and scratchpad grantor ensuring the lifetime of all newly created objects, and steps into the next execute call, either repeating this step, or getting into the previous one.
  • and then leaving the execute function saving parallel tasks on the runtime side and destroying everything created along the way even before tasks started execution. In the reality it's not always the case, and threads can pick up tasks while submitter is still inside the execute, but for simplicity of thinking, that is the worst case scenario and the runtime should be treated as such when changes are considered.

Covered primitives:

  • Reorder
  • Matmul
  • Convolution (all flavors)
  • Lnorm
  • Softmax
  • Eltwise/Binary/Reduction
  • Graph patterns based on those primitives + SDPA.

mgouicem and others added 16 commits November 6, 2025 21:35
Gemm-based infrastructure requires a lot of changes to support
asynchronous runtime. However, it's impossible to disable asynchronous
runtime solely as threadpool is not avaialble at creation time through
conventional API. There's an option to introduce API to register a
threadpool before calling a primitive_desc creation function. If this
gets implemented, then introduced calls can be modified to check for
registered threadpool and if it is present, then check its flag to
decide if an implementation should be enabled/disabled. For the
simplicity this API is not a part of this commit.
Changing capture is needed to copy a lambda function as it will be
destroyed after submitter exits its creation context.
@dzarukin dzarukin requested review from a team as code owners November 7, 2025 06:22
@github-actions github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Nov 7, 2025
@dzarukin
Copy link
Contributor Author

dzarukin commented Nov 7, 2025

make test

Copy link
Contributor

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Dima for taking care of the graph part. 👍

}

void prolong_temporary_scratchpad_lifetime(const stream_t *g_stream,
const std::shared_ptr<temporary_scratchpad_t> &scratchpad) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need below to avoid compiler or tidy complains:
MAYBE_UNUSED(g_stream);
MAYBE_UNUSED(scratchpad);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants