Skip to content

Commit 8682bc9

Browse files
authored
[RFC] API to simplify creation of task arenas constrained to NUMA nodes (#1679)
1 parent 3dfecb5 commit 8682bc9

File tree

2 files changed

+234
-4
lines changed

2 files changed

+234
-4
lines changed

rfcs/proposed/numa_support/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ to pin threads to different arenas to each of the NUMA nodes available on a syst
3030
across those `task_arena` objects and into associated `task_group` objects, and then wait for work
3131
again using both the `task_arena` and `task_group` objects.
3232

33+
```c++
3334
void constrain_for_numa_nodes() {
3435
std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes();
3536
std::vector<tbb::task_arena> arenas(numa_nodes.size());
@@ -39,7 +40,7 @@ again using both the `task_arena` and `task_group` objects.
3940
for (int i = 0; i < numa_nodes.size(); i++)
4041
arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0);
4142

42-
// enqueue work to all but the first arena, using the task_group to track work
43+
// enqueue work to all but the first arena, using the task groups to track work
4344
// by using defer, the task_group reference count is incremented immediately
4445
for (int i = 1; i < numa_nodes.size(); i++)
4546
arenas[i].enqueue(
@@ -53,10 +54,11 @@ again using both the `task_arena` and `task_group` objects.
5354
tbb::parallel_for(0, N, [](int j) { f(w); });
5455
});
5556

56-
// join the other arenas to wait on their task_groups
57+
// join the other arenas to wait on their task groups
5758
for (int i = 1; i < numa_nodes.size(); i++)
5859
arenas[i].execute([&task_groups, i] { task_groups[i].wait(); });
5960
}
61+
```
6062
6163
### The need for application-specific knowledge
6264
@@ -108,6 +110,7 @@ Is it reasonable for a developer to expect that a series of loops, such as the o
108110
try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c`
109111
in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints?
110112
113+
```c++
111114
tbb::parallel_for(0, N,
112115
[](int i) {
113116
b[i] = f(i);
@@ -118,6 +121,7 @@ in the two loops are from the same NUMA nodes? Or is this too much to expect wit
118121
[](int i) {
119122
a[i] = b[i] + c[i];
120123
});
124+
```
121125

122126
## Possible Sub-Proposals
123127

@@ -126,9 +130,9 @@ in the two loops are from the same NUMA nodes? Or is this too much to expect wit
126130
See [sub-RFC for increased availability of NUMA API](tbbbind-link-static-hwloc.org)
127131

128132

129-
### Add NUMA-constrained arenas
133+
### Create NUMA-constrained arenas
130134

131-
See [sub-RFC for creation and use of NUMA-constrained arenas](numa-arenas-creation-and-use.org)
135+
See [sub-RFC for creation of NUMA-constrained arenas](create-numa-arenas.md)
132136

133137
### NUMA-aware allocation
134138

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# API to simplify creation of task arenas constrained to NUMA nodes
2+
3+
This sub-RFC proposes an API to ease creation of a one-per-NUMA-node set of task arenas.
4+
5+
## Introduction
6+
7+
The code example in the [overarching RFC for NUMA support](README.md) shows the likely
8+
pattern of using task arenas to distribute computation across all NUMA domains on a system.
9+
Let's take a closer look at the part where arenas are created and initialized.
10+
11+
```c++
12+
std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes();
13+
std::vector<tbb::task_arena> arenas(numa_nodes.size());
14+
std::vector<tbb::task_group> task_groups(numa_nodes.size());
15+
16+
// initialize each arena, each constrained to a different NUMA node
17+
for (int i = 0; i < numa_nodes.size(); i++)
18+
arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0);
19+
```
20+
21+
The first line obtains a vector of NUMA node IDs for the system. Then, a vector of the same size
22+
is created to store `tbb::task_arena` objects, each constrained to one of the NUMA nodes.
23+
Another vector holds `task_group` instances used later to submit and wait for completion
24+
of the work in each of the arenas - it is necessary because `task_arena` does not provide
25+
any work synchronization API. Finally, the loop over all NUMA nodes initializes associated
26+
task arenas with proper constraints.
27+
28+
While not incomprehensible, the code is quite verbose and arguably too explicit for the typical scenario
29+
of creating a set of arenas across all available NUMA domains. There is also risk of subtle issues.
30+
The default constructor of `task_arena` reserves a slot for an application thread. The arena initialization
31+
at the last line explicitly overwrites it to 0 to allow TBB worker threads taking all the slots, however
32+
this nuance might be unknown and easy to miss, potentially resulting in underutilization of CPU resources.
33+
34+
## Proposal
35+
36+
We propose to introduce a special function to create the set of task arenas, one per NUMA node on the system.
37+
The initialization code equivalent to the example above would be:
38+
39+
```c++
40+
std::vector<tbb::task_arena> arenas = tbb::create_numa_task_arenas();
41+
std::vector<tbb::task_group> task_groups(arenas.size());
42+
```
43+
44+
The rest of the code in that example might be rewritten with the API proposed in
45+
[Waiting in a task arena](../task_arena_waiting/readme.md):
46+
47+
```c++
48+
// enqueue work to all but the first arena, using the task groups to track work
49+
for (int i = 1; i < arenas.size(); i++)
50+
arenas[i].enqueue(
51+
[] { tbb::parallel_for(0, N, [](int j) { f(w); }); },
52+
task_groups[i]
53+
);
54+
55+
// directly execute the work to completion in the remaining arena
56+
arenas[0].execute([] {
57+
tbb::parallel_for(0, N, [](int j) { f(w); });
58+
});
59+
60+
// join the other arenas to wait on their task groups
61+
for (int i = 1; i < arenas.size(); i++)
62+
arenas[i].wait_for(task_groups[i]);
63+
```
64+
65+
### Public API
66+
67+
The function has the following signature:
68+
69+
```c++
70+
// Defined in tbb/task_arena.h
71+
72+
namespace tbb {
73+
std::vector<tbb::task_arena> create_numa_task_arenas(
74+
task_arena::constraints other_constraints = {},
75+
unsigned reserved_slots = 0
76+
};
77+
}
78+
```
79+
80+
It optionally takes a `constraints` argument to change default arena settings such as maximal concurrency
81+
(the upper limit on the number of threads), core type etc.; the `numa_id` value in `other_constraints`
82+
is ignored. The second optional argument allows to override the number of reserved slots, which by default
83+
is 0 (unlike the `task_arena` construction default of 1) for the reasons described in the introduction.
84+
85+
These arena parameters were selected for pre-setting because there appear to be practical use cases to modify
86+
it uniformly for the whole arena set - e.g., to suppress the use of hyper-threading or to reserve a slot
87+
for a dedicated application thread. For other arena parameters, such as priorities and thread leave policy,
88+
no obvious use cases are seen for uniformly changing default values; it can be addressed on demand.
89+
90+
The function returns a `std::vector` of created arenas. The arenas should not be initialized,
91+
in order to allow changing certain arena settings before the use.
92+
93+
### Possible implementation
94+
95+
```c++
96+
std::vector<tbb::task_arena> create_numa_task_arenas(
97+
tbb::task_arena::constraints other_constraints,
98+
unsigned reserved_slots)
99+
{
100+
std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes();
101+
std::vector<tbb::task_arena> arenas;
102+
arenas.reserve(numa_nodes.size());
103+
for (tbb::numa_node_id nid : numa_nodes) {
104+
other_constraints.numa_id = nid;
105+
arenas.emplace_back(other_constraints, reserved_slots);
106+
}
107+
return arenas;
108+
}
109+
```
110+
111+
### Shortcomings and downsides
112+
113+
The following critics was provided in the RFC discussion:
114+
115+
- It might be confusing that a single `constraints` object is used to generate multiple arenas,
116+
especially with part of it (`numa_id`) being ignored.
117+
- The proposed API addresses just one, albeit important, use case for creating a set of arenas.
118+
119+
See the "universal function" alternative below for related considerations.
120+
121+
## Considered alternatives
122+
123+
### Sub-classing `task_arena`
124+
125+
The earlier proposal [PR #1559](https://github.com/uxlfoundation/oneTBB/pull/1559) also aimed to simplify
126+
the typical usage pattern of NUMA arenas, with possibility to extend to other similar cases.
127+
128+
It suggested to add a new class derived from `task_arena` which would have "only necessary methods
129+
to allow submission and waiting of a parallel work", by selectively exposing methods of `task_arena`
130+
and also adding a method to wait for work completion. Instances of such class could only be created
131+
via a factory function that would instantiate a ready-to-use arena for each of the NUMA domains.
132+
133+
```c++
134+
class constrained_task_arena : protected task_arena {
135+
public:
136+
using task_arena::is_active;
137+
using task_arena::terminate;
138+
using task_arena::max_concurrency;
139+
140+
using task_arena::enqueue;
141+
using task_arena::execute; // not in the original proposal
142+
143+
void wait();
144+
friend std::vector<constrained_task_arena> initialize_numa_constrained_arenas();
145+
};
146+
```
147+
148+
In the code example used in this document, that API (with the method `execute` also exposed)
149+
would fully eliminate explicit use of `task_group`:
150+
151+
```c++
152+
std::vector<tbb::constrained_task_arena> arenas =
153+
tbb::initialize_numa_constrained_arenas();
154+
155+
// enqueue work to all but the first arena
156+
for (int i = 1; i < arenas.size(); i++)
157+
arenas[i].enqueue([] {
158+
tbb::parallel_for(0, N, [](int j) { f(w); });
159+
});
160+
161+
// directly execute the work to completion in the remaining arena
162+
arenas[0].execute([] {
163+
tbb::parallel_for(0, N, [](int j) { f(w); });
164+
});
165+
166+
// join the other arenas to wait on their task groups
167+
for (int i = 1; i < arenas.size(); i++)
168+
arenas[i].wait();
169+
```
170+
171+
While suggesting more concise and error-protected API for the problem at question, that approach has
172+
its downsides:
173+
- Adding a special "flavour" of `task_arena` potentially increases the library learning curve and
174+
might create confusion about which class to use in which conditions, and how these interoperate.
175+
- It seems very specialized, capable to only address specific and quite narrow set of use cases.
176+
- Arenas are pre-initialized and so could not be adjusted after creation.
177+
178+
Our proposal, instead, aims at a single usability aspect with incremental improvements/extensions
179+
to the existing oneTBB classes and usage patterns, leaving other aspects to complementary proposals.
180+
Specifically, the [Waiting in a task arena](../task_arena_waiting/readme.md) RFC improves the joint
181+
use of a task arena and a task group to submit and wait for work. Combined, these extensions will
182+
provide only a slightly more verbose solution for the NUMA use case, while being more flexible
183+
and having greater potential for other useful extensions and applications.
184+
185+
### Universal function to create an arena set
186+
187+
In the discussion of this RFC, it was suggested to generalize the function for creating a set of arenas
188+
with various prescribed characteristics, beyond only NUMA-bound arenas. It would take a "constraint
189+
generator" type which would represent multiple arena constraint objects according to given options,
190+
and each of these constraints would be used to create an arena.
191+
192+
Generally, such generalized API needs some way to describe which arena parameters should vary
193+
across the arena set and how, and which should remain uniform. Possible ways for that are
194+
to use some descriptive "language" and/or to generate parameter sets by a function.
195+
196+
In the suggestion, construction arguments of a constraint generator (including value sets
197+
and named patterns) serve as the description language.
198+
Another way could be to add named patterns as possible data values for `task_arena::constraints`;
199+
for example, `tbb::task_arena::constraints c{ .numa_id = tbb::task_arena::iterate }`
200+
would serve as a pattern for generating a set of constraints bound to NUMA domains.
201+
202+
Using a function to generate arena parameter sets seems even more flexible comparing to a description
203+
language, as that function would be not limited in which parameters to alter, and how.
204+
205+
Compared to the proposal, this approach would necessarily be more verbose because of the need to
206+
describe or generate a set of constraints. That could however be mitigated for common use cases
207+
with presets provided by the library allowing for a reasonably concise code. For example:
208+
```c++
209+
std::vector<tbb::task_arena> arenas = tbb::create_task_arenas(tbb::iterate_over_numa_ids);
210+
```
211+
where `iterate_over_numa_ids` is a predefined variable of a type accepted by the function.
212+
213+
Yet it is questionable if such flexibility is useful at the moment, and so if it justifies
214+
extra complexity. As of now, we know just a few arena set patterns with potential practical usage:
215+
besides NUMA arenas, we can think of splitting CPU resources by core type and of creating a set of
216+
arenas with different priorities. We do not however recall any requests to simplify these use cases.
217+
218+
## Open questions
219+
- Instead of a free-standing function in namespace `tbb`, should we consider
220+
a static member function in class `task_arena`?
221+
- The proposal does not consider arena priority, simply keeping the default `priority::normal`.
222+
Are there use cases for pre-setting priorities? Similarly for the experimental thread leave policy.
223+
- Are there more practical use cases which could justify the universal function approach?
224+
- Need to consider alternatives to silently ignoring `numa_id` in constraints, such as an exception
225+
or undefined behavior.
226+
- Are there any reasons for the API to first go out as an experimental feature?

0 commit comments

Comments
 (0)