|
| 1 | +# API to simplify creation of task arenas constrained to NUMA nodes |
| 2 | + |
| 3 | +This sub-RFC proposes an API to ease creation of a one-per-NUMA-node set of task arenas. |
| 4 | + |
| 5 | +## Introduction |
| 6 | + |
| 7 | +The code example in the [overarching RFC for NUMA support](README.md) shows the likely |
| 8 | +pattern of using task arenas to distribute computation across all NUMA domains on a system. |
| 9 | +Let's take a closer look at the part where arenas are created and initialized. |
| 10 | + |
| 11 | +```c++ |
| 12 | + std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes(); |
| 13 | + std::vector<tbb::task_arena> arenas(numa_nodes.size()); |
| 14 | + std::vector<tbb::task_group> task_groups(numa_nodes.size()); |
| 15 | + |
| 16 | + // initialize each arena, each constrained to a different NUMA node |
| 17 | + for (int i = 0; i < numa_nodes.size(); i++) |
| 18 | + arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0); |
| 19 | +``` |
| 20 | +
|
| 21 | +The first line obtains a vector of NUMA node IDs for the system. Then, a vector of the same size |
| 22 | +is created to store `tbb::task_arena` objects, each constrained to one of the NUMA nodes. |
| 23 | +Another vector holds `task_group` instances used later to submit and wait for completion |
| 24 | +of the work in each of the arenas - it is necessary because `task_arena` does not provide |
| 25 | +any work synchronization API. Finally, the loop over all NUMA nodes initializes associated |
| 26 | +task arenas with proper constraints. |
| 27 | +
|
| 28 | +While not incomprehensible, the code is quite verbose and arguably too explicit for the typical scenario |
| 29 | +of creating a set of arenas across all available NUMA domains. There is also risk of subtle issues. |
| 30 | +The default constructor of `task_arena` reserves a slot for an application thread. The arena initialization |
| 31 | +at the last line explicitly overwrites it to 0 to allow TBB worker threads taking all the slots, however |
| 32 | +this nuance might be unknown and easy to miss, potentially resulting in underutilization of CPU resources. |
| 33 | +
|
| 34 | +## Proposal |
| 35 | +
|
| 36 | +We propose to introduce a special function to create the set of task arenas, one per NUMA node on the system. |
| 37 | +The initialization code equivalent to the example above would be: |
| 38 | +
|
| 39 | +```c++ |
| 40 | + std::vector<tbb::task_arena> arenas = tbb::create_numa_task_arenas(); |
| 41 | + std::vector<tbb::task_group> task_groups(arenas.size()); |
| 42 | +``` |
| 43 | + |
| 44 | +The rest of the code in that example might be rewritten with the API proposed in |
| 45 | +[Waiting in a task arena](../task_arena_waiting/readme.md): |
| 46 | + |
| 47 | +```c++ |
| 48 | + // enqueue work to all but the first arena, using the task groups to track work |
| 49 | + for (int i = 1; i < arenas.size(); i++) |
| 50 | + arenas[i].enqueue( |
| 51 | + [] { tbb::parallel_for(0, N, [](int j) { f(w); }); }, |
| 52 | + task_groups[i] |
| 53 | + ); |
| 54 | + |
| 55 | + // directly execute the work to completion in the remaining arena |
| 56 | + arenas[0].execute([] { |
| 57 | + tbb::parallel_for(0, N, [](int j) { f(w); }); |
| 58 | + }); |
| 59 | + |
| 60 | + // join the other arenas to wait on their task groups |
| 61 | + for (int i = 1; i < arenas.size(); i++) |
| 62 | + arenas[i].wait_for(task_groups[i]); |
| 63 | +``` |
| 64 | +
|
| 65 | +### Public API |
| 66 | +
|
| 67 | +The function has the following signature: |
| 68 | +
|
| 69 | +```c++ |
| 70 | +// Defined in tbb/task_arena.h |
| 71 | +
|
| 72 | +namespace tbb { |
| 73 | + std::vector<tbb::task_arena> create_numa_task_arenas( |
| 74 | + task_arena::constraints other_constraints = {}, |
| 75 | + unsigned reserved_slots = 0 |
| 76 | + }; |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | +It optionally takes a `constraints` argument to change default arena settings such as maximal concurrency |
| 81 | +(the upper limit on the number of threads), core type etc.; the `numa_id` value in `other_constraints` |
| 82 | +is ignored. The second optional argument allows to override the number of reserved slots, which by default |
| 83 | +is 0 (unlike the `task_arena` construction default of 1) for the reasons described in the introduction. |
| 84 | + |
| 85 | +These arena parameters were selected for pre-setting because there appear to be practical use cases to modify |
| 86 | +it uniformly for the whole arena set - e.g., to suppress the use of hyper-threading or to reserve a slot |
| 87 | +for a dedicated application thread. For other arena parameters, such as priorities and thread leave policy, |
| 88 | +no obvious use cases are seen for uniformly changing default values; it can be addressed on demand. |
| 89 | + |
| 90 | +The function returns a `std::vector` of created arenas. The arenas should not be initialized, |
| 91 | +in order to allow changing certain arena settings before the use. |
| 92 | + |
| 93 | +### Possible implementation |
| 94 | + |
| 95 | +```c++ |
| 96 | +std::vector<tbb::task_arena> create_numa_task_arenas( |
| 97 | + tbb::task_arena::constraints other_constraints, |
| 98 | + unsigned reserved_slots) |
| 99 | +{ |
| 100 | + std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes(); |
| 101 | + std::vector<tbb::task_arena> arenas; |
| 102 | + arenas.reserve(numa_nodes.size()); |
| 103 | + for (tbb::numa_node_id nid : numa_nodes) { |
| 104 | + other_constraints.numa_id = nid; |
| 105 | + arenas.emplace_back(other_constraints, reserved_slots); |
| 106 | + } |
| 107 | + return arenas; |
| 108 | +} |
| 109 | +``` |
| 110 | +
|
| 111 | +### Shortcomings and downsides |
| 112 | +
|
| 113 | +The following critics was provided in the RFC discussion: |
| 114 | +
|
| 115 | +- It might be confusing that a single `constraints` object is used to generate multiple arenas, |
| 116 | + especially with part of it (`numa_id`) being ignored. |
| 117 | +- The proposed API addresses just one, albeit important, use case for creating a set of arenas. |
| 118 | +
|
| 119 | +See the "universal function" alternative below for related considerations. |
| 120 | +
|
| 121 | +## Considered alternatives |
| 122 | +
|
| 123 | +### Sub-classing `task_arena` |
| 124 | +
|
| 125 | +The earlier proposal [PR #1559](https://github.com/uxlfoundation/oneTBB/pull/1559) also aimed to simplify |
| 126 | +the typical usage pattern of NUMA arenas, with possibility to extend to other similar cases. |
| 127 | +
|
| 128 | +It suggested to add a new class derived from `task_arena` which would have "only necessary methods |
| 129 | +to allow submission and waiting of a parallel work", by selectively exposing methods of `task_arena` |
| 130 | +and also adding a method to wait for work completion. Instances of such class could only be created |
| 131 | +via a factory function that would instantiate a ready-to-use arena for each of the NUMA domains. |
| 132 | +
|
| 133 | +```c++ |
| 134 | +class constrained_task_arena : protected task_arena { |
| 135 | +public: |
| 136 | + using task_arena::is_active; |
| 137 | + using task_arena::terminate; |
| 138 | + using task_arena::max_concurrency; |
| 139 | +
|
| 140 | + using task_arena::enqueue; |
| 141 | + using task_arena::execute; // not in the original proposal |
| 142 | +
|
| 143 | + void wait(); |
| 144 | + friend std::vector<constrained_task_arena> initialize_numa_constrained_arenas(); |
| 145 | +}; |
| 146 | +``` |
| 147 | + |
| 148 | +In the code example used in this document, that API (with the method `execute` also exposed) |
| 149 | +would fully eliminate explicit use of `task_group`: |
| 150 | + |
| 151 | +```c++ |
| 152 | + std::vector<tbb::constrained_task_arena> arenas = |
| 153 | + tbb::initialize_numa_constrained_arenas(); |
| 154 | + |
| 155 | + // enqueue work to all but the first arena |
| 156 | + for (int i = 1; i < arenas.size(); i++) |
| 157 | + arenas[i].enqueue([] { |
| 158 | + tbb::parallel_for(0, N, [](int j) { f(w); }); |
| 159 | + }); |
| 160 | + |
| 161 | + // directly execute the work to completion in the remaining arena |
| 162 | + arenas[0].execute([] { |
| 163 | + tbb::parallel_for(0, N, [](int j) { f(w); }); |
| 164 | + }); |
| 165 | + |
| 166 | + // join the other arenas to wait on their task groups |
| 167 | + for (int i = 1; i < arenas.size(); i++) |
| 168 | + arenas[i].wait(); |
| 169 | +``` |
| 170 | +
|
| 171 | +While suggesting more concise and error-protected API for the problem at question, that approach has |
| 172 | +its downsides: |
| 173 | +- Adding a special "flavour" of `task_arena` potentially increases the library learning curve and |
| 174 | + might create confusion about which class to use in which conditions, and how these interoperate. |
| 175 | +- It seems very specialized, capable to only address specific and quite narrow set of use cases. |
| 176 | +- Arenas are pre-initialized and so could not be adjusted after creation. |
| 177 | +
|
| 178 | +Our proposal, instead, aims at a single usability aspect with incremental improvements/extensions |
| 179 | +to the existing oneTBB classes and usage patterns, leaving other aspects to complementary proposals. |
| 180 | +Specifically, the [Waiting in a task arena](../task_arena_waiting/readme.md) RFC improves the joint |
| 181 | +use of a task arena and a task group to submit and wait for work. Combined, these extensions will |
| 182 | +provide only a slightly more verbose solution for the NUMA use case, while being more flexible |
| 183 | +and having greater potential for other useful extensions and applications. |
| 184 | +
|
| 185 | +### Universal function to create an arena set |
| 186 | +
|
| 187 | +In the discussion of this RFC, it was suggested to generalize the function for creating a set of arenas |
| 188 | +with various prescribed characteristics, beyond only NUMA-bound arenas. It would take a "constraint |
| 189 | +generator" type which would represent multiple arena constraint objects according to given options, |
| 190 | +and each of these constraints would be used to create an arena. |
| 191 | +
|
| 192 | +Generally, such generalized API needs some way to describe which arena parameters should vary |
| 193 | +across the arena set and how, and which should remain uniform. Possible ways for that are |
| 194 | +to use some descriptive "language" and/or to generate parameter sets by a function. |
| 195 | +
|
| 196 | +In the suggestion, construction arguments of a constraint generator (including value sets |
| 197 | +and named patterns) serve as the description language. |
| 198 | +Another way could be to add named patterns as possible data values for `task_arena::constraints`; |
| 199 | +for example, `tbb::task_arena::constraints c{ .numa_id = tbb::task_arena::iterate }` |
| 200 | +would serve as a pattern for generating a set of constraints bound to NUMA domains. |
| 201 | +
|
| 202 | +Using a function to generate arena parameter sets seems even more flexible comparing to a description |
| 203 | +language, as that function would be not limited in which parameters to alter, and how. |
| 204 | +
|
| 205 | +Compared to the proposal, this approach would necessarily be more verbose because of the need to |
| 206 | +describe or generate a set of constraints. That could however be mitigated for common use cases |
| 207 | +with presets provided by the library allowing for a reasonably concise code. For example: |
| 208 | +```c++ |
| 209 | + std::vector<tbb::task_arena> arenas = tbb::create_task_arenas(tbb::iterate_over_numa_ids); |
| 210 | +``` |
| 211 | +where `iterate_over_numa_ids` is a predefined variable of a type accepted by the function. |
| 212 | + |
| 213 | +Yet it is questionable if such flexibility is useful at the moment, and so if it justifies |
| 214 | +extra complexity. As of now, we know just a few arena set patterns with potential practical usage: |
| 215 | +besides NUMA arenas, we can think of splitting CPU resources by core type and of creating a set of |
| 216 | +arenas with different priorities. We do not however recall any requests to simplify these use cases. |
| 217 | + |
| 218 | +## Open questions |
| 219 | +- Instead of a free-standing function in namespace `tbb`, should we consider |
| 220 | + a static member function in class `task_arena`? |
| 221 | +- The proposal does not consider arena priority, simply keeping the default `priority::normal`. |
| 222 | + Are there use cases for pre-setting priorities? Similarly for the experimental thread leave policy. |
| 223 | +- Are there more practical use cases which could justify the universal function approach? |
| 224 | +- Need to consider alternatives to silently ignoring `numa_id` in constraints, such as an exception |
| 225 | + or undefined behavior. |
| 226 | +- Are there any reasons for the API to first go out as an experimental feature? |
0 commit comments