|
| 1 | +# Adding API for parallel block to task_arena to warm-up/retain/release worker threads |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +In oneTBB, there has never been an API that allows users to block worker threads within the arena. |
| 6 | +This design choice was made to preserve the composability of the application.<br> |
| 7 | +Since oneTBB is a dynamic runtime based on task stealing, threads will migrate from one arena to |
| 8 | +another while they have tasks to execute.<br> |
| 9 | +Before PR#1352, workers moved to the thread pool to sleep once there were no arenas with active |
| 10 | +demand. However, PR#1352 introduced a busy-wait block time that blocks a thread for an |
| 11 | +`implementation-defined` duration if there is no active demand in arenas. This change significantly |
| 12 | +improved performance in cases where the application is run on high thread count systems.<br> |
| 13 | +The main idea is that usually, after one parallel computation ends, |
| 14 | +another will start after some time. The default block time is a heuristic to utilize this, |
| 15 | +covering most cases within its duration. |
| 16 | + |
| 17 | +The default behavior of oneTBB with these changes does not affect performance when oneTBB is used |
| 18 | +as the single parallel runtime.<br> |
| 19 | +However, some cases where several runtimes are used together might be affected. For example, if an |
| 20 | +application builds a pipeline where oneTBB is used for one stage and OpenMP is used for a |
| 21 | +subsequent stage, there is a chance that oneTBB workers will interfere with OpenMP threads. |
| 22 | +This interference might result in slight oversubscription, |
| 23 | +which in turn might lead to underperformance. |
| 24 | + |
| 25 | +This problem can be resolved with an API that indicates when parallel computation is done, |
| 26 | +allowing worker threads to be released from the arena, |
| 27 | +essentially overriding the default block-time.<br> |
| 28 | + |
| 29 | +This problem can be considered from another angle. Essentially, if the user can indicate where |
| 30 | +parallel computation ends, they can also indicate where they start. |
| 31 | + |
| 32 | +<img src="parallel_block_introduction.png" width=800> |
| 33 | + |
| 34 | +With this approach, the user not only releases threads when necessary |
| 35 | +but also specifies a programmable block where worker threads should stick to the |
| 36 | +executing arena. |
| 37 | + |
| 38 | +## Proposal |
| 39 | + |
| 40 | +Let's consider the guarantees that an API for explicit parallel blocks can provides: |
| 41 | +* Start of parallel block: |
| 42 | + * Indicates the point from which the scheduler can use a hint and stick threads to the arena. |
| 43 | + * Serve as a warm-up hint to the scheduler, making some worker threads immediately available |
| 44 | + at the start of the real computatin. |
| 45 | +* "Parallel block" itself: |
| 46 | + * Scheduler can implement different busy-wait policies to retain threads in the arena. |
| 47 | +* End of parallel block: |
| 48 | + * Indicates the point from which the scheduler can drop a hint |
| 49 | + and unstick threads from the arena. |
| 50 | + * Indicates that worker threads should ignore |
| 51 | + the default block time (introduced by PR#1352) and leave. |
| 52 | + |
| 53 | +Start of parallel block:<br> |
| 54 | +The warm-up hint should have similar guarantees as `task_arena::enqueue` from a signal standpoint. |
| 55 | +Users should expect the scheduler will do its best to make some threads available in the arena. |
| 56 | + |
| 57 | +"Parallel block" itself:<br> |
| 58 | +The guarantee for retaining threads is a hint to the scheduler; |
| 59 | +thus, no real guarantee is provided. The scheduler can ignore the hint and |
| 60 | +move threads to another arena or to sleep if conditions are met. |
| 61 | + |
| 62 | +End of parallel block:<br> |
| 63 | +It can indicate that worker threads should ignore the default block time but |
| 64 | +if work was submitted immediately after the end of the parallel block, |
| 65 | +the default block time will be restored. |
| 66 | + |
| 67 | +But what if user would like to disable default block time entirely?<br> |
| 68 | +Because the heuristic of extended block time is unsuitable for the task submitted |
| 69 | +in unpredictable pattern and duration. In this case, there should be an API to disable |
| 70 | +the default block time in the arena entirely. |
| 71 | + |
| 72 | +```cpp |
| 73 | +class task_arena { |
| 74 | + void indicate_start_of_parallel_block(bool do_warmup = false); |
| 75 | + void indicate_end_of_parallel_block(bool disable_default_block_time = false); |
| 76 | + void disable_default_block_time(); |
| 77 | + void enable_default_block_time(); |
| 78 | +}; |
| 79 | + |
| 80 | +namespace this_task_arena { |
| 81 | + void indicate_start_of_parallel_block(bool do_warmup = false); |
| 82 | + void indicate_end_of_parallel_block(bool disable_default_block_time = false); |
| 83 | + void disable_default_block_time(); |
| 84 | + void enable_default_block_time(); |
| 85 | +} |
| 86 | +``` |
| 87 | +
|
| 88 | +If the end of the parallel block is not indicated by the user, it will be done automatically when |
| 89 | +the last public reference is removed from the arena (i.e., task_arena is destroyed or a thread |
| 90 | +is joined for an implicit arena). This ensures correctness is |
| 91 | +preserved (threads will not stick forever). |
| 92 | +
|
| 93 | +## Considerations |
| 94 | +
|
| 95 | +The retaining of worker threads should be implemented with care because |
| 96 | +it might introduce performance problems if: |
| 97 | +* Threads cannot migrate to another arena because they |
| 98 | + stick to the current arena. |
| 99 | +* Compute resources are not homogeneous, e.g., the CPU is hybrid. |
| 100 | + Heavier involvement of less performant core types might result in artificial work |
| 101 | + imbalance in the arena. |
| 102 | +
|
| 103 | +
|
| 104 | +## Open Questions in Design |
| 105 | +
|
| 106 | +Some open questions that remain: |
| 107 | +* Are the suggested APIs sufficient? |
| 108 | +* Are there additional use cases that should be considered that we missed in our analysis? |
0 commit comments