Improve RO transaction setup #967

roman-khimov · 2025-05-30T09:35:05Z

We're using Bolt in https://github.com/nspcc-dev/neofs-node/ (in https://github.com/nspcc-dev/neo-go/ as an option as well) and we recently had a case of some requests processed by node taking much more time than expected in performance tests. It was like the majority of requests served in ~5-6ms with many jumping to 300-500ms, some to 2-4s and we've seen 20+s once as well. After some debugging we found that in this scenario we have a lot of DB readers involved and while "many readers" scenario is exactly where Bolt is good based on our past experience, this time we had a clear difference in time taken by func(tx *bolt.Tx) error doing things and db.View() wrapping it. Which suggested some problems in transaction management by the DB.

Now the View() doesn't do a lot, but still it takes some exclusive locks (which was surprising to me) and after some poking at them we found that they're exactly the problem, it's a classic lock contention which wasn't expected at the level of concurrency we have, but that's what it is. And the most annoying thing about it is that routines losing the race for locks could be seriously penalized, seconds of response time for simple requests are not acceptable (given nice machines with plenty of CPU/RAM) and they seriously affected the averages.

So after some digging in I've cooked up a set of patches that makes our app run fine without these sudden spikes in response time. Now the problem is that this is ~first time I'm looking at the Bolt codebase and even though tests are all green, maybe I'm doing something wrong and I'd really like to get a feedback from people that know this code. My questions are:

Is it technically correct at all?
Can these patches be merged?
How to properly align this with the Bolt release schedule (target branches/release estimates)?
Maybe there is some better way to solve these problems?

Currently these patches are based on and targeted for 1.4 (since that's the version we're using). And while many changes are internal, introducing a configuration option is likely not what people want to see in 1.4.1. So maybe we need to think of 1.5 or something else.

It outputs something like this for AMD Ryzen 7 PRO 7840U: workers samples min avg 50% 80% 90% max 1 10 49.323µs 974.287µs 1.068978ms 1.112882ms 1.131938ms 1.131938ms 10 100 32.592µs 685.315µs 980.5µs 1.125385ms 1.137678ms 1.169789ms 100 1000 31.49µs 219.084µs 77.427µs 353.651µs 656.916µs 1.785808ms 1000 10000 30.668µs 1.639366ms 99.128µs 3.086665ms 5.031354ms 16.315849ms 10000 100000 30.818µs 40.893475ms 36.963667ms 78.650583ms 111.553136ms 302.412177ms The most interesting data is for 1K-10K of workers, with just 8 cores we obviously have some threads waiting for transaction initialization, in the worst case up to tens and hundreds of ms. Signed-off-by: Roman Khimov <[email protected]>

Seems like it was more useful before 263e75d, but now it's only used for statistics which can easily be managed in a different way. I see no other valid purposes for this list, a reference can have some value for GC, but if DB user loses a reference to transaction that is not closed there is not much DB can do. This improves ConcurrentView test from workers samples min avg 50% 80% 90% max 1 10 49.323µs 974.287µs 1.068978ms 1.112882ms 1.131938ms 1.131938ms 10 100 32.592µs 685.315µs 980.5µs 1.125385ms 1.137678ms 1.169789ms 100 1000 31.49µs 219.084µs 77.427µs 353.651µs 656.916µs 1.785808ms 1000 10000 30.668µs 1.639366ms 99.128µs 3.086665ms 5.031354ms 16.315849ms 10000 100000 30.818µs 40.893475ms 36.963667ms 78.650583ms 111.553136ms 302.412177ms to workers samples min avg 50% 80% 90% max 1 10 78.358µs 964.847µs 1.059159ms 1.073256ms 1.07551ms 1.07551ms 10 100 32.802µs 304.922µs 80.924µs 674.54µs 1.069298ms 1.220625ms 100 1000 30.758µs 304.541µs 64.192µs 397.094µs 1.101991ms 2.183302ms 1000 10000 30.558µs 1.05711ms 92.426µs 2.111896ms 3.317894ms 11.790014ms 10000 100000 30.548µs 10.98898ms 90.742µs 21.740659ms 33.020076ms 135.33094ms Signed-off-by: Roman Khimov <[email protected]>

I think most Bolt users never care about this data, so we're just wasting time for nothing. This doesn't change much on its own, but once the other remaining lock is removed enabling/disabling it clearly affects the result. The default behavior is kept for compatibility. Now that stats is a pointer we can revert a part of 26f89a5 as well and make the structure cleaner. Signed-off-by: Roman Khimov <[email protected]>

metalock is documented to protect meta page access, but its real functionality is broader and less obvious: * meta page itself is accessed in t.init(), but it's not changed, so an exclusive lock is not needed * db.opened is somewhat related, but it's changed in close() only which takes all the locks anyway * db.data is protected by mmaplock * db.freelist is in fact changed when metalock is taken and it has no synchronization of its own So we have a lock that tries to protect meta page, but in fact it helps freelists more than the others. What we're trying to do in freelist is to keep track of open RO transaction IDs, maintaining a slice of them. Transaction IDs in their turn are not exactly IDs in that they do not identify transactions uniquely, rather it's an ID of the latest transaction that changed the state and at any given point all of the new readers use the same latest ID. The other important aspect of these IDs is that any RW transaction always uses a new ID that is incremented from the previous one, IDs only move forward (not considering uint64 overflow). At the very minimum this means that: * the most common case is when this list has exactly one unique ID which is the latest * occasionally we can have a previous one as well * only in rare cases of long-running read transactions we can have some set of older IDs * once an old ID is gone it'll never return since new transactions use the latest one Which in turn means that: * keeping a list of IDs wastes memory for nothing, we usually have a lot of duplicates there * what we need in fact is some sort of reference counting for a limited number of past IDs * no new IDs are possible unless there is an RW transaction Reference counting can be implemented with atomic variables, not requiring locks, the only problem is to always have an appropriate ID to count its users. And that's exactly where ReleasePendingPages() that is called by RW transactions can be leveraged, it can manage the list of ID-ref structures, while RO transactions can rely on this list always having current ID that they can use. Which means they can use RO lock for the list itself while doing changes to the counter. This removes the final exclusive lock for read-only transactions (with disabled statistics) and this drastically improves ConcurrentView test results as well as real application behavior with many readers. Before: workers samples min avg 50% 80% 90% max 1 10 78.358µs 964.847µs 1.059159ms 1.073256ms 1.07551ms 1.07551ms 10 100 32.802µs 304.922µs 80.924µs 674.54µs 1.069298ms 1.220625ms 100 1000 30.758µs 304.541µs 64.192µs 397.094µs 1.101991ms 2.183302ms 1000 10000 30.558µs 1.05711ms 92.426µs 2.111896ms 3.317894ms 11.790014ms 10000 100000 30.548µs 10.98898ms 90.742µs 21.740659ms 33.020076ms 135.33094ms After: workers samples min avg 50% 80% 90% max 1 10 96.093µs 969.267µs 1.063618ms 1.066473ms 1.085629ms 1.085629ms 10 100 32.803µs 252.33µs 73.71µs 827.159µs 934.392µs 1.071142ms 100 1000 31.339µs 239.554µs 38.703µs 613.263µs 888.616µs 1.403793ms 1000 10000 30.518µs 195.822µs 41.538µs 101.031µs 360.474µs 4.075932ms 10000 100000 30.478µs 165.346µs 39.054µs 89.981µs 144.544µs 11.35032ms The only potential case not covered by this patch probaby is a newly opened DB that didn't do any RW transactions, it'd have an empty refs list. But it shouldn't be a problem since the first RW transaction shouldn't have any pending pages anyway. Signed-off-by: Roman Khimov <[email protected]>

metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (RO transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen. Signed-off-by: Roman Khimov <[email protected]>

k8s-ci-robot · 2025-05-30T09:35:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: roman-khimov
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-05-30T09:35:14Z

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ahrtr · 2025-06-05T19:33:21Z

can you breakdown this PR into small ones, so that we can evaluate each change separately?

After a quick review, the second commit makes sense 4ad0ebf.

roman-khimov · 2025-06-05T19:42:26Z

Well, it's patch by patch now, but we can go PR after PR as well. Are you interested in the first one, or it's not worth a PR at all?

ahrtr · 2025-06-06T06:51:47Z

but we can go PR after PR as well

Yes, I prefer to this.

roman-khimov · 2025-06-24T09:42:44Z

Superseded by #973, #977, #989 and #998, has historic value only.

roman-khimov added 5 commits May 30, 2025 11:14

k8s-ci-robot added needs-ok-to-test size/L labels May 30, 2025

This was referenced Jun 6, 2025

db: drop unused txs list #969

Merged

db: drop unused txs list #973

Merged

k8s-ci-robot added needs-rebase and removed needs-rebase labels Jun 9, 2025

roman-khimov mentioned this pull request Jun 10, 2025

Make statistics optional #977

Merged

roman-khimov marked this pull request as draft June 10, 2025 09:42

k8s-ci-robot added the do-not-merge/work-in-progress label Jun 10, 2025

roman-khimov mentioned this pull request Jun 13, 2025

tx: add missing lock on meta page update #989

Merged

roman-khimov mentioned this pull request Jun 24, 2025

db: rework metalock into RWMutex #998

Open

roman-khimov closed this Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve RO transaction setup #967

Improve RO transaction setup #967

Uh oh!

roman-khimov commented May 30, 2025

Uh oh!

k8s-ci-robot commented May 30, 2025

Uh oh!

k8s-ci-robot commented May 30, 2025

Uh oh!

ahrtr commented Jun 5, 2025

Uh oh!

roman-khimov commented Jun 5, 2025

Uh oh!

ahrtr commented Jun 6, 2025

Uh oh!

roman-khimov commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Improve RO transaction setup #967

Improve RO transaction setup #967

Uh oh!

Conversation

roman-khimov commented May 30, 2025

Uh oh!

k8s-ci-robot commented May 30, 2025

Uh oh!

k8s-ci-robot commented May 30, 2025

Uh oh!

ahrtr commented Jun 5, 2025

Uh oh!

roman-khimov commented Jun 5, 2025

Uh oh!

ahrtr commented Jun 6, 2025

Uh oh!

roman-khimov commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants