Skip to content

Conversation

@roman-khimov
Copy link
Contributor

We're using Bolt in https://github.com/nspcc-dev/neofs-node/ (in https://github.com/nspcc-dev/neo-go/ as an option as well) and we recently had a case of some requests processed by node taking much more time than expected in performance tests. It was like the majority of requests served in ~5-6ms with many jumping to 300-500ms, some to 2-4s and we've seen 20+s once as well. After some debugging we found that in this scenario we have a lot of DB readers involved and while "many readers" scenario is exactly where Bolt is good based on our past experience, this time we had a clear difference in time taken by func(tx *bolt.Tx) error doing things and db.View() wrapping it. Which suggested some problems in transaction management by the DB.

Now the View() doesn't do a lot, but still it takes some exclusive locks (which was surprising to me) and after some poking at them we found that they're exactly the problem, it's a classic lock contention which wasn't expected at the level of concurrency we have, but that's what it is. And the most annoying thing about it is that routines losing the race for locks could be seriously penalized, seconds of response time for simple requests are not acceptable (given nice machines with plenty of CPU/RAM) and they seriously affected the averages.

So after some digging in I've cooked up a set of patches that makes our app run fine without these sudden spikes in response time. Now the problem is that this is ~first time I'm looking at the Bolt codebase and even though tests are all green, maybe I'm doing something wrong and I'd really like to get a feedback from people that know this code. My questions are:

  1. Is it technically correct at all?
  2. Can these patches be merged?
  3. How to properly align this with the Bolt release schedule (target branches/release estimates)?
  4. Maybe there is some better way to solve these problems?

Currently these patches are based on and targeted for 1.4 (since that's the version we're using). And while many changes are internal, introducing a configuration option is likely not what people want to see in 1.4.1. So maybe we need to think of 1.5 or something else.

It outputs something like this for AMD Ryzen 7 PRO 7840U:

workers samples min             avg             50%             80%             90%             max
1       10      49.323µs        974.287µs       1.068978ms      1.112882ms      1.131938ms      1.131938ms
10      100     32.592µs        685.315µs       980.5µs         1.125385ms      1.137678ms      1.169789ms
100     1000    31.49µs         219.084µs       77.427µs        353.651µs       656.916µs       1.785808ms
1000    10000   30.668µs        1.639366ms      99.128µs        3.086665ms      5.031354ms      16.315849ms
10000   100000  30.818µs        40.893475ms     36.963667ms     78.650583ms     111.553136ms    302.412177ms

The most interesting data is for 1K-10K of workers, with just 8 cores we
obviously have some threads waiting for transaction initialization, in the worst
case up to tens and hundreds of ms.

Signed-off-by: Roman Khimov <[email protected]>
Seems like it was more useful before 263e75d,
but now it's only used for statistics which can easily be managed in a
different way. I see no other valid purposes for this list, a reference can
have some value for GC, but if DB user loses a reference to transaction that
is not closed there is not much DB can do. This improves ConcurrentView test
from

workers samples min             avg             50%             80%             90%             max
1       10      49.323µs        974.287µs       1.068978ms      1.112882ms      1.131938ms      1.131938ms
10      100     32.592µs        685.315µs       980.5µs         1.125385ms      1.137678ms      1.169789ms
100     1000    31.49µs         219.084µs       77.427µs        353.651µs       656.916µs       1.785808ms
1000    10000   30.668µs        1.639366ms      99.128µs        3.086665ms      5.031354ms      16.315849ms
10000   100000  30.818µs        40.893475ms     36.963667ms     78.650583ms     111.553136ms    302.412177ms

to

workers samples min             avg             50%             80%             90%             max
1       10      78.358µs        964.847µs       1.059159ms      1.073256ms      1.07551ms       1.07551ms
10      100     32.802µs        304.922µs       80.924µs        674.54µs        1.069298ms      1.220625ms
100     1000    30.758µs        304.541µs       64.192µs        397.094µs       1.101991ms      2.183302ms
1000    10000   30.558µs        1.05711ms       92.426µs        2.111896ms      3.317894ms      11.790014ms
10000   100000  30.548µs        10.98898ms      90.742µs        21.740659ms     33.020076ms     135.33094ms

Signed-off-by: Roman Khimov <[email protected]>
I think most Bolt users never care about this data, so we're just wasting
time for nothing. This doesn't change much on its own, but once the other
remaining lock is removed enabling/disabling it clearly affects the result.
The default behavior is kept for compatibility.

Now that stats is a pointer we can revert a part of 26f89a5
as well and make the structure cleaner.

Signed-off-by: Roman Khimov <[email protected]>
metalock is documented to protect meta page access, but its real functionality
is broader and less obvious:
 * meta page itself is accessed in t.init(), but it's not changed, so an
   exclusive lock is not needed
 * db.opened is somewhat related, but it's changed in close() only which takes
   all the locks anyway
 * db.data is protected by mmaplock
 * db.freelist is in fact changed when metalock is taken and it has no
   synchronization of its own

So we have a lock that tries to protect meta page, but in fact it helps
freelists more than the others. What we're trying to do in freelist is to keep
track of open RO transaction IDs, maintaining a slice of them. Transaction IDs
in their turn are not exactly IDs in that they do not identify transactions
uniquely, rather it's an ID of the latest transaction that changed the state
and at any given point all of the new readers use the same latest ID. The
other important aspect of these IDs is that any RW transaction always uses
a new ID that is incremented from the previous one, IDs only move forward
(not considering uint64 overflow). At the very minimum this means that:
 * the most common case is when this list has exactly one unique ID which is
   the latest
 * occasionally we can have a previous one as well
 * only in rare cases of long-running read transactions we can have some set
   of older IDs
 * once an old ID is gone it'll never return since new transactions use the
   latest one

Which in turn means that:
 * keeping a list of IDs wastes memory for nothing, we usually have a lot of
   duplicates there
 * what we need in fact is some sort of reference counting for a limited
   number of past IDs
 * no new IDs are possible unless there is an RW transaction

Reference counting can be implemented with atomic variables, not requiring
locks, the only problem is to always have an appropriate ID to count its users.
And that's exactly where ReleasePendingPages() that is called by RW
transactions can be leveraged, it can manage the list of ID-ref structures,
while RO transactions can rely on this list always having current ID that
they can use. Which means they can use RO lock for the list itself while
doing changes to the counter.

This removes the final exclusive lock for read-only transactions (with disabled
statistics) and this drastically improves ConcurrentView test results as well
as real application behavior with many readers. Before:

workers samples min             avg             50%             80%             90%             max
1       10      78.358µs        964.847µs       1.059159ms      1.073256ms      1.07551ms       1.07551ms
10      100     32.802µs        304.922µs       80.924µs        674.54µs        1.069298ms      1.220625ms
100     1000    30.758µs        304.541µs       64.192µs        397.094µs       1.101991ms      2.183302ms
1000    10000   30.558µs        1.05711ms       92.426µs        2.111896ms      3.317894ms      11.790014ms
10000   100000  30.548µs        10.98898ms      90.742µs        21.740659ms     33.020076ms     135.33094ms

After:

workers samples min             avg             50%             80%             90%             max
1       10      96.093µs        969.267µs       1.063618ms      1.066473ms      1.085629ms      1.085629ms
10      100     32.803µs        252.33µs        73.71µs         827.159µs       934.392µs       1.071142ms
100     1000    31.339µs        239.554µs       38.703µs        613.263µs       888.616µs       1.403793ms
1000    10000   30.518µs        195.822µs       41.538µs        101.031µs       360.474µs       4.075932ms
10000   100000  30.478µs        165.346µs       39.054µs        89.981µs        144.544µs       11.35032ms

The only potential case not covered by this patch probaby is a newly opened DB
that didn't do any RW transactions, it'd have an empty refs list. But it
shouldn't be a problem since the first RW transaction shouldn't have any
pending pages anyway.

Signed-off-by: Roman Khimov <[email protected]>
metalock is supposed to protect meta page, but it looks like the only place
where we're modifying it is not protected in fact. Since page update is not
atomic a concurrent reader (RO transaction) can get an inconsistent page. It's
likely to fall back to the other one in this case, but still we better not
allow this to happen.

Signed-off-by: Roman Khimov <[email protected]>
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: roman-khimov
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahrtr
Copy link
Member

ahrtr commented Jun 5, 2025

can you breakdown this PR into small ones, so that we can evaluate each change separately?

After a quick review, the second commit makes sense 4ad0ebf.

@roman-khimov
Copy link
Contributor Author

Well, it's patch by patch now, but we can go PR after PR as well. Are you interested in the first one, or it's not worth a PR at all?

@ahrtr
Copy link
Member

ahrtr commented Jun 6, 2025

but we can go PR after PR as well

Yes, I prefer to this.

@roman-khimov
Copy link
Contributor Author

Superseded by #973, #977, #989 and #998, has historic value only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants