Skip to content

Conversation

@blimmer
Copy link
Contributor

@blimmer blimmer commented Aug 13, 2025

Description

We recently ran into an issue where our serverless database had to dramatically scale up capacity. Upon reviewing those queries, we realized that the issue was caused by Graphile Worker query:

with j as ( 
    select jobs.job_queue_id, jobs.priority, jobs.run_at, jobs.id 
    from graphile_worker . _private_jobs
 as jobs 
    where jobs.is_available = ? and run_at <= now ( ) and task_id = any ( ? :: int [ ] ) and ( jobs.job_queue_id is ? or jobs.job_queue_id in ( 
        select id 
        from graphile_worker . _private_job_queues
 as job_queues 
        where job_queues.is_available = ? for 
        update skip locked 
    ) ) 
order by priority asc, run_at asc 
limit ? for 
update skip locked 
), q as ( 
    update graphile_worker . _private_job_queues
 as job_queues 
    set locked_by = ? :: text, locked_at = now ( ) 
    from j 
    where job_queues.id = j.job_queue_id 
) 
update graphile_worker . _private_jobs
 as jobs 
set attempts = jobs.attempts + ?, locked_by = ? :: text, locked_at = now ( ) 
from j 
where jobs.id = j.id returning *

Specifically the subquery against the _private_job_queues table.

select id from graphile_worker._private_job_queues
  where job_queues.is_available = ?
  for update skip locked

We were creating lots of job queues using high-cardinality values (e.g., database row IDs), which caused many thousands of dead queues to build up in this table.

db=> select count(*) from graphile_worker._private_job_queues;
 count
-------
 61106
(1 row)

Because the graphql worker polling job necessarily does a table scan on _private_job_queues, this caused a big performance issue for us, forcing the DB to scale up.

When I reviewed the documentation, I found a warning about not using randomly-generated values in queueNames

## Database cleanup
Over time it's likely that graphile_worker's tables will grow with stale values
for old job queue names, task identifiers, or permanently failed jobs. You can
clean up this stale information with the cleanup function, indicating which
cleanup operations you would like to undertake.
:::tip
If you find yourself calling this quite often or on a schedule, it's likely that
you are doing something wrong (e.g. allowing jobs to permafail, using random
values for job queue names, etc).
:::

This was helpful. However, I think it makes sense to also warn about this on the queueName parameter itself, so people know before there's a problem. This PR adds those warnings.

Performance impact

N/A - docs only

Security impact

N/A - docs only

Checklist

  • My code matches the project's code style and yarn lint:fix passes.
  • N/A - docs only. I've added tests for the new feature, and yarn test passes.
  • N/A - docs only. I have detailed the new feature in the relevant documentation.
  • N/A - docs only. I have added this feature to 'Pending' in the RELEASE_NOTES.md file (if one exists).
  • N/A - docs only. If this is a breaking change I've explained why.

* queue to run serially). (Default: null)
* queue to run serially). Avoid using high cardinality values (e.g., random
* strings, UUIDs, timestamps) as this degrades performance and requires
* periodic database cleanup. (Default: null)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback is welcome on the phrasing here - I don't feel strongly!

@blimmer blimmer marked this pull request as ready for review August 13, 2025 17:17
benjie
benjie previously approved these changes Aug 15, 2025
Copy link
Member

@benjie benjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 34 to 45

:::warning

Avoid using high cardinality values (e.g., random strings, UUIDs,
timestamps) for queue names as this will create many dead queues that
degrade performance and require
[periodic database cleanup](../admin-functions.md#gc_job_queues). If you
find yourself needing to run the `GC_JOB_QUEUES` cleanup task regularly,
you're likely using queue names incorrectly.

:::

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jemgillam Please confirm this renders okay. If not, move it to below the list of options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine

Screenshot 2025-08-15 at 18 50 47

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines are all spaced out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was only looking at the indentation of the list, not the line spacing between bullets. I have moved the warning outside of the list.

Screenshot 2025-08-15 at 18 57 55

@benjie benjie merged commit 91c950f into graphile:main Aug 17, 2025
19 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants