Now and then you receive a report about something not working as expected. This time it was scarier than usual: a job, killed by Docker after consuming too much memory with an OOM error, was disappearing from the queue without a trace. In production, we deploy
sidekiq-pro for its reliability guarantees, specifically
super_fetch strategy for pulling work from the queue, and normally should re-queue the job after a restart.
Table of contents
Why Sidekiq stopped recovering terminated jobs
Looking at the startup logs, I would normally expect
sidekiq-pro to post messages about recovering jobs, like “SuperFetch recovered job:”, but this is not happening even though the SuperFetch is being initialized:
2022-10-27T10:19:41.187Z 1 TID-3lvl INFO: SuperFetch activated 2022-10-27T10:19:41.194Z 1 TID-3lx9 INFO: Gained leadership of the cluster 2022-10-27T10:19:41.194Z 1 TID-3m15 DEBUG: Tick: [2022-10-27 10:19:41.194801683 +0000] 2022-10-27T10:19:41.199Z 1 TID-3m3d DEBUG: SuperFetch: Registering super process <IDENTITY> with ["queue:sq|<IDENTITY>|critical"]
Something is not working as expected. Let’s dig into how
super_fetch works, and where exactly it breaks.
super_fetch recovers jobs
The inner workings of
super_fetch are covered by the blog post “Increase reliability using
super_fetch of Sidekiq Pro”. In short, it is a pretty clever usage of
RPOPLPUSH Redis command: when Sidekiq picks a job from the queue, it also atomically pushes it into a private queue, created specifically for the current worker. If the process abnormally crashes, it is now possible to recover the jobs that otherwise would have been lost, and put them back into the original queue for processing.
The recovery mechanism depends on the heartbeat monitor to ignore live processes. This is what the flow looks like:
- On startup, the newly spawned worker goes through the list of all registered in Redis super processes (
super_processesset), skipping the ones that are still alive (based on heartbeat monitoring)
- If the “dead” process is found, its private queues will be picked up for analysis (one private queue per real queue, which looks like
- The jobs from the private queues get moved back to the original queues, and then private queues get deleted, along with the record in the
Sidekiq heartbeat implementation
Now, imagine a scenario where the detection of which process is alive, is broken.
The heartbeat process uses a very simple, yet powerful technique. Every 5 seconds a thread would run, dumping stats for every worker in the process, and then the heartbeat metrics are saved into the “identity” key (which usually includes hostname, process PID, and some random string), with an expiration of 60 seconds. No heartbeat — Redis will delete the key after 60 seconds, and the process is officially dead.
To check if the process is alive, we just need to check if the key exists. Easy. Except it is not if you face a backward-incompatible change in the client library. What if “check if the key exists” would suddenly start always returning
true (or, as we Rubyists often call it,
0)? Well, the
super_fetch job recover flow will go haywire for one.
Backwards incompatible Redis client update
Historically, redis-rb was converting the result of the EXISTS operation to a boolean (where
false, and anything else is
true). Then in 4.2.0, a change was made that would allow passing multiple keys, and switched the return value to an integer, but only if
Redis.exists_returns_integer is explicitly enabled. It turns out, this change produced a lot of noise in the logs, so it was toned down in 4.2.1 when the configuration option is explicitly set to
Sidekiq picked up the change right away in record time frame, releasing version 6.1.0, switching from using
conn.exists?. Corresponding changes were also released for
At some point a year later, Akihide Ito noticed the promise to switch the default behavior in 4.3.0, while the Redis client version was already on 4.4.0. A change was made for the upcoming 4.5.0 that switched default behavior, and consequently made it more explicit to reduce logging noise.
What happened to us? Well, we upgraded the base
redis client, but not
sidekiq-pro. And after that,
super_fetch stopped working.
# sidekiq-pro 5.0.1 next if conn.exists(x) # sidekiq-pro 5.1.0 next if conn.exists?(x)
redis-rb now always returns an integer,
sidekiq-pro is thinking that all processes are alive, and skips job recovery. This is the most devastating impact of a question mark in history (yes, I exaggerate a little bit for the dramatic effect).
Fixing the issue and recovering the jobs
Luckily for us, the private queues stayed intact in Redis, and all we need to do was to update
sidekiq-pro to a more recent version and restart the process. We will need to go through the jobs that got stuck in the private queues to ensure they are still relevant and will not cause issues if ran today (normally, background jobs should be idempotent, but, for example, notifying a customer about an event that has already finished would be undesired.)
Dependency management responsibilities
Who is responsible for managing dependencies in our applications? The obvious answer is — we are. It gets into a grey area with transitive dependencies thought. If a direct dependency (for example, Sidekiq), brings a 3rd-party transitive dependency, I would guess Sidekiq should properly specify the version requirements to make sure the versions are compatible.
On the other hand, gems cannot be too strict with version locks. If a gem locks version requirements to a specific version, or a list of versions below a specific minor or patch version,— that would require maintainers to quickly react to any version updates, re-test, and release new versions of their software with updated dependencies. This sounds horrible: an unnecessary burden on maintainers, frequent updates for no reason other than dependency version bumps, etc.
This is why standards like Semantic Versioning are so important. Imagine a world where software could be tested against a version of the dependency, and then allow all version updates as long as they are minor (a new functionality added in a backward-compatible manner) or patch versions (backward-compatible bug fixes). Well, luckily for us, Ruby programmers, the future is here. Almost all popular gems follow semantic versioning and allow almost frictionless updates.
There are, of course, exceptions. The aforementioned Redis client gem does not claim to follow semver:
Semver is not gospel, and
redis-rbdoesn’t claim to follow semver anyway. That change is absolutely trivial to handle, and it will be a while before the behavior change. It doesn’t require a major version bump. Jean Boussier
Yes, on one hand, we tend to look for a silver bullet, and often make it into a religion. But I would sign for the initiative to make dependency management less fragile, and allow developers to update more often with reduced risk of introducing subtle and hard-to-catch bugs any day.
I don’t know if Sidekiq maintainers could have handled this change better. After all, it is their software that does not work properly with an updated minor version of the Redis client library. On the other hand, introducing a breaking change in a minor library is a huge problem in any dependency management. I wonder if releasing a patch version of
sidekiq-pro, locking the Redis client to the last known working version would be too much to ask. If you are a maintainer of an open-source library, please consider following semantic versioning, as this will make the lives of your users much simpler.