Summary
The inline stalled job recovery in reserve-batch.lua recovers expired jobs (removes from processing ZSET, re-adds to group ZSET, re-adds group to ready) but does not remove the jobId from the per-group active list (g:{gid}:active). Because the reserve path gates on LLEN(groupActiveKey) == 0 (line 72), the orphaned entry permanently blocks the group from ever being reserved again. The dedicated check-stalled.lua correctly performs LREM (line 50), but by the time it runs, reserve-batch.lua has already cleaned the processing ZSET, so check-stalled.lua finds no candidates and does nothing.
Environment
- GroupMQ version: 1.1.0
- Node.js: v20.20.1
- Redis: 7-alpine (Docker)
Root Cause
There is a race condition between the two stalled-job recovery paths:
reserve-batch.lua — inline recovery (lines 30–54)
This path finds expired jobs and re-queues them, but never touches the per-group active list:
-- reserve-batch.lua lines 30-54
local expiredJobs = redis.call("ZRANGEBYSCORE", processingKey, 0, now)
if #expiredJobs > 0 then
for _, jobId in ipairs(expiredJobs) do
local procKey = ns .. ":processing:" .. jobId
local procData = redis.call("HMGET", procKey, "groupId", "deadlineAt")
local gid = procData[1]
local deadlineAt = tonumber(procData[2])
if gid and deadlineAt and now > deadlineAt then
local jobKey = ns .. ":job:" .. jobId
local jobScore = redis.call("HGET", jobKey, "score")
if jobScore then
local gZ = ns .. ":g:" .. gid
redis.call("ZADD", gZ, tonumber(jobScore), jobId)
local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
if head and #head >= 2 then
local headScore = tonumber(head[2])
redis.call("ZADD", readyKey, headScore, gid)
end
redis.call("DEL", ns .. ":lock:" .. gid)
redis.call("DEL", procKey)
redis.call("ZREM", processingKey, jobId)
-- ❌ MISSING: redis.call("LREM", ns .. ":g:" .. gid .. ":active", 1, jobId)
end
end
end
end
check-stalled.lua — dedicated checker (lines 48–50)
This path correctly removes from the active list:
-- check-stalled.lua lines 48-50
-- BullMQ-style: Remove from per-group active list
local groupActiveKey = ns .. ":g:" .. groupId .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)
The conflict
When reserve-batch.lua runs first and cleans the processing ZSET, the subsequent check-stalled.lua invocation finds zero candidates (line 27–29: ZRANGEBYSCORE returns empty) and exits early, never reaching the LREM cleanup code.
Steps to Reproduce
- Worker reserves a job for group
G → job goes to processing ZSET + g:G:active LIST
- Process crashes (e.g., uncaught exception) before completing the job
- Process restarts, worker calls
reserveBatch()
reserve-batch.lua inline recovery finds the expired job in processing
- It removes from
processing ZSET, adds job back to group ZSET, adds group to ready
- But does NOT
LREM the jobId from g:G:active
- Later,
check-stalled.lua runs but finds nothing in processing (already cleaned by reserve-batch)
- The active list entry persists forever
- All subsequent reserve attempts for group
G check LLEN(groupActiveKey) (line 71–72), see activeCount > 0, and skip it
- Group
G is permanently blocked
Impact
Per-group FIFO ordering is permanently broken for affected groups. In our production system, 8 conversation groups became permanently blocked after a single process crash. These groups accumulated jobs in their group ZSETs but none were ever processed, requiring manual Redis intervention to unblock them.
Observed State in Redis (production)
# Orphaned active list — contains the jobId from the crashed process
> LRANGE gmq:myqueue:g:{conversationId}:active 0 -1
1) "job-12345"
# Processing ZSET is empty — reserve-batch already cleaned it
> ZRANGEBYSCORE gmq:myqueue:processing -inf +inf
(empty array)
# Group is in ready ZSET (worker sees it)
> ZSCORE gmq:myqueue:ready {conversationId}
"1710000000000"
# But jobs accumulate in group ZSET, never processed
> ZCARD gmq:myqueue:g:{conversationId}
(integer) 14
Suggested Fix
Add LREM to the inline recovery section in reserve-batch.lua, matching what check-stalled.lua already does:
-- In reserve-batch.lua, inline stalled recovery section (after line 48, before DEL procKey):
local groupActiveKey = ns .. ":g:" .. gid .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)
The full corrected block would be:
if jobScore then
local gZ = ns .. ":g:" .. gid
redis.call("ZADD", gZ, tonumber(jobScore), jobId)
local head = redis.call("ZRANGE", gZ, 0, 0, "WITHSCORES")
if head and #head >= 2 then
local headScore = tonumber(head[2])
redis.call("ZADD", readyKey, headScore, gid)
end
-- Clean up per-group active list (matches check-stalled.lua behavior)
local groupActiveKey = ns .. ":g:" .. gid .. ":active"
redis.call("LREM", groupActiveKey, 1, jobId)
redis.call("DEL", ns .. ":lock:" .. gid)
redis.call("DEL", procKey)
redis.call("ZREM", processingKey, jobId)
end
Workaround
Manually delete the orphaned active list keys:
redis-cli KEYS "gmq:*:g:*:active" | xargs -I{} redis-cli DEL {}
Or more targeted:
redis-cli LRANGE "gmq:myqueue:g:{groupId}:active" 0 -1
redis-cli DEL "gmq:myqueue:g:{groupId}:active"
Summary
The inline stalled job recovery in
reserve-batch.luarecovers expired jobs (removes fromprocessingZSET, re-adds to group ZSET, re-adds group toready) but does not remove the jobId from the per-group active list (g:{gid}:active). Because the reserve path gates onLLEN(groupActiveKey) == 0(line 72), the orphaned entry permanently blocks the group from ever being reserved again. The dedicatedcheck-stalled.luacorrectly performsLREM(line 50), but by the time it runs,reserve-batch.luahas already cleaned theprocessingZSET, socheck-stalled.luafinds no candidates and does nothing.Environment
Root Cause
There is a race condition between the two stalled-job recovery paths:
reserve-batch.lua— inline recovery (lines 30–54)This path finds expired jobs and re-queues them, but never touches the per-group active list:
check-stalled.lua— dedicated checker (lines 48–50)This path correctly removes from the active list:
The conflict
When
reserve-batch.luaruns first and cleans theprocessingZSET, the subsequentcheck-stalled.luainvocation finds zero candidates (line 27–29:ZRANGEBYSCOREreturns empty) and exits early, never reaching theLREMcleanup code.Steps to Reproduce
G→ job goes toprocessingZSET +g:G:activeLISTreserveBatch()reserve-batch.luainline recovery finds the expired job inprocessingprocessingZSET, adds job back to group ZSET, adds group toreadyLREMthe jobId fromg:G:activecheck-stalled.luaruns but finds nothing inprocessing(already cleaned by reserve-batch)GcheckLLEN(groupActiveKey)(line 71–72), seeactiveCount > 0, and skip itGis permanently blockedImpact
Per-group FIFO ordering is permanently broken for affected groups. In our production system, 8 conversation groups became permanently blocked after a single process crash. These groups accumulated jobs in their group ZSETs but none were ever processed, requiring manual Redis intervention to unblock them.
Observed State in Redis (production)
Suggested Fix
Add
LREMto the inline recovery section inreserve-batch.lua, matching whatcheck-stalled.luaalready does:The full corrected block would be:
Workaround
Manually delete the orphaned active list keys:
Or more targeted: