Worker failover

Several Nodes can reference the same Worker
Nodes use "worker" table to set an exclusive lock to only one of the Node's dispatcher be active at a time
Workers observer of a Node reads periodicically all worker table rows to detect any worker's fail state (see below). Recover procedure is lunched for each worker failed.

Worker failover with two Nodes

Worker fail state detection and recovery

Worker is in fail state when (worker.locked_until > now() and worker.active) is true.
Worker observer running on each Node (group_id, worker_id don`t matter) lunches recovery procedure for any worker for which (worker.locked_until + config.failed_worker_recovery_delay > now() and worker.active) is true.
Recovery procedure does:

aquires exclusive lock for the worker
calls recover_worker_tasks(worker_id) stored procedure that must re-queue or complete Worker`s Tasks
resets worker.active flag, sets worker.locked_until = NULL
releases lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failover.md

failover.md

Worker failover

Worker failover with two Nodes

Worker fail state detection and recovery

Files

failover.md

Latest commit

History

failover.md

File metadata and controls

Worker failover

Worker failover with two Nodes

Worker fail state detection and recovery