All Projects
Active

Distributed Task Scheduler

A fault-tolerant task scheduler built with Go and gRPC, designed for high-throughput workloads on Kubernetes.

Overview

A distributed task scheduler designed to handle millions of jobs per day across a Kubernetes cluster. Workers register dynamically, tasks are distributed with configurable priority and retry policies, and the system recovers gracefully from node failures.

Problem

Existing solutions like Celery and Sidekiq are tightly coupled to their language ecosystems (Python and Ruby respectively). We needed a language-agnostic scheduler that:

  • Supports heterogeneous workers (different languages, different capabilities)
  • Scales horizontally with zero downtime
  • Provides exactly-once execution guarantees
  • Integrates natively with Kubernetes health checks and scaling

Solution

The scheduler is built as a set of microservices:

  1. API Server — accepts task submissions via gRPC and REST, validates payloads, and writes to the task queue.
  2. Dispatcher — reads from the queue, matches tasks to available workers based on capabilities and priority, and manages the execution lifecycle.
  3. Worker SDK — lightweight libraries in Go, Python, and TypeScript that handle registration, heartbeats, and result reporting.

Architecture

The system uses a combination of:

  • Redis Streams for the task queue (ordered, persistent, consumer-group support)
  • etcd for worker registration and leader election
  • PostgreSQL for task history and analytics

Key Learnings

  1. Exactly-once is hard. We settled on at-least-once delivery with idempotency keys — much simpler to implement and reason about.
  2. Backpressure matters. Without rate limiting at the dispatcher, a burst of high-priority tasks could starve lower-priority work indefinitely.
  3. Observability is not optional. We instrumented every component with OpenTelemetry from day one, which saved us countless hours during incidents.

Tech Stack

  • Go 1.22, gRPC, Protocol Buffers
  • Redis 7 (Streams + pub/sub)
  • PostgreSQL 16
  • Kubernetes (Helm charts, HPA)
  • OpenTelemetry, Prometheus, Grafana
gogrpckubernetesdistributed-systems