Autoscaling is one of the most powerful promises of Kubernetes, but also one of the most misunderstood. Many teams rely on default CPU — or memory — based autoscaling, only to discover that their cluster doesnotscalewhenitshould, especially for asynchronous workloads (e.g., Celery workers, Kafka consumers, ETL services, etc.).
In this guide, we’ll walk through a production-proven autoscaling strategy that goes far beyond basic resource metrics. You’ll learn how to combine:
Custom application metrics;
Prometheus for scraping;
Prometheus Adapter for exposing metrics to Kubernetes;
Horizontal Pod Autoscalers (HPA) for pod-level scaling;
AKS (or any cloud)Node Autoscaler for infrastructure-level scaling;
This is the exact architecture we use to scale Celery-based workloads under high demand, and it has proven to be resilient, predictable, and cost-efficient.
Whether you’re building search pipelines, background processors, event-driven systems, or AI inference services, this tutorial provides a definitive scaling blueprint for Kubernetes.
Why traditional autoscaling falls short
Kubernetes’s default horizontal autoscaling relies on CPU and memory, but with asynchronous or queue-driven systems, those metrics don’t reflect real pressure.
For example:
Your Celery queue grows from 200 → 6,000 tasks;
Workers are fully idle between tasks;
CPU remains at 10–30%;
Kubernetes thinks everything is fine.
Meanwhile, users wait longer and longer.
The truth is: To autoscale correctly, you need semantic metrics that describe the real workload, like queue depth, worker utilization, event lag, etc. In this tutorial, we’ll scale based on what actually matters, not what the kernel reports.
This ensures Prometheus scrapes celery_queue_depth and celery_workers_busy_ratio every 15 seconds.
3. Exposing metrics via Prometheus Adapter
Kubernetes Horizontal Pod Autoscalers cannot consume Prometheus metrics directly. To bridge this gap, the Prometheus Adapter is deployed as a separate component in the cluster, usually via its own Helm chart and configuration file, commonly in a shared namespace such as monitoring.
The adapter connects to Prometheus, runs predefined queries, and then exposes the results through the Kubernetes External Metrics API (external.metrics.k8s.io). These metric-mapping rules are defined exclusively in the Prometheus Adapter configuration and are what make custom application metrics, such as Celery queue depth or worker utilization, available for autoscaling.
Once deployed, these metrics can be queried by Kubernetes and referenced directly by Horizontal Pod Autoscalers.
These can be inspected with:
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
You should see:
celery_queue_depth
celery_workers_busy_ratioCode language:JavaScript(javascript)
4. Kubernetes HPA: Autoscaling pods based on real workload
Once custom metrics are exposed through the Prometheus Adapter, Kubernetes can use them to make autoscaling decisions. This is done through the Horizontal Pod Autoscaler (HPA), which adjusts the number of pod replicas based on real workload pressure rather than CPU or memory usage alone.
In this setup, the HPA scales worker pods based on external metrics and is configured to scale worker pods using external metrics such as queue depth and worker utilization. These metrics reflect how much work the system is actually processing, making scaling decisions more accurate and responsive.
Pods scale up as the queue grows beyond the target depth;
Pods scale up when workers approach full utilization;
Pods scale down when the system becomes idle.
Tuning HPA behavior and why this matters
When scaling based on external or workload-driven metrics, it is important to also configure HPA behavior parameters. Without them, the autoscaler may react too aggressively to short-lived metric spikes, leading to rapid scale-up and scale-down cycles (“flapping”).
By defining stabilization windows and scaling policies, you can ensure that scaling decisions are smoother, more predictable, and aligned with sustained workload trends rather than transient noise.
This configuration limits how frequently replicas can be added or removed and gives the system time to absorb changes in demand before making further adjustments. In production environments, especially when autoscaling from queue-based metrics, proper HPA behavior tuning is essential to avoid instability and unnecessary resource churn.
This is semantic autoscaling, scaling driven by business logic.
5. Node-level autoscaling (AKS or any cloud provider)
Pod autoscaling only works when the cluster has enough compute capacity. To automatically add or remove nodes, a cluster autoscaler must be enabled. This can be implemented using the platform’s preferred solution (such as the native Kubernetes Cluster Autoscaler, Karpenter, or a cloud-provider managed autoscaler) and configured either via infrastructure-as-code tools or directly through the cloud console. In this guide, node autoscaling is enabled using Terraform, but the same concepts apply regardless of the tooling or autoscaler implementation used.
This is essential for any scalable production Kubernetes system.
6. Validating the setup
Check HPA decisions:
kubectl describe hpa -n default
Check external metrics:
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/celery_queue_depth"
Check pending pods:
kubectl get pods -A | grep Pending
Check node autoscaler actions:
kubectl get nodes
kubectl describe node <name>Code language:JavaScript(javascript)
Conclusion: A scalable, production-proven autoscaling strategy
Autoscaling Kubernetes isn’t just about turning on HPAs. Real-world autoscaling requires application-aware metrics, smart decision-making, and infrastructure elasticity.
By combining:
Custom metrics
Prometheus
Prometheus Adapter
Kubernetes HPA
Node autoscaling
You build a cluster that reacts to real demand, scales smoothly under pressure, and minimizes cost during idle periods.
If your applications rely on queues, background processing, or any asynchronous workloads, this scaling strategy is not just ideal. It’s essential.
This is the definitive way to autoscale Kubernetes.
If your engineering team wants help implementing this pattern or wants to scale more advanced workloads (AI, search pipelines, ETL, etc.), feel free to reach out!
Passionate about understanding the essence of technology, I specialized in cloud infrastructure and process automation. With extensive involvement in data-driven, mobile, and web applications, I'm dedicated to creating and optimizing environments to provide the best possible experience for development teams.