Efficiently scale your Kubernetes workloads

Beyond Kubernetes Load Balancing: Why You Need Endpoint Discovery

4 min readDec 31, 2024

When running a distributed system at scale, load balancing becomes critical to system performance, reliability, and cost efficiency. At Databricks, as our platform grew to serve millions of notebooks, jobs, and ML models, we discovered that Kubernetes’ default load balancing mechanisms weren’t enough to handle our complex workload patterns.

In this story we’ll share how we tackled these challenges to improve the availability of our system and also reduced the overall cloud costs.

The Problem with Kubernetes Load Balancing

Kubernetes, by default, uses a simple random load balancing strategy through kube-proxy’s iptables mode. While this works well for simple applications, it presents several challenges for complex, high-scale systems:

1. Not All Requests Are Created Equal

In a typical data platform like Databricks, different requests can have vastly different resource requirements. Some API calls might be quick metadata operations, while others could involve heavy computation or data movement. Random load balancing doesn’t account for these differences, leading to uneven resource utilization across pods.

2. Connection Reuse Creates Hotspots

Modern systems often use connection pooling and keep-alive connections to improve performance. However, this can create an unexpected side effect: once a connection is established, it stays with that same backend pod until the connection is closed. During deployments or scale events, this means traffic doesn’t shift quickly enough to new pods when we are rolling our a new version, creating potential bottlenecks.

3. Streaming Workloads Create Imbalance

gRPC Bi-Di streams and WebSocket connections, which are common in interactive computing platforms, can stick to specific endpoints for extended periods. With random load balancing, these long-lived connections might concentrate on a few pods, leading to overload while other pods remain underutilized.

4. No Built-in Failover

Perhaps most critically, when a pod becomes unhealthy, kube-proxy doesn’t automatically failover to healthy pods. This forces clients to implement their own retry logic, which can be inefficient and potentially overwhelming during partial outages.

The Cost of Suboptimal Load Balancing

At Databricks, these issues weren’t just theoretical. We found ourselves significantly over-provisioning resources to handle the uneven load distribution. Our analysis showed that optimizing load balancing could potentially save up to $75,000 per month per-region in cloud costs.

Enter EDS: A Smart Load Balancing Solution

In September 2022, we began developing the Endpoint Discovery Service (EDS) based on the open-source java-control-plane implementation. The design goals were clear:

Make load balancing load-aware and dynamic.
Ensure backward compatibility.
Maintain high availability with failover capabilities.
Keep it simple for service teams to adopt.

EDS works by:

Dynamic Load Awareness: EDS continuously monitors system metrics from each pod, initially using M3 metrics and later moving to direct metric collection for reduced latency.
Intelligent Weight Calculation: Instead of random distribution, EDS calculates weights for each endpoint based on their current load and capacity, ensuring requests are directed to the least loaded pods. The weight calculation formula considers multiple factors: `weight = 100 * (1 — load_1m / nr_cpus) / Σ(1 — load_1m(i) / nr_cpus(i))`
Fast Failover: When pods become unhealthy, EDS quickly removes them from the load balancing pool and redistributes traffic to healthy instances, even across availability zones if necessary.
Safe Rollout Mechanisms: EDS supports various rollout strategies, including traffic splitting, where services can designate a percentage of traffic to use EDS load balancing while keeping the rest on the traditional method.

Implementation Details

One key design decision was implementing EDS as a soft dependency. We configured API Proxy to use an aggregate cluster setup:

Primary: EDS-derived cluster with weighted least request load balancing
Secondary: Traditional ClusterIP-based cluster as fallback

This design ensures that if EDS becomes unavailable, traffic automatically falls back to the traditional Kubernetes load balancing without service disruption. We also implemented a “Big Red Button” via our Global Operations Console (GOC) to instantly disable EDS for any service if needed.

Results in Production

After rolling out EDS across our platform in waves (starting with dev environments and gradually moving to production), we observed:

Better Load Distribution: CPU utilization variance across pods decreased significantly, with the difference between most and least loaded pods dropping from 3x to 1.2x
Faster Recovery: During pod failures, traffic redistribution time improved from seconds to milliseconds
Resource Efficiency: We were able to reduce our pod count by up to 50% while maintaining the same performance SLAs
Improved Rollouts: New pods during scale-up events now receive their fair share of traffic immediately

Industry Context and Future

Our journey with EDS mirrors similar challenges faced by other large-scale platforms. Google’s Maglev load balancer and Amazon’s Application Load Balancer both implement sophisticated load-aware balancing strategies. However, these solutions are typically not available to companies running their own Kubernetes clusters.

Looking forward, we’re continuing to enhance EDS with:

Integration with our RPC clients for service-to-service communication.
Support for custom metrics and weighting algorithms.
Enhanced observability and debugging capabilities.
Potential integration with Cluster Discovery Service (CDS) for more dynamic configuration.

The journey from Kubernetes’ basic load balancing to EDS shows how solving fundamental infrastructure challenges can have far-reaching impacts on system reliability, performance, and cost efficiency. Today, all our workloads at Databricks use EDS, and we’ve built it into our RPC clients natively so that not only our API Gateway (powered by Envoy) but all other clients are also using the same system.

Remember, while Kubernetes provides a great foundation for container orchestration, as your system grows, you might need to build additional layers to handle your specific scaling challenges. EDS is just one example of how we at Databricks have extended the Kubernetes ecosystem to better serve our needs.