To log, or not to log — that is the question 🤔

Scaling Envoy Proxy: How We Slashed Logging Costs by 65%

Rohit Agrawal
4 min readDec 22, 2024

At Databricks, we face a classic engineering dilemma: our Envoy proxies generate invaluable access logs that help us debug network issues, but these same logs significantly impact performance and costs. In fact, our benchmarks showed that simply turning off access logs could double our throughput.

Here’s how we solved this challenge.

The Setup: Our Envoy Architecture

We deploy Envoy at multiple layers — as our main ingress proxy and as an egress filtering proxy. The access logs and metrics they emit are crucial for understanding network behavior. When a user workload fails, these logs help us quickly identify the root cause: whether it’s an ExtAuthZ error, a rate-limit breach, or an RBAC rule violation.

Our Ingress Envoy’s HTTP Filter Chain: Request Processing Flow

The Challenge: Performance vs. Observability

However, comprehensive logging comes at a steep price:
- Reduced throughput of Envoy proxies
- Significant processing overhead
- High storage costs

In our independent benchmark testing, we discovered something striking: disabling access logs completely could boost overall throughput by up to 2x.

Our Solution: Strategic Logging

The answer isn’t to abandon logging entirely — it’s to log strategically. Envoy provides powerful filtering capabilities that allow us to:
- Write different types of requests and responses to different access logs.
- Intelligently limit our logging footprint by filtering out noise.

Understanding Envoy’s Log Filtering

Envoy’s log filtering offers extensive options based on response flags, status codes, request duration. It’s also possible to use complex expressions using AND/OR operators.

For instance, instead of logging every 200 OK response, we can focus on specific scenarios:
- High-latency requests (> 500ms)
- Requests that got retried upstream
- All 4XX and 5XX responses for post-mortem analysis

Implementation: Sample Filtering Configuration

Here’s a real-world example of how we implemented some of these filters:

# This configuration achieves:
# 1. Logs all errors (4XX, 5XX)
# 2. Samples successful requests at a configurable rate
# 3. Allows runtime modification of sampling rates
filter:
or_filter:
filters:
- or_filter:
filters:
# Filter for specific status codes
- status_code_filter:
comparison:
op: EQ
value:
default_value: 0
runtime_key: access-logs-eq-zero-status-code
# Filter for error responses
- status_code_filter:
comparison:
op: GE
value:
default_value: 400
runtime_key: access-logs-ge-400-status-codes
# Sampling filter for remaining traffic
- runtime_filter:
percent_sampled:
denominator: MILLION
numerator: 1000000
runtime_key: access-logs-sampling-rate
use_independent_randomness: false

Results and Impact

After implementing this strategic logging approach, we observed:

  • Up to 80% reduction in the overall logging volume.
  • Up to 60% improvement in the proxy throughput.
  • Up to 65% decrease in logging storage costs.

Lessons Learned

During this optimization journey, we learned several key lessons:

  1. Not all logs are created equal. Developers should focus on capturing anomalies and errors.
  2. Runtime configurability is crucial for adapting to different traffic patterns.
  3. Regular review of logging patterns helps identify opportunities for any further optimization.

Looking Ahead: Limitations and Future Improvements

While Envoy’s filtering capabilities have helped us optimize our logging pipeline and increase the overall throughput, there are several limitations in the current system that drive our future roadmap:

  1. Context-Aware Filtering: The system can’t make decisions based on historical data. For instance, we can’t implement rules like “sample requests from an IP if its previous 200 OK response wasn’t sampled.”
  2. Dynamic Traffic Adaptation: Current filtering rules are static and can’t automatically adjust based on traffic patterns. For example, we can’t automatically increase sampling rates during low-traffic periods or decrease them during traffic spikes.
  3. Real-time Rule Updates: While Envoy supports runtime updates, changing filtering rules requires careful coordination and can’t be done in response to real-time traffic conditions.

To address these limitations, we’re working on:

  1. Implementing dynamic sampling rates that adapt to traffic patterns.
  2. Developing tools to analyze and optimize our sampling coverage.
  3. Creating a feedback loop between our monitoring systems and Envoy configuration.

We’re actively collaborating with the Envoy community to enhance these capabilities, especially around context-aware filtering and more sophisticated sampling strategies.

Advanced Filtering

While we’ve focused on basic filters in this post, Envoy also supports Common Expression Language (CEL) for more complex filtering needs.

We’ll deep dive into CEL expressions in a future post, as they deserve their own detailed exploration.

Key Takeaways

  1. Strategic logging can significantly improve performance while maintaining visibility.
  2. Envoy’s filtering capabilities provide flexible control over logging
    Regular measurement and adjustment of logging strategies is crucial.
  3. Consider both immediate and future needs when designing logging systems.

This is just the beginning of our journey in optimizing Envoy’s observability. Stay tuned for our next post about advanced filtering techniques using CEL expressions.

--

--

Rohit Agrawal
Rohit Agrawal

Written by Rohit Agrawal

Engineer @Databricks focused on scaling network traffic & building service mesh. OSS contributor to Envoy with a focus on new features, optimization, security.

No responses yet