All posts

High Cardinality in Metrics: Challenges, Causes, and Solutions

As engineers, we love data. But there’s a point where having too much data—or more specifically, too many unique values—can become a headache. This is exactly what we mean by "high cardinality." Let's dive into what this term really means, why it’s such a challenge in metrics, and some real-life examples you’ve likely run into.

What Is High Cardinality?

In data terms, cardinality refers to the number of unique values in a dataset. High cardinality means that a column or field has a very large number of unique values. If you think about some of the typical metric labels we handle as DevOps engineers or software developers, things like user IDs, request paths, or device IDs often come with a seemingly endless variety of values.

High cardinality is not inherently bad, but it becomes problematic in observability systems where high cardinality in metrics leads to increased costs, slow queries, and inefficient storage usage. It’s the difference between collecting insightful metrics and being drowned by an avalanche of overly granular data that is difficult to use.

How Does High Cardinality Escalate Quickly in Metrics?

High cardinality in metrics can escalate quickly in certain situations, often catching engineers by surprise. This usually happens when fields that are expected to have manageable numbers of unique values start accumulating more unique entries due to changes in usage patterns or increased system complexity. Let’s dive deeper into a few common scenarios where high cardinality in metrics can escalate, with some code examples to illustrate how these situations arise:

  1. User-Specific Labels in Metrics: Consider a situation where metrics are labeled with user IDs to track specific user actions. If millions of users are interacting with your service daily, each user ID represents a unique label value.

# Example: Creating metrics with user-specific labels
from prometheus_client import Counter

user_action_counter = Counter('user_actions', 'Count of user actions', ['user_id'])

def track_user_action(user_id):
    user_action_counter.labels(user_id=user_id).inc()

# Millions of users generating unique metrics
for user_id in range(1, 1000000):
    track_user_action(f'user_{user_id}')

In this example, each user generates a unique label (user_id), leading to a massive number of unique metric series. This can quickly overwhelm Prometheus or any other time-series database due to the sheer number of unique time series being created.
A better approach would be to aggregate user actions by cohorts or user segments rather than using individual user IDs as labels, reducing the overall number of unique metric series.

  1. Instance-Level Metrics for High-Scale Deployments: If you are monitoring metrics at the instance level across a large-scale deployment, each instance may generate unique metrics. For cloud environments where instances are created and destroyed frequently, this can result in high cardinality.
# Example: Tracking instance metrics
from prometheus_client import Gauge

instance_cpu_usage = Gauge('instance_cpu_usage', 'CPU usage per instance', ['instance_id'])

def log_instance_cpu(instance_id, cpu_usage):
    instance_cpu_usage.labels(instance_id=instance_id).set(cpu_usage)

# Simulating multiple instances with unique IDs
for instance_id in range(1, 10000):
    log_instance_cpu(f'instance_{instance_id}', 75.0)

Here, each cloud instance (instance_id) generates a unique metric series, and as instances are scaled up or down, the cardinality becomes extremely high. Instead of tracking each instance individually, consider aggregating metrics by service or availability zone.

  1. Dynamic Endpoint Monitoring: Metrics that track performance of individual endpoints can quickly escalate in cardinality if those endpoints contain dynamic components, such as user-specific paths.
# Example: Monitoring dynamic endpoints
from prometheus_client import Histogram

endpoint_latency = Histogram('endpoint_latency', 'Latency of requests to endpoints', ['endpoint'])

def log_endpoint_latency(endpoint, latency):
    endpoint_latency.labels(endpoint=endpoint).observe(latency)

# Logging latency for multiple dynamic endpoints
for i in range(1, 10000):
    log_endpoint_latency(f'/users/{i}/profile', 0.2)

Each unique endpoint (e.g., /users/12345/profile) adds a new metric series, leading to high cardinality. A more efficient way would be to generalize endpoints by stripping dynamic components, such as using /users/{user_id}/profile as a static label.

  1. IoT Device Metrics: In IoT environments, each device often reports metrics independently. With thousands of devices sending frequent telemetry, each with a unique identifier, the number of time series can explode.
# Example: Metrics from IoT devices
from prometheus_client import Gauge

device_temperature = Gauge('device_temperature', 'Temperature reported by IoT devices', ['device_id'])

def report_device_temperature(device_id, temperature):
    device_temperature.labels(device_id=device_id).set(temperature)

# Thousands of devices generating telemetry
for device_id in range(1, 10000):
    report_device_temperature(f'device_{device_id}', 22.5)

Each device_id creates a unique metric series. Aggregating devices by region or model can help reduce the number of unique labels and, consequently, the cardinality of the metrics.

Hidden Causes of High Cardinality in Metrics

There are also scenarios that developers typically don’t think of when designing metrics, which can lead to unexpected high cardinality:

  1. Frequent Updates with Dynamic Labels: Metrics that include frequently changing label values, such as timestamps or request IDs, can lead to unexpected high cardinality. For example, if you include a request ID in your labels, every request generates a new unique label, leading to an explosion in the number of metric series.
  2. Auto-Scaling and Ephemeral Infrastructure: In environments that use auto-scaling, new instances are frequently spun up and destroyed. If each instance has unique identifiers included in metric labels, such as instance IDs or container IDs, this can cause the number of unique series to grow rapidly. Ephemeral infrastructure makes it particularly easy to underestimate the impact of cardinality.
  3. Debug Labels: Adding detailed debug information as labels to metrics might seem like a good idea during development or troubleshooting, but it can drastically increase cardinality if those labels are not removed afterward. Labels like debug_mode=true or trace_id can introduce an enormous variety of values, especially in production environments.
  4. User-Agent Metrics: Including user-agent strings as labels to track metrics per browser or device type can lead to high cardinality because user-agent strings can vary significantly. Instead of including full user-agent strings, consider categorizing by browser type or version.
  5. Custom Tags: Allowing users to define their own tags or custom attributes can lead to unpredictable cardinality. For example, metrics labeled with user-defined tags can result in a wide variety of unique values, which are difficult to predict and control.

Why Is High Cardinality a Problem in Metrics?

  • Increased Costs: Storage of high-cardinality metrics can get expensive. Every unique label combination generates a separate time series, and if you’re generating millions of distinct time series, the storage costs add up. Observability solutions often charge based on the amount of data ingested and stored, which means uncontrolled cardinality can have a significant financial impact.
  • Slow Queries: High cardinality makes querying metrics significantly slower. Systems like Prometheus struggle when asked to aggregate across many unique time series. Queries that need to aggregate metrics labeled by user IDs or instance IDs, for example, may time out or take an impractical amount of time to execute.
  • Scaling Issues: High cardinality puts immense pressure on the underlying infrastructure, especially in distributed systems. Time-series databases need to manage large volumes of unique series, which can lead to data imbalance across nodes and affect query performance and availability. This imbalance can cause nodes to become overwhelmed with data, impacting system reliability.
  • Operational Complexity: Managing a high number of unique metric series can lead to operational complexity. Maintaining indices, managing query performance, and ensuring system stability all become more challenging as cardinality grows. Engineers need to spend more time on infrastructure maintenance and performance tuning, reducing the time available for feature development or improving user experience.

How to Solve High Cardinality Issues

Addressing high cardinality requires a combination of strategies to reduce the number of unique label values in your metrics. Here are some effective techniques to help you get started:

  1. Aggregate Labels: Instead of using highly unique labels (e.g., user IDs or instance IDs), consider aggregating metrics at a higher level, such as user cohorts, regions, or service tiers. This allows you to retain valuable insights without incurring the cost of extremely high cardinality.
  2. Bucketing: Group data into predefined buckets to reduce the number of unique values. For instance, rather than storing exact latency values or detailed metrics per request, bucket these values into ranges (e.g., 0-100ms, 100-200ms, etc.). This can significantly lower the number of unique metric series while still providing a clear performance overview.
  3. Label Whitelisting: Implement strict policies around which labels can be added to your metrics. Review your metrics and eliminate unnecessary labels that do not provide value. This can prevent metrics from including unpredictable labels, such as request IDs or other dynamically generated values.
  4. Use Exemplars: Instead of adding high-cardinality labels to all metrics, use exemplars to trace specific data points. Exemplars can add additional context to a small sample of metrics without significantly increasing overall cardinality. This approach works well in systems like Prometheus.
  5. Sample Data: Consider sampling your metrics when tracking high-cardinality values. For instance, instead of logging every single request, log only a sample (e.g., 1%) of requests that contain high-cardinality information. This helps to keep cardinality under control while still retaining insight into the behavior of the system.
  6. Normalize Labels: Remove or generalize dynamic parts of labels. For example, instead of labeling metrics with full URLs, replace dynamic components with placeholders (e.g., /users/{user_id}/profile). This approach helps reduce the number of distinct label combinations, significantly lowering cardinality.
  7. Periodic Cleanup: Introduce automated jobs to clean up old, unused, or low-value metrics from your time-series database. High cardinality often results from historical data being retained longer than necessary. Periodic cleanup ensures that only relevant data is retained, which helps maintain system efficiency.
  8. Batch Metrics by Service Level: Aggregate metrics at the service level rather than the individual instance or user level. Metrics like response times or error rates can be tracked by the service or availability zone, which provides sufficient insight for monitoring without the overhead of tracking every individual instance.