All posts

An Introductory Guide to Log Sampling

An Introductory Guide to Log Sampling

Introduction

As log data grows in volume, managing it effectively becomes crucial for engineering and DevOps teams. Instead of dealing with the entire set of logs, log sampling allows you to collect and analyze a subset of data that accurately represents the whole. This helps you maintain observability, cut down on storage costs, and reduce the overhead of log processing. In this post, we’ll go deeper into log sampling techniques, provide examples, showcase how to implement them with real-world configurations using Fluent Bit and Logstash, and explore the pros and cons of each method.

What Is Log Sampling?

Log sampling is the process of selecting a subset of log data to reduce volume while maintaining the usefulness of the data for debugging, monitoring, and alerting. By sampling intelligently, you can get insights from your logs without ingesting or storing every single log entry, which can be costly and inefficient.

For example, you might be running a system that generates millions of logs per minute. Analyzing all of this data in real-time would be overwhelming and expensive. Instead, you could apply sampling techniques to select a portion of this data for deeper analysis, without sacrificing accuracy.

Types of Log Sampling Techniques With Fluent Bit and Logstash Configurations

1. Random Sampling

In random sampling, each log entry has an equal probability of being chosen. This method is effective when logs are relatively uniform, and there is no need for bias in selection.

Fluent Bit Configuration:

In Fluent Bit, you can use the filter_modify plugin to randomly drop logs based on a percentage. Here’s how you can configure random sampling:

[INPUT]
    Name tail
    Path /var/log/app.log
    Tag app_logs

[FILTER]
    Name modify
    Match app_logs
    Random 10  # This means 10% of logs will be kept, 90% dropped

[OUTPUT]
    Name es
    Match app_logs
    Host 127.0.0.1
    Port 9200
    Index sampled-logs


In this configuration, Fluent Bit is set to randomly drop 90% of logs, keeping only 10%.

Logstash Configuration:

In Logstash, you can implement random sampling using Ruby within the filter block.

yaml

input {
    file {
        path => "/var/log/app.log"
        start_position => "beginning"
    }
}

filter {
    ruby {
        code => "
            sample_rate = 10
            event.cancel if rand(100) >= sample_rate
        "
    }
}

output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "sampled-logs"
    }
}

In this configuration, Logstash will drop logs randomly, keeping 10% of them.

Pros:

  • Unbiased selection, ensuring no inherent bias.
  • Simple to implement and effective for general overview.

Cons:

  • Not always representative, as rare events like errors might be missed.
  • Lacks temporal or contextual uniformity, which could skew analysis if time or context is critical.

2. Systematic Sampling

Systematic sampling involves selecting logs at regular intervals. For example, you might collect every 100th log entry. This is useful when you want to sample logs over time in a consistent manner.

Fluent Bit Configuration:

[INPUT]
    Name tail
    Path /var/log/app.log
    Tag app_logs

[FILTER]
    Name record_modifier
    Match app_logs
    Add sample_key ${FLB_LOG_INDEX}

[FILTER]
    Name grep
    Match app_logs
    Regex sample_key ^.*[0|5]$  # Keeps every 5th log entry based on the index

[OUTPUT]
    Name es
    Match app_logs
    Host 127.0.0.1
    Port 9200
    Index sampled-logs


In this configuration, Fluent Bit samples every 5th log entry based on the log index.

Logstash Configuration

yaml

input {
    file {
        path => "/var/log/app.log"
        start_position => "beginning"
    }
}

filter {
    ruby {
        code => "
            event.cancel if event.get('log_index') % 100 != 0
        "
    }
}

output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "sampled-logs"
    }
}

This Logstash configuration will keep every 100th log entry, ensuring systematic sampling.

Pros:

  • Consistent coverage over time, making it ideal for long-term trends.
  • Easy to automate, and effective for steady log streams.

Cons:

  • May miss patterns, particularly if logs follow periodic events.
  • Less flexible for logs generated sporadically or in bursts, potentially missing critical data.

3. Stratified Sampling

Stratified sampling divides logs into groups (strata) based on certain criteria (such as log level, geographical region, or user type) and samples within each group. This method ensures that each subgroup is represented in the sample.

Fluent Bit Configuration:

[INPUT]
    Name tail
    Path /var/log/app.log
    Tag app_logs

[FILTER]
    Name record_modifier
    Match app_logs
    Add strat_key ${log_level}

[FILTER]
    Name grep
    Match app_logs
    Regex strat_key ERROR|WARN  # Keep only ERROR and WARN logs

[OUTPUT]
    Name es
    Match app_logs
    Host 127.0.0.1
    Port 9200
    Index sampled-logs

In this Fluent Bit configuration, logs are stratified by log level, and only ERROR and WARN logs are retained.

Logstash Configuration:

yaml

input {
    file {
        path => "/var/log/app.log"
        start_position => "beginning"
    }
}

filter {
    if [log_level] == "ERROR" or [log_level] == "WARN" {
        mutate { add_field => { "stratified_sample" => true } }
    } else {
        drop {}
    }
}

output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "sampled-logs"
    }
}

This Logstash configuration samples logs based on their log level, keeping only ERROR and WARN logs.

Pros:

  • Ensures that all subgroups (e.g., error and info logs) are represented, reducing the risk of missing critical information.
  • Ideal for focused analysis of specific log types or regions.

Cons:

  • Requires additional setup to identify and segment the logs.
  • Adds complexity when logs are relatively uniform, which may not justify the overhead.

4. Cluster Sampling

In cluster sampling, logs are grouped into clusters (e.g., by server, geographical location, or time window), and then entire clusters are randomly selected for analysis. This technique is useful when log data is naturally grouped in clusters.

Fluent Bit Configuration:

ini

[INPUT]
    Name tail
    Path /var/log/app.log
    Tag app_logs

[FILTER]
    Name record_modifier
    Match app_logs
    Add cluster_key ${server_name}

[FILTER]
    Name grep
    Match app_logs
    Regex cluster_key server1  # Keep logs from server1

[OUTPUT]
    Name es
    Match app_logs
    Host 127.0.0.1
    Port 9200
    Index sampled-logs

In this Fluent Bit configuration, only logs from server1 are sampled as part of cluster sampling.

Logstash Configuration:

yaml

input {
    file {
        path => "/var/log/app.log"
        start_position => "beginning"
    }
}

filter {
    if [server_name] == "server1" {
        mutate { add_field => { "cluster_sample" => true } }
    } else {
        drop {}
    }
}

output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "sampled-logs"
    }
}

In this Logstash configuration, logs from server1 are sampled, representing a cluster sample.

Pros:

  • Efficient for naturally grouped logs, reducing complexity when managing large-scale systems.
  • Less overhead as entire clusters can be sampled at once.

Cons:

  • Risk of cluster bias, where selected clusters may not represent the entire system.
  • Requires that logs are already grouped, which might not always be the case.

Tools and Technologies for Log Sampling

Several tools can help automate and streamline the log sampling process:

  1. Elastic Stack (ELK): Elastic Stack provides capabilities to configure log collection rules using Logstash or Beats to sample logs based on certain conditions.
  2. Datadog: Datadog offers out-of-the-box log sampling features, allowing you to filter or sample logs based on attributes such as log level, tags, or service.
  3. Fluentd: Fluentd, a popular log collector, can also implement log sampling through configuration.
  4. Splunk: Splunk offers advanced sampling capabilities through its search processing language (SPL), which allows teams to define sampling rules to reduce data volume without sacrificing observability.
  5. AWS CloudWatch Logs: AWS CloudWatch Logs offers sampling features to reduce storage and ingestion costs, allowing teams to sample logs based on specific conditions or thresholds.
  6. Google Cloud Logging: Google Cloud Logging enables log sampling through configuration, helping teams to optimize observability in cloud environments.

Conclusion

Log sampling is a powerful technique that helps developers and DevOps teams manage log volumes effectively while reducing