Introduction
As log data grows in volume, managing it effectively becomes crucial for engineering and DevOps teams. Instead of dealing with the entire set of logs, log sampling allows you to collect and analyze a subset of data that accurately represents the whole. This helps you maintain observability, cut down on storage costs, and reduce the overhead of log processing. In this post, we’ll go deeper into log sampling techniques, provide examples, showcase how to implement them with real-world configurations using Fluent Bit and Logstash, and explore the pros and cons of each method.
Log sampling is the process of selecting a subset of log data to reduce volume while maintaining the usefulness of the data for debugging, monitoring, and alerting. By sampling intelligently, you can get insights from your logs without ingesting or storing every single log entry, which can be costly and inefficient.
For example, you might be running a system that generates millions of logs per minute. Analyzing all of this data in real-time would be overwhelming and expensive. Instead, you could apply sampling techniques to select a portion of this data for deeper analysis, without sacrificing accuracy.
In random sampling, each log entry has an equal probability of being chosen. This method is effective when logs are relatively uniform, and there is no need for bias in selection.
Fluent Bit Configuration:
In Fluent Bit, you can use the filter_modify
plugin to randomly drop logs based on a percentage. Here’s how you can configure random sampling:
[INPUT]
Name tail
Path /var/log/app.log
Tag app_logs
[FILTER]
Name modify
Match app_logs
Random 10 # This means 10% of logs will be kept, 90% dropped
[OUTPUT]
Name es
Match app_logs
Host 127.0.0.1
Port 9200
Index sampled-logs
In this configuration, Fluent Bit is set to randomly drop 90% of logs, keeping only 10%.
Logstash Configuration:
In Logstash, you can implement random sampling using Ruby within the filter block.
yaml
input {
file {
path => "/var/log/app.log"
start_position => "beginning"
}
}
filter {
ruby {
code => "
sample_rate = 10
event.cancel if rand(100) >= sample_rate
"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "sampled-logs"
}
}
In this configuration, Logstash will drop logs randomly, keeping 10% of them.
Pros:
Cons:
Systematic sampling involves selecting logs at regular intervals. For example, you might collect every 100th log entry. This is useful when you want to sample logs over time in a consistent manner.
Fluent Bit Configuration:
[INPUT]
Name tail
Path /var/log/app.log
Tag app_logs
[FILTER]
Name record_modifier
Match app_logs
Add sample_key ${FLB_LOG_INDEX}
[FILTER]
Name grep
Match app_logs
Regex sample_key ^.*[0|5]$ # Keeps every 5th log entry based on the index
[OUTPUT]
Name es
Match app_logs
Host 127.0.0.1
Port 9200
Index sampled-logs
In this configuration, Fluent Bit samples every 5th log entry based on the log index.
Logstash Configuration
yaml
input {
file {
path => "/var/log/app.log"
start_position => "beginning"
}
}
filter {
ruby {
code => "
event.cancel if event.get('log_index') % 100 != 0
"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "sampled-logs"
}
}
This Logstash configuration will keep every 100th log entry, ensuring systematic sampling.
Pros:
Cons:
Stratified sampling divides logs into groups (strata) based on certain criteria (such as log level, geographical region, or user type) and samples within each group. This method ensures that each subgroup is represented in the sample.
Fluent Bit Configuration:
[INPUT]
Name tail
Path /var/log/app.log
Tag app_logs
[FILTER]
Name record_modifier
Match app_logs
Add strat_key ${log_level}
[FILTER]
Name grep
Match app_logs
Regex strat_key ERROR|WARN # Keep only ERROR and WARN logs
[OUTPUT]
Name es
Match app_logs
Host 127.0.0.1
Port 9200
Index sampled-logs
In this Fluent Bit configuration, logs are stratified by log level, and only ERROR
and WARN
logs are retained.
Logstash Configuration:
yaml
input {
file {
path => "/var/log/app.log"
start_position => "beginning"
}
}
filter {
if [log_level] == "ERROR" or [log_level] == "WARN" {
mutate { add_field => { "stratified_sample" => true } }
} else {
drop {}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "sampled-logs"
}
}
This Logstash configuration samples logs based on their log level, keeping only ERROR
and WARN
logs.
Pros:
Cons:
In cluster sampling, logs are grouped into clusters (e.g., by server, geographical location, or time window), and then entire clusters are randomly selected for analysis. This technique is useful when log data is naturally grouped in clusters.
Fluent Bit Configuration:
ini
[INPUT]
Name tail
Path /var/log/app.log
Tag app_logs
[FILTER]
Name record_modifier
Match app_logs
Add cluster_key ${server_name}
[FILTER]
Name grep
Match app_logs
Regex cluster_key server1 # Keep logs from server1
[OUTPUT]
Name es
Match app_logs
Host 127.0.0.1
Port 9200
Index sampled-logs
In this Fluent Bit configuration, only logs from server1
are sampled as part of cluster sampling.
Logstash Configuration:
yaml
input {
file {
path => "/var/log/app.log"
start_position => "beginning"
}
}
filter {
if [server_name] == "server1" {
mutate { add_field => { "cluster_sample" => true } }
} else {
drop {}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "sampled-logs"
}
}
In this Logstash configuration, logs from server1
are sampled, representing a cluster sample.
Pros:
Cons:
Several tools can help automate and streamline the log sampling process:
Log sampling is a powerful technique that helps developers and DevOps teams manage log volumes effectively while reducing observability costs. By selecting the right sampling techniques and using tools like Fluent Bit and Logstash to automate configurations, teams can maintain system visibility and actionable insights without the need for exhaustive data collection. Whether you choose random, systematic, stratified, or cluster sampling, each method offers unique advantages and trade-offs. The key is to align your sampling strategy with your team’s specific observability requirements, ensuring a balance between data volume and critical insigh