Creating a Text Histogram of S3 Bucket Object Sizes Using Bash

Andy Rea
3 min readNov 6, 2023

In the world of cloud storage, understanding the distribution of file sizes can be crucial for optimizing performance and costs. For users of Amazon S3, there is no built-in tool to quickly visualize this distribution. However, with a simple Bash script, we can generate a histogram that provides insights into the size categories of the objects stored within an S3 bucket.

The Script

The script takes a single argument, which is the name of the S3 bucket you wish to analyze. By using a combination of AWS CLI commands and AWK, the script outputs a simple text-based histogram representing the distribution of object sizes within the specified bucket.

#!/usr/bin/env bash

set -e

bucket_name="$1"
aws s3api list-objects-v2 --bucket "$bucket_name" --query 'Contents[].Size' --output text | tr '\t' '\n' | \
awk '{
if ($1 >= 0 && $1 < 1024) bin["0-1KB"]++;
else if ($1 < 10240) bin["1KB-10KB"]++;
else if ($1 < 102400) bin["10KB-100KB"]++;
else if ($1 < 1024000) bin["100KB-1MB"]++;
else if ($1 < 10240000) bin["1MB-10MB"]++;
else if ($1 < 102400000) bin["10MB-100MB"]++;
else bin["100MB+"]++;
}
END {
for (b in bin) {
print b ": " bin[b]
}
}'

How It Works

1. Input: The script expects the bucket name as its first argument.

2. Fetch Object Sizes: It uses the AWS CLI’s `list-objects-v2` command to fetch the sizes of all objects in the bucket.

3. Text Processing: The output is piped through `tr` to replace tabs with newlines, creating a list where each size is on its own line.

4. AWK Script: The sizes are then piped into an AWK script, which sorts them into predefined size bins. Each size is checked against a series of if-else conditions, and a counter for the appropriate size bin is incremented.

5. Output: Once all sizes have been categorized, the AWK script prints the bin labels and counts in no particular order.

Example Output and Interpretation

The example output of the script is straightforward:


0–1KB: 2
10KB-100KB: 112
100KB-1MB: 109
1KB-10KB: 1
1MB-10MB: 12

This output tells us that the bucket contains:

- Two objects of size between 0 and 1KB.
- One object of size between 1KB and 10KB.
- 112 objects of size between 10KB and 100KB.
- 109 objects of size between 100KB and 1MB.
- Twelve objects of size between 1MB and 10MB.

No objects larger than 10MB were found in this particular case.

Use Cases

This histogram can be particularly useful for:

- Cost Analysis: Larger files may cost more to access or transfer, depending on your S3 pricing tier.
- Performance Optimization: Knowing the size distribution can help in optimizing performance, as larger files may take longer to process or transfer.
- Storage Management: Identifying the distribution can aid in storage lifecycle policies, such as moving rarely accessed large files to cheaper storage classes.

Conclusion

With a simple Bash script, S3 users can gain valuable insights into their storage patterns, enabling better decision-making regarding cost, performance, and storage management. The histogram is a powerful visualization tool, even in its text-based form, for quick and actionable analytics.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Andy Rea
Andy Rea

Written by Andy Rea

Experimenting with Medium to share my AWS CLI queries in combination with other shell utilities and also help from ChatGPT for post and image content

No responses yet

Write a response