Cleanup Large S3 Buckets


s3 aws

I found a neat python tool called s3wipe which brings significant speed improvements when deleting extremely large s3 buckets. It achieves this by using multiple threads and batch deletes.

This really helped me out recently when deleting buckets containing several million objects and versions.

Example Usage

Empty a bucket of all objects, and delete the bucket when done.

BUCKET_NAME=project-files-public
docker run -it --rm slmingol/s3wipe \
   --id ${AWS_ACCESS_KEY_ID} \
   --key ${AWS_SECRET_ACCESS_KEY} \
   --path "s3://${BUCKET_NAME}" \
   --delbucket

Remove all objects and versions with a certain prefix, but retain the bucket.

BUCKET_NAME=project-files-public
CLEANUP_PATH=js
docker run -it --rm slmingol/s3wipe \
   --id ${AWS_ACCESS_KEY_ID} \
   --key ${AWS_SECRET_ACCESS_KEY} \
   --path "s3://${BUCKET_NAME}/${CLEANUP_PATH}"

My Use Case

Some time ago we were using a s3 fuse filesystem for user-uploaded files. The pod liveness probe script would make sure that the filesystem was writable by touching a file named healthz_$random_hash.txt, and then deleting it. We’ve since moved these shared filesystems to EFS, and it was time to decommission the old buckets.

Unfortunately we couldn’t just delete the bucket - AWS insists on buckets being empty before they can be removed. As versioning was enabled on these buckets, every single healthz_ file ever created needed to be listed (to find the version id) and then have its revision explicity deleted. Lets do a bit maths - each environment has 2 buckets, each project had 3 environments, the liveness probe runs every minute, and the s3 filesystem was in place for 2 years before being retired.

Thats about 6,300,000 healthz_ files to be removed. Per project.

The aws cli utility can do this, but it only runs a single thread meaning the 6 million or so deletes would take several weeks to complete. With s3wipe I was able to cleanup and delete the buckets for a project in around 12 hours.