Image serving with Content Addressable Storage, S3, and ImgIx

03 Jun 2018
Read this post on Medium

Marketplacer and its websites served user-provided web-optimised images from the same webserver as its actual application, and stored both the original image and the web-optimised versions on disk. Each image's filename was stored in a column on the related model, and uploads were handled with the Ruby gem carrierwave.

Marketplacer's websites would often have many duplicate images stored against database records and on disk:

a templating system allowed users to copy manufacturer provided images to their own adverts
users would duplicate existing records between stores of the same franchise
different sites with the same models would use the same original images

These duplicate images, along with handling the web-optimisation of the images and storage server-side, presented problems:

storing images on disk made it difficult for us to horizontally scale without using shared filesystems
developers & test environments using production data would refer to images and assets that did not exist on the local environment
copying images from templates or between users could take time and use additional disk space
changing our web-optimisation method (eg, image size, compression ratio) required us to re-process every single uploaded image

To correct these problems we decided to rebuild our image processing & serving architecture based on ImgIx, S3 and a concept called "Content Addressable Storage".

ImgIx allowed us to optimise our images for the web on the fly by specifying the desired resolution, compression and image format in the URL provided to the user. For added security, we opted to sign these URLs so that a user could not retrieve the original asset by modifying the URL.

We created a single S3 bucket for ImgIx to use to source the image originals. This S3 bucket would be used by Marketplacer websites, across development, test, staging and production, and for all asset types. We were able to do this without risk of corrupting production assets due to the nature of Content Addressable Storage.

Content Addressable Storage meant that every image's location in S3 is representative of its contents. Two files with different contents will always have a different path & filename, and two files with the same contents will always have the same path & filename.

Here's how it works:

a user uploads an image to a web application
the image is stored in a temporary directory
we create SHA digest of its content (eg CgqfKmdylCVXq1NV12r0Qvj2XgE)
we infer the image's location is S3 based on that digest (Cg/qfKmdylCVXq1NV12r0Qvj2XgE)
we check S3 to determine whether we already have an image in this location
if we don't, we upload the image to the location
we store the image's location in the database on the relevant database record
we later serve a link to a processed version of that image via ImgIx (eg https://marketplacer.imgix.net/Cg/qfKmdylCVXq1NV12r0Qvj2XgE?auto=format&fm=pjpg&fit=max&w=1600)

During the first month of rolling this mechanism out to production, we had a 5% hit rate of duplicate images on all uploads.

We don't only use this method for images - since early 2017, all user-provided assets served by Marketplacer applications are served via this method. Non-image assets are now served via signed CloudFront links connecting to this S3 bucket.

We never remove assets from this S3 bucket (as storing small things in S3 is relatively cheap). The entire bucket is backed up to a server external to AWS every day.

Using SHA means that the chance of two different assets having the same name is effectively impossible. This means that each asset uploaded will never be replaced with a different one in the same location.

As we rely on signed links, and as the S3 bucket is private, we can even use this method to serve & control access to paid content.

In conclusion:

using S3 to store images, and ImgIx & CloudFront to serve them, allows us to scale horizontally without using NFS to share files between servers
as images are stored in a single bucket, development & test environments can link to them without worrying about missing content
there is no risk of development or test corrupting production data, as each image uploaded to S3 is named based on its content, making each individual asset effectively immutable
copying images from templates or between customers is now as simple as copying a string between database records
we only need to change the query parameters sent to ImgIx to change our web-optimisation method