Please note that this post, first published over a year ago, may now be out of date.
If you’re familiar with Dockerfiles, you probably know that they consist of a set of instructions. Each instruction results in a new docker layer. You may use that fact to optimise your deployment times, as well as storage and bandwidth costs, but it requires some strategic planning. Let’s explore how.
Docker layers
In order to make this plan, one needs to understand what a Docker layer is. You may already know that it’s a basic unit of a container image. Internally, you may think of a container image layer as an image itself with an automatically generated ID - instead of a tag assigned by you.
Having separate layers is supposed to help you with optimising image build time, get faster downloads when you have part of an image cached, and help create faster feedback loops. It’s because Docker uses a cache for each layer so it can reuse them for future builds, but only under the assumption that previous layers remain unchanged. Changing a previous layer invalidates the cache and results in recreation of subsequent layers from scratch.
And here’s when careful planning comes into play. Firstly, you want to place instructions which change less frequently before others. It can mean that you are reusing another base image or taking advantage of multi-stage builds. Another tip for maximising cache benefits is to reduce the number of layers. For example, you can combine multiple instructions into one by concatenating them.
All these options come with their own disadvantages which can include reducing the readability of your Dockerfile. This may result in a poor developer experience or waste your time when trying to understand what the author originally had in mind. Instead of hitting these kinds of problems you may consider using docker-squash.
docker-squash: how to?
If you’re familiar with git’s squash concept, then you may already have an idea what docker-squash
is about. If not, think about combining multiple image layers into one. Firstly, let’s see that in action taking into account the following Dockerfile:
FROM ubuntu:18.04 AS compile-image
RUN apt-get update
RUN apt-get -qq -y install curl
RUN apt-get install -y gcc build-essential
WORKDIR /root
COPY hello.c .
RUN gcc -o helloworld hello.c
RUN apt-get remove -y gcc build-essential
RUN apt-get -y autoremove
Imagine that you build an image using docker build -t hello:latest
and then check its size by running docker images
- I saw 324 MiB. Then it’s possible to check what layers it consists of (docker history hello:latest
). That command will also display the auto-generated ID for each layer, if recorded.
Now we have some options available to us: leave it as-is; squash all the layers; squash layers down to the selected layer using the layer’s ID, or squash by specifying how many layers we want to squash. This choice should be always made on a case-by-case basis depending on the image structure. You’ll read more about it below, where I will show how to squash down to one layer, and then how to squash everything but the last few layers.
The simple option is to squash everything: install docker-squash and run: docker-squash -t hello:squashed hello:latest
.
I’ve also specified the target image and tag using -t
, and the source image plus its tag goes last on the command line. What I get is a new tag (hello:squashed
) with just one layer.
I tried this with a simple build and my target image’s size was reduced by 62.1%. That 324 MiB has become 120 MiB.
What if I want to keep the last 3 layers? Knowing that my image has 11 layers, I could run docker-squash -f 8 -t hello:squashed hello:latest
, specifying how many layers I want to squash. Also, if I know the layer ID of layer 8, I could specify that instead.
docker-squash: when to?
Remember that by using docker-squash
you may get rid of caching, as well as parallelisation advantages when downloading a single layer. That’s why tailoring your solution should be done carefully. Below I’ll present some use cases where docker-squash
may be particularly useful.
1. Temporary files
Sometimes you need to download some temporary files in one Docker layer just to remove them in a subsequent one. In such a case, they’ll still contribute to your Docker image size, as the fact that you deleted it in a specific layer doesn’t equal removing them from the previous ones. Once you merge these layers, only the diff from merged underlying instructions are preserved, and you can optimise your image size. This also refers to multiple layers modifying the same files - squashing will result in extracting only the delta of all merged layers.
2. Keeping things safe
Ideally, there aren’t any secrets written to your layers. Nevertheless, it’s good to know that if you want to avoid the possibility of retrieving some files from a specific layer, removing them in a separate instruction won’t be enough. That’s a perfect reason to use docker-squash
.
3. Partial squashing
Imagine that you have a 190MiB container image for your app, but actually what changes from one release to another is mostly a few megabytes of JavaScript code. Squashing everything would mean downloading the whole 190MiB every time. Squashing nothing might mean a much bigger download, for example 990 MiB, not 190 MiB. The sweet spot is a partial squash where, usually, you get an update after downloading a few MiB, and everything feels fast.
4. CI/CD pipelines
Preferably your images are built by CI/CD pipeline, where the Docker image cache starts from scratch. In such a case, you don’t have to worry much about the caching behaviour, and just optimise an image for the size.
5. Readable Dockerfiles
As mentioned before, it’s recommended to use as few layers as possible. A common technique is to concatenate RUN
commands in a Dockerfile, cutting image size but also decreasing the readability. This may increase the barrier of entry for new hires significantly, and also negatively affect the developers’ experience. Instead you may still separate your Dockerfile instructions for the sake of simplicity, and squash layers after the image build to reduce the time taken for image pulls and container launches.
Summary
docker-squash
is a useful thing to have in your toolset to make your container images lighter and deployment times faster, while not trading away readability for a Dockerfile. As shown above, usage of the tool seems simple, but the value is really about when to use it to gain the maximum benefit out of it.
Want an easier route to running your workload in containers, with security, build automation, and scalability designed in? We offer a number of packaged container platforms that provide exactly that.
This blog is written exclusively by The Scale Factory team. We do not accept external contributions.