Understanding how caching works in GitHub Actions

Keep CI running fast

Joel Clermont

2024-03-29

GitHub provides a first-party action for caching in your workflows.

It's relatively easy to use. You configure it with a path to cache and a key to identify when the cache should be restored.

Here's an example to cache the vendor directory in a PHP project:

  - name: Cache composer dependencies
    uses: actions/cache@v4
    with:
      path: vendor
      key: composer-${{ hashFiles('composer.lock') }}

If nothing changes in your composer.lock file, then your workflow can just reload the vendor folder from the last time your workflow ran. This can save quite a bit of time on your CI runs.

This is great for tools like Composer and npm that have distinct lock files, but what about tools that use a cache to speed up operations, but don't have a distinct mechanism for detecting changes.

For example, we use PHPStan (via Larastan) on all our projects. It can take a while to run, so it utilizes a cache folder to track files it already scanned and haven't changed. How can we leverage the GitHub cache action for this?

We can't hash a folder or location, but we can rely on a feature of the GitHub cache action that allows us to specify more than one key to identify a cache. I'll show an example, and then explain it in more detail:

  - name: Cache Larastan result cache
    uses: actions/cache@v4
    with:
      path: .phpstan.cache
      key: "phpstan-result-cache-${{ github.run_id }}"
      restore-keys: |
        phpstan-result-cache-

What is going on with the key and restore-keys here?

GitHub uses these values to try to figure out which cache to restore during this run. It first checks for a cache result that matches our key value, but it only considers it a match if it is a complete, exact match. If it doesn't find one, it then checks one or more restore-keys values in order. When checking these values, though, it only has to find a partial prefix match. The first one it finds as a match will be restored. And if multiple cache results match our prefix, it will use the most recent one.

Knowing how this works, notice how the main key is set to a value which includes the current run_id. This value is unique for every single run. Because of this, we know the key will never match, and therefore it will always fall back to our restore-keys value.

Why would we set it up this way? It might make more sense with a concrete example:

Let's say our CI run has a run_id of 123 and the previous run had a run_id of 122.

So when this run starts, we'll have a value in our cache with the key phpstan-result-cache-122 from the previous run.

Because our current run is 123, GitHub will first try to fetch a cache with the key phpstan-result-cache-123. And because there is no match it will fall back to our restore-keys value of phpstan-result-cache-. Remember, this only has to match a prefix of the key, so it will find our cache with the key phpstan-result-cache-122 and restore it.

When this run completes, it will save a new cache with the key phpstan-result-cache-123 which will in turn be used by our next run.

This gives us the best of both worlds. We're able to restore the PHPStan cache from the previous run, speeding up our current run. But we're also constantly saving a new cache with the current run's results. This way, our cache doesn't drift further and further out of date.

This same technique works well with a number of different tools. We use it with Rector and PHP CS Fixer as well.

Hope this helps,

Joel

P.S. Can you imagine how awesome your project would be if we were working with you, adding improvements like this every week?

Previous tip

Next tip