Diff at the edge with serverless cloud functions
I was recently downloading packages using the npm package manager, and realized that although I often have a previous version of a package already installed, npm has to download the entire tarball for the new version if installing an update to a module. This seems very inefficient.
Requesting the difference between two previously cached files — using just a CDN configuration and a serverless cloud compute function — is a great example of exploiting edge and serverless compute services to make your website more efficient and performant, and lower your bandwidth costs. In this post, I’ll present a solution which, for a site hosting versioned downloadable assets such as software, documents, and saved games, can reduce bandwidth consumption dramatically.
Taking one of my own npm modules as an example, the latest version is 11MB gzipped, and 99MB uncompressed. Using bsdiff, we can produce a patch to summarize the changes from the penultimate version to the latest:
$ bsdiff module-3.16.0.tar module-3.17.0.tar module-3.16.0...3.17.0.patch
$ ls -lah
total 424
drwxr-xr-x 5 me staff 170B 18 Apr 15:55 .
drwxr-xr-x 14 me staff 476B 18 Apr 16:32 ..
-rw-r--r-- 1 me staff 209K 18 Apr 17:27 module-3.16.0...3.17.0.patch
-rw-r--r-- 1 me staff 99M 18 Apr 15:54 module-3.16.0.tar
-rw-r--r-- 1 me staff 97M 18 Apr 15:53 module-3.17.0.tar
So if the client already has 3.16.0, getting to 3.17.0 could be done with a download of only 209KB, a mere 1.8% of the full 11MB (gzipped from 99MB) that you'd otherwise need for the full tarball.
However, module hosting services like npm typically store their modules on a static hosting environment like Amazon S3 or Google Cloud Storage, so there is limited or no ability to add this kind of dynamic content generation feature, and pre-generating a diff between every pair of versions of every module seems unlikely to be a good use of compute or storage resources.
Can this be done at the CDN level?
Absolutely. Here’s how this could be done with Fastly’s CDN:
A CDN that allows origin services to be selected based on characteristics of the request could be used to route “diff” requests to a patch-generating service. With Fastly’s CDN, we can do this in VCL (Varnish Configuration Language, which we make accessible to customers). First, define a special backend:
backend be_diff_service {
.dynamic = true;
.port = "443";
.host = "<<CLOUD-FUNCTIONS-HOSTNAME>>";
.ssl_sni_hostname = "<<CLOUD-FUNCTIONS-HOSTNAME>>";
.ssl_cert_hostname = "<<CLOUD-FUNCTIONS-HOSTNAME>>";
.ssl = true;
.probe = {
.timeout = 10s;
.interval = 10s;
.request = "GET /healthcheck HTTP/1.1" "Host: <<CLOUD-FUNCTIONS-HOSTNAME>>" "Connection: close" "User-Agent: Fastly healthcheck";
}
}
Now, we can decide on a special syntax to use for patch requests, and make a small addition to vcl_recv
that detects this syntax and routes the request to the special backend:
sub vcl_recv {
....
declare local var.diffUrlPrefix STRING;
declare local var.diffUrlSuffix STRING;
if (req.url ~ "^(/.*\/\-\/.*)\-(\d+\.\d+\.\d+)...(\d+\.\d+\.\d+)(\.tgz)\.patch") {
set var.diffUrlPrefix = if (req.http.Fastly-SSL, "https://", "http://") req.http.Host ".global.prod.fastly.net" re.group.1 "-";
set var.diffUrlSuffix = re.group.4;
set req.backend = be_diff_service;
set req.http.Host = "<<CLOUD-FUNCTIONS-HOSTNAME>>";
set req.http.Backend-Name = "diff";
set req.url = "/compareURLs?from=" var.diffUrlPrefix re.group.2 var.diffUrlSuffix "&to=" var.diffUrlPrefix re.group.3 var.diffUrlSuffix;
}
....
}
npm’s downloads use URLs such as /module-name/-/module-name-1.2.3.tgz
, so I'd like to also support /module-name/-/module-name-1.2.3...1.2.4.tgz.patch
as a diff request. The regular expression in the VCL above captures the requests that fall into this category, and then:
- Changes the backend to point to the diff service
- Updates the
Host
header so we are sending the correct origin’s domain in the request to the service - Rewrites the path to match the syntax of the diff generator service
(For more information on getting started with running your own VCL on the Fastly edge cloud platform, see our introductory guide to VCL.)
This is all very well, but the CDN cache nodes cannot generate diffs by themselves. This is a great use case for serverless compute services, such as AWS Lambda or Google Cloud Functions. We’ll use a Google Cloud Function to handle this.
If you want to use GCF and don’t have it set up already, Google has an excellent quick start guide that will get you up and running.
The source of the cloud function that we need looks like this:
const url = require('url');
const zlib = require('zlib');
const fetch = require('node-fetch');
const bsdiff = require('node-bsdiff').diff;
exports.compareURLs = function compareURLs (req, res) {
Promise.resolve()
.then(() => {
return Promise.all(['from', 'to'].map(param => {
return fetch(req.query[param])
.then(resp => {
const name = url.parse(req.query[param]).pathname.replace(/^.*\/([^\/]+)\/?$/, '$1');
const isCompressed = Boolean(resp.headers.get('Content-Encoding') === 'gzip' || name.match(/\.(tgz|gz|gzip)$/));
const respStream = isCompressed ? resp.body.pipe(zlib.createGunzip()) : resp.body;
const bufs = [];
respStream.on('data', data => bufs.push(data));
return new Promise(resolve => {
respStream.on('finish', () => {
resolve(Buffer.concat(bufs));
});
});
})
;
}))
})
// Create patch and serve it
.then(([from, to]) => {
const patch = bsdiff(from, to);
res.status(200);
res.send(patch);
})
;
};
I’m using two public npm modules, node-fetch which implements the now-standard WHATWG Fetch API in NodeJS (which at time of writing is not natively supported by Node), and node-bsdiff, which performs the amazing binary diff algorithm invented by Colin Percival.
This code includes no error handling or validation, and we can also improve the patch response by adding appropriate Cache-Control
information (the patch can be cached for as long as the least-cacheable of the two files being compared), and also by passing through any surrogate-key headers present on the input files. I’ve uploaded a more comprehensive solution to GitHub with comments, so feel free to make use of that.
Testing
To test the new endpoint, I invented differentnpm.com
: a fictitious new domain name for the npm registry for which I could create a Fastly service, and I set it up with the real npm registry as its origin server. A request to download the full tarball of lodash 4.17.4, one of the most popular modules on npm, shows that the new service behaves like the npm registry:
$ curl "http://differentnpm.com.global.prod.fastly.net/lodash/-/lodash-4.17.4.tgz" -vs 1>/dev/null
< HTTP/1.1 200 OK
< Cache-Control: max-age=21600
< Content-Type: application/octet-stream
< Content-Length: 310669
< X-Served-By: cache-sjc3143-SJC, cache-sjc3628-SJC
< X-Cache: HIT, HIT
This request is routed to npm’s real registry, and results in a 310KB file (see the Content-Length header), and as we’d expect, is a cache HIT because this is a popular file so it’s likely to be available at the local CDN cache node.
However, this new registry also transparently supports the new diff URLs:
$ curl "http://differentnpm.com.global.prod.fastly.net/lodash/-/lodash-4.17.3...4.17.4.tgz.patch" -vs 1>/dev/null
< HTTP/1.1 200 OK
< Cache-Control: max-age=21600
< content-type: application/octet-stream
< Content-Length: 1207
< Connection: keep-alive
< X-Served-By: cache-sjc3132-SJC
< X-Cache: HIT
Here the request for the difference between lodash 4.17.3 and 4.17.4 is a patch of only 1,207 bytes, just 0.3% of the original size.
Bsdiff ships with a companion bspatch tool, which can take the old file and the patch, and produce the new one:
$ ls -la
-rw-r--r-- 1 me staff 2254848 18 Apr 16:30 lodash-4.17.3.tar
-rw-r--r-- 1 me staff 1207 19 Apr 17:35 lodash-4.17.3...4.17.4.tgz.patch
$ bsdiff lodash-4.17.3.tar lodash-4.17.4.tar lodash-4.17.3...4.17.4.tgz.patch
$ tar tf lodash-4.17.4.tar
package/package.json
package/README.md
package/LICENSE
package/_baseToString.js
....
Savings
To work out how useful this kind of thing could be, I made a list of npm’s most depended-upon modules, and for each one, gathered the following data:
- Number of downloads over a test period (I used April 2017)
- Size of most recent version tarball
- Size of diff between most recent and penultimate version tarball
One thing we can’t know from public data is how often a user has a prior version of a file in cache locally. Let’s look at the impact if that number were 5%, 15%, and 50%:
Patch size | Monthly data savings, GB, by cache ratio | |||||||
---|---|---|---|---|---|---|---|---|
Module | Downloads (1000s) | Size (bytes) | Monthly transfer (GB) | Abs (b) | rel (%) | 5% | 15% | 50% |
lodash | 42,866 | 310,669 | 12,403 | 1,207 | 0.39% | 618 | 1,853 | 6,177 |
request | 24,756 | 56,636 | 1,306 | 3,248 | 5.73% | 62 | 185 | 615 |
async | 43,923 | 97,968 | 4,008 | 23,083 | 23.56% | 153 | 459 | 1,532 |
express | 11,577 | 52,372 | 565 | 602 | 1.15% | 28 | 84 | 279 |
chalk | 21,045 | 5,236 | 103 | 1,027 | 19.61% | 4 | 12 | 41 |
bluebird | 14,327 | 135,089 | 1,803 | 2,669 | 1.98% | 88 | 265 | 883 |
underscore | 12,229 | 34,172 | 389 | 6,879 | 20.13% | 16 | 47 | 155 |
commander | 26,118 | 13,425 | 327 | 1,309 | 9.75% | 15 | 44 | 147 |
debug | 45,226 | 16,144 | 680 | 588 | 3.64% | 33 | 98 | 328 |
moment | 9,219 | 497,477 | 4,271 | 891 | 0.18% | 213 | 640 | 2,132 |
Total (top 10 modules) | 251,286 | 25,853 | 1,229 | 3,687 | 12,290 | |||
Relative saving | 4.75% | 14.26% | 47.54% |
Diff sizes obviously vary, and the most popular npm modules also tend to be quite small, but if some percentage of npm’s module requests could be diffs, then this data suggests that they would eliminate almost that same percentage of their bandwidth.
Other use cases
Package managers are not the only type of business that could benefit from this. Android uses binary diffs to update apps from the Google Play Store, and any scenario where you need to send a user an update to something they already have, diffs can make your bandwidth use dramatically more efficient.