Reusing backend connections to increase performance

Senior Professional Services Engineer

April 22, 2015

Reusing connections between your Varnish instance and your backends (origins) is a good idea for multiple reasons. If your Varnish is on the same network as your backends and you're doing low volume traffic, you can stop reading, because a) the difference will probably be negligible, and b) you're probably already reusing backend connections.

To find out whether or not connections are being reused, use the following command

me@example:~$ varnishstat -1 -f backend_conn,backend_reuse
backend_conn          3881846        4.22 Backend conn. success
backend_reuse       191288113      208.34 Backend conn. reuses

The first column is the stat name, where backend_conn is the number of times a connection has been made, and backend_reuse is how many times it has been reused. The second column is the value since your Varnish was started, and the third column is the average per second over that period.

What you're looking for is for backend_reuse to not be zero, and ideally to be an order of magnitude, or more, higher than backend_conn. The values in the example above are pretty ideal. It makes the occasional connection, but the vast majority of requests are made through already existing connections.

You can also run the varnishstat command without the -1 and it will give you an additional column, which will show the change in the values for the past second.

Before I dive into how you can improve these numbers if they're bad, let's go over some of the reasons why reusing is a good idea.

Why reuse connections?

As I mentioned above, there are two conditions where reusing is important. The most crucial is if there's a high round-trip time (RTT) between your Varnish and the backends. The other is running a high-traffic site, with thousands of requests per second. If you have a RTT that's below a millisecond, you'll find that the improvements of reusing connections are pretty small, but due to the high volume, small improvements can have a big impact.

I will start, however, with the most important reason: you can run out of port numbers. Every connection is uniquely identified by a tuple consisting of the source and destination IPs as well as the source and destination ports. Out of those four, only the source port is not static between your Varnish and a single backend. There are usually thousands of ports available (and with some tweaking of your operating system, about 64K). After a connection is closed, your operating system will keep the tuple around for anywhere between 30 seconds and 2 minutes, and refuse to use that exact tuple for any new connections to avoid delayed data from throwing a spanner in the works. These connections are in the so-called TIME_WAIT status. Most operating systems allow you to tune how long connections stay in TIME_WAIT after they're closed, but it's not recommended to lower that value, and you can avoid the problem by reusing connections.

A bit of back-of-the-paper-napkin math shows that if you have tuned your OS to use all port numbers above 1024 for outgoing connections, and your OS keeps closed connections in the TIME_WAIT status for 60 seconds, you can do roughly 1066 connections per second to a single backend before running out of ports.

The second reason to reuse connections is that making the connection has a cost. On the network side, it takes a round trip before a request can be sent. But even if that's not a problem because your RTT is low, it still takes some effort from your operating system. Other than some buffers, it also has to allocate a file descriptor, which usually takes a linear search to find the lowest available number, and can't be parallelized. In benchmarks I've seen, reusing connections make a 300-400% difference in throughput, and that was with Varnish and the backends on the same local network.

The third reason is TCP window scaling. While a TCP connection is in use, both ends adjust the congestion window (amount of data allowed to be on the wire at once) to allow for greater transfer speeds if possible. Of course, the higher the RTT, the more impact this has, but even on a local network it still matters for multi-megabyte files. When reusing connections, previous requests will usually have caused the window to have reached its sweet spot already and responses will be transferred as fast as your network allows.

How to make sure connections are reused

There are key several elements to check if you don't see connections being reused enough.

The protocol version

First of all, make sure that your backends are capable of doing HTTP/1.1. It seems pretty amazing, but 16 years after HTTP/1.1 was introduced, there are still HTTP/1.0 servers in the wild.

To check whether or not your backends are replying with HTTP/1.1 or HTTP/1.0, use the following command:

me@example:~$ varnishtop -b -i RxProtocol

The -b filters the Varnish log down to just backend communications. The -i RxProtocol narrows it down further to just the protocol part of the response.

You should see something like:

list length 2                                    example

  6612.25 RxProtocol        HTTP/1.1
     1.82 RxProtocol        HTTP/1.0

The numbers reflect how many times in the last minute a certain line was seen in the log. Obviously we would like to see nothing but HTTP/1.1, but the occasional spurious HTTP/1.0 can still occur. If the ratio is anything like the example above, then you have nothing to worry about. You're not necessarily looking for perfection here.

However, if you see a larger amount of HTTP/1.0 requests than you'd like, you can use the following command to see all the backend transactions that have a HTTP/1.0 response:

me@example:~$ varnishlog -b -m RxProtocol:HTTP/1.0

One of the things to look at is BackendOpen. That gives you the name of the backend in your VCL, as well as the IPs and ports of the connection. You'll also see all the request and response headers.

How to go about making your backend respond using HTTP/1.1 completely depends on your backend. Check its documentation or contact the vendor. Upgrading to a newer version of the software in question can also help.

Persistent connections

HTTP/1.1 considers all connections persistent (able to be reused) unless explicitly marked otherwise. The header used to mark a connection as not persistent is Connection: close. To see how many Connection: close headers you get per second, you can use:

me@example:~$ varnishtop -bC -i RxHeader -I connection.*close

To look at those transactions in detail:

me@example:~$ varnishlog -bC -m RxHeader:connection.*close

Again, how to get your backend to stop sending the Connection: close headers and actually reuse the connections is something that is very specific. Look in the documentation, contact the vendor, and consider upgrading.

Maximum number of requests

Some web servers will only do a certain number of requests per connection. When the maximum is reached, they will send a Connection: close header and close the connection after the response is completed.

Apache, for instance, has the setting MaxKeepAliveRequests which defaults to 100. Setting this to 0 will allow Apache to handle an infinite number of requests on a connection.

Keep-alive timeouts

To conserve resources, web servers close connections after they've been idle for a while. This is often referred to as the keep-alive timeout. By default, Varnish will close connections from clients after 10 seconds of no activity and Apache after 5 seconds. In contrast, Varnish will keep connections to backends open for as long as it can. If random clients (i.e. users on the internet) are not able to connect directly to your backend, there really is no good reason to keep the keep-alive timeout short. Think minutes instead of seconds.

In Apache the setting is KeepAliveTimeout, and most other software will have something close to that.

Network device timeouts

Something to keep in mind is that there might be stateful firewalls or NAT devices between your Varnish and your backends. These tend to have timeouts on TCP connections, too. If those are lower than the keep-alive timeout on your backends, the network devices will not know what to do with the traffic on the connection Varnish is trying to reuse. They will either drop it and cause timeouts, or send resets and cause Varnish to retry the request if it can.

Check the whole network path between your Varnish and your backends, and make sure that the timeouts for open connections are higher than the keep-alive timeout on your backends.

TL;DR

Reusing connections highly benefits performance between your Varnish and your backends. Make sure you're not using HTTP/1.0, that your backends are not sending Connection: close headers, that you raise timeouts, and that you check network timeouts on all devices in the path.