Distributed computer systems applications in general, and web services in particular, rely on a number of dependencies behind the front-end layer, such as databases, queueing infrastructure, distributed locking mechanisms and all kinds of middle-tier services. Reliability of the distributed system strongly depends on the availability of its critical dependencies, in an inversely proportional relationship - the more dependencies, the higher the likelihood any one of them could be down at any given time. And dependencies do become unavailable, mostly for brief periods of time during active/standby failovers or due to network routing blips.
Whatever the reason for unavailability, you want to make sure that your service handles dependency failures gracefully. One way of verifying this is to simulate an outage. If you're in charge of your dependencies, you can just bring them down and see what happens, e.g. shutdown a database. This might not always be feasible - a database might be a resource shared among many users, who shouldn't need to suffer from your testing. The same applies when using a cloud service as dependency - it can't be shut down easily at your convenience, yet it will go down, if just for a brief period, later in the month.
Here are the two rules I add to the service hosts to simulate a dependency outage for respective scenarios:
- Connect timeout: All packets, including those carrying TCP handshake simply get lost.
iptables -A OUTPUT -d www.example.com -j DROP
- Socket read timeout: The TCP handshake succeeds SYN, SYN+ACK, but the final ACK originating from the client gets dropped
iptables -A OUTPUT -p tcp --tcp-flags ACK ACK -d www.example.com -j DROP