A few customers with SPF enabled on their mail servers experienced “501 5.1.7 Invalid address” errors earlier today, as an indirect result of the power outage yesterday. The problem was resolved this morning Pacific time; what follows is a detailed explanation of what triggered the issue.
When the power outage occurred yesterday at the Amsterdam facility, we directed traffic to other facilities and brought a set of new servers online in another data center to manage the addition load. For each of our message handling servers, two IP addresses are used for delivery: one “low risk” address for normal deliveries, and one “higher risk” address used for NDRs, since the latter can sometimes trigger IP address blacklisting by third-party blacklisting services. The lower risk address was placed on the newer of two subnets (18.104.22.168/22) and the higher risk on the older of the two (22.214.171.124/22).
Although both ranges have been published for over a year, we began to see delivery issues for some customers that had never updated their firewall to permit the newer address range, so we reacted by switching the two bindings. Concurrent with that change, we updated the affected A and PTR records in the relevant DNS zone files. However, when some of the services were restarted on the servers, the TTL for the host name records had not expired and the cached information was used on service startup. This resulted in an issue with the SRS address rewriting (SRS is used for delivery to customers running SPF on their mail servers), as the server name was not properly resolving and instead the addresses were formed using the bracket IP notation rather than a hostname, causing some remote systems to fail the address with a “501 5.1.7 Invalid address” error upon a delivery attempt.
To resolve the problem, our engineers restarted the services on the affected servers, so that the updated A and PTR records would replace the cached information.
Although this circumstance involving the reversing of the bindings on the delivery servers is unlikely to recur, to mitigate against this risk we will investigate forcing SRS hostname resolution if a valid name is not found.