[August 29, 2020] Between 15:22 and 15:41 20i datacentre experienced an increased number of 503 HTTP response codes for websites hosted in a particular network zone from our shared hosting platform.
The root cause has been identified as a fault from one of 20i network-attached storage (NAS) nodes, which started to perform abnormally around 15:13. The node did not trigger any hardware or monitoring alerts, and did not crash or fail. As a result, no automatic failover event occurred. The storage continued handling requests, but at a much slower rate than normal. This caused a backlog of requests across web servers from that zone, until around 15:22 when the queue reached such a volume that 20i load balancers began to return 503 responses to any further requests into that area of 20i network.
On-call system administration team were alerted, and by 15:39 they were able to isolate the problem storage server and manually remove it from operation. Due to the highly available storage configuration, websites hosted on this storage were immediately served from alternative nodes and subsequently, all web servers returned to normal service by 15:41.
The system admin team have since performed detailed hardware diagnosis on-site, and replaced a number of different components in the problem NAS. We can confirm the server is back in normal operation and full storage redundancy has now been restored. 20i technicians are continuing to monitor the situation but are not expecting any further problems. The replaced components will be returned to the hardware vendor for a full investigation as to why no monitoring detected any faults, in order to help prevent such a situation from happening again.
We apologise for any inconvenience caused.
Saturday, August 29, 2020