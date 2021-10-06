In home Facebook continue to investigate the down extended of Facebook, Instagram and WhatsApp of the day before yesterday so that it never happens again, both for the reputation of the services and for an evident economic damage, both direct and to customers. After the first explanations of the past few hours, the social network returned to the topic rattling off more details on an unprecedented problem.

Mark Zuckerberg, in a post, he called it “the worst service crash we’ve had in years. […] The biggest concern with a blackout like this is not how many people switch to competing services or how much money we lose, but what it means for the people who rely on our services to communicate with loved ones, run their businesses or support their community. ” , the CEO said.

According to what was reconstructed by the technicians, the down was triggered by the system that manages the global backbone, the network that Facebook has developed to connect all its IT structures: tens of thousands of kilometers of fiber optic cables that cross the world and connect all the datacenters. The data traffic between the datacenters is managed by routers that must route the incoming and outgoing data.

“Our engineers often need to take part of the backbone offline for maintenancesuch as repairing a fiber line, adding capacity, or updating software on the router itself. And this was the source of the down“explained Santosh Janardhan, vice president of infrastructure.

More specifically, during routine maintenance “was sent a command with the intention of evaluating the availability of global network capacity, and this unintentionally interrupted all connections in our backbone, effectively disconnecting Facebook’s datacenters globally. “Typically, Facebook has a system to prevent a command similar leads to a dramatic effect like the one seen in the past few hours, but “a bug in that control tool prevented the command from successfully aborting“.

Not only did the complete disconnection between the datacenter and the Internet, but also a more serious problem concerning the relationship between DNS and Border Gateway Protocol (BGP). The former are a kind of Internet address book, which allow web addresses to be translated into specific IP addresses. Those translation instances are answered by “authoritative name servers” which themselves occupy known IP addresses, and which in turn are advertised on the rest of the Internet via the BGP protocol.

“To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves cannot speak to our datacenters, as this it is an indication of an unhealthy network connection. In the recent service outage, the entire backbone was removed from operation, causing these locations to declare themselves unhealthy and retire those BGP ads. The result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers“.

In practice, Facebook’s systems thought there was an attack by malicious people and protected themselves. Everything happened quickly, but when the engineers realized what was happening they found two big obstacles. “The first is that it was not possible to access our datacenters with our normal means because their networks weren’t working, and secondly, the total loss of DNS has broke many of the internal tools that we would normally use to investigate and resolve situations like this “.

Facebook has thus sent technicians to the datacenters, but the security systems made it difficult to enter. Then, once inside, the technicians had to fight with hardware and routers, equipped with systems that try to prevent physical access. “It therefore took longer to activate the secure access protocols needed to allow people on site to work on the servers. Only then were we able to confirm the problem and get our backbone online.”

Ultimately, the reactivation of services led to a immediate traffic boom which led to an imbalance in the energy demand that it could jeopardize the infrastructure itself. However, Facebook had prepared for this scenario with exercises called “storm” in which it simulates “a serious system failure by taking a service, a datacenter or an entire region offline, subjecting all the infrastructures and software involved to stress tests”.

Santosh Janardhan believes that all in all, the reaction times have been good, also because Facebook had never simulated a scenario like the one that reality threw in her face. “Any failure like this is an opportunity to learn and improve“said the manager.

“We have done a great job to strengthen our systems to prevent unauthorized access and it was interesting to see how that protection slowed us down as we were trying to recover from a downturn caused not by malicious activity, but by our own mistake, “he concluded.