“It’s not you, it’s us.”
This message from WhatsApp will be circulated on Twitter on Monday evening at 7:18 PM Dutch time. For example, the chat service tries to reassure two billion users who could no longer communicate via the world’s most used chat service.
It soon becomes apparent that the cause is a malfunction at parent company Facebook. A small but critical mistake disabled all Facebook services, including Instagram, WhatsApp and the software Facebook’s own staff uses to keep the network up and running. It took more than six hours to fix the problem – a long-term outage by internet standards.
What went wrong?
Border Gateway Protocol
For that we have to descend into the depths of the internet, to the technology that connects separate computer networks. The internet is a collection of more than 65,000 networks that communicate with each other via the Border Gateway Protocol (BGP). That is a dynamic road map, a table with the fastest routes to connect computers from network A to the computers in network B. This infrastructure has been around since the early days of the Internet and has not changed significantly since then.
The BGP database is gigabytes in size and also contains references to Facebook, a giant in data traffic with about three billion users. Companies like Facebook provide BGP updates themselves to offer optimal routes. ‘Advertising’, that’s called in network jargon. Something went wrong there on Monday: the routes to Facebook turned out not to be adjusted, but were completely removed.
Facebook had wiped itself off the map with a wrong ‘ad’. Because updates in the BGP database are taken over by all central switching points of the internet, the routes to Facebook, Instagram and WhatsApp disappeared worldwide. An tweet who sent WhatsApp just after 6 p.m. Monday evening – „we know that some users experiencing problems” – seemed to underestimate the magnitude of the problem.
The statement from Facebook’s own experts, who appeared online on Tuesday, mentions a “configuration error in the routers that coordinate network traffic between our data centers.”
Internet service Cloudfare
Internet service Cloudflare – itself also hit by a similar malfunction – offers more details. There they noticed at 3:58 PM UTC (standard time, 5:58 PM in the Netherlands) that something was wrong with the connection to Facebook. “At first we thought it was our fault.”
Eighteen minutes earlier, at 3:40 PM, Cloudflare saw a spike in the number of updates passed to the “BGP routing tables” by Facebook. In those eighteen minutes, Facebook disappeared from the internet.
Also read: If it gives the internet a heart attack
What caused the error? It may have something to do with automation that Facebook implemented when closing ‘peering agreements’ (the way service providers connect to an internet giant like Facebook). Facebook has not yet given a definitive answer.
Need manual adjustment
In any case, the error had to be fixed manually. Facebook’s network administrators had locked themselves out; the accidental change made it no longer possible to modify the servers remotely. A team had to physically travel to Facebook’s data center in Santa Clara, California, to fix the problem.
Internal Facebook software also appeared to no longer work due to the routing errors; Facebook office access passes failed and employees were forced to use other methods to communicate with each other.
Automated systems that register domain names also proved to be ineffective. This made it seem as if domain name Facebook.com was also for sale, for everyone who was interested – and enough money. Jack Dorsey, founder and chief executive of the rival social network Twitter, only needed two words for his sneer: „How much?”
Twitter had a peak day on Monday as both Facebook and Instagram users sought refuge elsewhere and now decided to dust off their Twitter accounts. Texting also turned out to be a solution for many. KPN, the Dutch provider, noted a doubling of the usual number of text messages on Monday evening. But it is not possible to have group conversations or send attachments via SMS.
Also read this column by Marc Hijink: Latest warning from WhatsApp
The outage confronted users worldwide with error messages such as “Sorry, something went wrong. Please try again.”That went beyond discomfort, tweeted Eva Cukier, NRC correspondent in Russia. “It’s annoying for the many Russians and other residents of semi-authoritarian countries who use Facebook as a front-line news bulletin.”
Facebook’s outage was resolved during the night from Monday to Tuesday. After that, services got underway cautiously, with apologies for the inconvenience. Since then, no new failures have been reported. The company emphasized on Tuesday that there was no malicious intent, for example by hackers. Business customers – companies that advertise on Facebook – will get Tuesday the notification that the ad system is still “recovering”. Apparently Facebook hasn’t quite recovered from the shock yet.