I have a confession to make. I’m a sucker for good architecture. Visiting places like Singapore, London, Rome, Bueno Aires, and New York City, I quickly find myself gravitating towards beautiful archways, spires, and even the voids used in designing some of the world’s most amazing buildings. I also found myself with a similar sense of awe when working as a network, telecom, and security engineer over the past 20 years.
To design resilient systems, there must be a careful balance between redundancy and capacity, and a real thoughtful structure as to what systems you control and what systems you need to account for – which are ultimately out of your control. By that I mean, on one side of the equation you have your own routers, load balancers, CPE (customer premise equipment), and the ISP circuits with associated “guarantees” to account for.
On the other side, you have to figure out a plan for ISP/carrier diversity, as well as last mile diversity in the event of a fiber line disruption. If you’re a large corporation where local construction crews need to be accounted for (because their digging can shred your fiber lines), then you’re probably also thinking about the capacity of your ISP to handle inbound traffic to your datacenter, and for that matter the ability to handle DDoS attacks to said datacenter. Now, what if your ISP is unable to handle a large-scale attack and it actually causes an outage of mass proportions to the region of the country your ISP services? Never happens, right? Wrong.
Let’s take a look at a small handful of events from 2018 as reported by ThousandEyes:
April 13: A DE-CIX issue in Frankfort, Germany cut Internet access to a major world economy and most of the public Internet users in that region
May 31: AWS outage due to an ISP power outage that impacted AWS US-east-2
June 29: Comcast had a fiber-cut outage slowing or killing service for millions of web users even beyond that of their actual customer base
If you follow the above links, you’ll see the impact of these events on neighboring ISP’s as well as the geographic regions of Internet they service.
So, what are we talking about here in terms of requirements for building an environment that’s resilient to DDoS attacks, ISP outages, and BGP routing issues? Let’s handle these one at a time:
Network Based Attacks:
For DDoS mitigation services, the goal is to measure defensive capacity against attacks in in Gigabit Per Second (Gbps) or Terabit Per Second (Tbps). Problem is, when DDoS vendors claim total capacity numbers they often do so on the back of the network that’s already carrying a large amount of consumed capacity. So, in the diagram below, the initial claim from the vendor is that they have 1 Tbps of DDoS mitigation capacity. But we can see that subtracting the total amount of “consumed capacity” already being used by other delivery services across that same network, it only leaves an available 100gbps of capacity to defend against an attack.
Web Based Attacks:
Another thing to consider, is that this DDoS network (as advertised above) is usually referring to a mitigation platform that requires BGP route changes to announce network traffic across a series of scrubbing centers. There is another type of DDoS attack that does not target network IP’s, but instead targets the web application or API of a vulnerable website. These types of attacks require a reverse proxy architecture to intercept, inspect, validate, and pass along traffic from a network of reverse proxies back onto the origin web server environment. Additionally, to keep the origin applications from getting DDoS’d directly, only the traffic that passes through the reverse proxy platform will be allowed into the origin environment. This helps ensure that attackers don’t attempt to bypass the reverse proxies. This measurement is usually much higher in terms of capacity. For instance, Akamai’s high watermark for traffic traversing its Web platform in 2019 was 80 Tbps (terabits per second).
When it comes to ISP outages there’s a world of interconnected networks in which outages can originate. Take a look at the attached diagram. When user traffic originates from anywhere in the world, the path their traffic takes goes from ISP to ISP based on business-based decisions. What ISP relationships exist along that path, what ISP’s give discounts to other ISP’s for routing traffic along their network segment. But as you can see by the network path represented by the red dotted line, this path is not always efficient.
With large corporations, just “staying online” during an ISP outage is usually not enough. The goal during an ISP outage is to ensure that the path your users take to get to you remains unaffected by the outage and still fast performing to maintain the level of service your users expect. Having access to a CDN platform that sits as an overlay to the Internet is critical as you see represented by the diagram with the orange circles. This resilient architecture requires a combination of: CDN services to served cached content, static pages or friendly “fail whale” pages in the event of a catastrophic failure, Global Traffic Management services to make sure that user traffic is avoiding the bad routes through failed or congested ISPs, and finally a CDN-based mechanism that has performance optimization to ensure that users application traffic is treated with priority and optimized along that path.
BGP Outages:
As for BGP-related attacks, this usually requires a service that can monitor for BGP peering relationship changes and unexpected route changes such as a BGP-based DDoS Scrubbing service. Using these services, your DDoS vendor can monitor for unusual route changes across the Internet and warn / take action if something is discovered. For immediate action, always routing your IP space through your DDoS vendor is required. This can help for quicker mitigation of observed attacks and does not require you to announce your BGP peering change to the Internet, which can take 3-5 minutes to propagate globally. See the screenshot below of a high distributed DDoS scrubbing center architecture. This allows for redundancy within region in case one scrubbing center is hit with more traffic than another, and it allows for backhauling of traffic between scrubbing centers for delivery optimization of clean traffic back onto the origin location that is being protected.
There are many other considerations to be made when architecting resilient systems. Most of them circle around understanding that the platforms that exist beyond your network closer to the end user must be leveraged for maximum results. The other consideration is to leverage different platforms for these services. The three I recommend are a CDN platform, Network scrubbing platform, and let us not forget about the infamous cloud-based DNS platforms. These three cloud platforms are the only way to ensure maximum effectiveness to reduce downtime and outages, while making sure user traffic remains a priority when every minute can cost thousands, or millions of dollars and lost reputation. When going offline is NOT an option, don’t be afraid to ask for help.
(Tony Lauro manages the Enterprise Security Architecture team at Akamai Technologies. With over 20 years of information security industry experience, Tony has worked and consulted in many verticals including finance, automotive, medical/healthcare, enterprise, and mobile applications. He is currently responsible for Akamai‘s North America clients as well as the training of an Akamai internal group whose focus is on Web Application Security and adversarial resiliency disciplines. Tony‘s previous responsibilities include consulting with public sector/government clients at Akamai, managing security operations for a mobile payments company, and overseeing security and compliance responsibilities for a global financial software services organization.)