Cascading failures, the downside of “Eat your own dogfood”

You may have heard that there was a pretty substantial outage in Amazon AWS Cloud services November 25, 2020.  The summary of what happened can be found at: https://aws.amazon.com/message/11201/ and is emblematic of how a simple maintenance action can lead to layer upon layer of unanticipated effects.  Amazon is an amazing company, the AWS service is really impressive, and what happened in this case could easily have occurred in any cloud service.  I’m unaware of whether there was any direct impact of this particular outage on any NG9-1-1 service, and I’d be very interested in hearing if there was.  The lessons to be learned here apply much more broadly than cloud services.

A major lesson here is that, while the adage of “eat your own dogfood” is really very good advice, and generally should be followed, it has consequences that must be taken into account if you are trying to build very reliable systems.  And, at this point, I should point out that as far as I know, this outage did NOT violate any regular AWS Service Level Agreements.  They don’t claim it’s five nines, and it’s not.  The outage was limited to a single region, and everyone who uses services like this who needs high availability knows to make their system multi-region.  Still, looking at what happened is helpful to avoid having similar things happen to your system.

The outage started with a service called “Kinesis”, which enables real time processing of streaming data (which includes media like video and audio but also includes streams of web clicks, and, most importantly to this event, logging data).   Like nearly all AWS services, Kinesis has a large set of (virtual) servers that handle the distributed load.  The trigger for the event was adding new servers to a part of Kinesis.  This was a planned maintenance action, which had been done several times before without incident.  This time, however, the total number of “threads” used by the servers exceeded the limit the operating system was configured for.   As is very common for a problem like this, the system didn’t create a meaningful error report that showed the problem.  Instead, what was observed was a very large set of effects of the problem.  It was confusing enough that it took a couple hours to recognize that it was the capacity add that triggered the problem.  But then, it turns out the only way to fix it was to restart the servers, one by one.  That took many more hours to complete – the problem started at 5:15 AM, and was back to a normal state by 10:23 PM the same day.

This was bad enough.  But there is much, much more to the story.  Kinesis is used by Cognito, which is the authentication and single sign on mechanism in AWS.  A latent bug in Cognito caused the problem with Kinesis to snowball and made Cognito unable to handle logins in some services.  In addition, Kinesis is used by the Cloudwatch service, which is the monitoring system AWS uses.  That had a lot of knock-on effects, because Cloudwatch is used for all kinds of things in AWS.  Specifically, it’s used as the source of metrics for the “autoscale” functions that automatically add servers as demand increases, and deletes servers as demand decreases.  So, any customers using autoscale saw that mechanism not work for most of the day.  And it also affects the Lambda service, which is a very low overhead way to implement a simple microservice, including simple reactive web servers, seldom occurring events that start from some triggering event, etc.  Lambdas are used all over the place in AWS and in customer systems.

But wait, there is more!!

Cognito (the user authentication service) was used to authenticate AWS technicians needing access to the customer notification portal.  They couldn’t login to cause notifications to go out!!!  There was a backup system, but the technicians on duty weren’t aware of it or trained on it.

Oh my.

So, adding capacity to Kinesis caused all sorts of failures in other services.

This is the “eat your own dogfood” issue.  AWS encourages its engineers to use other AWS services to implement their own AWS services.  “Eat your own dogfood”.  It’s a good idea in general.  If the service isn’t good enough for your own people, why would it be good enough for customers?  But you see the downside: a failure in one created a failure in another.  As you can read in Amazon’s report, their response to this cascade of failures is to make each service capable of dealing with failures itself, mostly as a backup to a failure in the other service.  That’s a pretty good strategy, but you have to anticipate the possible failures and create work arounds in advance.  And, where the backup is manual, which might be very reasonable for some issues, you have to make sure the people who have to activate the backup know about it and are trained up on it.

It’s almost always the case that these kinds of cross system dependencies arise incrementally.  As new services come up or old ones are improved, intertwining incrementally occurs.  That means you just can’t test for this when you first deploy a system, you have to look out for it at every update.

Which brings me to “Chaos Monkeys”.  While we would like to believe we can anticipate these kinds of failures if only we put some time and effort into analyzing the situation, in my experience, that doesn’t work, and also doesn’t happen.  It’s very, very tough to think through all the possibilities and recognize what would happen.  Enter the Chaos Monkey.

When we build 5 nines systems, we use redundancy.  We have backups and work arounds and alternate paths and back stops.  And we test them, but usually only in controlled conditions.  The Chaos Monkey is an idea the Netflix guys came up with that provides a whole other way to make sure your redundancy actually works.  Kill things on the active system.

The Chaos Monkey randomly terminates instances on the production system to ensure that it really, really works.  This induces terror in most engineers and their managers.  What? Deliberately break the running system?  We can’t do THAT!  Yes, you can.  You MUST.  Your system is designed to work with all of this chaos.  It should work.  It HAS to work.  So do it.  Kill random instances and make sure it works. 

Some of the failures shown in the AWS incident require some work beyond random process killing.  You have to kill entire systems and see what the effect is on other systems.  This can only be done in non-production systems (or at least at times everyone is prepared to deal with a system being down).  You often have to do that kind of testing at scale, or at least close to scale to see knock-on effects.  Do it during maintenance periods.  Take 1/2 of the system off-line and run an at-scale test of each system going down.  Be prepared to stop the test and restore service if you have a failure on the still-running production system, but try it, probably at least once a year.  You will be surprised, as the AWS engineers were, on how failures in one system affect another. 

Security Vulnerabilities: Discover, React, Deploy – in 72 Hours

There is a new vulnerability in a common piece of enterprise networking, from the industry leader in the class of device called a “load balancer” that is particularly nasty: https://www.wired.com/story/f5-big-ip-networking-vulnerability/

The vendor is a well known, generally well respected supplier, and reacted quickly and effectively to the exploit, producing a patch quickly and correctly advising its customers that this patch was necessary to implement immediately.  The referenced article notes that the patch was released on June 30, and if it wasn’t implemented over the weekend, it may well be too late.  So, 4 days or less from notice to deployment.

You might ask your current or proposed NG9-1-1 NGCS vendor how they would handle a vulnerability like this.

It’s pretty typical to see 30, 60 or even 90 day commitments for patches, if there is even a commitment at all.  Usually, vendors claim that they must perform extensive testing on the patch to make sure it doesn’t cause more problems, and often, it can take 20-30 days to roll out a patch to all deployments.

Compare that typical response to the need here. 

It would be great if the bad guys would let us have a few months to patch a serious vulnerability.  Unfortunately, they just aren’t that nice.  Darn.

To be able to react this fast requires a few things:

  1. An automated test that can quickly determine if there are problems with the patch affecting the services
  2. An expedited approval process that takes hours to run
  3. A deployment plan for emergency patches that takes at most a couple days to implement across all deployments

A really important thing that a vendor needs to make this work is practice.  This is the kind of thing that goes well only if everyone who has to do their part has practiced it successfully a few times.  It’s almost guaranteed not to work if it’s never been tried.  And refresher practice every year or so is also important.

It gets even more complex, and requires even more practice when there is more than one company involved.  If the NGCS vendor purchases, say, an ECRF from another company, and, let’s say, a serious vulnerability in a library used by the ECRF supplier is discovered.  Then both the ECRF vendor and the NGCS vendor’s processes have to work in a couple of days, and have to be practiced together a couple times to make sure they do.

How about yours?

72 hours from release of patch to deployment is about all we can reasonably expect, but we should demand that.  You can make it dependent on the severity of the exploit.  The Common Vulnerability Scoring System (CVSS) is one common way to measure the severity of a vulnerability.  Scores of 9 or 10 are usually deemed critical.  Certainly, a CVSS score of 10 would necessitate the fastest response.  Whether a score of 9 requires 72 hour response is something to discuss with vendors or potential vendors.

I wish I knew how to write software that would never have bugs like this one.  I wish there were ways to give vendors more time to respond when this class of bug is discovered.  Alas, no such magic.  So, we have to assume this will happen, plan for it and practice our plan.

Trials and Tribulations with serverless in AWS

So I needed to create a simple database backend with an api for a mobile app. It’s a small, simple database, with several classical join operations. Perfect for MySQL. It has an unusual characteristic: it’s only going to be used once or twice a month for a few days. It’s a problem management system for national fencing tournaments, used by officials to report and resolve problems in the tournament. I have the app working in Xamarin, and this seems like something trivial in AWS.

Because of the very low, sporadic use, this seems like serverless is ideal – only pay when we use it, and we don’t use it much. Probably fits in the free tier, but we’re willing to pay a bit if it doesn’t. Lambda is the obvious choice for the business logic, I’m using Swagger (OpenAPI) to define the interface, and AWS API Gateway to deploy it. It’s very easy to trigger the Lambda functions from the API operations. I’ve already figured out how to use Cognito for app login, and I’m using SNS for the messaging notifications to the app. All Amazon all the time, right?

So, AWS has serverless Aurora, which is a pretty straightforward MySQL system. Price is right, has the right level of scaling, and, again, serverless.

But you just try to make a Lambda function that calls Aurora Serverless. It’s AMAZINGLY difficult.

Aurora Serverless, unlike, say, DynamoDB (AWS’s NoSQL database) is all wrapped up in
Virtual Private Clouds. The only way to get access to the db is through a VPC. Okay, I’m a network guy. I’ve been around the block a few times: I was on the machine room floor when the Arpanet IMP was winched in to Carnegie Mellon. I worked at Xerox Parc in the glory days, and Bob Metcalf taught me about Ethernet. VPCs aren’t all that complicated – subnets, ACLs, yada, yada. Seems like a lot of work just to get at a serverless db. And it really is a LOT of work. The most surprising thing is that to get Lambda to talk to Aurora, you need a NAT. What? Yeah, two internal AWS services need a NAT between them. It’s that VPC thing. Yes, I do want the database secure; only Lambda can access it, plus my provisioning path (SSL from my desktop). But this is 2019. We don’t believe in walled gardens anymore, we use TLS, authentication and authorization. We don’t need really static IP addresses, DNS handles everything nicely. And we’re out of IPv4 addresses, all the mobile devices have IPv6 addresses and we should be doing all-IPv6 for new apps.

Not Amazon. It’s private address spaces, ACLs and dynamically managed public IPs. IPv4 only. SSL optional. And it’s complicated. I’ve poured through maybe 100 pages of documentation trying to make this all work. Used the forums, paid for support. Ran wizards. Followed tutorials.

Never got it to work. As close as I came was a stand alone DB on a simple VPC with a Bastion Host EC2 instance that let me connect MySQL workbench to the DB instance. When I tried to hook up to Lambda, it was start over, much more complex VPC, and then the NAT.

Amazon has two NAT thingies – a managed NAT gateway, and a DIY NAT “instance”. The problem with the NAT gateway is cost. It would cost hundreds of dollars a year to run that. The NAT instance is code on an EC2 server. A wizard makes the instance, and it runs on the smallest config. You can trigger cron jobs to turn it up and down.

But you need an elastic IP (a not quite so static IP), for reasons I don’t understand – everything else in AWS for this app is basically happy to use DNS and not need elastic IPs, even the website, hosted in an S3 bucket and Lambda, and running on my own domain. Not Aurora serverless connecting to Lambda – you need an NAT and a static IP. TLS is optional, and not as easy as it should be. Documentation is roughly a half a page. Never got it to even think it was working. Nothing talked to anything else. Couldn’t see the DB from Lambda, couldn’t SSH to it. Like it wasn’t there. No doubt something having to do with the VPC config, because Aurora wasn’t happy with the wizard created VPC.

So I backed it all out. Deleted all the EC2 instances, the databases and the VPCs. Redid the whole thing in DynamoDB. It’s not a great fit, but I can make it work NoSQL. And it’s trivial – create an IAM user with the right permissions and I’m off and running with the DynamoDB api. I had a simple Lambda function working in 30 minutes with zero prior experience with DynamoDB.

What’s up with that?

Getting enough network redundancy at all PSAPs in NG9-1-1

I often hear complaints that in many PSAPs, there isn’t any way to get enough redundancy of network connections to get a reliable ESInet. Often these are rural PSAPs, or larger ones where network diversity wasn’t even considered when siting the facility. I’m here to tell you that I think you CAN get a fair amount of redundancy, and thus availability, but you, and your vendors, have to be creative.

Clearly, if you have fiber coming into the building, then you start there. What you want, and should have as a top 5 priority in siting a new PSAP, is dual fiber entrances, where there are two ducts on opposite sides of the building that have fiber coming in, and no common path or element on those fibers. That’s tough to get in most rural areas, for sure. But one fiber is much better than no fiber.

But then you take a walk around the building and look up. See what kinds of wires you can spot. If there is copper from the local telco, then you want some: whatever high speed service you can get. But what you really want is more than one service, especially if there is a way to get those service from different COs. I’m not a believer in two of the same kind of service from the same vendor. If that’s all there is, then that’s all you can get, but I think you are better off with different services, and different vendors, or any combination. So, metro ethernet, yes please, DSL, yes, I’ll take it. Lowly T1s, yes, if the price is right. Anything that can get you a megabit or so is good, and more than one is better.

You definitely want a cable modem if at all possible, with enterprise service if you can get it, but whatever you can get from the cable company is great.

You also want as much wireless service as you can get: if 2 or 3 wireless carriers have service that works from your PSAP, get a wireless modem for each of them.

Then, go up on your roof and look out farther: see if there is fiber that is close to, but not in your PSAP. Could you put up point to point radio, microwave or laser, and pipe in some more bandwidth from another source or even just another CO? How far do you have to go to get service from another network supplier? Point to point wireless is getting pretty cheap. You may have some tower access that can help, and even if you need a repeater, that may get you bandwidth when your other connections go down.

I think 8 network connections to every PSAP is definitely not too many, and if there were 4 or 5 different technologies in that 8, that’s good. Sure, you want tens or hundreds of megabits but anything >1 megabit is helpful. One of them might be the “normal” connection that is what gets used most of the time. The others are there for when it dies.

What you may not realize is that IP networks have this very important characteristic: if there exists a path between two points: regardless of who provides the paths, how torturous they are, how many routers they go through, who owns the routers, or what technology underlies the IP connection, the path will be found and used. If there is more than one path, the one that appears to be “best” at the time a packet is sent will be used, and the route for one packet can be completely different from the route the very next packet takes. Network operators can, and often do, screw up routing to prohibit this kind of behavior, but they also can be instructed to let IP do its thing, and then you get data flowing if ANY path exists. Without prior planning. Without prior configuration.

You want that. You want lots of paths in the hopes that whomever you need to send packets to or receive packets from, there exists at least one working path, even in a disaster. That’s why IP network tend to be more reliable than other networks when disaster strikes. If a path exists, it will be automatically found and used.

However, as you are no doubt well aware, every network operator has problems, even the best of them. The recent outage at CenturyLink is an example. Many of their services went down at the same time. You should not look at that as a particular shortcoming of CL. CL is a fine network operator. Some version of what happened could and probably will occur on your prime network operator some day. You really, really, really don’t want to be dependent on one vendor to keep your ESInet and your PSAP running.

Now, to be sure, your NGCS vendor may want to run VPN connections on each of your paths to the ESInet. That’s fine. The underlying IP networks will work the way we want them to.

But there is another part of this that I wish the NGCS operators and the originating service providers would work out. If a call comes from one of the networks that you have a connection to, and that network works well enough that packets can flow from the caller to the PSAP, then, no matter what, I want that call to be able to be answered by you. Most ESInets won’t do that. They need a path from the caller to the NGCS and from the NGCS to the PSAP, and they won’t allow calls to go direct, even when there is no other path. Most of them insist on putting the ESInet’s BCF in the media path, which makes the network connection requirements in disasters even less likely to work. On top of that, the originating service providers often have very restrictive network configurations that wouldn’t allow the kind of path I’m advocating, but I think they should have.

I’m not big fan of “sub-IP” mechanisms like MPLS, and I say that as someone who was present at the beginning of MPLS. It’s not that I think it’s bad, it’s actually great. It’s that the benefits doesn’t justify the cost. I’d rather see your dollars going into more diversity rather than MPLS. One thing to recognize: in many networks, if you get something like metro Ethernet, that service is often built on top of MPLS. You don’t get all the benefits of having dedicated label switched paths, but you get most of them, and metro Ethernet tends to be the least cost way to get bandwidth when there is some decent sized pipe. There is some experience that suggests that at least in some networks, MPLS fails in disasters where basic IP networks keep working. That often has to do with how the MPLS network is configured rather than some inherent defect in the technology.

I expect that most ESInet designers will look at this piece and think I’m nuts. They want to bring two MPLS connections from the same network provider to your PSAP and be done. They would say that trying to deal with 8 different network connections from 7 providers is so much work that it’s not feasible. They would say that each of those connections doesn’t have anywhere near the reliability of their MPLS connection. They will tell you they can’t manage those connections. They would be wrong, of course. Diversity is achievable. Even in the boonies.

Analysis of CenturyLink Dec 2018 outage: Transport Operator/Supplier Diversity is Critical

The FCC has released its report on the December 27, 2018 outage at CenturyLink that affected 9-1-1 service. The problem was a packet storm in a management network that controlled a major part of CL’s optical network. The packets kept multiplying and congesting the system, and since it was in the management network, it was very difficult for CL and its vendor Infinera to deal with.

The FCC report goes into a lot of detail of why the packet storm occurred and how widespread its effects were, what steps were taken, and how it was eventually brought under control. No one knows exactly how it started, but the packets managed to pass all the filters that were supposed to stop this kind of thing from happening. The report makes the point that the feature that failed wasn’t actually used by CL, but the vendor had it enabled by default.

I don’t know how to prevent network failures like this from happening. It’s a combination of coding errors and other human errors. Packet storms should never be able to be maintained, especially in a management network. There should have been loop detection. There should never be infinite lifetimes for things like this. There should be shortcuts that let technicians into misbehaving networks no matter what is going on. No features should be on by default if not configured. Lots and lots of examples. But in the end, humans make mistakes. These things happen.

What was avoidable was the consequences of the failures. Lots and lots of systems went down. This happened in the optical network, and lots of transports rode on that common optical network. SS7 networks were affected, packet networks were affected, virtual private networks were affected.

In all of the cases where the consequences were significant, and specifically the 9-1-1 effects, the reason emergency calls were affected was that ALL the paths the services used were on the same optical network. If all your paths are on one network, and that network fails, you are down hard. That’s what happened here, and that is preventable.

This is the real lesson to be learned here. CL is a fine network operator, Infinera is generally well thought of. There is no reason to think that any other network operator, or any other vendor would be better at this kind of thing. They all are dependent on humans, and humans make mistakes. What is wrong is relying entirely on CL, or any other single network operator, for all of your paths.

To get diversity, you need:

  • Physical Diversity (where the actual fiber/cable runs)
  • Operator Diversity (who runs the network)
  • Supplier Diversity (who supplies the systems that underlie the networks)

There have been failures that affect all the switches supplied by one vendor, so if you have multiple paths that are physically diverse, and operator diverse, but all rely on a common set of code, a bug in that code can affect all the paths.

This specific failure shows the effect of both operator diversity and supplier diversity. It used to be that network operators like CL would not allow a single vendor to supply all the equipment in a network – they qualified at least two vendors, and had at least a reasonable set of both vendors in the network just so that a bug like that it did not bring down their network. Those days are long gone (anyone remember the AT&T Frame Relay network failure? That was 1998!). Customers of those operators should assume that the ENTIRE network can fail. Here, because it was in the optical network, customers were aghast to discover that multiple networks they thought were diverse by virtue of being entirely different technology: SS7 and packet networks for example, were all affected, because of the common optical network. And it’s not just the optical network that is common in large operators like CL. Management networks are common, fiber bundles are common, lots and lots of ways that a single human failure or a single code bug can cause widespread outage.

So, to me, the root cause of this particular NETWORK failure was coding bugs in the switch complicated by the configuration issue. But the root cause of CALL failures, which is what we and the FCC really care about, was lack of diversity. That was foreseeable, that was preventable, and that is almost universally a critical design fault of 9-1-1 networks, including NG9-1-1 networks today. I don’t know of any ESInets that have sufficient physical, operator and supplier diversity to prevent a catastrophic failure such as this one from happening.

While this problem tends to be most severe when the ILEC is the ESInet operator, it happens pretty much uniformly even when the ESInet operator isn’t a network operator. The ESInet operator tends to partner up with someone who has lots of fiber/IP network capability, and nearly all the paths are from that partner. And since vendor diversity in operator networks is practically non-existent, 9-1-1 Authorities have to assume their ESInet suffers from lack of both operator and supplier diversity, and this can happen to them.

But operator and supplier diversity isn’t all that hard to get. Physical diversity IS hard, both because PSAPs tend to not be in places with sufficiently diverse cable entrances and because suppliers have woefully inadequate documentation on what physical paths the service they are selling takes. But operator and supplier diversity is almost always simple to get and the expense tends to be very small.

How about your ESInet?

And one more thing. It bothers me that its way too easy for me to criticize an FCC report on a failure. And it’s not just the FCC: I’ve seen reports from other sources about 9-1-1 failures that don’t get to the root cause of why the 9-1-1 calls failed. They stop at the proximate cause, not the root cause. They’re not asking the Five Whys. I don’t think I’m particularly skilled at asking Five Whys. But it does seem obvious to me that they aren’t even asking 3.

I invite your comments.

Where do the NG9-1-1 common nodes go?

One of the bigger architectural decisions when designing an #NG9-1-1 implementation is where the NGCS (central services) go, such as the ECRF, LVF, ESRP and bridge. I’ve been consulting on state-wide NG9-1-1 deployments, and have seen proposals from most of the vendors in the space. I see the following choices:

  • State-specific nodes located in state, or in some cases, in an adjacent state
  • Shared nodes in vendor operated data centers
  • Shared nodes in public cloud sites

In the latter, while I believe the implementations I’ve seen actually do share the functions, it would be easy, and I think slightly better, to have separate instances per state. I’ve seen a proposal that shared one site between two states with a second, separate in-state site for each state, but I believe the vendor had separate instances on common hardware for each state in the shared site. I think that’s a wise choice.

The advantage to a state-specific instance, whether it’s in an in-state site or a public cloud site is that a failure at one state would not necessarily affect the system in another state. I say “necessarily” because it depends on the failure. In the two adjacent states sharing a site, with separate instances, a hardware failure could affect both sites. In a public cloud, separate instance, a failure in the public cloud instance that affected the region (which is not uncommon) would affect all the states served by that region. Most of the more recent 9-1-1 failures affected more than one state, so this is a very important consideration. I lean heavily towards an in-state solution.

The advantages to shared instances of any flavor are, of course, mostly cost savings, and the price should reflect that. If the data center is a major hub for the vendor, and the vendor maintains staff 24/7 in that data center as well as a full complement of spares, then the Mean Time To Repair, which is an important, but often overlooked, component of availability (“nines”) is significantly lower. Sometimes, I see one site with on-site technicians, but other sites are remote, and take multiple hours for parts and technicians to arrive. That negates the advantage. State-specific sites most often have techs hours away, and few if any on-site spares. Vendors can fix that, but .. cost.

It’s all in the MTTR numbers, which vendors should understand, and customers should inquire about. MTTR in a public cloud site is a complex subject. The hardware, and the effect of a hardware failure on a running system, looks to the customer as being really, really high availability. However, the incidence of entire region failure, which is almost always a software issue, is so common, that it tends to wipe out the advantage of the apparent advantageous hardware MTTR effect.

The thing that makes me much less comfortable with out-of-state data centers is tromboning of media, and to a lesser extent, signaling. If you have a local caller, on a local originating service provider, to a local PSAP, the spatial distance between where the call comes from and goes to (the PSAP) can be a small number of miles. To be sure, mobile networks tend to be more regionalized, so the call may actually be handled a ways away from both ends, but backhauling a local call to a local PSAP back to a data center multiple states away is not good.

The effect of tromboning is actually fairly mild. Some extra delay in the media path. It’s typically only a few states (maybe a New England call to an Atlanta data center), not all the way across the country, except in extreme circumstances. However, as anyone who has worked with me on NG9-1-1 system design knows, I’m always thinking about how our systems work in disasters. Think Katrina, Loma Pietra, Sandy or 9/11. We know that sometimes we get islands of connectivity. So there might be an IP path working between the caller and the PSAP, but not if it has to go back to Atlanta. I really think at least one close-by data center is a better design choice.

On the other, other hand, having all the data centers in one geographic area means that a single large weather or other natural event can take them all out. That means you really, really want at least one site that is nowhere near you. Yes, that event could take out enough network that it wouldn’t matter anyway, but if you have a robust enough network design, with excellent path, vendor and technology diversity, it might work anyway. IP networks are the most reliable networks in disasters based on experience, if they are designed with diversity in mind. Cost, of course, interferes.

So, ideally you would have two or three data centers in state, one or two across the country, and maybe one in an adjacent state. I have an article coming in NENA’s The Call about availability, but one take-away is that for all the NG9-1-1 system designs I know about, two sites is not enough to get 5 nines. You need more. 5 or 6 is a good number 🙂

So where does this leave a state looking to deploy NG9-1-1? Well, as always, it depends. If a vendor proposes a shared system, at least one of its data centers is very close to you, and the cost is significantly lower, then I think that’s a good set of tradeoffs. If there is no significant cost difference, another vendor is offering an in-state solution, and it has a reasonable MTTR plan, then I’d go that direction. Of course this is one of many design decisions that a state would consider when selecting a vendor, so we’re really talking about a score component rather than an actual vendor decision.

Of course, I think there are better ways to do this that address all these issues, but states have to choose among what vendors actually have available.

See my first post to find out more about me, and my point of view.


A bit about me

I will be linking to this post from most of my other posts about #E911 and #NG911. It’s meant to be an introduction to who I am, and where I’m coming from.

I’ve been around a while. I learned to code in 1965, and I haven’t stopped yet, although I consider myself a lousy coder. I can get the job done, but that’s about it. I’ve been a system’s architect, and hardware/software team leader since around 1978 and think I’m pretty good at that. I know networking fairly well. I was on the machine room floor when the ARPAnet IMP was winched into Carnegie-Mellon’s Computer Science Department. I learned about Ethernet from Bob Metcalfe while I was working at Xerox Palo Alto Research Center (PARC) where I did hardware design on the “D” machines. I’ve been dealing with network protocol design and implementation since then.

I’ve been a founder of 5 start-ups, and President/CEO of 3 of them. Nothing hit big, but we made a difference in a few areas. Relevant to what I’ll be writing about, two of the start-ups were in the medical devices field. When you create software for medical devices, you learn that while you can’t eliminate bugs, you can make sure you don’t hurt people. The notion that limiting damage can be more important that correct operation is a bias I carry.

I’ve been working in computer graphics in one form or another nearly my entire professional life. I’ve done hardware design and software for lots of different graphics systems from stroke writers (look it up) to telepresence systems. In the latter, we managed to get a good grounding in high quality user interface design, including human factors research, interaction design, graphics design and usability testing/optimization. I learned that good UI is a science, not just an art and it’s possible to engineer a good UI, test it, and make it work well. I’m appalled at the UI in most of the systems I’ll be writing about.

Very early in my career, I got to go to a seminar on quality assurance that was taught by W. Edwards Deming, his associates and disciples. It was eye opening, and I will never forget listening to him explain what it took to get to high quality results. If you will permit me to relate one of his more memorable examples: he explained “Zero Defects” this way: “How many babies is a delivery nurse allowed to drop?” Does it ever happen? Yes, it does. But what is the requirement? Zero of course. And if it ever does happen, that is a deviation from the standard and requires corrective action. Thus began my fascination with quality.

I did a stint at FORE Systems, which was acquired by Marconi, and was a premier supplier of Asynchronous Transfer Mode (ATM) switches for enterprise, government and Tier 1 ISPs. I was fortunate to get a real education in building 5 nines availability systems at FORE. They employed a skilled reliability engineer who really knew his stuff, and I got to learn what goes in to really achieving 5 nines. FORE had (painfully) learned to build such systems and I got there just after they had finally made it work for real. They had the zeal of the recently converted and I got indoctrinated in what it actually takes.

In the last two years, I helped the quality assurance team improve overall quality at Neustar. One of my primary contributions was a ringleader in our Change Control Board efforts. A couple of us got pretty good at asking the right questions and insisting on meeting the process standards. I did reviews of Root Cause Analysis reports. I also ran a team that built a cloud based external monitoring system that monitors most of Neustar’s many services from outside the company’s network and colo sites. So I have recent experience building public cloud based services. I’ve also done some work on software development methodology and tools. I’m a big believer in Continuous Integration, Containers and Test Driven Development.

I’ve been a professional standards geek for the last 20 years. My first IETF meeting was IETF 43 in Orlando, back in 1998 and I haven’t missed many since. I started working in Megaco/H.248, and I was the IETF editor for that work (RFC 3015). I then became the co-chair for the sip working group, and I was chair when the base document for SIP, RFC 3261 was published. I was responsible for the ABNF in that document, which, alas, has several errata 🙁 I am presently the co-chair of sipcore, the main SIP protocol working group. I also authored or co-authored many of the IETF documents on emergency calling, which is the basis for Next Generation 9–1–1.

While I was working on SIP, and VoIP in general, we realized that until we made emergency calling (9–1–1, 1–1–2, …) work, VoIP would not take off. About the same time, the 9–1–1 system began receiving calls from VoIP systems. They had just gotten 9–1–1 from mobile phones to work after the mobile operators insisted no one would use a cell phone to call 9–1–1. Today, over 80% of 9–1–1 calls are from mobiles. It was déjà vu all over again. The NENA guys went looking for VoIP expertise and IETF guys went looking for 9–1–1 expertise. We found each other.

There was a meeting arranged by Tom Breen in Atlanta where several IETFers, including me, met with several NENA folks. Out of that meeting came a three part plan:

  • i1: document how the then-current VoIP systems implemented 9–1–1 (badly, if at all)
  • i2: Design and document the right way to support VoIP in the current E9–1–1 system
  • i3: Design an entirely new 9–1–1 system based on Internet Protocol

I was a major contributor to the i2 effort (NENA 08–001), and since the initiation, have been the chair or co-chair of the i3 working group in NENA, which is the technical standard for what is called Next Generation 9–1–1. I was the editor for the first version of the standard (NENA 08–003) and I have written a great deal of the text in the current version, NENA-STA-010.2, and the forthcoming Version 3 release, NENA-STA-010.3. I also am a major contributor to several other NENA documents including the security document NENA-INF-015, and the ESInet design document, NENA-INF-016.2.

Early in my work with NENA, I met Richard Ray and Donna Platt, among others, who work to make sure 9–1–1 works for deaf and hard of hearing persons. I have maintained an active interest in making sure our work significantly improves how we serve this important constituency in emergency calling. As it happened, in my day job at Neustar, we bid on, and won a contract with the FCC to design, deploy and operate the iTRS Directory, which is a database that allows deaf and hard hearing persons using Video Relay Service to call each other and to port their telephone numbers from one VRS Provider to another. That service was operated at a 5 nines availability SLA. It had a failure in the first two years where we did not achieve 5 nines, but it has met that standard for 7 years since then. I have excellent relationships with several staffers and officials at the FCC.

I do not consider myself a security expert. I know several, and I’m not one of them. I was at the beanbag lecture by Ron Rivest (the “R” in RSA) where he described public key cryptography for the first time anyone at PARC had heard about it. There were 3 implementations by the next day. I nevertheless frequently find myself the most knowledgeable security guy in the room, which is weird to me. I did nearly all the original security design of Next Generation 9–1–1, although we now have several folks who contribute. Neustar operated one of the largest DDoS mitigation services in the world, and I learned a bit about how that works. So, frequently, I will opine about security in NG9–1–1, but remember, I’m not a real expert.

So, that’s me: hardware/software engineer, systems architect, team leader, entrepreneur, graphics/video guy, quality obsessed, standards geek, sip expert, NG9–1–1 pioneer, and a bit of security. I’ve designed, built, deployed, sold, maintained and operated dozens of systems. Been there, done that. I retired from Neustar in 2018 and am now consulting, mostly to state governments who are, or are planning to, deploy NG9–1–1. I usually work with a larger consulting group for these assignments. I do some other consulting besides that. In my postings, I’m aiming both to get vendors to improve what they are offering, and to help 9–1–1 Authorities ask for the right things. I’ve always been a vendor, so that’s the side I’m coming from. I really do understand what it takes to run a profitable business. But this, like the delivery room Dr. Deming talked about, this is 9–1–1. People’s lives are at stake. Zero defects and five nines aren’t just slogans to me.

You can reach me at br@brianrosen.net. I’m @brian_rosen on Twitter. If you are involved in public safety or emergency communications, join my LinkedIn network.

Lessons Learned – CenturyLink 911 Outage from August 2018

See my post here to learn more about me and my point of view. 

As of this writing, CenturyLink (CL) has recently restored service from a massive outage of their IP network, which affected wireless #E911 service in many areas. We don’t know what happened yet. The FCC is investigating, and we will get a report that details what happened. But on August 1 this year, CL had another outage that affected 9-1-1 calls in several states, including Minnesota. Minnesota has an #NG911 system, so this is one of the first NG9-1-1 outages. The report on this outage is available from the State of Minnesota here. Note that while the document has a CL proprietary notice, it’s available on a public website from the State. If you haven’t heard about this incident, I suggest you read the CL document, and then come back to this post. I have no other information on this incident other than the report and from some discussions with some of the Minnesota 9-1-1 people.

The basics are simple: West Corp (“the vendor”) fat fingered a network provisioning operation that broke routing to their Miami node, and prevented fall back to their Englewood CO node although some calls were (normally) routed through Englewood and were not affected. It was an hour and 5 minute outage, and 693 calls failed in Minnesota. It’s not clear from this report how many calls from other states were affected. West finally noticed that the failures started when the provisioning change was done, and rolled back that change. Service was restored quickly (about 5 minutes) after that.

The report notes that this kind of provisioning activity was not at that time considered “maintenance” and thus was not done during the normal maintenance window, which apparently is every night from 10 pm to 6 am.

The corrective actions described are basically changing processes to implement more validation of provisioning changes, breaking changes up into smaller hunks, and improving test procedures when doing that kind of provisioning.

I think West is a good vendor. Probably better than most. It has a long way to go, in my opinion, to address process and system design faults I think this outage exposes. I’d guess most vendors are no better than West in these areas.

So, what can we learn? First of all, I don’t know how to prevent people from fat fingering manual configurations. It happens. So, I don’t fault West or their tech from making a mistake like that

They also were able to revert the change and get the service restored fairly quickly. Now, to be sure, 5 nines is 5 minutes a year of down time, so every single minute counts, but they had backout procedures and they worked. That’s also good.

This wasn’t a software bug. It wasn’t a hardware failure. This is a system design and operating process problem.

There are several glaring problems not addressed here. This problem will probably happen again, not exactly the same way this one did, but in some similar way, because they did not address the root problems. They did not ask the “Five Whys” that are key to a good Root Cause Analysis (RCA). Or, at least it doesn’t show in this document.

If you don’t know, following an outage, there is a meeting held to determine what happened, and how to prevent it from happening again. Running a good Root Cause Analysis meeting is hard. No one likes to admit their shortcomings, and when it’s a very visible outage, management tends to focus on blame and not on actually finding and fixing the real root cause. Usually, the outcome of the meeting is a report, like this one, and set of follow up corrective actions that should have committed dates and resource commitments.

As an aside, I expect this is not the actual RCA report. It’s a sanitized, finance, sales and management-approved version of the actual RCA. There are legal and financial reasons why this is true, and I don’t think any customer should expect anything differently. Unfortunately, while there are legitimate reasons why customers get a sanitized version, it’s sometimes the case that things they should know are kept from them because it’s embarrassing, or there is a very low probability that effective corrective action will be taken because of cost considerations. I don’t know how to fix that, and I don’t know if that happened here. Additionally, CL is the customer of West, and CL probably got a sanitized version of the West RCA. Then CL may have sanitized what they gave to the State.

Why was this manual? In other words, why is the process subject to manual error? One of my frustrations is that network ops people insist on manual configuration via the command line. It’s just the way they do things. My response to them is “if fat fingering can take the system down, automate”. They don’t like that. It’s a tool thing; they don’t like the available automated provisioning tools. I’d say, tough, manual provisioning is too error prone.

In fact, my rule of thumb is that if a change procedure takes more than 3 or 4 steps, then much more stringent processes must be in place. The changes, no matter how (in)consequential, have to be approved by a change board in advance (like a week in advance). The document that describes the steps has to be very explicit: exactly what is typed/selected/clicked, what the response looks like exactly. There has to be a set of tests that show the change had the desired effect, and, most importantly, the service(s) that run on the network that changed have to have a comprehensive system check completed following the change. That appears to be far from the process used in this instance, although some of the identified corrective actions cover improvements to their process.

Why was this failure not caught immediately by monitoring? The report says “Vendor became aware of the broader issue when they identified an influx of calls”. So their monitoring did not detect a failure. Actually, it explicitly says that because “trunks were available to the Miami ECMC”, they didn’t know calls were failing to route. They are using an interior monitor (trunk usage) instead of an exterior monitor (calls are failing). Here, there is not only a failure on West’s part, CenturyLink is not actively monitoring the system. Apparently, they expected the origination service providers to notice failures of calls not completing, but that didn’t happen. There should be less then 5 minute detection of this kind of problem. Test calls should be automatically generated and automatically traced to both sites. This monitoring fail turned what should have been a 10-15 minute outage into a one hour outage. This system, which should have been 5 nine’s, isn’t even going to make 4 nines this year (the more recent CL failure did affect Minnesota apparently). I’m told that the NOC that serves 9-1-1 was overloaded with service calls, and PSAPs resorted to calling individuals in CL on their cell phones to make them aware of the extent of the problem.

And then we get to the biggie: why did the failure affect the ability to fall back to the alternate site (Englewood)? That should not be possible. There is discussion in the document about remapping cause codes, and it may be that because the wrong cause code was returned, the clients didn’t attempt to try the other site. That seems like a failure in design: it should not be possible for a situation like that to occur. It also seems like a test failure: a good system test would have simulated routing failures and discovered failover didn’t work as planned. What other systems have problems like that which need to be investigated and addressed? What other ways can the system report some failure that causes the alternate site to not be tried? The RCA should have been very explicit about this: a failure of one site didn’t result in the other site getting the calls. In fact, I would say that this is the path to the root cause: anything should be able to go wrong in the Miami center and the Englewood center should have gotten all the traffic. It’s not the fat finger that is the issue, it’s not the change process that’s the issue. It’s the fact that the change in Miami prevented calls from failing over to Englewood. Without knowing more about how this happened, we’re unable to continue asking “Why?” down this path, but that’s what I think is needed more than anything else.

If you look at the report it says the root cause was human error by a vendor. I don’t agree. The root cause appears to me to be related to the inability to fail over. Fat fingers happen, sites go down. Redundancy is supposed to prevent failures in one site from affecting another. The human error was the proximate cause, but the root failure was a system design fault on failover.

That’s how I see it anyway. I invite your comments.