Pipeline Publishing, Volume 3, Issue 11- Wedge Greene and Barbara Lancaster, LTC International

Pipeline Publishing, Volume 3, Issue 11

	This Month's Issue:
	The Long Arm of Telecommunications Law

Carrier Grade: The Myth and the Reality of Five Nines

article page | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |

But consider the “bad apple” example - i.e. if one system crashes for 24 hours, and all others work without interruption for the month, the 24 hour outage might be averaged across a large base of installed units. When the number of installed units is large enough, this yields a number that would remain within the required 5 nines of up time. More specifically, if a service provider installs a product described as "carrier grade" or providing five-nines availability, should the service provider expect that this is the product reliability standard expected of every single network element, or should the service provider expect that some of the elements may perform at a much degraded level, and that it is only the world wide “law of large numbers" that is used to measure carrier-grade? You see, it isn’t just that one bad apple in a group of network elements can skew the overall numbers – there are actual customers and real traffic affected by that “bad apple” element. Way beyond some theoretical measure, the effect on customers might be quite severe – certainly outside the guarantees of any customer SLA, and almost certainly extracting a penalty payment from the carrier, and likely attracting a fine from the regulator as well. As we shall see later, the problem of “one versus many” is being addressed by several groups.

The bottom line is that units will fail, so five-nines hardware availability is actually a design game of building systems which are always covered, i.e. they are redundant. This is the next telecom rule-of-thumb: no “single point of failure” (SPOF) shall exist.

Supplying Redundancy

Redundancy is the addition of information, resources, or time beyond what is needed for normal system operation. Both hardware and software can be made redundant. Hardware redundancy is the addition of extra hardware, usually as backups or failovers or for tolerating faults. Software redundancy is the addition of extra software, beyond the baseline of feature implementation that is used to detect and react to faults. Information redundancy is the addition of extra information beyond that required to implement a given function and includes configuration and data duplication, replication, and backup databases. Hardware and software work together using information to provide redundancy, usually by having software monitor for abnormalities and initiate the configuration changes necessary to switch service to backup hardware, servers, and standby software programs.

IBM says of SPOF – “A single point of failure exists when a critical function is provided by a single component. If that component fails, the system has no other way to provide that function and essential services become

The first step towards quality is agreeing to common standards and uniform ways of measuring something.

unavailable. The key facet of a highly available system is its ability to detect and respond to changes that could impair essential services.”

SPOF is defined against an interacting system. One establishes the level and layer of redundancy in the system design. But systems can be decomposed into smaller sub-groups and eventually into units. How far down does the SPOF rule apply? Is this like a fractal which always holds its pattern no matter what smaller piece you choose to examine? Well practically, a base level for redundancy is reached beyond which failures are not automatically compensated for. This last level is called the “high-availability” unit. Both hardware and software can be built and marketed as “high availability”. High availability is consistently used by telecommunications vendors to describe their products, perhaps even more often than the umbrella term “carrier grade”.

But building a system entirely of high-availability units will not itself guarantee the system dynamics will remain carrier-grade. This is because systems can shift performance even as each of the individual units remain within their tolerance. Also, during the time a high-availability unit performs a switchover, the larger system could react to the out of expected performance. So the corollary rule to SPOF is “install multi-layered redundancy” in the network system as a whole. This does not excuse bad design of a failure prone network or software element; but it does shield the product and the customer from that “bad apple”.

Networks and Systems Dynamics

We maintain that layering redundancy must involve considerable design scope. In fact, network design in service providers is frequently compartmentalized with one technology group not knowing what the other does. This is especially true for vendors specializing in just one technology or compartmentalizing their product teams. The classic example is ATM and SONET. Many networks were built with ATM ridding over SONET. SONET rings have very fast failover restoration. But ATM was also designed with restoration implemented through signaling and routing adjustments. When a transport layer error was detected, the SONET switched fast; meanwhile the ATM network layer began reacting to the notification of the outage to adjust itself. It might also switch unnecessarily. Or it may have to react to a routing link-weight model which is no longer appropriate given the change in SONET path.

article page | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |

© 2006, All information contained herein is the sole property of Pipeline Publishing, LLC. Pipeline Publishing LLC reserves all rights and privileges regarding
the use of this information. Any unauthorized use, such as copying, modifying, or reprinting, will be prosecuted under the fullest extent under the governing law.