Redundant Carrier Systems
The networks of the major carriers are built as redundant systems, meaning that there is a duplicate or backup system immediately available to overcome outages that may occur virtually anywhere on their networks. While the large carriers offer a high degree of redundancy in their networks, smaller competitors may not. Many CLECs, for example, may have fiber rings in the cities they serve, which move traffic between their regional or national data backbones. But not all the CLECs have dual-ring architectures that can route traffic in another direction if a cable cut occurs. Given the high cost of building resilient networks and the shortage of capital for infrastructure enhancement, telecom and IT managers must factor these considerations into their decision-making when selecting carrier services.
15.2.1 Switching Systems
For voice services and low-speed data, local exchange carriers (LECs) operate central offices, which use voice- and data-switching equipment from vendors such as Lucent Technologies, Nortel Networks, Fujitsu, Ericsson, and Siemens. Typically, the goal for these systems, including failure and both scheduled and unscheduled maintenance, is 99.999% (five 9s) availability. This works out to 5 minutes of downtime per year. An exception is Lucent’s 5ESS, which performs at 99.9999% (six 9s) availability. This equates to only 10 seconds of downtime per year. To achieve these high levels of performance, each switch is equipped with dual processors, so that if one processor fails, the second one can take over automatically. In essence, the switch can be viewed as two computers running simultaneously, with the backup ready to take over the full processing load instantly if a problem is detected.
The near 100% availability of these switches is also achieved with redundant subsystems and continuous internal testing. If an internal test reveals that one or more of a subsystem’s performance metrics fall below an established baseline, the backup subsystem takes over while the primary subsystem undergoes a full suite of diagnostic tests to pinpoint the problem. So even though the primary system is not in service, the availability of the switch is not diminished. The switches themselves are closely monitored by on-site technicians as well as remotely from one or more NOC.
In selecting the services of a CLEC, however, telecom and IT managers should be aware that these carriers’ switches may not be provisioned in the same way as those of the Regional Bell Operating Companies (RBOCs). Many CLECs did not purchase central office switches in the volumes that would qualify them for discounts. Others lacked the negotiating skills to obtain feature parity with the RBOCs. As a result, they do not always have the redundant subsystems and features to provide their equipment with the highest level of reliability. To make a bad situation worse, some CLECs leveraged the reputation for reliability of their vendor’s equipment, while not actually having a configuration that would provide that level of reliability.
15.2.2 Signal Transfer Points
The carriers also operate signal transfer points (STPs), which are the computers that route network inquiries into their signaling networks. These signaling networks are separate from the networks that carry the voice and data traffic of customers; they are packet-switched data networks that use messages to set up calls and support intelligent services. The STPs are configured as mated pairs with separate processors. The load-balanced STP pairs are not collocated but are usually hundreds of miles away from each other and operate at just under 50% capacity. With this architecture, if something happens to one STP, its mate can pick up the full load and operate until repair or replacement of the damaged STP can be made.
15.2.3 Network Control Points
NCPs are the customer databases for advanced services such as 800 number routing. The NCP nodes process 800-number call-routing requests received from telephone switches in the carrier’s network. They have dual processors, but if the second processor should fail, there is a backup NCP that is called into operation, thus protecting the customer’s intelligent services information. With several levels of redundancy, there is little chance that customer information regarding services and features will be lost. By way of comparison, AT&T alone has 310 NCPs in its network—50% more than its nearest competitor—enabling it to provide the highest level of redundancy to cope with virtually any disaster scenario.
15.2.4 Digital Interface Frames
Digital interface frames (DIFs) provide access to and from Class 4 central office switches for processing calls. The DIFs that handle this work have spare units available to take over immediately should a problem occur. Guiding the overall work of each DIF are two controllers running simultaneously, so that if one experiences a problem, the backup controller can take over without the customer noticing.
In addition, certain switched business services such as 800, which entail large-quantity egress (traffic flowing off the carrier’s network) can make use of an optional capability. This feature sends a customer’s traffic to another DIF at another switch location if the customer’s primary switch encounters a problem.
15.2.5 Power Systems
Carrier switching systems derive power from the local utility companies. Power lines come into the building to provide direct current (dc) to redundant rectifiers, which distribute power to the switching equipment and battery banks. If commercial power fails, batteries, which are kept charged by the rectifiers, provide backup power. The power levels of the batteries are monitored to ensure readiness in case a commercial power outage should occur. An additional stage of redundancy is provided by diesel-fueled generators, which can replace commercial power for days or weeks at a time.
15.2.6 Cable, Building, and Signaling Diversity
A carrier’s network facilities (i.e., cable routes) are built as a series of circles or loops that touch one another to form an interconnecting grid. Should any particular loop be cut, such as by a backhoe operator hitting a fiber cable, a fair amount of traffic can be sent over one or more adjacent fiber loops. Construction of new facilities in recent years has focused on making these loops smaller and smaller to reduce the magnitude of problems when they occur.
Generally, in larger metropolitan areas, carriers are able to offer business customers building diversity. By being able to reach the carrier’s network at two distinct geographic locations, business customers can enhance reliability for their high-capacity switched and/or special services applications.
This is how STPs are protected as well. Each pair of STPs is connected to every other pair of STPs by multiple data links. To ensure that connectivity will always be available, these links are established through three geographically separated routes. Should something happen on one route to disrupt signaling message transfer, the other two routes remain available to keep the carrier’s signaling system operational.
Within each central office switch, there is a device that permits the switch to interface with the carrier’s common channel signaling system to send and accept information that is used to set up and deliver calls. Should this interface device malfunction, the switch can use special data links that are directly connected to one or more “helper” switches to gain access to the signaling network via their interface. In this manner, central office switches can continue to process long-distance calls while a repair is made. AT&T calls this backup signaling capability the alternate signaling transport network (ASTN).
The central office switches can also make use of ASTN should both halves of a mated pair of STPs fail. Each switch normally uses a particular mated pair of STPs to handle call setup. If something happens to the STP pair on which the switch normally relies, the switch can use ASTN to access the signaling network through helper switches that use a different STP pair.
15.2.7 Real-Time Network Routing
Some of the larger carriers employ very sophisticated call-routing schemes. The network of switches belonging to AT&T, for example, routes calls through a system known as real-time network routing (RTNR). This software system enables every switch within the AT&T network to know the available resource capacity of every other switch in the network on a real-time basis. Since AT&T has more than 130 switches in its U.S. network, customer calls will have more than 130 ways to be routed across the network. This path diversity enables high call-completion rates despite regional congestion and resource constraints. Together, the redundancy and alternate routing features of AT&T and other public networks enable AT&T to offer customers special restoration services.
Carrier Restoration Services
For most customers, a 5-minute service interruption is within tolerable limits. But other customers, such as businesses that rely extensively on inbound or outbound calling, need a much shorter restoration period. For these customers, carriers offer optional services that can be used to meet individual reliability requirements. These can range from the carrier planning and building a complete private network, to customers using one or more optional reliability features to meet highly specific needs. Of the more than 1,200 carriers operating in the United States today, none has a more comprehensive set of restoration services than AT&T. The following offerings, however, can also be used to make comparisons with other carriers:
-
Split-access flexible egress routing (SAFER). For users of dedicated access facilities, SAFER provides a backup mechanism to protect the exit ramp of an organization’s toll-free service to ensure consistent and reliable access to its customers. SAFER protects against network congestion, access facility failure, or a service disruption at the AT&T switch. SAFER can also be used to protect a business from a failed or busy T1 facility. If toll-free calls cannot complete through the normal terminating network switch, SAFER redirects these calls through an alternate switch in the AT&T network. This gives callers an alternate route to the business location. This mechanism is automatically activated in near real-time, whenever it is needed.
-
Alternate destination call routing (ADCR). For customers with toll-free operations in more than one location, ADCR allows the AT&T switch that normally carries the calls to the company’s location to route incoming calls to another business location automatically when a problem arises. For example, if a company’s ACD at the main location is unavailable or too busy with calls, any additional calls would be forwarded automatically to an alternate location. Calls could be directed either through the original AT&T switch or through an alternate switch, thus protecting against disruptions in AT&T switches, local exchange switches, or customer equipment.
-
Network protection capability (NPC). For digital service customers, the optional NPC provides a geographically diverse backup facility and will usually switch traffic to this backup route within 20 ms of a service interruption. When the service is fully restored, the NPC automatically routes traffic according to the original configuration. The backup and restoration processes occur so rapidly that the customer will not notice any disruption to service. If data is in transit during the configuration change, no data will be lost.
-
Enhanced diversity routing option (EDRO). To protect a business from service disruptions in the event of a cable cut or natural disaster, EDRO provides customers with a documented physical and electrical circuit diversity program. As part of EDRO, AT&T designs and maintains physically separate paths through its network to eliminate common points of failure between circuits. Under this option, diverse circuits are separated by at least 100 feet and avoid common AT&T buildings to further reduce the possibility of a common point of failure.
-
Access protection capability (APC). To protect the access portion of a customer’s circuit, APC provides immediate recovery of access circuits from certain network failures by automatically transferring service to a dedicated, separately routed access circuit.
-
Customer controlled reconfiguration. This service is available in conjunction with AT&T’s digital access and cross-connect system (DACS). CCR offers a means to route around failed facilities. The DACS is not a switch (PBX) that can be used for setting up calls in real time or for performing alternate routing on a dynamic basis; it is simply a static routing device. Originally designed to automate the process of circuit provisioning to avoid having a carrier’s technician manually patch the customer’s derived 64-Kbps DS0 channels to designated long-haul transport facilities, the DACS allows CCR subscribers to organize and manage their own circuits from an on-premises management terminal. Any changes will take a few minutes to a half-hour to implement because changes must be uploaded to the carrier’s network before they take effect.
-
Bandwidth manager services. For data services, AT&T offers network managers the capability of fine-tuning their WAN to handle dynamic applications requirements, such as LAN interconnection, videoconferencing, and traffic-load balancing. In addition, bandwidth management services can be used to automatically restore dedicated private-line circuits or redirect private line and frame relay service circuits to a backup location in the event of a circuit failure and/or a disaster at the primary site.
-
T1.5 reserved service. This service supports applications requiring T1 (1.544 Mbps) speeds. AT&T brings a dedicated T1 facility on-line only after the customer verbally requests it with a phone call. This restoration solution requires that the customer pre-subscribe to the service and that local-access facilities already be in place.
-
Fiber Network Restoration
The major carriers in the United States operate extensive fiber networks and have implemented architectures with sophisticated protection mechanisms to ensure uninterrupted voice and data services. Many of the smaller carriers with fiber networks, on the other hand, ran out of investment capital while expanding the reach of their backbones and did not have time to build much redundancy into their networks. Companies with mission-critical applications, therefore, should exercise due diligence when considering such carriers.
AT&T is one carrier that provides extensive restoration services for its fiber network. The carrier’s Fast Automatic Restoration (FASTAR) system provides automated facilities restoration for all types of services (special services and switched) traveling over AT&T’s fiber-optic transmission systems. FASTAR is designed to restore 90% to 95% of network circuits within 2 minutes, while FASTAR II is capable of rerouting circuits within 60 ms of a failure.
Specifically, FASTAR is a routing algorithm used to instruct DACS systems to reroute traffic around failed or congested routes. In the event of a fiber-optic cable cut in the core network, FASTAR automatically locates the exact site of the cut and transfers the affected circuits to spare capacity going around the cut. In this way, 72 T3 circuits can be rerouted by FASTAR within 5 minutes. Before FASTAR was introduced in 1992, T3 circuits had to be rerouted manually at patch panels, which could take hours.
When a facilities problem occurs, such as a cable cut, the following activities are typically performed by the FASTAR system:
-
The problem is identified;
-
The exact location of the problem is determined;
-
The amount and location of currently available protection or backup/spare facilities is determined;
-
A substitute route is constructed from the available spare facilities;
-
The substitute route is tested to ensure it is operational and of high quality;
-
The traffic on the damaged route is moved to the substitute route.
The FASTAR system goes through all these facilities restoration steps outlined but at computer speed and on a fully automated basis.
FASTAR II operates in conjunction with a network consisting of more than 50 double-interconnected SONET rings with ATM switching at crossover points to create a ring and mesh architecture for data services. Using overlapping, self-healing rings, FASTAR II can restore certain types of network failures, such as simple cable cuts, in milliseconds. With this type of outage, clients often do not even notice that traffic was interrupted.
Many other carriers also use SONET for disaster recovery, including RBOCs and CLECs. Fiber is deployed in redundant rings around major metropolitan areas and high-traffic corridors between major cities. SONET fiber facilities are typically configured in a dual counter-rotating ring topology, as illustrated in Figure 15.1. This topology makes use of self-healing mechanisms in SONET-compliant equipment [i.e., add-drop multiplexers (ADMs)] to ensure the highest degree of network availability and reliability. In the event of a break in the line, traffic is automatically switched from one ring to the other, thus maintaining the integrity of the network. In the unlikely event that both the primary and secondary lines fail, the SONET-compliant equipment adjacent to the failures automatically loops the data between rings, thus forming a new C-shaped ring from the operational portions of the original two rings. When the break is fixed, the network automatically returns to its original state.
Figure 15.1: Self-healing SONET-compliant fiber-ring topology. In this scenario, if the inner ring is cut or fails, traffic is rerouted in the opposite direction on the outer ring. The SONET equipment at node D changes the direction of the traffic.
SONET’s embedded management channels give carriers and users alike more capabilities for continuous monitoring and preemptive corrective action to impending trouble conditions. In private SONET networks, network managers can reconfigure channels and facilities without the involvement of telephone companies. Through software programming, it is even be possible to map SONET circuits so that they can be automatically rerouted to alternate carrier facilities should a failure occur on the primary circuit(s).
A new generation of optical systems have become available that offer much more bandwidth than SONET. WDM uses the different colors in the light source as separate high-speed channels. Each channel that can support a particular service, such as T-carrier for private lines, Gigabit Ethernet for LAN interconnectivity, Fibre Channel for storage-area networking, or ESCON for IBM mainframe connectivity. WDM-equipped fiber links can also transport SONET payloads. The WDM systems carry SONET’s embedded overhead channels transparently, which perform link supervision and gather performance statistics and allow SONET’s fault-recovery procedures to operate as normal to ensure network availability. With a 50-ms recovery time, WDM-matches the recovery performance of SONET in case of link failures, allowing both technologies to play complementary roles. And with embedded supervisory channels, WDM systems can report on a number of performance metrics to help diagnose problems with individual channels, as well as with the fiber link.