Sunday, December 9, 2007

LAN Restoration Planning

Network Reliability

A network or resource is reliable when it continues to operate despite the failure of a critical element. The critical elements are different for each network topology: star, ring, and bus. Thus, each topology can be evaluated in terms of its reliability, as well as its suitability for specific applications.

3.2.1 Star Topology

When it comes to link availability, the star topology is highly reliable. In the star topology, all network devices (i.e., nodes) or LAN segments connect to a central hub. Although the loss of a link prevents communication between the hub and the affected node, all other nodes will continue to operate as before unless the hub itself suffers a catastrophic failure.

To ensure a high degree of reliability, the hub has redundant subsystems at critical-points: the control logic, backplane, and power supply. The hub’s management system can enhance the fault tolerance of these redundant subsystems by continuously monitoring their operation and reporting any anomalies. With the power supply, for example, monitoring may include hotspot detection and fan operation to detect trouble before it disrupts hub operation. Upon the failure of the main power supply, the redundant unit switches over automatically or manually under the network manager’s control without disrupting the network.

The flexibility of the hub architecture lends itself to varying degrees of fault tolerance, depending on the criticality of the applications. For example, workstations running non-critical applications may share a link to the same LAN module at the hub. Although this configuration might seem economical, it is disadvantageous in that a failure in the LAN module will put all of the workstations on that link out of commission. A slightly higher degree of fault tolerance may be achieved by distributing the workstations among two LAN modules and links. That way, the failure of one module would affect only half the number of workstations. A one-to-one correspondence of workstations to modules offers an even greater level of fault tolerance, because the failure of one module impacts only the workstation connected to it. However, this configuration is also a more expensive solution than the others.

A critical application may demand the highest level of fault tolerance. This can be achieved by connecting the workstation to two LAN modules at the hub with separate links. The ultimate in fault tolerance would be achieved by connecting one of those links to a different hub. In this arrangement, a transceiver is used to split the links from the application’s host computer, enabling each link to connect with a different module in the hub or to a different hub. All of these levels of fault tolerance are summarized in



3.2.2 Ring Topology

In its pure form, the ring topology offers poor reliability to both node and link failures. The ring uses link segments to connect adjacent nodes together. Each node is actively involved in the transmissions of other nodes through token passing. The token is received by each node, at which time it can transmit data before passing the token to the adjacent node. The loss of a link not only results in the loss of a node but brings down the entire network as well. Enhancing the reliability of the ring topology requires adding redundant links between nodes as well as bypass circuitry. Adding such components, however, makes implementing the ring topology more expensive.

3.2.3 Bus Topology

The bus topology also provides poor reliability. If the link fails, that entire segment of the network is rendered useless. A redundant link for each segment will increase the reliability of the bus topology, but at extra cost. Unlike the ring topology, where each node is dependent on the others adjacent to it, the nodes in a bus topology are independent and contend for access to the LAN. If a node fails, the rest of the network continues to operate.

Network Availability

Availability is a measure of performance dealing with the LAN’s ability to support all users who wish to access it. A network that is highly available provides services immediately to users, whereas a network that suffers from low availability typically forces users to wait for access. The topology of the LAN influences availability.

Availability on the bus topology is dependent on load, the access control protocol-used, and length of the bus. With a light load, availability is virtually assured for any user who wishes to access the network. As the load increases, however, so does the chance of collisions. When a collision occurs, the transmitting nodes back off and try again after a short interval. The chance of collisions also increases with bus length.

A network based on a star topology can only support what the central hub can handle. In any case, each LAN module in the hub can handle only one request at a time, which can impact other users on that segment during heavy load conditions. Hubs equipped with multiple processors and LAN modules can alleviate this situation somewhat, but even with multiple processors, there will not usually be a one-to-one correspondence between users and processors. Such a system would be cost-prohibitive.

In terms of network availability, the ring topology scores higher than either the bus or star topology. This is because each node on the ring has an equal chance at accessing the network, which is governed by the token. However, since each node on the ring must wait for the token before transmitting data, the time interval allotted for transmission decreases as the number of nodes on the ring increases.

Recovery Options

The LAN is a data-intensive environment requiring special precautions to safeguard one of the organization’s most valuable assets—information. The procedural aspect of minimizing data loss entails the implementation of manual or automated methods for backing up all data on the LAN to avoid the tedious and costly process of recreating vast amounts of information. The equipment aspect of minimizing data loss entails the use of redundant circuitry, as well as components and subsystems that are activated automatically upon the failure of various LAN devices to prevent data loss and maintain network availability.

In addition to the ability to respond to errors in transmissions by detection and correction, other important aspects of LAN operation are recovery and reconfiguration. Recovery deals with bringing the LAN back to a stable condition after an error, and reconfiguration is the mechanism by which the network is restored to its previous condition after a failure.

LAN reconfigurations involve mechanisms to restore service upon loss of a link or network interface unit. To recover or reconfigure the network after failures or faults requires that the network possess mechanisms to detect that an error or fault has occurred and to determine how to minimize the effect on the system’s performance. Generally, these mechanisms provide the following:

  • Performance monitoring;

  • Fault location;

  • Network management;

  • System availability management;

  • Configuration management.

These mechanisms work in concert to detect and isolate faults, determine their effects on the system, and remedy these conditions to bring the network to a stable state with minimal impact on network availability.

Reconfiguration is a fault management scheme used to bypass major failures of network components. This process entails detecting that a fault condition has occurred that cannot be corrected by merely restarting the equipment. Once it is determined that a fault has occurred, its impact on the network is assessed so that an appropriate reconfiguration can be formulated and implemented. In this way, normal operations can continue under a new configuration until the problem can be fixed and the network restored to its primary configuration.

Fault detection is augmented by logging systems that keep track of failures over a period of time. This information is examined to determine trends that adversely affect network performance. This information, for example, might reveal that a particular component is continually causing problems on the network, or the monitoring system might detect that a component on the network has a higher-than-normal failure rate.

The configuration assessment component of the reconfiguration system uses information about the current system configuration—including connectivity, component placement, paths, and traffic flows—and maps it against the failed component. This information is analyzed to indicate how that particular failure is affecting the system and to isolate the cause of the failure. Once this assessment has been performed, a solution can be worked out and implemented.

The solution may consist of reconfiguring most of the operational processes to avoid the source of the fault. The solution determination component examines the configuration and the affected hardware or software components, determines how to move resources around to bring the network back to an operational state or indicates what must be eliminated because of the failure, and identifies network components that must be serviced.

Determining the most effective course of action is based on the criticality of keeping certain functions of the network operating and maintaining the resources available to do this. In some environments, nothing can be done to restore service because of device limitations (e.g., lack of redundant subsystems) or the lack of spare bandwidth. In such cases, about all that can be done is to indicate to the servicing agent what must be corrected and keep users informed of the situation.

Once an alternate configuration has been determined, the reconfiguration system-implements it. In most cases, this means rerouting transmissions, moving and restarting processes from failed devices, or reinitializing software that has failed because of some intermittent error condition. In some cases, nothing may need to be done except notify affected users that the failure is not severe enough to warrant system reconfiguration.

Geographically distributed LANs can be inter-networked over the WAN using such devices as bridges and routers connected to leased lines and/or switched services. An advantage of using routers for this purpose is that they permit the building of large mesh networks. With mesh networks, the routers can steer traffic around points of congestion or failure and balance the traffic load across the remaining links. In addition, routers have flow control and more comprehensive error protection than bridges.

Bridges are useful for reducing the size of sprawling LANs into discrete subnetworks that are easier to control and manage. Through the use of bridges, similar devices, protocols, and transmission media can be grouped together into communities of interest. Such partitioning can yield many advantages, such as eliminating congestion and improving the response time of the entire network. Subnetworks are also useful for testing new applications before making them available over the enterprise network.

No comments: