Network Resilience Audit Service
8 pages
English

Network Resilience Audit Service

-

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
8 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Network Resilience Audit Service A. Foster, R. J. Proctor and P. A. Smith Overview Many organisations need communications with a guaranteed level of service. A resilient service requires a resilient network. Resilience of individual com-ponents of a network cannot guarantee resilience of the network as a whole. The network resilience audit service identifies the failure modes of resilient networks. This paper highlights some of them. © Copyright A.Foster, R.J.Proctor, P.A.Smith 2002 www.discoveryconsultancy.com/services/nras ƒƒƒƒNetwork Resilience Audit Service Introduction The Problem The Solution Many organisations require resil- In an ideal world, with boundless ient communications. Some try to capacity and a workforce with the provide this by using resilient all the time in the world, there are components, but this does not many ways of ensuring that diver-address the real problem. The true gence is always delivered and answer is network resilience, maintained. However, we do not which requires network diversity. live in a perfect world. Rather, the Is the diversity real? Will it work workforce is fully employed doing when something breaks? their day-to-day tasks. So, what can be done? The Network Resilience Resilience comes at a price; Audit Service can aid users and where the cost of a failure is high, providers of network resilience ser-duplication may not be as good as ...

Informations

Publié par
Nombre de lectures 47
Langue English

Extrait

Network Resilience Audit ServiceA. Foster, R. J. Proctor and P. A. Smith Overview
Many organisations need communications with a guaranteed level of service. A resilient service requires a resilient network. Resilience of individual com-ponents of a network cannot guarantee resilience of the network as a whole. The network resilience audit service identifies the failure modes of resilient networks. This paper highlights some of them. ©Copyright A.Foster, R.J.Proctor, P.A.Smith 2002
www.discoveryconsultancy.com/services/nras
Introduction
The Problem Many organisations require resil-ient communications. Some try to provide this by using resilient components, but this does not address the real problem. The true answer is network resilience, which requires network diversity. Is the diversity real? Will it work when something breaks?
Resilience comes at a price; where the cost of a failure is high, duplication may not be as good as expected. Services can be subject to failures elsewhere in the net-work. Management and ownership issues can hide prob-lems.
Networks are implemented and maintained by human beings. The most carefully planned network will not deliver its requirement where: ƒThe network is not built to its paper design; ƒChanges in the network are not designed and implemented carefully; ƒThe network does not adhere to its own naming scheme; ƒRecords are not kept up to date, either during initial roll out or during network modifi-cations, mergers and upgrades.
Network Resilience Audit Service
The Solution
In an ideal world, with boundless capacity and a workforce with the all the time in the world, there are many ways of ensuring that diver-gence is always delivered and maintained. However, we do not live in a perfect world. Rather, the workforce is fully employed doing their day-to-day tasks. So, what can be done? The Network Resilience Audit Service can aid users and providers of network resilience ser-vices. We bring our understanding of the problems and apply it in a phased way.
A Network Resilience Audit Ser-vice assesses the resilience of a communications network against failures, from the simple failure to duplicate to esoteric protocol fail-ures.
The rest of this paper highlights some of the failures that can occur and the problems they cause. A Network Resilience Audit could identify and reduce the risks these failures cause.
Page 2
Network Resilience Audit Service
Resilience Audit Service Description
Customers expect their network services to be both reliable and available when required. In order to deliver both of these, network services must be resilient against failure. This paper examines the problems that must be confronted when providing these resilient network services.
Some of the most serious network resilience issues are identified be-low. Each layer of a communications network is exam-ined in turn, as each has its own potential problems.
Operators wish to provide a net-work with resilience, which requires a combination of plan-ning, good network design and methodical management practices. Network planning ensures that sufficient capacity and diversity is provided for the resilient service.
Network Aspect Ownership
Management
Network
Service
Packet
Access
Transmission
Physical
Page 3
Good design avoids dubious net-work topologies. It prevents over-reliance on the capabilities of indi-vidual network elements.
Network management monitors and maintains the equipment and protection mechanisms. Those circuits that require a level of net-work resilience greater than that provided by the network itself are also supported.
All protection mechanisms rely upon divergent routing of the con-nections. Our network resilience service analyses the risks to the service in several ways. In particu-lar, we can identify potential failure conditions caused by traffic problems and problems with net-work topology. This paper concentrates on the topological problems that may occur.
Anecdotal Evidence 1
Everything in the building was duplicated, the equipment, the power feeds, the air condition-ing, except both air conditioning units were fed from the same power feed. Overheating led to equipment failure.
Table 1: Categories of Resilience Problem
Possible Problems  Lack of transparency across owners of both routes or art the same man-controlled b ement of both routes  Network mana agement equipment  Lack of transparency across management domains  Failures in the management communication network  Poor Record Keeping  Network Overload  Topology problems  Inappropriate inter-layer interactions  Apparently different services using same infrastructure  Common equipment bein responsible or involved in the routin of both main and standb routes  Too long to reconfigure  Distant failures overloading your routes  Network rerouting depending on capacity available  Everything is high priority  Links overload following failures  Single access from customer  Inappropriate protection  Inappropriate interconnection of rings and transmission systems  Inappropriate resource naming th, same fibre, same cable, same duct, same street, Same card, same box, same wavelen same building, same power, same air conditioning
Figure 1: End Customer View of Diverse Network Connections
Protecting Connection
Voice and Service Networks
Transport Networks
Packet Networks
Worker Connection
hese various protection mecha-sms ensure that the service is aintained following the failure of
Networks are traditionally divided into layers, as shown in Figure 2.
Divergent Network Connections
Transport Networks
Anecdotal Evidence 3
Anecdotal Evidence 2
There were two apparently diverse routes on two sides of same road. Unfortunately a utility company dug a trench straight across both ducts.
Physical NetworksInfrastructure
Network Intelligence and Servers
In a transport network, a trailan end-to-end sequence of network nnectionscarries the customer nnection. In modern transport tworks, failures detected at the d of the trail trigger one of the rious in-built protection mecha-sms.
Operator A wanted to make a site resilient. They only had one cable to the area, but leased capacity from operator B, who leased capacity from operator C who leased capac-ity from operator A on the same cable used for the origi-nal route.
Network Resilience Audit Service
Page 4
Another risky situation, illustrated in Figure 5, occurs where two ducts are used but are close together. For example, this may occur in a met-ropolitan environment where ducts owned by two different companies are routed in parallel along the same street. The danger here is that where one connection is damaged there is a finite possibility that the other may also be affected.
By contrast, Figure 4 shows a badly formed diverse network. While the circuits are routed over completely separate and independent optical fibres, they pass through the same duct. Such ducts are buried under-ground or located in the superstructure of a building. An uncommon but catastrophicfailure can occur when such a duct is damaged by road or building works. Although there are clearly two connections that appear from a user’s perceptive to be separate, they are actually susceptible to a single failure. There are many rea-sons why a common duct may be used, some of which are given in the side boxes marked Anecdotal Evidence numbers 3 and 4.
Figure 2: Example Layers of a Network
We will now examine network re-silience layer by layer. Physical Networks Figure 3 shows a much simplified example of how physical resources can be put together to create diverse network paths. It shows a well-formed protected network where two divergent connections are each carried through geographical sepa-rate network elements and ducting.
Care must also be taken that the connections are not carried on a common resource in a lower layer.
Ensuring that a network provides resilient connections requires the existence of an alternative route in the network. The alternative route must be ready to carry traf-fic in the event of a connection failure. Transport networks use protection mechanisms to provide these alternative routes. Packet and cell-based networks have several alternative routes built into their routing tables. In both cases, the use of the alternative route after a failure should not propagate the failure to other routes.
The end customer should see di-vergence as relatively simple. Figure 1 shows this. Two entirely separate connections are created between the two end points. These connections may be pro-vided by a single network opera-tor or by two different operators. While this design appears to pro-vide the required resilience to failure in theory, things are more complicated in practice.
Customer Network Element
Network
Customer Network Element
Network Resilience Audit Service
a trail. In SDH networks, protec-tion mechanisms include connection protection, multiplex section protection and various ring protection schemes. The pre-cise mechanism employed depends on a number of factors. Many protection schemes are complex. For example, radio transport systems use composite protection as they are less reliable than fibre-based systems.
Protection is based on the availability of alternate routes through a network. Transport networks use protecting trails in a number of ways. For example, traffic can be transmitted simulta-neously on both trails, allowing an easy and fast switchover when the receiving end detects an error. The benefit is that very little of the traffic is lost following failure. This type of protection is often used to protect highly valued traf-fic.
Other protection schemes only route the traffic over the protect-ing trail after a failure is detected. This allows the protecting trail to carry other (“casual” traffic) when it is not being used for its primary purpose. A failure of the worker trail causes the traffic to be switched to the protecting trail. The casual traffic, usually made up of low valued non-time-critical traffic, is lost. The disadvantage of this type of protection scheme is that the switching involves an end-to-end protocol. Of necessity, the switchover takes longer than the other scheme.
With both schemes, there is an assumption that the worker trail
Protecting Connection
Customer NE
Duct
Operator’s NE
Worker Connection
Figure 3: Example of Well-formed Diverse Network Connections
Protecting Connection
Common ductvulnerable to a single incident
Worker Connection
Figure 4: Example of Badly-formed Diverse Network Connections
Protecting Connection
Adjacent Ducts
Worker Connection
Figure 5: Example where Diversity may be Compromised by Adjacency
Inpractice, may become
Common ductvulnerable to a single incident
Adjacent ducts?
Figure 6: Rings Ain’t What They Used To Be
Ring In Situ
Anecdotal Evidence 4
Originally a company pur-chased the communication between the two sites from two separate operators. These later merged and in a cost cutting measure the new company replaced the two existing transmission links with one higher speed trans-mission link.
Anecdotal Evidence 5
An IP route, that had been carefully tuned, was re-placed by a much faster route. Following network discovery, routers then started selecting that route preferentially, until it was trying to take half the IP traf-fic between New York and Europe. An increase in ca-pacity had led to the loss of traffic.
EF Flow = 5Mbit/s
EF Flow = 5Mbit/s
and the protecting trail have com-pletely divergent routes. This is true of any intra-layer protection. It must rely on the good design of the layers below.
A common topology used in transport networks is the ring, such as those in Figure 6. This would seem to guarantee diver-gence. It should be noted that there is always the danger that where the ring is extended to a remote network element, there is always a chance that both of its links are carried in a single duct.
Even within the transport layer, ensuring that there is no common bearer is not easily achieved. Wave division multiplexing (WDM) allows individual optical fibres to carry a Terabit of data on a single fibre. This is equivalent to half a million 2Mbit/s connec-tions. With so much bandwidth available it is tempting to concen-trate everything on to the single fibre, leading to problems such as described in ‘Anecdotal Evidence 4’. Note also that early WDM equipment did not have network protection.
Packet Networks
Packet networks come in two principal varieties. There are con-nection oriented networks, such as ATM and connectionless net-works, such as IP. Looking first at connection oriented networks, there are a further two varieties. Where the network uses cross-connection, the divergence char-acteristics can be handled in the same way as we have seen with
After the failure, link capacity is exceeded: 2 × 5Mbit/s + routing > 10Mbit/s and so EF traffic is delayed or discarded until routing activity settles down
Example Connection Failure
Figure 7: Example of Collateral Damage in a Connectionless Network
Network Resilience Audit Service
SDH, with the proviso that we may have an underlying SDH network that may already deliver its own protection. Some ATM networks use ATM’s switching capabilityPNNIto provide a mechanism for automatic re-routing after path failure. Its main drawback is that the capacity to carry this “protection” path must be available at the time of the fail-ure. If it is not available at that precise moment, the alternative path cannot be created. In theory, connectionless networks do not have a single path across them for any particular client flow. Therefore, again in theory, when a part of the network fails, the traffic that would have been carried over the failed section finds another route across the network. Where all the traffic is best effort, this may lead to delays to the traffic on the alternative paths, depending on the traffic volumes involved. Where the techniques that are used to prioritise traffic are in place, such as Diff-Serv, the amount of diverted traffic may cause a planned quality scheme to become overloaded. Given that priority schemes only work where there is lower priority traffic that can be pre-empted, a flood of high priority traffic may cause the prioritisation to fail to meet its requirement. Service Networks Apparently different services such as mobile phones and landlines, might use the same infrastructure, this can cause a loss of resilience.
Some service networks may be de-pendant on centralised network intelligence such as IN services or Domain Name Servers. These are often duplicated for resilience, but still may share some risks; they might be in the same building or built from non resilient equipment that was never designed for use in a resilient service. Most service net-works have a high degree of resilience under normal conditions, but this can be compromised due to overload, upgrades and other main-tenance work.
Page 6
Network Resilience Audit Service
For Internet traffic, there is an important routing function – the domain name servers (DNS) these translate between accessing a site by name to its physical address in the network. The DNS function is normally duplicated for high availability and can be triplicated or more, the DNS function can spread the load over a number of servers to cater for failures and load balancing. The DNS needs to be very highly available, but often the duplications are in adjacent equipment, sometimes on the same machine. Operations de-pendent on the Internet should have the independence of the DNS checked.
Network Issues Some network vulnerabilities exist because of the resilience features of their technology. As we have seen with connectionless networks, the very act of rerout-ing the traffic can cause failures, perhaps in traffic originally unaf-fected by the failure. If there is a processing overload, because of other traffic the network or indi-vidual connections might be lost. Even where a network’s signal-ling traffic is kept apart from the customers’ traffic, there may be problems. Where a node is spend-ing more of its processing power reacting to failures, there is less available to process the custom-ers’ traffic. There is even a danger that a large failure can propagate across a network, escalating as it goes. Where a node has overload protectiona useful characteris-tic in other circumstances this might even lead to a network shutting down, however tempo-rarily. Network Management Network management is key to the provision of resilient services. One of the most valuable con-cepts used in the successful design of complex communica-tions solutions is that of domains. In the same way that network layering can reduce a problem space to a reasonable size, do-
Page 7
mains can reduce the scope of man-agement problems. However, there is a price to pay for this.
In ‘Anecdotal Evidence 3’, separate connections were seen to be pur-chased from different operators. The use of separate management domains within a single operator may lead to the same result. There is always a temptation to see only the problems within the domain under examination rather than across the whole network. Lack of coordinating across domains can lead to network failures. The data communications network (DCN) that is used by the manage-ment systems to communicate with the network elements can also pre-sent problems. Where a network does not have a signalling capabil-ity, the detection of failures cannot be used to initiate any mitigating behaviour unless network manage-ment can play this rôle. This is not possible where the DCN is itself damaged by the failure. Most DCNs are based on some sort of packet technology. These are susceptible to all the failures of Packet networks highlighted above. Ownership There are several potential prob-lems with the ownership of networks and the equipment. Own-ers merge, change and disappear causing the links to be administered differently, and the equipment itself may not be owner by the operators, some is leased from third parties (who might also be leasing the same equipment to the resilient link). Network operators are often very sensitive about where their equip-ment is and how it is connected. They would be reluctant to inform their customers and very reluctant to have the knowledge about their network to get to their competitors.
A trusted independent third party would be able to talk to the opera-tors, to obtain sufficient information to audit a network provided it was used for the audit only.
Anecdotal Evidence 6
An operator possessed a high speed link. It was now carry-ing so much valuable traffic that the operator decided that an alternative route was re-quired. During its construction the primary link failed. It had been cut through more than once by the companies own contractors.
Anecdotal Evidence 7
An example of a record keeping failure is with an operator who wished to install a modem at a remote location. Their records showed an almost empty equipment room. In reality, there was no space for the modem.
Network Resilience Audit 1.Investigate the networks 2.Identifies Resilience Risks 3.Propose Solutions (if needed)
Why Discovery Consultancy? Experienced telecommuni-cations professionals Understand the technology Understand the problem Independent from suppli-ers, operators and customers Operates under strict non disclosure agreements with all parties to obtain the necessary information to audit the network
Network Resilience Audit Service
The Impact of Process Failures
Now we have taken a brief look at some of the physical and topo-logical issues, we ought to examine the principal causes of resilience problems, namely the processes.
Human error impacts even the most carefully planned network.
One example is a network not built to its paper design. This may be due to time and cost con-straints or reluctance to change from accepted practice.
Additionally, when staff are under pressure to perform changes to the network quickly, they may introduce problems for the long term stability of the network and its services.
Naming schemes can be difficult to administer, especially where different schemes must be coordi-nated or they cross organisational boundaries. The use of equipment from different manufacturers can exacerbate the problem.
Perhaps the most insidious prob-lems arise where records are not kept up to date, either during ini-tial roll out or during network modifications, mergers and up-grades. Poor records cause a lack of knowledge about the true physical nature of the network, leading to the topological prob-lems previously described.
Let us look at a simple example. In Figure 2 we saw that we must
Conclusions
There are many ways that net-work resilience can be compromised. This paper has highlighted some of them: from low level simple failure of dupli-cation, to esoteric network and protocol failures. Things will fail, but provided you have appropriate equipment, appropriately net-
keep the worker and protecting trails separate. Assuming that di-vergence is physically possible, two trails must be created that do not at any point use the same resource, such as a fibre, duct or network element. Where all the appropriate resources are identified, the first step is to define trails that do not use the same resources. There are a number of schemes that allow this to be performed automatically, some simpler than others. Note that the more simplistic the scheme, the less able it is to cope with network re-designs.
Assume that we have now created  logicallytwo trails. Are the they truly divergent? If the resource naming contains errors, they may not be. Resource naming can go awry for a number of reasons; hu-man error is one. A more subtle problem can occur where resources are leased from other operators. Naming schemes exist within net-works and at their edges. There is no way of knowing that at some point the same fibre or duct may be being used, even though each trail sees a different resource name.
Can Service Level Agreements help? Perhaps, but often the service provider is reluctant to accept what they may regard as rigorous and onerous SLAs, limiting their use-fulness. Even where they exist, SLAs can be difficult to monitor and enforce, especially within an organisation.
worked, communications continue after the failure.
will
Discovery Consultancy offers a network resilience audit service of your communications that covers the breadth of your organisation's communications covering all of the cases highlighted in this paper and a lot more besides. For more information please contact us.
Page 8
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents