Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications
12 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
12 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

UnderstandingNetworkFailuresinDataCenters: Measurement,Analysis,andImplications Phillipa Gill Navendu Jain Nachiappan Nagappan University of Toronto Microsoft Research Microsoft Research navendu@microsoft.com nachin@microsoft.comphillipa@cs.toronto.edu ABSTRACT lessons learned from this study to guide the design of future data center networks.We present the first large-scale analysis of failures in a data cen- Motivated by issues encountered by network operators, weter network. Through our analysis, we seek to answer several fun- study network reliability along three dimensions:damental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how ef- ? Characterizing the most failure prone network elements. To fective is network redundancy? We answer these questions using achieve high availability amidst multiple failure sources such as multiple data sources commonly collected by network operators. hardware, software, and human errors, operators need to focus The key findings of our study are that (1) data center networks on fixing the most unreliable devices and links in the network. show high reliability, (2) commodity switches such as ToRs and Tothisend,wecharacterizefailurestoidentifynetworkelements AggS are highly reliable, (3) load balancers dominate in terms of with high impact on network reliability e.g.

Sujets

Informations

Publié par
Publié le 21 mars 2013
Nombre de lectures 85
Langue English
Poids de l'ouvrage 1 Mo

Extrait

UnderstandingNetworkFailuresinDataCenters:
Measurement,Analysis,andImplications
Phillipa Gill Navendu Jain Nachiappan Nagappan
University of Toronto Microsoft Research Microsoft Research
navendu@microsoft.com nachin@microsoft.comphillipa@cs.toronto.edu
ABSTRACT lessons learned from this study to guide the design of future data
center networks.We present the first large-scale analysis of failures in a data cen-
Motivated by issues encountered by network operators, weter network. Through our analysis, we seek to answer several fun-
study network reliability along three dimensions:damental questions: which devices/links are most unreliable, what
causes failures, how do failures impact network traffic and how ef- ? Characterizing the most failure prone network elements. To
fective is network redundancy? We answer these questions using achieve high availability amidst multiple failure sources such as
multiple data sources commonly collected by network operators. hardware, software, and human errors, operators need to focus
The key findings of our study are that (1) data center networks on fixing the most unreliable devices and links in the network.
show high reliability, (2) commodity switches such as ToRs and Tothisend,wecharacterizefailurestoidentifynetworkelements
AggS are highly reliable, (3) load balancers dominate in terms of with high impact on network reliability e.g., those that fail with
failure occurrences with many short-lived software related faults, high frequency or that incur high downtime.
(4)failureshavepotentialtocauselossofmanysmallpacketssuch
as keep alive messages and ACKs, and (5) network redundancy is ? Estimating the impact of failures. Given limited resources at
only 40% effective in reducing the median impact of failure. hand,operatorsneedtoprioritizesevereincidentsfortroubleshoot-
ing based on their impact to end-users and applications. In gen-Categories and Subject Descriptors: C.2.3 [Computer-Comm-
eral, however, it is difficult to accurately quantify a failure’s im-unication Network]: Network Operations
pact from error logs, and annotations provided by operators in
General Terms: Network Management, Performance, Reliability
trouble tickets tend to be ambiguous. Thus, as a first step, we
Keywords: Data Centers, Network Reliability estimatefailureimpactbycorrelatingeventlogswithrecentnet-
work traffic observed on links involved in the event. Note that
logged events do not necessarily result in a service outage be-1. INTRODUCTION
cause of failure-mitigation techniques such as network redun-
Demand for dynamic scaling and benefits from economies of dancy [1] and replication of compute and data [11,27], typically
scale are driving the creation of mega data centers to host a broad deployed in data centers.
rangeofservicessuchasWebsearch,e-commerce,storagebackup,
video streaming, high-performance computing, and data analytics. ? Analyzing the effectiveness of network redundancy. Ideally,
Tohosttheseapplications,datacenternetworksneedtobescalable, operators want to mask all failures before applications experi-
efficient,faulttolerant,andeasy-to-manage.Recognizingthisneed, ence any disruption. Current data center networks typically pro-
the research community has proposed several architectures to im- vide 1:1 redundancy to allow traffic to flow along an alternate
provescalabilityandperformanceofdatacenternetworks[2,3,12– route when a device or link becomes unavailable [1]. However,
14,17,21]. However, the issue of reliability has remained unad- this redundancy comes at a high cost—both monetary expenses
dressed, mainly due to a dearth of available empirical data on fail- and management overheads—to maintain a large number of net-
ures in these networks. workdevicesandlinksinthemulti-rootedtreetopology.Toana-
In this paper, we study data center network reliability by ana- lyze its effectiveness, we compare traffic on a per-link basis dur-
lyzing network error logs collected for over a year from thousands ing failure events to traffic across all links in the network redun-
of network devices across tens of geographically distributed data dancy group where the failure occurred.
centers. Our goals for this analysis are two-fold. First, we seek
For our study, we leverage multiple monitoring tools put into characterize network failure patterns in data centers and under-
place by our network operators. We utilize data sources that pro-standoverallreliabilityofthenetwork.Second,wewanttoleverage
vide both a static view (e.g., router configuration files, device pro-
curement data) and a dynamic view (e.g., SNMP polling, syslog,
trouble tickets) of the network. Analyzing these data sources, how-
ever,posesseveralchallenges.First,sincetheselogstracklowlevelPermission to make digital or hard copies of all or part of this work for
network events, they do not necessarily imply application perfor-personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies mance impact or service outage. Second, we need to separate fail-
bear this notice and the full citation on the first page. To copy otherwise, to uresthatpotentiallyimpactnetworkconnectivityfromhighvolume
republish,topostonserversortoredistributetolists,requirespriorspecific and often noisy network logs e.g., warnings and error messages
permission and/or a fee.
even when the device is functional. Finally, analyzing the effec-
SIGCOMM’11, August 15-19, 2011, Toronto, Ontario, Canada.
tiveness of network redundancy requires correlating multiple dataCopyright 2011 ACM 978-1-4503-0797-0/11/08 ...$10.00.sources across redundant devices and links. Through our analysis,
we aim to address these challenges to characterize network fail- Internet
ures, estimate the failure impact, and analyze the effectiveness of Internet
network redundancy in data centers.
Data centerCore Core1.1 Key observations
Layer 3
We make several key observations from our study:
primary and
AccRAccR? Datacenternetworksarereliable. Wefindthatoverallthedata }
back up
center network exhibits high reliability with more than four 9’s
of availability for about 80% of the links and for about 60% of
the devices in the network (Section 4.5.3). Layer 2
LBLB AggSAggS
? Low-cost, commodity switches are highly reliable. We find
that Top of Rack switches (ToRs) and aggregation switches ex- LBLB
hibit the highest reliability in the network with failure rates of
about 5% and 10%, respectively. This observation supports net-
ToR ToR ToR ToR
work design proposals that aim to build data center networks
using low cost, commodity switches [3,12,21] (Section 4.3).
Figure 1: A conventional data center network architecture
adapted from figure by Cisco [12]. The device naming conven-? Loadbalancersexperienceahighnumberofsoftwarefaults.
tion is summarized in Table 1.We observe 1 in 5 load balancers exhibit a failure (Section 4.3)
and that they experience many transient software faults (Sec-
tion 4.7).
Table 1: Summary of device abbreviations
Type Devices Description
? Failures potentially cause loss of a large number of small AggS AggS-1, AggS-2 Aggregation switches
LB LB-1, LB-2, LB-3 Load balancerspackets. By correlating network traffic with link failure events,
ToR ToR-1, ToR-2, ToR-3 Top of Rack switcheswe estimate the amount of packets and data lost during failures.
AccR - Access routers
We find that most failures lose a large number of packets rela-
Core - Core routers
tive to the number of lost bytes (Section 5), likely due to loss of
protocol-specific keep alive messages or ACKs.
2. BACKGROUND
? Network redundancy helps, but it is not entirely effective. Our study focuses on characterizing failure events within our
Ideally, network redundancy should completely mask all fail-
organization’ssetofdatacenters.Wenextgiveanoverviewofdata
ures from applications. However, we observe that network re- center networks and workload characteristics.
dundancyisonlyabletoreducethemedianimpactoffailures(in
terms of lost bytes or packets) by up to 40% (Section 5.1). 2.1 Data center network architecture
Figure 1 illustrates an example of a partial data center net-
Limitations. As with any large-scale empirical study, our results work architecture [1]. In the network, rack-mounted servers are
aresubjecttoseverallimitations.First,thebest-effortnatureoffail- connected (or dual-homed) to a Top of Rack (ToR) switch usu-
ure reporting may lead to missed events or multiply-logged events. ally via a 1 Gbps link. The ToR is in turn connected to a primary
Whileweperformdatacleaning(Section3)tofilterthenoise,some and back up aggregation switch (AggS) for redundancy. Each re-
eventsmaystillbelostduetosoftwarefaults(e.g.,firmwareerrors) dundant pair of AggS aggregates traffic from tens of ToRs which
or disconnections (e.g., under correlated failures). Second, human is then forwarded to the access routers (AccR). The access routers
biasmayariseinfailureannotations(e.g.,rootcause).Thisconcern aggregate traffic from up to several thousand servers and route it to
is alleviated to an extent by verification with operators, and scale core routers that connect to the rest of the data center network and
and diversity of our network logs. Third, network errors do not al- Internet.
ways impact network traffic or service availability, due to several All links in our data centers use Ethernet as the link layer
factors such as in-built redundancy at netw

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents