31 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
31 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

‰‰‰„„„‰‰‰‰„„„‰„„„„„„„‰‰„„‰‰„„„‰‰‰Computer Performance ModelingGoal: Estimating and understanding the performance of computer systemsPractical Cache Performance Low-Level ModelsModeling for Computer ArchitectsVarious levels of details: functional, trace, cycle-accurate, etc.ProsVersatile: adaptable to new architectures and workloadDetails: can embed very detailed statistics Yan Solihin, NCSU, solihin@ncsu.eduPopular: SimpleScalar, SIMICS widely usedFei Guo, NCSU, fguo@ncsu.eduConsThomas Puzak, IBM, trpuzak@us.ibm.comO(n) overhead, scales with workload size: each instruction/event is simulated to capture its effectPhil Emma, IBM, pemma@us.ibm.comSlow: realistic workload has many billions of instructions2Modeling MethodsComputer Performance ModelingHigh-Level Models White Box Purpose: Evaluating gross trade-offs of designsModel incorporates knowledge about the system (e.g. relationships of parameters are known a priori)ProsAnalytical or heuristics-basedShort execution time, and sometimes O(1)Requires little coding Pros: models reveal insights &explain “why”, no training requiredReveal basic relationships of variablesMay reveal non-obvious trends and insightsCons: problem-specific solutionConsBlack BoxLess versatileModel learns knowledge about the systemRequires performance modeling expertise AI-based: neural networks, decision tree, curve fitting, etc.Uses: Pros: can model complex problem ...

Informations

Publié par
Nombre de lectures 15
Langue English

Extrait

Practical Cache Performance Modeling for Computer Architects
Yan Solihin, NCSU,solihin@ncsu.edu Fei Guo, NCSU,fguo@ncsu.edu Thomas Puzak, IBM,trpuzak@us.ibm.com Phil Emma, IBM,pemma@us.ibm.com
Computer Performance Modeling „High-Level Models ‰Purpose: Evaluating gross trade-offs of designs ‰Pros „Short execution time, and sometimes O(1) „Requires little coding „Reveal basic relationships of variables „May reveal non-obvious trends and insights ‰Cons „Less versatile „Requires performance modeling expertise ‰Uses: „Early design cycle: for pruning design search space „Entire design cycle: to re-evaluate design search space if requirements change 3
Computer Performance Modeling „Goal: Estimating and understanding the performance of computer systems „Low-Level Models ‰Various levels of details: functional, trace, cycle-accurate, etc. ‰Pros „Versatile: adaptable to new architectures and workload „Details: can embed very detailed statistics „Popular: SimpleScalar, SIMICS widely used ‰Cons „O(n) overhead, scales with workload size: each instruction/event is simulated to capture its effect „Slow: realistic workload has many billions of instructions
Modeling Methods „White Box ‰Model incorporates knowledge about the system (e.g. relationships of parameters are known a priori) ‰Analytical or heuristics-based ‰Pros: models reveal insights &explain why, no training required ‰Cons: problem-specific solution „Black Box ‰Model learns knowledge about the system ‰AI-based: neural networks, decision tree, curve fitting, etc. ‰Pros: can model complex problem/system ‰Cons: prediction without insights, requires training
2
4
1
Focus of this tutorial „Types: ‰High Level, Hybrid High/Low level „A priori knowledge: ‰White Box „Socep: ‰Miss count & rate ‰Miss cost ‰Bandwidth usage
Capturing Temporal Locality Behavior
5
Program 8:30  8:45: Introduction 8:45  9:00: Capturing temporal locality behavior 9:00  9:30: Modeling cache sharing 9:30  10:00: Modeling cache replacement policy 10:00  10:30: Coffee Break 10:30  11:30: Analysis of the effects of miss clustering on the cost of a cache miss 11:30  12:30: Interaction of Caching and Bandwidth Pressure
6
Temporal Locality Behavior „Programs exhibit locality of references „Spatial locality: the neighbor of recently-accessed data tends to be accessed in the near future „Temporal locality: recently-accessed data tends to be accessed again (orreused) in the near future „Snceficaigni:temporal reuse & cache parameters determine all non-cold misses ‰If each memory block is accessed exactly once, we only have cold misses ‰Cold misses affected by block size „How can we capture temporal locality?
8
2
Stack Distance Profiling[Mattson70] „Early attempt to capture temporal reuse behavior „Models LRU stack with a counter for each stack position „Example: fully associative cache with 8-entry stack ‰C1: incremented whenever the MRU block is accessed ‰C2: incremented whenever the 2ndMRU block is accessed ‰C3: incremented whenever the 3rdMRU block is accessed ‰‰C8: incremented whenever the 8thMRU (or LRU) block is accessed ‰C>8: incremented whenever the 9th, 10th,  block is accessed
Stack Distance Properties „For fully-associative LRUcache with A blocks, the number of misses of theMisses=CA cache isi=A+1 „For A-way set associative LRU cache, we can collectset-specific stack distance profile, and the number ofMisses=CA misses of the set is:i=A+1 „Alternatively, keep per-set stacks, but use global set of counters
9
11
Typical Shape „Empirical observationGeometric or exponential sequence „Due to temporal locality „Ci+1= Cix r, where 0<r<1 is the common ratio Incremented on access to MRU line 30%Incremented on access to 2ndMRU line 25% Incremented on access to 3rdMRU line 20% Incremented on a miss 15% 10% 5% 0%
C1 C2 C3 C4 C5 C6 C7 C8 C>8 Stack Distance Counters 10
Where to Profile P For capturing temporal reuse patterns at the L1 cache levels Predict cache misses for various L1 cache L1 Instruction L1 Data configurations For capturing temporal reuse patterns at the L2 cache levels cache misses for various L2 cachePredict L2 Cache configurations
Mem
12
3
Limitations of Stack Distance Profile „Useful only for predicting cache misses across different cache associativities „For other purposes, we need to capture temporal reuse patterns in greater details „So, useCircular Sequence Profilenhdar[aC05] ‰Extends stack distance profiling ‰Counts the occurrence of cseq(d,n)
13
Relationship with Stack Distance Profile „Cx = number of circular sequences cseq(d=x,n=any value), or Cx=cseq(x,n) n=x „Hence, stack distance profile is a subset of circular sequence profile
15
Definitions „seq(d,n)=quseceenofnaccesses toddistinct addresses (in a cache set) „cseq(d,n)(circular sequence) = a sequence in which the firstand thelast accessesare to the same address seq(5,8) cseq(5,7) A B C D A E E B cseq(4,5) cseq(1,2)
14
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack Access Counter Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 3 4 5
16
4
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack A Access Counter 1 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 3 4 5
17
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack C B A Access Counter 1 2 3 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 3 4 5
19
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack B A Access Counter 1 2 Access Stream for 1 Set d 1 2 3 4 A B C B A 1 n 2 3 4 5
18
Collecting Circular Sequence Profile MRU LRU 1 3 4 LRU Stack C B AFound a Circular Sequence! Access Counter 2 4 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 31 4 5
20
5
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack B C A Access Counter 1 2 4 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 31 4 5
21
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack A B C Access Counter 1 2 3 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 31 4 51
23
Collecting Circular Sequence Profile MRU LRU 1 2 4 LRU Stack B C AFound a Access Counter 3Circular Sequence! 2 Access Stream for 1 Set 1 2 d 3 4 A B C B A        1 n 2 31 4 51
22
Predicting Contention Across Cache
6
Shared Cache Challenge CMP P P L1 L1 L2 Cache „In todays CMP, L2 cache is shared by multiple cores „on different core compete for L2 cacheApplications space
Modeling Goal „Given n applications, predict the miss rates of any pair of applications „Input: ‰Behavior of each application ‰Cache parameter ‰Relative speed when the pair runs together „pttuuO: ‰Number of cache misses for each application in the pair
25
27
Impact of Cache Space Contention 400%100% 350% 300%80% 250%60% 200% 150%40% 100%20% 50% 0%0%
„ cifs-noicepAacitppil „soCecspicifedche-ul „actn :SgiinifUp to 4X cache misses, 65% IPC reduction How to model the impact of cache sharing?
Assumptions „LRU Replacement Algorithm „Applications share nothing ‰Mostly true for sequential apps (except for library and OS code) „Applications not similar ‰Parallel apps: threads likely to show uniform behavior, so predicting their miss rates is trivial
26
28
7
Circular Sequence Properties „X runs alone in the system:Thread ‰Given a circular sequencecseqX(dX,nX),the last access is a cache miss iffdX> Assoc „Thread X shares the cache with thread Y: ‰DuringcseqX(dX,nX)s lifetime if there is a sequence of intervening accessesseqY(dY,nY), the last access of thread X is a miss iffdX+dY> Assoc
29
Example „Assume a 4-way associative cache: Xs circular sequence Ys intervening cseqX(2,3)access sequence A B AU V V W AUBV VAWAUBV V WA Cache Hit Cache Miss seqY(2,3) intervening inseqY(3,4) intervening in cseqXs lifetimecseqXs lifetime 31
Example „Assume a 4-way associative cache: Xs circular sequence Ys intervening cseqX(2,3)access sequence A B AU V V W lifetime No cache sharing: A is a cache hit Cache sharing: is A a cache hit or miss?
30
Inductive Probability Model „Define Pmiss(cseqXprobability of the last access is) = a cache miss „For eachcseqX(dX,nX) of thread X ‰Compute the number of intervening accesses from thread Y duringcseqXs lifetimedenote asnY ‰It is possible to havedY= 1, 2,  , nYcompute the probability of eachdY, denoted as P(seq(dY,nY)). ‰For eachdY= 1, 2,  , nY „IfdY+dX>Assoc, Pmiss(cseqX) = Pmiss(cseqX) + P(seq(dY,nY)) „IfdY+dXAssoc, Pmiss(cseqX) is kept the same _ ‰ misses +Misses = oldPmiss(cseqX) x F(cseqX) 32
8
Computing P(seq(dY, nY)) „Basic Idea: seq(d,n)+ 1 access to an already-seen address + 1 access to a A B sequence a cminglar irscuf ro.i.e new address with 1..d distinct addresse seq(d-1,n-1) seq(d,n-1) „This is a Markov process with 3 states, and 2 edges „P(seq(d,n)) =A* P(seq(d-1,n-1)) +B*P(seq(d-1,n)) d Ci B=i=1andA=1B Ci i=1 33
Example seq(2,3) P(1+) seq(1,2) P(1) seq(1,1) 1
P(2) seq(2,2) P(1+) seq(1,1) 1
35
Overall Formula d „ni:eDfeCi P d=i=1P d+= −P d( ) and ( ) 1 ( ) Ci i=1 „P(seq(d,n)) is computed by: 1= = P((d1)+)×P(seq(d1,n1)dd=nn>11 ( ( , )) (1 ) ( (1, 1))> = P seq d n=PP(d)××PP(qesqes(nd,n1))nn>dd>11 +P((d1)+)×P(seq(d1,n1)) 34
Final prediction „After we obtain Pmiss(cseqX(dX,nX)) for all cseqX(dX,nX), „Predict the total misses for thread X: A missX=oldmissX+Pmiss(cseqX(dX,nX))×CdX dX=1
36
9
Observations 
„Based on how vulnerable to cache sharing impact: ‰Some are highly vulnerable ‰Some are not vulnerable ‰Many are somewhat / sometimes vulnerable „:sthgisnI ‰Traditional characterizations: not indicative of impact of sharing „Low vs. High IPC „Int vs. Floating-Point „High Miss Rate vs. Low Miss Rate ‰Rather, interaction of temporal reuse behavior determine impact of cache sharing
Motivation „Cache design critical to performance ‰Memory wall: cache miss cost hundreds of processor cycles ‰Capacity pressure: Multi-core design, Virtual Machine
37
„Important parameters: size, associativity, block size, andreplacement policy
39
Modeling Replacement Policy Performance
Motivation „Performance variation due to replacement policy is significant L2 Miss RatesNormalized Exec Time 100%18% 32%100%3%1%19 80%80%67% 60%47%60% 40%40% 20%20% 0% art ammp cg ammp cg0% art LRU Rand-MRUskwLRU Rand-MRUskw „No agreement on the best implementation ‰Intel Pentium: LRU ‰Intel XScale: FIFO ‰IBM Power 4: tree-based pseudo-LRU ‰Others: round robin, random, replacement hints, etc.
40
01
Motivation „No analytical model, past models assume ‰LRU[Cascaval03, Chandra05, Ghosh97, Quong94, Sen02, Singh92, and Suh01] ‰or Random[Agarwal89, Berg04, Ladner99] „LRU/Random simplifies modeling, but ‰Ignores performance variation due to replacement policy ‰Inaccurate for highly associative caches
Outline „Input of the Model „Replacement Policy Model „Case Study „oCsonsilunc
41
43
Would be useful to model replacement policies App 1 Circular Seq Profiling . . . a App N Circular SeqcidenoitrPtnp ecempealhcr n eappoach or ef etar ssim detcdirePlociy ProfilingModel MissRate NApp 1 App . . . RP 1sRpe emtnalecRP 1 Probability Function(RPF). . . . . . RP MsacplRet enemRP M Probability Function(RPF)
Outline „Input of the Model ‰Replacement Probability Function (RPF) ‰Circular Sequence Profiles „Replacement Policy Model „Validation „lusionsoCcn
42
44
11
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents
Alternate Text