‰‰‰„„„‰‰‰‰„„„‰„„„„„„„‰‰„„‰‰„„„‰‰‰Computer Performance ModelingGoal: Estimating and understanding the performance of computer systemsPractical Cache Performance Low-Level ModelsModeling for Computer ArchitectsVarious levels of details: functional, trace, cycle-accurate, etc.ProsVersatile: adaptable to new architectures and workloadDetails: can embed very detailed statistics Yan Solihin, NCSU, solihin@ncsu.eduPopular: SimpleScalar, SIMICS widely usedFei Guo, NCSU, fguo@ncsu.eduConsThomas Puzak, IBM, trpuzak@us.ibm.comO(n) overhead, scales with workload size: each instruction/event is simulated to capture its effectPhil Emma, IBM, pemma@us.ibm.comSlow: realistic workload has many billions of instructions2Modeling MethodsComputer Performance ModelingHigh-Level Models White Box Purpose: Evaluating gross trade-offs of designsModel incorporates knowledge about the system (e.g. relationships of parameters are known a priori)ProsAnalytical or heuristics-basedShort execution time, and sometimes O(1)Requires little coding Pros: models reveal insights &explain “why”, no training requiredReveal basic relationships of variablesMay reveal non-obvious trends and insightsCons: problem-specific solutionConsBlack BoxLess versatileModel learns knowledge about the systemRequires performance modeling expertise AI-based: neural networks, decision tree, curve fitting, etc.Uses: Pros: can model complex problem ...
Practical Cache Performance Modeling for Computer Architects
Yan Solihin, NCSU,solihin@ncsu.edu Fei Guo, NCSU,fguo@ncsu.edu Thomas Puzak, IBM,trpuzak@us.ibm.com Phil Emma, IBM,pemma@us.ibm.com
Computer Performance Modeling High-Level Models Purpose: Evaluating gross trade-offs of designs Pros Short execution time, and sometimes O(1) Requires little coding Reveal basic relationships of variables May reveal non-obvious trends and insights Cons Less versatile Requires performance modeling expertise Uses: Early design cycle: for pruning design search space Entire design cycle: to re-evaluate design search space if requirements change 3
Computer Performance Modeling Goal: Estimating and understanding the performance of computer systems Low-Level Models Various levels of details: functional, trace, cycle-accurate, etc. Pros Versatile: adaptable to new architectures and workload Details: can embed very detailed statistics Popular: SimpleScalar, SIMICS widely used Cons O(n) overhead, scales with workload size: each instruction/event is simulated to capture its effect Slow: realistic workload has many billions of instructions
Modeling Methods White Box Model incorporates knowledge about the system (e.g. relationships of parameters are known a priori) Analytical or heuristics-based Pros: models reveal insights &explain why, no training required Cons: problem-specific solution Black Box Model learns knowledge about the system AI-based: neural networks, decision tree, curve fitting, etc. Pros: can model complex problem/system Cons: prediction without insights, requires training
Program 8:30 8:45: Introduction 8:45 9:00: Capturing temporal locality behavior 9:00 9:30: Modeling cache sharing 9:30 10:00: Modeling cache replacement policy 10:00 10:30: Coffee Break 10:30 11:30: Analysis of the effects of miss clustering on the cost of a cache miss 11:30 12:30: Interaction of Caching and Bandwidth Pressure
6
Temporal Locality Behavior Programs exhibit locality of references Spatial locality: the neighbor of recently-accessed data tends to be accessed in the near future Temporal locality: recently-accessed data tends to be accessed again (orreused) in the near future Snceficaigni:temporal reuse & cache parameters determine all non-cold misses If each memory block is accessed exactly once, we only have cold misses Cold misses affected by block size How can we capture temporal locality?
8
2
Stack Distance Profiling[Mattson70] Early attempt to capture temporal reuse behavior Models LRU stack with a counter for each stack position Example: fully associative cache with 8-entry stack C1: incremented whenever the MRU block is accessed C2: incremented whenever the 2ndMRU block is accessed C3: incremented whenever the 3rdMRU block is accessed C8: incremented whenever the 8thMRU (or LRU) block is accessed C>8: incremented whenever the 9th, 10th, block is accessed
Stack Distance Properties For fully-associative LRU∞ cache with A blocks, the number of misses of theMisses=∑CA cache isi=A+1 For A-way set associative LRU cache, we can collect∞ set-specific stack distance profile, and the number ofMisses=∑CA misses of the set is:i=A+1 Alternatively, keep per-set stacks, but use global set of counters
9
11
Typical Shape Empirical observation⇒Geometric or exponential sequence Due to temporal locality Ci+1= Cix r, where 0<r<1 is the common ratio Incremented on access to MRU line 30%Incremented on access to 2ndMRU line 25% Incremented on access to 3rdMRU line 20% Incremented on a miss 15% 10% 5% 0%
Where to Profile P For capturing temporal reuse patterns at the L1 cache levels ⇒Predict cache misses for various L1 cache L1 Instruction L1 Data configurations For capturing temporal reuse patterns at the L2 cache levels ⇒cache misses for various L2 cachePredict L2 Cache configurations
Mem
12
3
Limitations of Stack Distance Profile Useful only for predicting cache misses across different cache associativities For other purposes, we need to capture temporal reuse patterns in greater details So, useCircular Sequence Profilenhdar[aC05] Extends stack distance profiling Counts the occurrence of cseq(d,n)
13
Relationship with Stack Distance Profile Cx = number of circular sequences cseq(d=x,n=any value), or ∞ Cx=∑cseq(x,n) n=x Hence, stack distance profile is a subset of circular sequence profile
15
Definitions seq(d,n)=quseceenofnaccesses toddistinct addresses (in a cache set) cseq(d,n)(circular sequence) = a sequence in which the firstand thelast accessesare to the same address seq(5,8) cseq(5,7) A B C D A E E B cseq(4,5) cseq(1,2)
14
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack Access Counter Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 3 4 5
16
4
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack A Access Counter 1 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 3 4 5
17
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack C B A Access Counter 1 2 3 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 3 4 5
19
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack B A Access Counter 1 2 Access Stream for 1 Set d 1 2 3 4 A B C B A 1 n 2 3 4 5
18
Collecting Circular Sequence Profile MRU LRU 1 3 4 LRU Stack C B AFound a Circular Sequence! Access Counter 2 4 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 31 4 5
20
5
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack B C A Access Counter 1 2 4 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 31 4 5
21
Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack A B C Access Counter 1 2 3 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 31 4 51
23
Collecting Circular Sequence Profile MRU LRU 1 2 4 LRU Stack B C AFound a Access Counter 3Circular Sequence! 2 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 31 4 51
22
Predicting Contention Across Cache
6
Shared Cache Challenge CMP P P L1 L1 L2 Cache In todays CMP, L2 cache is shared by multiple cores on different core compete for L2 cacheApplications space
Modeling Goal Given n applications, predict the miss rates of any pair of applications Input: Behavior of each application Cache parameter Relative speed when the pair runs together pttuuO:Number of cache misses for each application in the pair
25
27
Impact of Cache Space Contention 400%100% 350% 300%80% 250%60% 200% 150%40% 100%20% 50% 0%0%
cifs-noicepAacitppil soCecspicifedche-ul actn:SgiinifUp to 4X cache misses, 65% IPC reduction How to model the impact of cache sharing?
Assumptions LRU Replacement Algorithm Applications share nothing Mostly true for sequential apps (except for library and OS code) Applications not similar Parallel apps: threads likely to show uniform behavior, so predicting their miss rates is trivial
26
28
7
Circular Sequence Properties X runs alone in the system:Thread Given a circular sequencecseqX(dX,nX),the last access is a cache miss iffdX> Assoc Thread X shares the cache with thread Y: DuringcseqX(dX,nX)s lifetime if there is a sequence of intervening accessesseqY(dY,nY), the last access of thread X is a miss iffdX+dY> Assoc
29
Example Assume a 4-way associative cache: Xs circular sequence Ys intervening cseqX(2,3)access sequence A B AU V V W AUBV VAWAUBV V WA Cache Hit Cache Miss seqY(2,3) intervening inseqY(3,4) intervening in cseqXs lifetimecseqXs lifetime 31
Example Assume a 4-way associative cache: Xs circular sequence Ys intervening cseqX(2,3)access sequence A B AU V V W lifetime No cache sharing: A is a cache hit Cache sharing: is A a cache hit or miss?
30
Inductive Probability Model Define Pmiss(cseqXprobability of the last access is) = a cache miss For eachcseqX(dX,nX) of thread X Compute the number of intervening accesses from thread Y duringcseqXs lifetime⇒denote asnY It is possible to havedY= 1, 2, , nY⇒compute the probability of eachdY, denoted as P(seq(dY,nY)). For eachdY= 1, 2, , nY IfdY+dX>Assoc, Pmiss(cseqX) = Pmiss(cseqX) + P(seq(dY,nY)) IfdY+dX≤Assoc, Pmiss(cseqX) is kept the same _ misses +Misses = old∑Pmiss(cseqX) x F(cseqX) 32
8
ComputingP(seq(dY, nY)) Basic Idea: seq(d,n)+ 1 access to an already-seen address + 1 access to a A Bsequenceacminglarirscufro.i.e new address with 1..d distinct addresse seq(d-1,n-1) seq(d,n-1) This is a Markov process with 3 states, and 2 edges P(seq(d,n)) =A* P(seq(d-1,n-1)) +B*P(seq(d-1,n)) d ∑Ci B=i=1andA=1−B ∞ ∑Ci i=1 33
Example seq(2,3) P(1+) seq(1,2) P(1−) seq(1,1) 1
P(2−) seq(2,2) P(1+) seq(1,1) 1
35
Overall Formula d ni:eDfe∑Ci P d−=i=1P d+= −P d− ( )∞ and ( ) 1 ( ) ∑Ci i=1 P(seq(d,n)) is computed by: 1= = P((d−1)+)×P(seq(d−1,n−1)dd=nn>11 ( ( , )) (1 ) ( (1, 1))> = P seq d n=PP(d−−)××PP(qesqes(nd,n−−1))nn>dd>11 +P((d−1)+)×P(seq(d−1,n−1)) 34
Final prediction After we obtain Pmiss(cseqX(dX,nX)) for all cseqX(dX,nX), Predict the total misses for thread X: A missX=oldmissX+∑Pmiss(cseqX(dX,nX))×CdX dX=1
36
9
Observations
Based on how vulnerable to cache sharing impact: Some are highly vulnerable Some are not vulnerable Many are somewhat / sometimes vulnerable :sthgisnI Traditional characterizations: not indicative of impact of sharing Low vs. High IPC Int vs. Floating-Point High Miss Rate vs. Low Miss Rate Rather, interaction of temporal reuse behavior determine impact of cache sharing
Motivation Cache design critical to performance Memory wall: cache miss cost hundreds of processor cycles Capacity pressure: Multi-core design, Virtual Machine
Motivation Performance variation due to replacement policy is significant L2 Miss RatesNormalized Exec Time 100%18% 32%100%3%1%19 80%80%67% 60%47%60% 40%40% 20%20% 0% art ammp cg ammp cg0% art LRU Rand-MRUskwLRU Rand-MRUskw No agreement on the best implementation Intel Pentium: LRU Intel XScale: FIFO IBM Power 4: tree-based pseudo-LRU Others: round robin, random, replacement hints, etc.
40
01
Motivation No analytical model, past models assume LRU[Cascaval03, Chandra05, Ghosh97, Quong94, Sen02, Singh92, and Suh01] or Random[Agarwal89, Berg04, Ladner99] LRU/Random simplifies modeling, but Ignores performance variation due to replacement policy Inaccurate for highly associative caches
Outline Input of the Model Replacement Policy Model Case Study oCsonsilunc
41
43
Would be useful to model replacement policies App 1 Circular Seq Profiling . . . a App N Circular SeqcidenoitrPtnpecempealhcrneappoachorefetarssimdetcdirePlociy ProfilingModel MissRate NApp 1 App . . . RP 1sRpeemtnalecRP 1 Probability Function(RPF). . . . . . RP MsacplRetenemRP M Probability Function(RPF)
Outline Input of the Model Replacement Probability Function (RPF) Circular Sequence Profiles Replacement Policy Model Validation lusionsoCcn