31 pages

English

HPCA07-TUTORIAL

Vufuk - Solihin

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

31 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

‰‰‰„„„‰‰‰‰„„„‰„„„„„„„‰‰„„‰‰„„„‰‰‰Computer Performance ModelingGoal: Estimating and understanding the performance of computer systemsPractical Cache Performance Low-Level ModelsModeling for Computer ArchitectsVarious levels of details: functional, trace, cycle-accurate, etc.ProsVersatile: adaptable to new architectures and workloadDetails: can embed very detailed statistics Yan Solihin, NCSU, solihin@ncsu.eduPopular: SimpleScalar, SIMICS widely usedFei Guo, NCSU, fguo@ncsu.eduConsThomas Puzak, IBM, trpuzak@us.ibm.comO(n) overhead, scales with workload size: each instruction/event is simulated to capture its effectPhil Emma, IBM, pemma@us.ibm.comSlow: realistic workload has many billions of instructions2Modeling MethodsComputer Performance ModelingHigh-Level Models White Box Purpose: Evaluating gross trade-offs of designsModel incorporates knowledge about the system (e.g. relationships of parameters are known a priori)ProsAnalytical or heuristics-basedShort execution time, and sometimes O(1)Requires little coding Pros: models reveal insights &explain “why”, no training requiredReveal basic relationships of variablesMay reveal non-obvious trends and insightsCons: problem-specific solutionConsBlack BoxLess versatileModel learns knowledge about the systemRequires performance modeling expertise AI-based: neural networks, decision tree, curve fitting, etc.Uses: Pros: can model complex problem ...

Informations

Publié par	Vufuk
Nombre de lectures	15
Langue	English

Extrait

Practical Cache Performance Modeling for Computer Architects

Yan Solihin, NCSU,solihin@ncsu.edu Fei Guo, NCSU,fguo@ncsu.edu Thomas Puzak, IBM,trpuzak@us.ibm.com Phil Emma, IBM,pemma@us.ibm.com

Computer Performance Modeling High-Level Models Purpose: Evaluating gross trade-offs of designs Pros Short execution time, and sometimes O(1) Requires little coding Reveal basic relationships of variables May reveal non-obvious trends and insights Cons Less versatile Requires performance modeling expertise Uses: Early design cycle: for pruning design search space Entire design cycle: to re-evaluate design search space if requirements change 3

Computer Performance Modeling Goal: Estimating and understanding the performance of computer systems Low-Level Models Various levels of details: functional, trace, cycle-accurate, etc. Pros Versatile: adaptable to new architectures and workload Details: can embed very detailed statistics Popular: SimpleScalar, SIMICS widely used Cons O(n) overhead, scales with workload size: each instruction/event is simulated to capture its effect Slow: realistic workload has many billions of instructions

Modeling Methods White Box Model incorporates knowledge about the system (e.g. relationships of parameters are known a priori) Analytical or heuristics-based Pros: models reveal insights &explain why, no training required Cons: problem-specific solution Black Box Model learns knowledge about the system AI-based: neural networks, decision tree, curve fitting, etc. Pros: can model complex problem/system Cons: prediction without insights, requires training

Focus of this tutorial Types: High Level, Hybrid High/Low level A priori knowledge: White Box Socep: Miss count & rate Miss cost Bandwidth usage

Capturing Temporal Locality Behavior

Program 8:30  8:45: Introduction 8:45  9:00: Capturing temporal locality behavior 9:00  9:30: Modeling cache sharing 9:30  10:00: Modeling cache replacement policy 10:00  10:30: Coffee Break 10:30  11:30: Analysis of the effects of miss clustering on the cost of a cache miss 11:30  12:30: Interaction of Caching and Bandwidth Pressure

Temporal Locality Behavior Programs exhibit locality of references Spatial locality: the neighbor of recently-accessed data tends to be accessed in the near future Temporal locality: recently-accessed data tends to be accessed again (orreused) in the near future Snceficaigni:temporal reuse & cache parameters determine all non-cold misses If each memory block is accessed exactly once, we only have cold misses Cold misses affected by block size How can we capture temporal locality?

Stack Distance Profiling[Mattson70] Early attempt to capture temporal reuse behavior Models LRU stack with a counter for each stack position Example: fully associative cache with 8-entry stack C1: incremented whenever the MRU block is accessed C2: incremented whenever the 2ndMRU block is accessed C3: incremented whenever the 3rdMRU block is accessed  C8: incremented whenever the 8thMRU (or LRU) block is accessed C>8: incremented whenever the 9th, 10th,  block is accessed

Stack Distance Properties For fully-associative LRU∞ cache with A blocks, the number of misses of theMisses=∑CA cache isi=A+1 For A-way set associative LRU cache, we can collect∞ set-specific stack distance profile, and the number ofMisses=∑CA misses of the set is:i=A+1 Alternatively, keep per-set stacks, but use global set of counters

Typical Shape Empirical observation⇒Geometric or exponential sequence Due to temporal locality Ci+1= Cix r, where 0<r<1 is the common ratio Incremented on access to MRU line 30%Incremented on access to 2ndMRU line 25% Incremented on access to 3rdMRU line 20% Incremented on a miss 15% 10% 5% 0%

C1 C2 C3 C4 C5 C6 C7 C8 C>8 Stack Distance Counters 10

Where to Profile P For capturing temporal reuse patterns at the L1 cache levels ⇒Predict cache misses for various L1 cache L1 Instruction L1 Data configurations For capturing temporal reuse patterns at the L2 cache levels ⇒cache misses for various L2 cachePredict L2 Cache configurations

Mem

Limitations of Stack Distance Profile Useful only for predicting cache misses across different cache associativities For other purposes, we need to capture temporal reuse patterns in greater details So, useCircular Sequence Profilenhdar[aC05] Extends stack distance profiling Counts the occurrence of cseq(d,n)

Relationship with Stack Distance Profile Cx = number of circular sequences cseq(d=x,n=any value), or ∞ Cx=∑cseq(x,n) n=x Hence, stack distance profile is a subset of circular sequence profile

Definitions seq(d,n)=quseceenofnaccesses toddistinct addresses (in a cache set) cseq(d,n)(circular sequence) = a sequence in which the firstand thelast accessesare to the same address seq(5,8) cseq(5,7) A B C D A E E B cseq(4,5) cseq(1,2)

Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack Access Counter Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 3 4 5 

Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack A Access Counter 1 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 3 4 5 

Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack C B A Access Counter 1 2 3 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 3 4 5 

Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack B A Access Counter 1 2 Access Stream for 1 Set d 1 2 3 4 A B C B A 1 n 2 3 4 5 

Collecting Circular Sequence Profile MRU LRU 1 3 4 LRU Stack C B AFound a Circular Sequence! Access Counter 2 4 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 31 4 5 

Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack B C A Access Counter 1 2 4 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 31 4 5 

Collecting Circular Sequence Profile MRU LRU 1 2 3 4 LRU Stack A B C Access Counter 1 2 3 Access Stream for 1 Set d A B C B A 1 2 3 4 1 n 2 31 4 51 

Collecting Circular Sequence Profile MRU LRU 1 2 4 LRU Stack B C AFound a Access Counter 3Circular Sequence! 2 Access Stream for 1 Set 1 2 d 3 4 A B C B A 1 n 2 31 4 51 

Predicting Contention Across Cache

Shared Cache Challenge CMP P P L1 L1 L2 Cache In todays CMP, L2 cache is shared by multiple cores on different core compete for L2 cacheApplications space

Modeling Goal Given n applications, predict the miss rates of any pair of applications Input: Behavior of each application Cache parameter Relative speed when the pair runs together pttuuO: Number of cache misses for each application in the pair

Impact of Cache Space Contention 400%100% 350% 300%80% 250%60% 200% 150%40% 100%20% 50% 0%0%

cifs-noicepAacitppil soCecspicifedche-ul actn :SgiinifUp to 4X cache misses, 65% IPC reduction How to model the impact of cache sharing?

Assumptions LRU Replacement Algorithm Applications share nothing Mostly true for sequential apps (except for library and OS code) Applications not similar Parallel apps: threads likely to show uniform behavior, so predicting their miss rates is trivial

Circular Sequence Properties X runs alone in the system:Thread Given a circular sequencecseqX(dX,nX),the last access is a cache miss iffdX> Assoc Thread X shares the cache with thread Y: DuringcseqX(dX,nX)s lifetime if there is a sequence of intervening accessesseqY(dY,nY), the last access of thread X is a miss iffdX+dY> Assoc

Example Assume a 4-way associative cache: Xs circular sequence Ys intervening cseqX(2,3)access sequence A B AU V V W AUBV VAWAUBV V WA Cache Hit Cache Miss seqY(2,3) intervening inseqY(3,4) intervening in cseqXs lifetimecseqXs lifetime 31

Example Assume a 4-way associative cache: Xs circular sequence Ys intervening cseqX(2,3)access sequence A B AU V V W lifetime No cache sharing: A is a cache hit Cache sharing: is A a cache hit or miss?

Inductive Probability Model Define Pmiss(cseqXprobability of the last access is) = a cache miss For eachcseqX(dX,nX) of thread X Compute the number of intervening accesses from thread Y duringcseqXs lifetime⇒denote asnY It is possible to havedY= 1, 2,  , nY⇒compute the probability of eachdY, denoted as P(seq(dY,nY)). For eachdY= 1, 2,  , nY IfdY+dX>Assoc, Pmiss(cseqX) = Pmiss(cseqX) + P(seq(dY,nY)) IfdY+dX≤Assoc, Pmiss(cseqX) is kept the same _ misses +Misses = old∑Pmiss(cseqX) x F(cseqX) 32

Computing P(seq(dY, nY)) Basic Idea: seq(d,n)+ 1 access to an already-seen address + 1 access to a A B sequence a cminglar irscuf ro.i.e new address with 1..d distinct addresse seq(d-1,n-1) seq(d,n-1) This is a Markov process with 3 states, and 2 edges P(seq(d,n)) =A* P(seq(d-1,n-1)) +B*P(seq(d-1,n)) d ∑Ci B=i=1andA=1−B ∞ ∑Ci i=1 33

Example seq(2,3) P(1+) seq(1,2) P(1−) seq(1,1) 1

P(2−) seq(2,2) P(1+) seq(1,1) 1

Overall Formula d ni:eDfe∑Ci P d−=i=1P d+= −P d− ( )∞ and ( ) 1 ( ) ∑Ci i=1 P(seq(d,n)) is computed by: 1= = P((d−1)+)×P(seq(d−1,n−1)dd=nn>11 ( ( , )) (1 ) ( (1, 1))> = P seq d n=PP(d−−)××PP(qesqes(nd,n−−1))nn>dd>11 +P((d−1)+)×P(seq(d−1,n−1)) 34

Final prediction After we obtain Pmiss(cseqX(dX,nX)) for all cseqX(dX,nX), Predict the total misses for thread X: A missX=oldmissX+∑Pmiss(cseqX(dX,nX))×CdX dX=1

Observations

Based on how vulnerable to cache sharing impact: Some are highly vulnerable Some are not vulnerable Many are somewhat / sometimes vulnerable :sthgisnI Traditional characterizations: not indicative of impact of sharing Low vs. High IPC Int vs. Floating-Point High Miss Rate vs. Low Miss Rate Rather, interaction of temporal reuse behavior determine impact of cache sharing

Motivation Cache design critical to performance Memory wall: cache miss cost hundreds of processor cycles Capacity pressure: Multi-core design, Virtual Machine

Important parameters: size, associativity, block size, andreplacement policy

Modeling Replacement Policy Performance

Motivation Performance variation due to replacement policy is significant L2 Miss RatesNormalized Exec Time 100%18% 32%100%3%1%19 80%80%67% 60%47%60% 40%40% 20%20% 0% art ammp cg ammp cg0% art LRU Rand-MRUskwLRU Rand-MRUskw No agreement on the best implementation Intel Pentium: LRU Intel XScale: FIFO IBM Power 4: tree-based pseudo-LRU Others: round robin, random, replacement hints, etc.

Motivation No analytical model, past models assume LRU[Cascaval03, Chandra05, Ghosh97, Quong94, Sen02, Singh92, and Suh01] or Random[Agarwal89, Berg04, Ladner99] LRU/Random simplifies modeling, but Ignores performance variation due to replacement policy Inaccurate for highly associative caches

Outline Input of the Model Replacement Policy Model Case Study oCsonsilunc

Would be useful to model replacement policies App 1 Circular Seq Profiling . . . a App N Circular SeqcidenoitrPtnp ecempealhcr n eappoach or ef etar ssim detcdirePlociy ProfilingModel MissRate NApp 1 App . . . RP 1sRpe emtnalecRP 1 Probability Function(RPF). . . . . . RP MsacplRet enemRP M Probability Function(RPF)