Benchmark user interface:

Shael - Liu Ren

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

16 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Problème de couverture de sommets

Understanding and Improving Performance of 3D GraphicsCard and Applications Liu Ren (liuren@cs.cmu.edu)Hua Zhong (zhonghh@cs.cmu.edu) Computer Science DepartmentCarnegie Mellon University 1 Introduction1.1 Importance of Graphics Hardware In the recent three years, the graphics hardware may be the most active part in computerhardware market with new generations of chips appeared almost every 3 months. The newestgraphics cards are equipped with a Graphics Process Unit (GPU) and very fast memory withbandwidth over 8.0GB. Now the GPU has as many as4 pipelines that can work parallel, and programmableshaders that can take some workload from CPU. Sothe graphics cards now is much more complex than itis used to be and more advanced functions andsophistic architectures have been introduced intographics cards. Hardware manufacturers such asNvidia and ATI have invented many new features tomake their cards run faster and faster. Now buying aPC, one must consider the importance of graphics cardas much as CPU! Along with the great improvementsof graphics hardware, demands for more realistic 3Dgmaamkee s tthoe exgpalmoiet tmhea rfkaestt sopneee do fo ft hger apmhoicsts cphriopfsit aalbsleo Figure 1 XBoxs game DoA3markets. The newest games of Microsofts Xboxpowered by a Nvidias special version of GeForce3 graphics chip exceeds everyones imaginationof how realistic current real time 3D games could be (Figure 1).So understanding the architecture of some most advanced graphics hardware and what will bethe potential bottleneck of the future graphics hardware and applications becomes more meaningfulnow than before. As more and more new features such as programmable shaders, hardwaretransform and lighting, hardware skinning havebeen added [5], it is none-trivial to identify whatwill be the future bottleneck of graphics chip now.And even we can figure out all the potentialbottlenecks of a very complicated graphics card,we still need to have a clear view of what futuregraphics applications need most before we candetermine which bottleneck is the most importantone. 1.2 What Future Graphics Applications NeedFigure 2 More triangles VS less trianglesall 3MDo raep prleiaclaitsitiocn sa nind tmheo rfe utnuarteu.r aSl iinsc et hge egoomale troyf of most objects is represented by triangles in computer graphics, to get more realistic geometric

details, we need more triangles, especially for those natural objects such as animals and humanbeings (Figure 2). Another most important factor to make the 3D applications more realistic is tohave more detailed textures. To make viewers feel realistic even in close-up views, we must applytextures with high resolution. Both the requirement for more triangles and high-resolution texturesturn out to be the demand for more memory. We can see this trend from the evolution of storagemedia for games: memory card, CD, to DVD in Xbox and PS2. Besides that, to make the renderedobjects more realistic, we need to add more special effects such as fog, bump mapping, reflectionsand refractions, more sophisticated models of lighting and etc. This means more computations butCPU cant take that much burden while still working for AI and other tasks in a computer game. Sousing the programmable shader is a good way to take these heavy workloads from CPU.There are other trends of future graphics application such as more natural motion of charactersand more realistic expressions. But here we would like to focus on the memory and programmableshaders which are definitely two most important parts for future graphics applications. Next we willidentify the corresponding potential bottlenecks by showing the result of some specially designedbenchmarks. Then we will provide some analyses based on the results and conclude with somesuggestions for both graphics software programmers and hardware designers. 2. Memory Bottleneck2.1 Potential Bottleneck of Memory2.1.1 Memory Architecture of Graphics Card There are three kinds of memories that a graphics application always uses: local video memory(memory on the card), system memory and AGP memory. Local video memory normally consists of DDR memory or any kind of fastest memory onecan get. It is directly integrated on the board, so it can be accessed by GPU at very highspeed.AGP means Accelerated Graphics Port Actually, AGP memory is a portion of main.system memory that can be accessed by CPU and graphics controller through a specialAGP bus which is much faster than the standard PCI bus. With current AGP 2.0specification, an AGP4x bus can maintain a 1GB/sec bandwidth. The goal of AGP is to usecheap system memory to compensate for the lack of high-performance but also expensivevideo memory. Another advantage of AGP memory is that both graphics controller and theCPU can access it very fast. Some AGP tutorials are available at [2].System memory can be PC133 SDRAM and accessed by CPU and graphics controllerthough PCI bus. These three kinds of memories dont have a hierarchy structure like cache-memoryarchitecture. Instead they are independent and parallel with each other in a graphics application.Basically, there is no restriction on where the data of a graphics application should be put amongthese three kinds of memories. For most of time, local video memory is the fastest one, but since weonly have limited video memory, AGP memory and system memory are also frequently used inmost of graphics applications. 2.1.2 Potential Bottlenecks of the Architecture The three kinds of the memory have different data transfer bandwidths. Here is a list ofbandwidths of video memory, AGP memory and system memory:

Memory Type Memory Bandwidth Bus BandwidthVideo MemoryUp to 8G-10G (GeForce3&Radeon 8500) 8G-10G (128-bit DDR)AGP Memory1G (PC133)-1.6G (PC800) or higher (PC2700) 528M(AGP2x)1056M(AGP4x)System Memory1G (PC133)  1.6G (PC800) or higher (PC2700) 133M (PCI33)-512M (PCI66) Table 1 Different Memory BandwidthsSince AGP has a deep pipeline and uncached mechanisms [3], and PCI has a lot of contentionsbetween memory and other I/O devices, the gap between system and AGP memory is larger thanwe expected.From table 1, we can expect there may be bottlenecks when a graphics application put data indifferent memories: Data from faster memory must wait data from slower memory to arrive beforethe graphics pipeline can continue. 2.1.3 Why this Bottleneck Is Important For Future Graphics Applications As future graphics applications need more complex geometry model and more detailedtextures, the data size will increase dramatically. Then AGP and even system memory may still be aplace where these data could reside even with increased video memory on board. And because thedynamic textures and vertices that both video controller and the CPU will update frequently arealways put in the AGP memory, we think this memory bottleneck will be more and more salient inthe future. So having a clear picture of this memory bottleneck will be helpful for both softwareprogrammers and hardware designers. Next we will focus on testing, verifying the existence of thememory bottleneck and showing how serious this bottleneck could be. Finally we will concludewith some suggestions for software programmers and hardware designers based on our testingresults. 2.2 Test Environment Our tests are based on two machines that are equipped with the most advanced graphics cards: Machine #1 Machine #2CPUPIII 1GHz PIII 1GHzGraphics CardATI Radeon 32M DDR Nvida GeForce3 64M DDRSystem Mem512M PC133 1G PC133OSWin2K SP2 Win2K SP2Table 2 Test EnvironmentsDetails of the two graphics card are as below: ATI Radeon 32M DDR Nvida GeForce3 64M DDRChipclock183MHz 200MHzMemory Data Path128-bit 128-bitMemory Speed366MHz 500MHzMemory Size32M DDR 64M DDRAGP support2x, 4x 2x, 4xTable 3 Graphics Cards We UsedThe software is a specially designed program written by ourselves using Microsoft DirectX 8.1.The drivers of the two cards are:ATI driver for Radeon ati3dvag.dll, version 5.13.01.3102(English)

Nvidia driver for GeForce3 nv4_disp.dll, version 21.83(English) 2.3 Design of Benchmark ProgramTo identify the memory bottleneck, we designed and implemented a special benchmarkprogram. To ensure that we correctly identify the memory bottleneck, we must make our programfree of other possible bottlenecks of graphics hardware. And it must be able to run upon differenthardware and to arbitrarily allocate the data into different parts of the memories. 2.3.1 Other Possible Bottlenecks of Graphics Applications and How to Get Rid of ThemBesides the memory bottleneck we mentioned above, there are still some other bottlenecks.These bottlenecks include: CPU bottleneck, GPU bottleneck, video card memory bandwidthbottleneck, and filling rate bottleneck.Some graphics applications need a lot of calculations by CPU, so sometimes the graphicscontroller and other parts of the application must wait the CPU to finish those calculations beforethey can continue their jobs. So now more and more computation of the geometry transformation,lighting and others has been moved to GPU to share the heavy load with CPU. Here we must avoidthis bottleneck since we only want to identify the memory bottleneck in the test. To do this, first,we choose a machine with a fast CPU(PIII 1GHz CPU). And we make sure that in our benchmarkprogram there are almost no CPU loads: we just render a static textured square and there are noanimations that require a lot of CPU work. Also in our application, GPU is not a bottleneck either,for lighting is disabled and there are no special rendering effects that require a lot of GPU work.Filling rate directly depends on rendering resolution. The common resolution 640x480 is madeup of 307,200 pixels, while a high-resolution frame buffer, such as 1600x1200 requires 1,920,000pixels. The 3D-chip has to 'render' each pixel of a frame before the frame can be displayed. The'frame rate' is defined as the number of frames that can be displayed in a certain amount of time. It'seasy to see that it will eat up more video card memory bandwidth to supply a certain frame rate at ahigh resolution than at a low resolution. This is why typically 3D cards score high frame rates at640x480 and lower frame rates at 1600x1200. So in our benchmark program, we render at a smallresolution of 400x300 that guarantees neither video card memory bandwidth nor filling ratebottleneck affect our final frame rate.Also as the geometry of the object becomes more complex and the texture becomes moredetailed, the performance of the applications will drop. This is because the amount of primitive(triangles) data needed to be processed by GPU increases. So the drop of the performance is onlydetermined by the number of triangles rendered. In our test, we compare results with the sameprimitive number or the normalized frame rate (performance/primitive number). 2.3.2 How To Arbitrarily Put Data into Different Memory LocationsWe use Microsoft DirectX 8.1[4] to implement our benchmark program. Microsoft DirectX isan advanced suite of multimedia APIs (application programming interfaces) built into MicrosoftWindows operating systems. DirectX provides a standard development platform for Windows-based PCs by enabling software developers to access specialized hardware features without havingto write hardware-specific code.In DirectX, the location of data can be specified by the two parameters. One is D3DUSAGE,and the other is D3DPOOL. With different combination of these two parameters, we can arbitrarilyallocate memory to local video memory, AGP memory or system memory.D3DUSAGE D3DPOOL Location of the memoryD3DUSAGE_WRITEONLYD3DPOOL_DEAFULT Local Video MemoryD3DUSAGE_DYNAMICD3DPOOL_DEAFULT AGP memoryD3DUSAGE_DEFUALTD3DPO _ MMEM System MemoryOL SYSTETable 4 DirectX Parameter Combinations

2.3.3 Implementation Details and User Interface (UI) The benchmark program is to render a square that isconstructed by a lot of triangles. We can change thenumber of triangles and map a texture to this trianglemesh. Triangles will be put into the rendering pipeline asprimitive streams. Each stream is a vertex array. Everythree consecutive vertices represent a triangle. Since thehardware always has a limitation of the maximum sizeof a single primitive stream, when the number oftriangles is too large, we have to split the big triangle listinto several primitive streams and then fit them into therendering pipeline one by one. This split usually doesntincur additional cost with DirectX 8.1(With DirectX 7,too many streams will cause a serious performance loss).Figure 3 UI of benchmark programShoo lditi ndg oveesrntet x adffaetac ti so cuar llferda mvee rtreaxt eb ruefsfeurl t(. VTBh)e. memory Besides vertex buffer, we also create different sizesof textures from a resolution of 256x256 to 2048x2048. The largest texture size is about 12MB. 2.4 Result and Analysis2.4.1 Speed of Three Kinds of Memories In this test we put same data in the video memory, AGP memory and system memoryrespectively, and record the final frame rate of the application. The goal of this test is to give us aclear view of how much the gap is among these three kinds of memories. The result is shown asfollowing:Triangle # VB Size FPS(Video) FPS(AGP2x) FPS(Sys) FPS(AGP4x)50000.42M 1007.02 728.29 480.531027.26200001.68M 308.7 208.8 136.87313.1800006.72M 81.36 54.52 35.3181.6332000026.88M 20.66 13.71 8.9620.681280000107.52M 4.24 3.43 2.25Table 5 VB in Video, AGP and System memoryVB size here means the size of the vertex buffer. As we can see, the performance is the bestwhen data is stored in video memory and AGP4x. The bandwidth of AGP4x is above 1GB/sec, andwith deep pipeline, the latency of reading data from AGP4x memory can be effectively hided.While in AGP2x, the performance is worse since the bandwidth for AGP2x is half of AGP4x. Forsystem memory, the result is even worse. Here we can clearly identify the gap between differentkinds of memories. In this case AGP4x seems no inferior to video memory and even a little bitfaster! Does it mean AGP4x is better? The answer is no as we can see from the next test. And onething we would like to point out is that when running the program in AGP4x, there are a lot oferrors. For those 4 slots of AGP4x in table 5, we had to reboot our Machine#2 for 3 times. And therendering result also has errors. The reason why AGP4x is so unstable is mainly because it transferstoo much data within one bus clock. In AGP1x, 4B data are transferred at every falling edge. InAGP2x, 4B data are transferred at each falling and rising edge. In AGP4x, even 4B more data aretransferred between any falling and rising edges.

2.4.2 Vertex Size Test In this test, we fix the number of triangles but increase the size of each vertex. This increasesthe total data size but maintain the same number of primitives. Since the graphics pipelineprocesses geometry data by one primitive by one primitive, so the workload for geometrycomputation is always the same. As we didnt change the resolution of the frame buffer, the fillingrate and texture coordinate interpolation workloads are the same all the time. Here we want to seeas the workload of GPU is staying the same, how increased memory load will affect theperformance of graphics application. This could happen in future graphics applications when moretexture coordinates are supported for each vertex. The result is shown below: vertex size (byte) Video AGP(2x) AGP(4x)9636.62 10.67 17.918836.67 10.7 17.818036.45 10.67 17.97234 10.7636. 17.896436.38 10.78 17.835636.63 12.25 20.384836.68 14.4 23.754028 28.4436.7 17.3636.84 19.58 31.573236.84 21.72 35.512436.84 28.93 36.65

4035302520151050

Video MemoryAGP2xAGP4x

Vertex Size Table 6 Different Vertex Sizes We increased the vertexs size by adding some redundant fields in a vertex structure. Such asnormal vector (since we dont have lighting, this field will be redundant), additional texturecoordinates (we only apply one texture in our program, so additional texture coordinates will beredundant). These redundant fields ensure that we dont have additional computational workloadbut only the increased memory size. This enables us to focus on the bottleneck of memory thataffects the performance. The triangle number in this test is 180,000. Z depth is 16-bit. Resolution is

400x300.We can see when the data is stored in video memory, as the vertex size increases, theperformance keeps the same, even when the vertex buffers size is four times larger. This meansvideo memorys high bandwidth isnt the bottleneck.Both in AGP4x and AGP2x, as the vertex size increases, the performance drops linearly. After66 bytes per vertex, the performance stays the same. This because after 66 bytes, the 5th to 8th setof texture coordinates are the only added components for each vertex. Since current GeForce3hardware only supports blending 4 textures within one pass, even when we add more texturecoordinates, the graphics controller will just ignore them. So the actual data transferred is still 66bytes per vertex even when the vertex size is bigger than 66 bytes.From this test, we can tell that AGP4xs speed still cant compare to video memory. As thememory data size increases, the performance of AGP4x becomes worse than the local videomemory. So if some of the data are stored in video memory and some of the data are stored in AGPmemory, the speed difference between these two kinds of memories will enforce the application towait the slower one. Then the overall performance of the application will be deteriorated. From thetest result, we can see the performance of AGP4x is almost half of that of the video memory. 2.4.3 Texture size test In this test, we combine different sizes of vertex buffers and textures, and record the performance ofeach combination. Since textures and vertices are two main types of graphics data that may eat up alarge portion of the memory, and these two types of data are processed by GPU quite differently, itis interesting to see when one of these two kinds of data is the major source of memory bottleneck.The result is shown below:TextureSize/Triangle # 720000 320000 80000 20000 5000 1250 288256x2565.19 20.67 81.76 310.98 1025.62 2196.73 2396.71512x5125.19 20.73 81.75 310.55 1025.52 1901.07 1967.281024x10245.18 20.68 81.68 310.58 979.28 1015.88 1050.492048x20485.19 20.64 81.63 310.73 986.04 1002.76 1020.63NAxNA6.05 25.04 98.41 364.24 1167.71 2460.8 2764.99Table 7 Textures & Vertices in GeForce3TextureSize/Triangle # 720000 320000 80000 20000 5000 1250 288256x2562.92 15.57 60.06 210.59 563.9 768.5 805.43512x5122.92 15.57 60.06 210.59 545.93 553.92 579.871024x10242.9 15.57 60.06 210.59 278.91 282.89 293.852048x20482.88 15.57 60.06 107.37 111.91 93.27 104.8NAxNA2.93 15.34 59.42 207.59 560.9 798.44 825.39Table 8 Textures & Vertices in RadeonFrom the above results, we can see the source of memory bottleneck changes from texture to vertexbuffer.1) When the triangle number is small (vertex buffer is small), the performance decreases asthe texture size increases.2) When the triangle number is large (vertex buffer is large), the performance is onlydetermined by the number of triangles and is independent of the size of the texture.Within one rendering step, normally GPU will access the vertex buffer more to do transformation

and lighting, clipping, and Z-test. But for textures, only once is enough when mapping the texturecolor to each pixel since nearest neighbor texture filtering is the scheme we used for texturemapping here. So as we can see the performance decrease caused by texture data and vertex data isnot proportional to the memory size they take. Those tests will give us some useful hints forgraphics programming which will be described in next section. 2.5 Suggestions for Software Programmer and Hardware Designers2.5.1 Try to put important data in Video Memory This is what the first two tests tell us. The speed gap between AGP memory and video memoryis salient. So when a programmer has enough video memory, try to make fully use of it. But theproblem is many cheap cards with small video memories are still being used by many users evenwhen more advanced graphics cards are available now. Then game developers must make sure thattheir applications can run smoothly in most cards. (This wont be a problem for game console sinceall users have the same hardware) They must be careful about which part of graphics data should beput in video card memory and which part can be put in AGP memory but still maintain a prettygood performance. As we have discussed in the final test result, vertex buffer always has more datatransfers with GPU, so it is not wise to put it in AGP memory. But texture data seem more suitableto be put in AGP memory. And with faster AGP standard available, the speed gap for texture datain different kinds of memory will be less and less noticeable. 2.5.2 Increase the AGP speed This is for hardware designer. Since we cant increase the video memory size unlimitedly, it isquite reasonable to improve the AGP memory bandwidth to minimize the speed gap between videocard memory and AGP memory. With the faster memories (PC2700, PC800) are available now,faster AGP bus is also needed to make fully use of these faster main memories. As we can see, inthe newest AGP specification 3.0 (5/2001), AGP8x standard has been included. Thats good newsbecause the bandwidth of AGP bus can reach 2GB with AGP8x. But stability of high-speed AGP isanother problem. As we have discussed before, even with AGP4x, the transfer of data is not stable.This greatly impedes use of high-speed AGP memory. The stability is determined by many factors:motherboard manufacturer, chipset designer, video card manufacturer, driver writers and standarddesigners. Especially a good driver is always what a user mostly wanted. In our test, we found newdrivers always boost the performance greatly even with the same hardware. ATIs driver is morestable but seems more conservative. Nvidias driver is smarter but generates more errors. 2.5.3 32byte alignment is not necessaryIn some graphics seminars, people said a programmer should make their vertex size 32bytealigned. That will increase the speed of your program. But in our test of Vertex Size, we can seethis really doesnt help improve the performance. So we dont need to pay attention to the issue of32byte alignment based on our test result. 2.5.4 Complex Geometry or detailed textureFrom our last test of vertex buffer and texture, we can get some really useful hints. Manygraphics programmer will choose less detailed geometry model with vivid textures in theirprograms. They claim that this will improve the performance and only make small visualdefections. But in our test result we find this is not always true. For example, in Radeon: 288triangles with 2048x2048 texture, the rendering speed is 104.8, but 20000 triangles with a1024x1024 texture, the rendering speed is 210.59, almost two times faster. That because whengeometry memory is not large, texture memory becomes the source of the memory bottleneck. Sofor programmers, before they want to reduce the number of triangles in the models, they need to

make sure to find a good balance of performance and visual quality. 3 Graphics Processing Unit(GPU) and video card driver impact on performanceIn the section, the impact of GPU on performance will be examined in detail. It is not easy todetect that GPU is an incoming bottleneck for graphics applications that demand much morecomputation power than before until programmable shaders [4] (a programmable part in the graphicspipeline) become available recently. Thus first, programmable shaders in the state-of-art graphicspipeline will be introduced in 3.1. Based on the programmable shaders, specific benchmarks could bemade to identify bottlenecks caused by GPU. These issues will be addressed in 3.2 and 3.3. Furthertesting results in 3.4 will expose some amazing facts and thus give game developers and graphicsprogrammers some good suggestions. In 3.5, we will also expose some weakness of current videocard drivers. 3.1 Graphics pipeline and programmable shaders

3D graphics rendering involvesnumerous steps that are performed oneafter another. These steps can be thoughtof like an assembly line, or a pipeline(Figure 4). The pipeline can be brokendown into two main stages: the geometryprocessing stage, and the rendering stage.When using a graphics chip that supportshardware accelerated transform andlighting (T&L), the geometry processingstage takes place in the GPU. Geometryprocessing involves calculations thatmodify or, in some cases, create new datafor vertices. At this point, previousgraphics card use fixed function pipeline(FFP) to perform transformation andlighting operations. Recently, vertexshader, a kind of programmable shaders, isintroduced and can be used to replace FFPto perform a specified set of operationsincluding T&L on every incoming vertex.In the rendering stage of this pipeline,Figure 4 Pipelinecaanloltehde rp ixkeil nsdh adoef r, pcraon grbae musmeadb lteo rsehplaadceer previous fixed function texturing, filteringand blending. So from the programmers point of view, the programmable shader is a piece of codeto perform different kinds of operations in GPU including T&L, texturing and etc. Compared withFFP, the programmable shaders provide more flexibility. On the other hand, those programmableshaders can also provide a new type of benchmark to evaluate GPU performance. In ourexperiment, all the benchmarks are different types of vertex shaders. By changing the number, thetypes or arrangement of instructions in those shaders, we can expose more GPU performance ondifferent applications. Figure 6 is the program model for vertex shaders. In this model, the vertexshader architecture streams vertex data into the shader from the graphics pipeline, performsoperations on the data using an arithmetic logic unit (ALU), and outputs the transformed vertexdata to the graphics pipeline for further processing. Registers are used for inputting data, outputting

data, and holding temporary results. Each register holds four fixed-point numbers. The vertexshader architecture defines several types of registers, each operating on a different type of data.v0v15. Vertex registers are used to stream vertex data into the shader ALU. Vertexdata is defined by an application and can contain any data. Those registers are onlyreadable.r0r11. Temporary registers are used to hold intermediate results. There are no read/write limitations on these registers.c0c95. Constant value registers are designed to input constant values to the shader.These registers are only readable too.a0. The address register can be used to provide an integer offset during the lookup ofany constant value register.oDn, oFog, oPos, oPts, oTn. Output registers output data from the vertex shader tothe graphics pipeline for further processing. The registers contain transformedposition data, texture coordinates, diffuse color, and specular color. These registersare write-only.To simplify hardware design, there is no branching in vertex shaders. Read/write limitation isalso applied on those different types of registers. The vertex shader can at most contain 128instructions.3.2 Benchmark tools and experiment setting. We build a benchmark tool (Figure 5) to render the same checkerboard image by read differentkinds of shader benchmarks. The coding is done using VC++. The latest DirectX8.1 driver is usedfor rendering, MFC is used to generate user interface. All the application are run on a P4 (1.9GHz)PC with Windows XP, 512M RDRAM (PC800) and AGP 4X mode. Two state-of-art video cardsare tested by our benchmarks. One is Nvidias Geforce3 (64M DDR) and the other one is ATIsRadoen 8500 (64M DDR). Both cards are only available this year.For each benchmark, a small number of diffuse colored triangles (8192) are rendered for thecheckerboard image. The vertex buffer size and index buffer size for those triangles are only0.589824Mb and 0.049152Mb. During rendering, we use 16 bit Z buffer and 32-bit color. Theframe buffer resolution is limited to 428*382. In our benchmark, the 1.9GHz P4 Processor is not abottleneck for rendering those triangles. Further more, since the vertex data size is quite small andno texture mapping is used, only a small portion of video card memory is used. Thus no AGPmemory is used in our testing. Also considering the low resolution of the frame buffer, the videocard memory bandwidth will not be a bottleneck in our testing. However, in each of those vertexshaders, instruction type, number and arrangement of those instructions are different. Although thefinal rendering image are the same, the rendering speed characterized by FPS (Frames Per Second)varies and exposes the details about the computation power of different GPUs.

Figure 6 Program Model

Figure 5 UI of benchmark for shaders3.3 Detecting GPU clock rate as a bottleneck dp4 oPos.x, v0, c017 oBpeenracthiomnas.r kTs:h e Thhaer dinwsatrreu citimopnl esmete notfa tvieornt eox f shGaedfeorr cceo3n sciasrtsd oifs dp4 oPos.y, v0, c1assumed to issue one instruction per clock and execute alldp4 oPos.z, v0, c2 dp4 oPos.w, v0, c3instructions with the same latency. However, we know simplemov oD0, v2ivnescttrourc tuionints ilni kFel oatMinOgV-p,o intM CUoLre , ( FPACD) Dof aGreef oprecref3o rwmheidl eb cyo SmIpMleDx instructions like RCP and RSQ are performed by its ci l spe afunction unit within about 1.5bits of IEEE precision using two-passRa hson iteration frFigure 7beNleiewvteodn -to phave different ohmar da wsaeree d itmabpllee. mAenTtIa tiRoand efoonr 8s5i0m0 plise instructions and complex instructions.Based on these facts, each of our benchmark will have the basic block, shown in Figure 7. Themajor task of this basic block is to perform transformation for each vertex vector stored in v0. c0,c1, c2 and c3 are the column vectors of the transformation matrix. The mov instruction willassign diffuse color of each vertex stored in v2 to the write-only output register oD0. Besides theseinstructions, each vertex shader benchmark either contains a set of simple instructions or a set ofcomplex instructions. Thus, we have two types of shader benchmarks here, one is called simpleshader and the other is called complex shader. We call this set of instructions additionalinstructions. The number of those additional instructions is 0,5,25,45,65,85, or 105 in each shaderbenchmark. In addition, each shader benchmark contains one more instruction:mov oD1, r5.x.Allthe additional instructions will be used to compute the specular color for the vertex. The result (inour case, it can be of any value, and we dont apply specular effects in our rendering, for the valuein register oD1 doesnt have any physical meaning here but for testing.) is hold in r5.x. The abovemov instruction is used to put the value into the write only register oD1 for specular color of eachvertex. During the rendering process, the lighting is disabled, so the specular effects wont be seen.However, the computation is still performed. Testing Results: