26 pages

English

CUDA Parallel Programming Tutorial

Obbi - Richard Membarth Richard.Membarth@Cs.Fau.De

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

26 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

CUDA Parallel Programming TutorialRichard Membarthrichard.membarth@cs.fau.deHardware-Software-Co-DesignUniversity of Erlangen-Nuremberg19.03.2009Friedrich-Alexander University of Erlangen-NurembergRichard Membarth1Outline◮ Tasks for CUDA◮ CUDA programming model◮ Getting started◮ Example codesFriedrich-Alexander University of Erlangen-NurembergRichard Membarth2Tasks for CUDA◮ Provide ability to run code on GPU◮ Manage resources◮ Partition data to ﬁt on cores◮ Schedule blocks to coresFriedrich-Alexander University of Erlangen-NurembergRichard Membarth3Data Partitioning◮ Partition data in smallerblocks that can be processedby one core◮ Up to 512 threads in oneblock◮ All blocks deﬁne the grid◮ All blocks execute sameprogram (kernel)◮ Independent blocks◮ Only ONE kernel at a timeFriedrich-Alexander University of Erlangen-NurembergRichard Membarth4Memory HierarchyMemory types (fastest memoryﬁrst):◮ Registers◮ Shared memory◮ Device memory (texture,constant, local, global)Friedrich-Alexander University of Erlangen-NurembergRichard Membarth5Tesla Architecture◮ 30 cores, 240 ALUs (1 mul-add)◮ (1 mul-add + 1 mul): 240 * (2+1) * 1.3 GHz = 936 GFLOPS◮ 4.0 GB GDDR3, 102 GB/s Mem BW, 4GB/s PCIe BW to CPUFriedrich-Alexander University of Erlangen-NurembergRichard Membarth6CUDA: Extended C◮ Function qualiﬁers◮ Variable qualiﬁers◮ Built-in keywords◮ Intrinsics◮ Function callsFriedrich-Alexander University of ...

Informations

Publié par	Obbi
Nombre de lectures	34
Langue	English

Extrait

CUDA Parallel Programming Tutorial

Richard Membarth

richard.membarth@cs.fau.de

Hardware-Software-Co-Design University of Erlangen-Nuremberg

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

19.03.2009

Outline

◮

Tasks for CUDA

CUDA programming

Getting started

Example codes

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

model

Tasks for CUDA

◮

Provide ability to run code on

Manage resources

Partition data to t on cores

Schedule blocks to cores

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

GPU

Data Partitionin

◮

Partition data in smaller blocks that can be processed by one core

Up to 512 threads in one block

All blocks dene the grid

All blocks execute same program (kernel)

Independent blocks

Only ONE kernel at a time

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

Memor

Hierarch

Memory types (fastest memory rst):

◮Registers

◮Shared memory

◮Device memory (texture, constant, local, global)

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

Tesla Architecture

◮

30 cores, 240 ALUs (1 mul-add)

(1 mul-add + 1 mul): 240 * (2+1) * 1.3 GHz = 936 GFLOPS

4.0 GB GDDR3, 102 GB/s Mem BW, 4GB/s PCIe BW to CPU

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

CUDA: Extended

◮

Function qualiers

Variable qualiers

Built-in keywords

Intrinsics

Function calls

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

Function Qualiers

◮Functions: device , host , global

__ __ global voidf i l t e r (int* in,int ... }

◮Default: host ◮No function pointers ◮No recursion ◮No static variables ◮No variable number of arguments ◮No return value

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

* out )

{

Variable Qualiers

◮

Variables: device , constant ,

shared

c o n s t a n tfloat ;t r i x [ 1 0 ]m a _ _ _ _ = {1 . 0 f , ... } s h a r e dint[ 3 2 ] [ 2 ] ; _ _ _ _

Default: Variables reside in registers

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

Built-In Variables

◮

Available inside of kernel code

Thread index within current block:

threadIdx.x,threadIdx.y, threadIdx.z

Block index within grid:

blockIdx.x,blockIdx.y

Dimension of grid, block:

gridDim.x,gridDim.y blockDim.x,blockDim.y blockDim.z

Warp size:

warpSize

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

Intrinsics

◮voidsyncthreads(); __ ◮Synchronizes in all thread of current block

◮Use in conditional code may lead to deadlocks

◮Intrinsics for most mathematical functions exists, e.g.

sinf(x), cosf(x), expf(x), ... __ __ __ ◮Texture functions

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Livre audio en ligne - Développement personnel Livre en ligne Tout le catalogue Tous les Intérêts

CUDA Parallel Programming Tutorial

YouScribe

Le catalogue

Le service

Les conditions