Benchmark für Rechnerauswahl

Anyae - Rz50

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

12 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Anyae
Nombre de lectures	23
Langue	English

Extrait

Universität Karlsruhe (TH)
Rechenzentrum

KaSC Benchmark Suite

Version 1.0

AUGUST 2004

- 2 -
1 DIRECTORY STRUCTURE ....................................................................................................................... 4
1.1 KERNELS .............................................................................................................................................. 4
2 BENCHMARK PROGRAMS...................................................................................................................... 4
2.1 CONFIGURATION OF BENCHMARK PROGRAMS......................................................................................... 6
2.2 COMPILATION OF BENCHMARK PROGRAMS ............................................................................................. 7
2.3 RUNNING THE BENCHMARK P.................................................................................................. 8
2.4 RUNNING DIFFERENT COMMUNICATION BENCHMARKS ............................................................................. 9
2.4.1 Communication within one SMP node 9
2.4.1.1 Bandwidth and latency for single ping pong between two tasks within one SMP node .......... 9
2.4.1.2 dth ansingle exchange between two tasks within one SMP node........... 9
2.4.1.3 Bisection bandwidth for ping pong within one SMP node...................................................... 10
2.4.1.4 dwidth for exchange 10
2.4.2 Communication between two SMP nodes................................................................................. 11
2.4.2.1 Single ping pong test between two tasks running on different SMP nodes ........................... 11
2.4.2.2 Single exchange between two tasks running on different SMP nodes .................................. 11
2.4.2.3 Multiple ping pong test between all CPUs of two SMP nodes ............................................... 11
2.4.2.4 Multiple exchange between all CPUs of two SMP nodes 11
2.4.3 Bisection Bandwidth................................................................................................................... 11
2.4.4 Latency Hiding ........................................................................................................................... 11

3 CONTACT................................................................................................................................................ 12

- 3 -
1 Directory structure

All files belonging to this benchmark are supplied in a compressed tar file benchmark.tar.gz which
should be uncompressed and untared using the commands

gunzip benchmark.tar.gz
tar -xvf benchmark.tar

The actual directory also contains a file make.inc. Within this file compiler names and options, library path
etc. are defined to compile the various benchmark programs. The Makefile includes this file in order to
read all these definitions. For some systems sample files make_<SYSTEM>.inc are included in the actual
directory.

1.1 Kernels

The actual directory contains some low level programs measuring some specific data of one processor, one
SMP node or the whole system. It contains a set of low level kernels that are typical for applications in the
field of scientific computing. Some programs only run on a single CPU, some programs run on a single CPU
as well as on all CPUs of an SMP node. The small benchmark suite also includes some programs to
measure the communication performance of typical MPI point-to-point communications.

2 Benchmark Programs

All programs are written in Fortran 90 and call a C function seconds to measure CPU time and wall clock
time. The C function seconds.c itself calls the function gettimeofday from the system library to get the
timing data. Depending on whether Fortran adds an underscore to the names of external functions or not you
should add the option –DFTNLINKSUFFIX and depending on whether Fortran changes the characters of the
name seconds to uppercase (SECONDS) or not you should add the option -DUPPERCASE to the list of the
compiler options for the C compiler (CFLAGS in make.inc). All programs contain a structure that is similar
to

call SECONDS(tim1,tdum)
do j=1,repfactor
call DUMMY(x1,x2,x3,n)
do i=1,n
simple operation
enddo
enddo
call SECONDS(tim2,tdum)

With this loop construct we want to measure the time needed to complete the inner loop do I=1,n. In order
to increase the time which is measured and to increase the accuracy of the measurement, this loop is
executed repeatedly (outer loop do j=1,repfactor). The call to the external subroutine DUMMY is included
to guarantee that the outer loop is executed as often as requested. Otherwise optimization by the compiler
could change this loop construct. You should make sure that neither the compiler nor the linker will do an
optimization that changes this sequence and nest of loops. This could mean that interprocedural optimization
should be switched off for linking, e.g. used options in FFLAGS (in make.inc) should not switch on any
interprocedural optimization during compile or link step.

The benchmark suite includes the following serial programs running on one processor (for these programs
except of bm116 exists a parallel version):

- 4 - x3(i) = x1(i) + x2(i) bm111: vector add
bm112: vector multiply x3(i) = x1(i) * x2(i)
x3(i) = x1(i) / x2(i)bm113: vector divide
x3(i) = x1(i) + s * x2(i)bm114: vector triad with scalar (saxpy)
x7(i) =(x1(i)+x2(i))*x3(i)+(x4(i)-x5(i))/x6(i)bm116: vector compound operation
s = s + x1(i) * x2(i)bm117: dot product
x4(i) = x2(i) + x3(i) * x4(i) bm118: vector triad

The benchmark suite includes the following serial programs running only on one processor:

bm111str: vector add with stride x3(i*istr) = x1(i*istr) + x2(i*istr) gth: vector adgather x3(i) = x1(ix(i)) + x2(i)
bm111sct: vector add with scatter x3(ix(i)) = x1(i) + x2(i)
bm117gth: dot product with gather s = s + x1(ix(i)) * x2(i)
bm121: matrix multiply - rowwise version C = A * B
bm122: matrix mutiply - dot product version
bm123: matrix multiply - columnwise version
bm123u4: matrix multiply - columnwise
C = A * B version with 4-fold unrolling
bm124: matrix multiply - library version
bm131: scalar performance test 1 a = a * b + c
bm132: scalar performance test 2 a = b*c + d; b = a - c*b; c = a/b

The benchmark suite includes the following programs that run usually in parallel on one node and measure
the performance on the first processor:

x3(i) = x1(i) + x2(i)bm111smp: vector add on all processors
x3(i) = x1(i) * x2(i)bm112smp: vector multiply on all processors
x3(i) = x1(i) / x2(i)bm113smp: vector divide on all processors
x3(i) = x1(i) + s * x2(i) bm114smp: vector triad with scalar on all
processors s = s + x1(i) * x2(i)bm117smp: dot product on all processors
x4(i) = x1(i) + x2(i) * x3(i) bm118smp: vector triad on all processors
x3(i) = x1(i) + s * x2(i) on 1. proc. bm1148smp: vector triad with scalar on the first
processor and vector triad with large,constant x4(i) = x1(i) + x2(i) * x3(i) on other
vectorlength on all other processors procs.
s = s + x1(i) * x2(i) on 1. processor. bm1178smp: dot product on the first processor
and vector with large,constant vectorlength triad
on all other processors procs.
x4(i) = x1(i) + x2(i) * x3(i) on 1. proc. bm1188smp: vector triad on the first processor
and vector triad with large,constant vectorlength
on all other processors procs.

The programs measuring the communication performance are:

bmmpi1: MPI ping-pong benchmark
bmmpi2: MPI double ping-pong (exchange)
bmmpi3: MPI overlap test (short messages)

- 5 - bmmpi4: MPI overlap test (long messages)

2.1 Configuration of Benchmark Programs

Before starting the installation process you should check and adapt the file make.inc. A sample make.inc is
shown below:

# make.inc
#
# Utility file used by the KaSC Benchmark.
#
########################################################################
#
# Part 1: Utilities
#
# Name of the make utility that will be used to compile and link the
# benchmarking program.
# Most of the benchmarking programs are based on using the GNU make utility.
MAKE = gmake

# Number of processors of one node
NP =
# Number of processors for the communication benchmarks
NP_COMM =

# Name of the command to run parallel programs, e.g. mpirun
PAR_CMD =

# Option for the number of processors (spatially) in front of the executable
# together with the number of processors for the serial run, e.g. -np 1
PAR_OPTS1 =
# together with the number of processors for the parallel run, e.g. -np $(NP)
PAR_OPTP1 =

# Option for the number of processors (spatially) behind the executable together
# with the number of processors for the serial run, e.g. -procs 1
PAR_OPTS2 =
# with the number of processors for the parallel run, e.g. -procs $(NP) –nodes 1
PAR_OPTP2 =