Optimal global instruction scheduling for the Itanium processor architecture [Elektronische Ressource] / von Sebastian Winkel

universitat_des_saarlandes

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

281 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	universitat_des_saarlandes
Publié le	01 janvier 2005
Nombre de lectures	22
Langue	English
Poids de l'ouvrage	2 Mo

Extrait

Optimal Global Instruction Scheduling for the
R
Itanium Processor Architecture
Dissertation
zur Erlangung des Grades des
Doktors der Ingenieurwissenschaften (Dr.-Ing.) der
Naturwissenschaftlich-Technischen Fakultäten der
Universität des Saarlandes
von
Diplom-Informatiker
Sebastian Winkel
Saarbrücken
September 2004ii
Das Bildmaterial für den Umschlag und für die Abbildungen auf den Seiten 32, 35 und 42 wurde
freundlicherweise zur Verfügung gestellt von Intel Deutschland GmbH.
Intel, Itanium und Pentium sind eingetragene Warenzeichen der Intel Corporation oder ihrer
Tochterunternehmen in den USA und anderen Ländern. Andere Marken oder Produktnamen
sind Eigentum der jeweiligen Inhaber.
Tag des Kolloquiums: 16.12.2004
Dekan: Prof. Dr. Jörg Eschmeier
Prüfungsausschuss:
Gutachter: Prof. Dr. Reinhard Wilhelm
Priv.-Doz. Dr. Friedrich Eisenbrand,
Max-Planck-Institut für Informatik, Saarbrücken
Prof. Dr. Peter Marwedel, Universität Dortmund
Vorsitzender: Prof. Dr. Raimund Seidel
Akademischer
Mitarbeiter: Dr.-Ing. Stephan Thesingiii
Abstract
On the Itanium 2 processor, effective global instruction scheduling is cru-
cial to high performance. At the same time, it poses a challenge to the
compiler: This code generation subtask involves strongly interdependent
decisions and complex trade-offs that are difﬁcult to cope with for heuris-
tics. We tackle thisNP-complete problem with integer linear program-
ming (ILP), a search-based method that yields provably optimal results.
This promises faster code as well as insights into the potential of the ar-
chitecture. Our ILP model comprises global code motion with compen-
sation copies, predication, and Itanium-speciﬁc features like control/data
speculation.
In integer linear programming, well-structured models are the key to
acceptable solution times. The feasible solutions of an ILP are repre-
sented by integer points inside a polytope. If all vertices of this polytope
are integral, then the ILP can be solved in polynomial time. We deﬁne
two subproblems of global scheduling in which some constraint classes
are omitted and show that the corresponding two subpolytopes of our ILP
model are integral and polynomial sized. This substantiates that the found is of high efﬁciency, which is also conﬁrmed by the reasonable so-
lution times.
The ILP formulation is extended by further transformations like cyclic
code motion, which moves instructions upwards out of a loop, circularly
in the opposite direction of the loop backedges. Since the architecture re-
quires instructions to be encoded in ﬁxed-sized bundles of three, a bundler
is developed that computes bundle sequences of minimal size by means
of precomputed results and dynamic programming.
Experiments have been conducted with a postpass tool that imple-
ments the ILP scheduler. It parses assembly procedures generated by In-
tel’s Itanium compiler and reschedules them as a whole. Using this tool,
we optimize a selection of hot functions from the SPECint 2000 bench-
mark. The results show a signiﬁcant speedup over the original code.ivv
Zusammenfassung
Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen
Einﬂuß auf die Leistung und stellt dabei gleichzeitig eine Herausforde-
rung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechsel-
seitig voneinander abhängigen Entscheidungen verbunden, die für Heuri-
stiken nur schwer zu beherrschen sind. Wir lösen diesesNP-vollständige
Problem mit ganzzahliger linearer Programmierung (ILP), einer suchba-
sierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht
neben schnellerem Code auch Einblicke in das Potential der Itanium-
Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschie-
bungen mit Kompensationscode, Prädikation und Itanium-speziﬁsche
Techniken wie Kontroll- und Datenspekulation.
Bei ganzzahliger linearer Programmierung sind wohlstrukturierte
Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lö-
sungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines
Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig,
kann das ILP in Polynomialzeit gelöst werden. Wir deﬁnieren zwei Teil-
probleme globaler Instruktionsanordnung durch Auslassung bestimmter
Klassen von Nebenbedingungen und beweisen, daß die korrespondieren-
den Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller
Größe sind. Dies untermauert die hohe Efﬁzienz des gefundenen Modells,
die auch durch moderate Lösungszeiten bestätigt wird.
Das ILP-Modell wird um weitere Transformationen wie zyklische Co-
deverschiebung erweitert; letztere bezeichnet das Verschieben von Befeh-
len aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärts-
kanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln
fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequen-
zen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dyna-
mischer Programmierung erzeugt.
Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest
von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet
die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Ange-
wandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint
2000 erreicht der Optimierer eine signiﬁkante Beschleunigung gegenüber
dem Originalcode.viExtendend Abstract
HP and Intel’s Itanium Processor Family (IPF) is considered as one of the most challenging
processor architectures to generate code for. Its performance is highly compiler-dependent since
it relies on static instruction scheduling. The code generator is responsible for extracting a high
amount of instruction-level parallelism and exposing it to the processor. To achieve this, it must
combine global instruction scheduling (with code motion between basic blocks) with the use of
predication and Itanium-speciﬁc features like explicit speculation.
When applying these techniques, the compiler must deal with strongly interdependent deci-
sions and keep the increasing resource pressure as a result of compensation copies and specula-
tive computations under control: Overuse can spoil the beneﬁt due to execution unit shortage,
but the opposite, a too conservative application, can lead to many unused execution slots. It
is difﬁcult for the greedy scheduling heuristics found in Itanium compilers to ﬁnd a balance.
Moreover, resource-constrained instruction scheduling is—even when limited to basic blocks—
an NP-complete problem, for which heuristics deliver only approximations. It is unclear how
far away these are from exact, optimal solutions.
In this thesis we tackle the global instruction scheduling problem for the Itanium 2 processor
with integer linear programming (ILP), a search-based combinatorial optimization method that
yields provably optimal results. Our ILP model comprises global code motion with automated
generation of compensation code, predication, as well as vital IPF features like control and data
speculative loads. The ILP approach can—within certain limits—resolve the interdependences
between the involved decisions and deliver the global optimum in the form of a schedule with
minimal length. This promises faster code, as well as some theoretically well-founded insights
into the potential of the architecture.
In integer linear programming, the search space of feasible solutions is represented by the in-
teger points inside a polytope. A well-structured model, in which these integer points are tightly
enclosed by the polytope, is the key to acceptable solution times. If all vertices of the polytope
are integral, then the ILP can even be solved in polynomial time. Since we are modeling anNP-
complete problem, however, we cannot expect to ﬁnd such a polytope of polynomial size for the
entire global instruction scheduling problem, but only possibly for subproblems. Therefore we
deﬁne several such subproblems in which some constraint classes are omitted (such as the re-
source constraints, which model the occupation of execution units, or the precedence constraints,
which enforce data dependence preservation). For subproblems that are no longerNP-complete,
ILP formulations may exist that describe integral and polynomial sized subpolytopes—and are
in fact developed in this thesis.
viiviii
For one of these subproblems, this is achieved by reducing it to a node packing problem on
a perfect graph and exploiting a standard result in order to obtain an integral subpolytope. A
further subproblem is modeled as a network ﬂow problem, a problem class for which integral
polytopes are well known, too. In both cases, it is necessary to reduce the complexity of the
resulting ILP formulation afterwards, which is in the former case even exponential. This is
done by removing redundant constraints and by applying integrality-preserving transformations.
These include lifting the polytope to a higher dimension (to reduce the number of constraints
needed to describe it), or inversely, projecting it onto a lower-dimensional subspace (to reduce
the number of needed variables).
As a central theoretical result of this thesis, we