//img.uscri.be/pth/95554395a9b59887de07f124998e8e3bdefdc453
La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Efficient Algorithms for Large-Scale Image Analysis [Elektronische Ressource] / Jan Wassenberg. Betreuer: Peter Sanders

De
217 pages
Efficient Algorithms forLarge-Scale Image Analysiszur Erlangung des akademischen Grades einesDoktors der Ingenieurwissenschaftender Fakultät für Informatikdes Karlsruher Instituts für TechnologiegenehmigteDissertationvonJan Wassenbergaus KoblenzTag der mündlichen Prüfung: 24. Oktober 2011Erster Gutachter: Prof. Dr. Peter SandersZweiter Prof. Dr.-Ing. Jürgen BeyererAbstractThe past decade has seen major improvements in the capabilitiesand availability of imaging sensor systems. Commercial satellitesroutinely provide panchromatic images with sub-meter resolution.Airborne line scanner cameras yield multi-spectral data with aground sample distance of 5 cm. The resulting overabundance ofdata brings with it the challenge of timely analysis. Fully auto-mated processing still appears infeasible, but an intermediate stepmight involve a computer-assisted search for interesting objects.This would reduce the amount of data for an analyst to examine,but remains a challenge in terms of processing speed and workingmemory.This work begins by discussing the trade-offs among the varioushardware architectures that might be brought to bear upon theproblem. FPGA and GPU-based solutions are less universal andentail longer development cycles, hence the choice of commoditymulti-core CPU architectures. Distributed processing on a cluster isdeemed too costly.
Voir plus Voir moins

agT

forithmsAlgorficientEf

AnalysisImageLarge-Scale

zurErlangungdesakademischenGradeseines

wissenschaftenIngenieurderDoktors

desderKarlsruherFakultätInstitutsfürInforfürTmatikechnologie

genehmigte

Dissertation

von

assenbergWJan

Koblenzaus

dermündlichenPrüfung:24.Oktober2011

ErsterGutachter:Prof.Dr.PeterSanders

ZweiterGutachter:Prof.Dr.-Ing.JürgenBeyerer

Abstract

andTheavpastailabilitydecadeofhasimagingseenmajorsensorimprosystems.vementsCommerinthecialcapabilitiessatellites
rAirboroutinelyneprlineovidescannerpanchrcamerasomaticyieldimageswithmulti-spectralsub-meterdatarwithesolution.a
grdataoundbringssamplewithitdistancetheof5challengecm.ofThertimelyesultingoanalysis.verabundanceFullyauto-of
mightmatedprinvolvocessingeastillcomputerappears-assistedinfeasible,searchbutforaninterinterestingmediateobjects.step
butThisrwemainsouldraeducechallengetheinamounttermsofofdataprforocessinganspeedanalystandtowexamine,orking
.ymemorhardwThisarwearorkbeginschitecturbyesthatdiscussingmightthebebrtrade-ofoughtfstoamongbeartheuponvariousthe
prentailoblem.longerFPGAdevandelopmentGPU-basedcycles,hencesolutionsthearechoicelessofuniversalcommodityand
deemedmulti-coretooCPUcostlyar.Wechitecturwilles.demonstrateDistributedtheprfeasibilityocessingonofaprclusterocessingis
2aerialhoursonimagesaofsingle100wkm×orkstation100kmwithareastwoatpr1mrocessorsesolutionandawithintotal
oftwelvecores.Becauseexistingapproachescannotcopewith
–suchfromdataamountsoaccessfdata,andeachsignalstageprofocessingthetoimageobjectprocessingextractionpipelineand
forfeaturemaximumcomputationperfor–mance.willhaWveetointrbeoducedesignednewfreffiomcientthegroundalgorithmsup
thatLetprousvidebeginusefulwithresultstheatmostfastertime-cspeedsriticalthantaskpre–theviouslyextractionpossible.
ofThisobjectstepiscandidatesnecessaryfrombecauseanimage,individualalsoknopixelswnasdonotprsegmentation.ovide
ableenoughmodelinforforthemationobjectsfortheinvscrolveseeninggroupingtask.Asimilarsimplepixelsbutrtogethereason-.
netwHigh-qualityorkflowandclusteringanisotropicalgorithmsdiffusionbasedareonfarmeantooshift,time-consuming.maximum

ix

prWeopertyintrofoduceavaoidingnewbothgraph-basedunder-andovalgorithmersegmentation.withtheItsimportantdistin-
guishingfeatureistheindependentparallelprocessingofimage
tileswithoutsplittingobjectsattheboundaries.Ourefficient
implementationtakesadvantageofSIMDinstructionsandout-
perforsimilarmsqualitymean.shiftbRecognizingyafactortheof50outstandingwhileprperforoducingmanceresultsofitsof
wemicrdeoarvelopchitecturitintoe-aawargenerale32-bitvirtual-memorintegerysortercounting,yieldingsortsubrthefastestoutine,
knownalgorithmforshared-memorymachines.
fultoBecausesuppresssegmentationsensornoise.groupsThetogetherBilateralsimilarFilterispixels,anitisadaptivhelp-e
smoothingkernelthatpreservesedgesbyexcludingpixelsthatare
distantinthespatialorradiometricsense.Severalfastapproxi-
highermation-dimensionalalgorithmsarespace.knoWwn,ee.g.accelerateconvthisolutionintechniqueadobyawnsampledfactor
ofimation14viaofthe3Dparallelization,Gaussvkernel.ectorizationTheandsoftwaareis73SIMD-friendlytimesasapprfastox-as
anexactcomputationonanFPGAandoutperformsaGPU-based
approximationbyafactorof1.8.
Physicallimitationsofsatellitesensorsconstituteanadditional
hurdle.Thenarrowmultispectralbandsrequirelargerdetectors
andFusingusuallybothhavdatasetsealoiswerterrmedesolutionpan-sharthanthepeningpanchrandimpromaticovesband.the
techniquessegmentationareduevulnerabletothetocoloradditionaldistortioncolorinforbecausemation.ofPremismatchesvious
efbetwfect,eenwethecomputebandsthespectraloptimalrsetesponseofbandwfunctions.eightsToforreacheduceinputthis
image.Ournewalgorithmoutperformsexistingapproachesbya
factorof100,improvesupontheircolorfidelityandalsoreduces
noiseinthepanchromaticband.
Becausethesemodulesachievethroughputsontheorderof
seTheveralubiquitoushundredMB/s,GDALthelibrarnextyisfarbottleneckslowertobethanaddrtheessedtheoriseticalI/O.

x

diskthroughput.Wedesignanimagerepresentationthatavoids
efficientunnecessaryasynchrcopying,onousandI/O.Thedescriberesultinglittle-knosoftwwnareistechniquesuptofor12
timesasfastasGDAL.Furtherimprovementsarepossibleby
compressingthedataifdecompressionthroughputisonparwith
theasymmetrictransferSIMDspeedscodecofathatdiskachiearravy.esaWecomprdevelopessionanoratiovelof0.5losslessfor
on16-bitasinglepixelscorande.rThiseachesisaboutdecompr100essiontimesthrasfastoughputsasof2lossless700MB/sJPEG-
2000Letandusnoonlywr20–60%eturntolargertheonextractedmultispectralobjects.satelliteAdditionaldatasets.steps
fordetectingandsimplifyingtheircontourswouldprovideuse-
fulinformation,e.g.forclassifyingthemasman-made.Toallow
softwannotatingarelarrasterizerge.imagesHigh-qualitywiththerantialiasingesultingpolyisachiegons,vwedebydevisederiv-a
ingoutperforthemsoptimalthepolynomialGupta-Sprlooullw-passalgorithmfilter.byOurafactorofimplementation24and
exceedsTheprtheeviouslyfillrateofadescribedmid-rangeprGPU.ocessingchainiseffective,but
electro-opticalsensorscannotpenetratecloudcover.Becausemuch
oftheearthssurfaceisshroudedincloudsatanygiventime,we
haveaddedaworkflowfor(nearly)weather-independentsynthetic
fromapertureuniforradarmly.Small,brightrhighly-regionsbeflectivyesubtractingobjectscaneachbedifferpixelsentiatedback-
grtheound,asymptoticestimatedfrcomplexityomtheofdarkestthisapprringoachsurrtoitsoundinglowit.erWeboundreduceby
AmeansofsophisticatedanewpipeliningalgorithminspirschemeedbensuryesRangethewMinimumorkingsetfitsQueries.in
cache,andthevectorizedandparallelizedsoftwareoutperforms
anFPGAimplementationbyafactorof100.
andTheseGPUresultssolutionschallengeenabletheconvsignificantentionalspeedupswisdomoverthatgeneral-FPGA
thepurloposewerCPUs.boundofBecausetheirallofcomplexitytheabo,vetheiralgorithmsusefulnesshaviserdecidedeached

xi

byconstantfactors.Itisthethesisofthisworkthatoptimized
softwarerunningongeneral-purposeCPUscancomparefavorably
inthisregard.Thekeyenablingfactorsarevectorization,paral-
lelization,andconsiderationofbasicmicroarchitecturalrealities
suchasthememoryhierarchy.Wehaveshownthesetechniquesto
beapplicabletowardsavarietyofimageprocessingtasks.How-
ever,itisnotsufficienttotunesoftwareinthefinalphasesofits
development.Instead,eachpartofthealgorithmengineeringcycle
–design,analysis,implementationandexperimentation–should
accountforthecomputerarchitecture.Forexample,noamount
ofsubsequenttuningwouldredeemanapproachtosegmentation
thatreliesonaglobalrankingofpixels,whichisfundamentally
lessamenabletoparallelizationthanagraph-basedmethod.The
algorithmsintroducedinthisworkspeedupsevenseparatetasks
byfactorsof10to100,thusdispellingthenotionthatsuchefforts
arenotworthwhile.Wearesurprisedtohaveimprovedupon
long-studiedtopicssuchaslosslessimagecompressionandline
rasterization.However,thetechniquesdescribedhereinmayallow
domains.otherinsuccessessimilar

xii

Acknowledgements

Iguidancesincerely–thankidentifyingmypradvisor,omisingProf.avPeterenuestoSanders,explorfore,proteachingviding
algorithmengineering,andsharingtheloreofcleveroptimizations.
Thankyou,Prof.Dr.-Ing.JürgenBeyerer,forreviewingthisthesis.
Lookingbackearlier,Ithankmyparentsfortheirloveandsup-
rport,esultingandforinteralloestinwingmecomputingaccesswtoasakindledTRS-80earlymicronatocomputerRandolph.The
Sconcerchool,ningaespeciallymodelbryDrocket.RobertsimulatorKir.chnerThanksstophysicsmysoccerassignmentcoach,
H.KillebrewBailey,forinstillingthespiritpracticehard,play
hard;noregrets!
Igratefullyacknowledgetheproductiveworkingenvironment
attheFGAN-FOMresearchinstitute,nowapartofFraunhofer
tianIOSB.WuttkeThanksfortointermyofestingficematesdiscussionsDominikoverPerlunchpeetandandSfruitfulebas-
administrativcollaboration;eRomymatters;PfeifmyfersuperandvisorAnjaDr.WBlancaniolfgangforhelpingMiddelmannwith
anddepartmentheadDr.KarstenSchulzforprovidingguidance
andthelatitudetoworkoninterestingproblems.
Thisthesisbuildsuponmachine-orientedgroundworklaidfor
thepleasur0eA.D.towstrategyorkwithgamethisprteamojectofstartingenthusiastic,in2002.Itself-motivhasbeenateda
volunteers,especiallyPhilipTaylor.
usefulIamtooltogratefulrtoead/writetheauthorsnearlyofanyGDALimageforfiledevforelopingmat.aThankstruly
toCharlesBloom,Prof.TanjaSchultzandDominikPerpeetfor
valuablefeedbackconcerningpartsofthiswork.
andImostapprofeciateall,themybelopatiencevedandSufen.Herunderstandingloveandofsupportfriends,meanfamilyso,
me.tomuch

xiii

Thisworkisdedicatedtothescientists/engineers/craftsmen
whobridgethegapbetweentheoryandpracticeofcomputing,
devisingsolutionsforpreviouslyinsurmountableproblemsand
teasingoutmaximumperformanceduetoadetailedunderstand-
ingoftheunderlyinghardware.Keeptheflameburning!
xiv

Contents

Contents

AppetizersI

Introduction11.1Fundamentals......................
1.2TheNeedforSpeed...................
1.3ImageProcessingChain................

ArchitectureComputer22.1BriefArchitectureDescriptions............
2.2DatasheetComparison.................
2.3OurChoice........................
2.4ConsequencesfortheAlgorithms...........
MemoryHierarchy...................
SIMD...........................
Parallelization......................
2.5Discussion........................

CourseMainII

Input/Output33.1ImageRepresentation.................
3.2EfficientI/O.......................
Synchronousvs.Asynchronous............
xv

xv

1

3345

779111314171821

23

25252627

4

5

6

BlockSize........................
ImplementationDetails................
Throughput.......................
3.3FileFormat........................
3.4Performance.......................
3.5Conclusion........................

CompressionSIMDicAsymmetrLossless4.1IntroductionandRelatedWork............
LosslessImageCompression.............
EntropyCoding.....................
AsymmetricCompression...............
4.2FastSIMDIntegerPacking...............
4.3SIMDSliding-WindowCompression.........
4.4Measurements......................
HardwareandSoftware................
Datasets.........................
Throughput.......................
CompressionRatio...................
FurtherExperiments..................
4.5Conclusion........................

SharpeningPan5.1IntroductionandRelatedWork............
5.2Algorithm........................
5.3NoiseReduction.....................
5.4Results..........................
5.5QualityMetrics.....................
5.6Performance.......................
5.7Conclusion........................

SegmentationImage6.1IntroductionandRelatedWork............
6.2Algorithm........................
6.3Results..........................
xvi

283032333436

3940404142434549495051545657

6162646568727677

79798185

6.4ParallelAlgorithm...................88
6.5RegionFeatures.....................91
6.6Performance.......................94
6.7Conclusion........................96

7AntialiasedLineRasterization97
7.1IntroductionandRelatedWork............97
7.2Algorithm........................100
7.3Performance.......................102
7.4OptimalAntialiasing.................104
7.5Results..........................107
7.6Conclusion........................110

111RadarApertureSynthetic88.1HotspotOperator....................112
8.2Algorithm........................113
8.3Results..........................121
8.4Performance.......................122
8.5Conclusion........................122

Discussion9

DessertsIII

125

129

AVirtual-MemoryCountingSort131
A.1Introduction.......................131
A.2SoftwareWrite-Combining..............132
A.3Virtual-MemoryCountingSort............134
A.4RadixSort........................135
A.5Performance.......................138
A.6Conclusion........................142

145DetailsImplementationBB.1SoftwareEngineering..................145

xvii

AccessesyMemorUnalignedB.2B.3LVTFileFormat........

Bibliography

Bibliography

IndexZusammenfassungLebenslauf.....

.........
.........

xviii

.............
.............

.............
.............

146148

159

159

189197203

Part

I

Appetizers

1

1Chapter

Introduction

Thischaptersetsthestagebybrieflyreviewingfundamentalsof
digitalimaging,explainingtheneedforautomation,andintroduc-
ingourprocessingchainforimageanalysis.

Fundamentals1.1

Wetectorbeginelementswithelectrmeasureo-opticaltheintensityimaging,ofincertainwhichfranarraequenciesyofde-of
electromagneticradiation(e.g.visiblelight)thatfallupontheir
(pictursurface.eEachelement)detectorbecauseyieldstheayardigitaletypicallynumber,referrcombinededtotoasformpixelsa
two-dimensionalimage.Whenthedetectorsaresensitivetoall
frmatic.equenciesPlacingoffiltersvisibleinfrlight,ontofthesomeimageofisthedescribeddetectorsasallowspanchrthemo-
oftofrascertainequencies,thee.g.contributionwhatweofperaceivcertaineas[spectral]blue.Imagesband–inawhichrange
eachpixelconsistsofmultiplecomponents(per-bandintensity
concermeasurnedements)witharesuchtermedimagesbecausemultispectral.theirThiscolorworkinforismationprimarilyis
canparticularlyobscureusefulobjectsforbehindautomatedthemanalysis.becauseHovisiblewever,lightiscloudsscatterorrained
bywatermoleculesorotherparticles[1].

3

byByatmosphericcontrast,syntheticconditionsaperturandweeatherradar.(SAR)Theseissystemsnearlyunafilluminatefected
scenesticatedwithpost-pranocessingantennaandcombinesrecordthesethesignalsmultipleintoechoes.whatSmightophis-
haverationebeenofanmeasurimageedbwithyarlarelativgeelyantenna,highrwhichesolutionallowscompartheedgen-to
conventionalradar.[2]Becauseelectro-opticalandradarimages
haalsovegivdiffeserentattentionandpertothehapsanalysiscomplementarofSARydata.advantages,thisthesis

SpeedforNeedThe1.2

Thepastdecadehasseensignificantimprovementsinthecapabili-
tiesofimagingsensorsystems.Forexample,therecentlylaunched
WorldView-2imagingsatelliteboastsagroundsampledistance
(GSD)1ofonly46cm[3].ThiscorrespondstoNIIRS(National
ImageInterpretabilityRatingScale)level6of9[4],indicatingthe
imagesaresuitableforawiderangeofinterpretationtasks.Large
formatcamerasonairborneplatformsoperatingatmuchlower
altitudesandmovementspeedsallowevenfinerresolutions,e.g.
17mmfortheDMCII250[5].Suchincreasesintechnicalcapability
areinvariablyaccompaniedbygreaterexpectations.Forexample,
animageanalysthasexpressedadesiretocountthenumberof
individualdwellingsinanareaspanninghundredsofsquarekilo-
meters.Computerassistanceisanabsolutenecessityfortasksof
suchmagnitude[6].Humananalystsremainindispensable,but
theirworkloadcouldbereducedbyscreeningimagesforrelevant
objects.Assumingthedetectionprobabilityissufficientlyhigh,
otherregionsneednotbeexaminedbytheanalyst.However,even
basicscreeningapproachesforwide-areadataarechallengingin
termsofprocessingtimeandmemoryrequirements.Theauthor
participatedinastudyofexistingalgorithmsandmodulesforim-
ageinterpretation,includingco-registration,screeningforobjects
1Forconvenience,weoftenrefertothisasthe[spatial]resolutionofanimage.
4

suchasvehicles,storagetanksandairplanes,andterrainpassabil-
ityanalysis.In2009,wemeasuredthroughputsbetween0.01and
3MPixel/sonaX5365CPUforninesoftwaremodulesdelivered
byvariousfirms.Letuscontrastthiswiththedataratesofrecent
cameras.TheDMCIIcapturesa252MPixelimageevery1.7s,
andaJAS-150ssystemscansnine12000pixellines800timesper
second[5].Real-timeprocessingentailsspeedinguptheexisting
softwarebyafactorof100to10000.Toatleastminimizethe
additionalprocessingtimeandtherebyenableswiftresponsesin
disasterrelief[7]andothertime-criticalapplications,thisthesis
developsnew,highlyefficientalgorithmscapableofthroughputs
MPixel/s.40ofexcessin

ChainProcessingImage1.3

vWearioushavedesignedapplicationsasuchgeneralasscrimageeeningprimagesocessingforchaincertainsuitabletypesforof
viousobjects,image.classifyingItbeginsthem,withorrreportingeceivingdatachangesfromwithrsatellitesespectortoaotherpre-
theirsources,featurperfores.msBecausenoisertheeduction,computationalextractscostobjectsofandexistingcomputesalgo-
frrithmsomtheisgrfartoooundhigh,upforeacheffilinkciencyof.theChapterchain2hasgivesbeenanroveredesignedview
ofcomputerarchitecturesandexplainslow-leveltechniquesfor
takemaximizingadvantageperforofthem,mance.andOurrpreducesocessingthepixelschaintoisamoreengineercompactedto
longerobject-basedrequirreeprexpensivesentation.eper-pixelSubsequentoperationsanalysisandtherapplicationseforeneedno
notbeTheasfolloconcerwingnedchapterswithofperforthismance.thesisaredevotedtotheindivid-
uallinksoftheprocessingchain:

5

Chapter3transferringdescribestoourandfrimageomreprblockesentationstoragedeandvices,framewwithorkem-for
phasisonavoidingcopiesandmaximizingthroughputvia
(I/O).input/outputonousasynchr

Chapter4introducesanovelalgorithmforlosslessasymmetric
ofcomprdatatoessionbethattransferracceleratesed.ItsI/Odecomprbyressioneducingisthefasteramountthan
copyingtheoriginaldatainmemory.

Chapter5presentsanefficientapproachforfusinghighresolution
panchromaticandlowerresolutionmultispectralsatellite
images.Afastedge-preservingfilterreducesnoise.Objective
qualitymetricsreportimprovedcolorfidelityincomparison
algorithms.entcurrto

Chapter6developsahigh-qualityalgorithmforextractingobjects
fromimages.Ourgraph-basedapproachenablesparalleliza-
tionwithoutanytilingartifacts.Ittendstoavoidexcessive
subdivisionandmergingofobjectsdespitemakingonlylocal
decisions.

Chapter7introducesasoftwarelinerasterizer,e.g.forseparately
extractedsegmentcontours,thatoutperformsthefillrateofa
mid-rangegraphicsprocessor.Wederivetheoptimalcubic
polynomialfilterforantialiasing,whichrespondentsina
subjectivesurveypreferredoverexistingapproaches.

Chapterlike8probjectsesentsinainfrarhighlyedefandficientradaralgorithmimages.forfindingpoint-

Chaptermance9gainsconcludesandthisprwoposingorkbyavenuesdiscussingforthefuturrewesultingork.perfor-

6

2Chapter

ArchitectureComputer

Ascarefulalways,attentionhightoperforthemancecomputercomesaratachitecturprice,e.Thisincludingchapterpayingsets
forthseveraloptions,explainsourchoiceanddiscussestheimpli-
algorithms.ourforcations

2.1BriefArchitectureDescriptions

Wefirstintroduceandbrieflydescribeseveralpossiblecomputer
chitecturares.

DigitalSignalProcessors(DSP)aretailoredtowardslow-latency
signalprocessingapplications.Theirspecializedarchitectures
oftenincludehardwareaccelerationforloops,multiply-add
sequencesanddatacopying.SingleInstructionsthatap-
plythesameoperationtoMultiplelanesofData(SIMD)
increasethecomputationalthroughput.Thedeliberateomis-
sionofcomplicatedhardwareforout-of-orderexecutionand
virtualmemorymanagementsignificantlyreducespowerand
coolingrequirements,makingDSPssuitableforembedded
]8[systems.

GraphicsceleratorProcessingchipstowUnitsards(GPU)general-purhaveevolvposeedprfromocessing.graphicsTheirac-

7

designemphasizesaggregatethroughput,utilizinghundreds
ofSIMDlanesandoverathousandindependentthreadsof
executiontohidememorylatency[9].Multipleinterfacesto
high-performanceGDDR5memory[10]provideincreased
bandwidth.TherecentFermiarchitectureincludesseveral
majoradvances,includingfull-fledgedandfastfloatingpoint
arithmetic,caches,anderror-correctioncodesformemory.
Itsunified64-bitaddressspaceandimprovedsupportfor
higher-levellanguagescontinuesthetrendofconvergence
towardsgeneral-purposearchitectures.[11]

FieldProgrammableGateArrays(FPGA)encompassblocksof
programmablelogic(typicallylookuptables)andconfig-
urableinterconnects.Theirinherentparallelismenablesma-
jorspeedupsincomparisontoserialprocessing.Because
instructionsareimplicitintheprogrammedstructure,they
neednotbefetchedfrommemorynordecoded[12].Al-
thoughareaandpowerrequirementsareanorderofmag-
nitudehigherthanapplication-specificintegratedcircuits,
FPGAsshortendevelopmenttimeandoffertheintriguing
possibilityofruntimeadaptivereconfiguration[13].

CentralProcessingUnits(CPU)areunderstoodtobegeneral-
purposemicroprocessors.Decadesofefforthavegoneinto
improvingtheirserialperformancebymeansofcaches,pre-
dictionandsuper-scalarpipeliningwithout-of-orderexe-
cution[14][p.1314].Thesefacilitiesenableaflexibleand
simpleprogrammingmodel.However,physicallimitations
motivatedaparadigmshifttowardsparallelismintheform
ofmultipleprocessors/coresandSIMD[15].Recently,spe-
cialhardwaresupporthasbeenaddedforapplicationssuch
asvideoencoding,cryptographyandchecksums[16][p.13],
thusblurringthedistinctionbetweenCPUsandaccelerators.

8

isonComparDatasheet2.2

Togainfurtherinsightintothestrengthsofeacharchitecture,we
compareseveraloftheirkeycharacteristics.Table2.1liststhetotal
cacheandmemorysizeavailabletoeacharchitecture.TheCPU

Ttheablecase2.1:ofTotalFPGAs)sizeofandtheexterarnalchitecturmemoresy.caches(orblockRAMin

Arch.ModelCache[MiB]Mem.[GiB]
86.50TMS320C6678TIDSPGPUNVidiaGF100Fermi1.756
FPGAXilinxVirtex-710.63(?)
CPUIntelSandyBridge9.25192

devotesasignificantproportionofitstransistorstothecache[17].
AlthoughtheDSPlacksathirdlevelcache,itsotherlevelsmatch
theCPUscapacity[18].Withtheadventof16GiBDDR3modules,
commodityworkstationscanaccommodate192GiBofmemory[19].
ThelimitforacustomFPGAmemoryinterfaceisunknown,but
bothotherarchitecturesarerestrictedtoafewgigabytes[18,20].
Thisisofparticularconcernforimagesegmentation,whichre-
quireslargeamountsofrandom-accessmemory(c.f.Chapter6).
Table2.2providesaroughestimateofattainableperformance
bylistingtheadvertised1floating-pointoperationspersecond
(FLOPS).TheGPUandespeciallyFPGAboasthighervaluesthan
theotherprocessorsduetotheirmassiveparallelism[22,23].How-
ever,despitemultiplememoryinterfaces,theirmemorybandwidth
lagsfarbehindtherawcomputationalpower[20,23].Amdahl
suggestedaruleofthumbforbalancedcomputerdesigns:1byte
ofmemoryand1bytepersecondofI/Oarerequiredforeach
instructionpersecond[11].Interestingly,theCPUismuchcloser
tomeetingtheseguidelinesthantheotherarchitectures[24,25].
1TheCPUsentryisanactualmeasurementonanoverclockedsystem[21].

9

Table[SIMD]2.2:LanesKeyareperforunderstoodmancetoindicatorsbeCUDAforcoreachesar(DSPchitecturslices)e.
inthecaseofGPUs(FPGAs).
Arch.LanesMem.BW[GB/s]GFLOPS
GPUDSP512128144121160500
CPUFPGA564280292336130737

Thataside,FLOPSareanincompletecharacterizationofperfor-
mance.Wealsowishtoprovideameasurethatislessdependent
onunitstheofaclockDSPrate.toIttheisdifplentifulficulttobutsevcomparerelyertheirrestrictedegularCUDAexecutioncores
onaGPU,orsimpleDSPslices(amultipliercombinedwithan
adder/subtracterandmultiplexer)inFPGAstocomplex,high
performanceCPUcores.However,wecanconsiderlanes,the
aggregatenumberofvaluesthatcanbecomputedperclock.There
isaboutatenfoldincreasefromCPUtoGPUtoFPGA[9,23,26].
ThisyieldstheimportantinsightthatGPUsandespeciallyFPGAs
requireDespitelargeouramountsfocusonofperforparallelismmance,totherealizesuitabilitytheiroffullanarpotential.chitec-
turedependsheavilyonotherfactors,someofwhicharelistedin
Table2.3.Forexample,theestimatedcostofaVirtex-7FPGA[27]

Tanablear2.3:chitecturesNon-perforreal-worldmance-relatedsuitability.characteristicsthatalsoaffect
Arch.Process[nm]Power[W]Transistors×106Price[€]
110(?)1040DSPFPGAGPU2840402253(?)000319500000
2209959532CPU

10

iseffectivaboute100meanstimesofthematchingpriceoftheaDSPFPGAorsCPUFLOPS[26].mayAinvmorolveecost-an
arraGPUyisofalsoDSPboarcomparativdsoraelyCPUexpensivclustere,.prTheesumablyhigh-enddueQuadrinopart6000to
itsrelativelylargeGDDR5memorycapacity.
DSPPoiswerquiterequirefficientementsinarethisranotheregard[28important],makingitconsideration.suitableThefor
embeddedsystems.Conversely,theGPUdrawstwicetheCPUs
power[20,26]andusesthreetimesasmanytransistors[9,17].A
atfairleastacomparisondual-CPUbetweensystem.GPUTheandFPGACPUhasshouldbeenthereforoptimizedeinvolvfore
loHowwepovwer,erletandusisnoteextrthatemelyitisefficientmanufacturinteredmsonofaFLOPS/Wsmallerpratt[ocess29].
withnode[2230nm].Thisphysicaladvgateantagemalengthsysoonareberexpectedeversed,tobeavbecauseailableCPUsby
].31[2012

ChoiceOur2.3

Havingseentherelativestrengthsandweaknessesofeacharchitec-
ture,wenowpresentaperhapscontroversialcaseforaCPU-based
approach.Ourenvisionedlarge-scaleimageanalysispipelinere-
quircopingesthewithdethevfloodelopmentofofdata.newAsalgorithmsfamouslyrandemarkedapprbyoachesWernerfor
FreiherrvonBraun:BasicresearchiswhatIamdoingwhenI
dontploration,knowi.e.whatthedeIvamelopmentdoing[of32pr].Thisototypes.uncertaintyCPUsflexibilitycallsforandex-
easeofprogramminggreatlysimplifythistask.Aninitialsoftware
andtestedimplementationmorethatrapidlyignoresthanperforanFPGA,manceandcanproftenobablybecdevonstructedeloped
atlessercostthanGPUorDSPsoftware.
Asidefromproductivityconcerns,recentstudieshavealso
dampenedtheenthusiasmforGPUacceleration.Asurveyof14

11

data-parallelkernelsfoundthataGPUisonlyabout2.5times
easvenfastthiswhenadvbothantageisnegatedimplementationsbytheareaboveoptimizedargument[33].thatHowaevfairer,
comparison(intermsofprice,transistorsandpowerdissipation)
requiresatleasttwoCPUs.TheconventionalwisdomthatGPUs
probecausevideaitlarleadsgetoanspeedupincrseemseasedatowarbeaenessofself-fulfillingGPUproptimizationophecy,
techniques.Indeed,aGoogleScholarsearchinJune2011for
year,GPGPUwhereas(generalonly82purposecontainedGPU)therweturordsned437optimized,worksfrSSE,omSIMD.that
onlyHeedingslightlyguidelinesdecreasesforconstantCPUsmayfactors.beHowdismissedever,theastuningoptimizationthat
techniquesarefundamentallyrelatedinthattheybothcallfor
explicitvectorization[34].Astudytakingthisintoaccountfound
thatperforGPUsmanceareonlycomputingasfastasapplicationsoneortw[35o].CPUsintraditionalhigh-
theirWhytheordoeseticalthepowactualer?ArperforecentmancesimulationofGPUsfoundlagthatsoafarreprbehindesen-
tativecomputationalsetofrnon-graphicsesourcesonavapplicationserage,withonlyausedworst45%caseoftheof5%GPUsfor
onebioinformaticsalgorithm.Threemaincauseswereidentified.
Thefirstiswaitingfordatafrommemory.GPUsattempttohide
thisrithmslatencydonotbyalwperforaysprmingovideotherwenoughorkintheparallelism.meantime,Thebutsecondalgo-is
waitsimilar:forthemtocomputationshavebeenthatdependcompleted.onprTheeviousfinalpitfalloperationsconcermustns
(warp)conditionallydifferinexecutedtermsoflogic.theIfpaththethrtaken,eadstheinyaareGPU-definedexecutedgrsequen-oup
tially![36]Theseobservationsconfirmthewell-knownfactthat
peakFLOPSareaninadequatepredictorofperformance.
fromHowtheseever,studies.thereisaBecausemoresimilarimportantperformanceconclusionwasrtobeeporteddrawnfor
andequallycostsofoptimizedoptimizingCPUanandGPUalgorithmforaimplementations,particulararthechitecturbenefitse

12

shouldcarefullybeconsidered.WebelieveCPUsholdmuchun-
tappedpotentialinthisregard.Letusnowreturntotheinitial
productivityargument.Itisrelativelyeasytotransformandop-
withtimizebuilt-insoftwarelogicchecksimplementationsandforcomparisonsCPUs.withVtheerifyingprecorrviousectnessitera-
tionimprovesreliability.Measuringtheactualimprovementateach
stepThiscycleenablesofinfordesign,medanalysis,decisionswhenimplementationexploringandthemeasurdesignementsspace.is
thedefiningcharacteristicoftheemergingdisciplineofalgorithm
thatengineeringmightnot[37].ariseItduringfacilitatesnostraightforvelward,algorithmichardwartransfore-orientedmationsde-
velopmentefforts.Thefollowingchaptersdescribemultiplecases
inwhichtheresultingsoftwaresurpassesthestatedperformance
ofaGPUorFPGAimplementation.
Althoughitisoftenpossibletoachieveadditionalspeedups
bymeansofdistributed-memoryalgorithmsdesignedforclusters
(multipleindependentcomputersconnectedbyanetwork),we
aretions.someSomewhatapplicationsconstrainedb(e.g.ypoinwer,mobilecoolinggroundandcontrspaceolconsidera-stations)
onlypermittheuseofasinglecomputer.Wethereforetarget
Unlesscommerotherciallywiseavailablenoted,ofthetestf-the-shelfplatforwmisorkstationsaDellwithT5500dualwithCPUs.two
X5690CPUs(3.6GHz)and48GiBDDR3memoryrunning64-bit
Windows7.Withthestatedexceptions,oursoftwareiscompiled
/MDwith/QipoICC12.0.1.096/QxSSE4.1/Ox/Qopenmp/Ob2/Oi/Ot/Qstd=c++0x/GA./GR-Ther/GS-esulting/Gy/EHscexe-
cutablesalsorunonAMDprocessorsthatsupporttherequisite
set.instructionSSE3

2.4ConsequencesfortheAlgorithms

WhatBecauseweimplicationsarenotdoesdealingourwithchoicecomputeofarclusters,chitectureourbringalgorithmsabout?

13

ofcanhabevingtodesignedforcommunicatethebsimplerypassingsharedmessages.memoryThemodelprevinsteadalent
Intelconsistencyarchitecturmodelealsoinprwhichovidesprafaocessorsvorable,seei.e.memorstrict,ywritesmemoroc-y
curthereinaraetotalthreeglobalmajororder[38peculiarities].ApartoffromCPUsthesetobetakensimplifications,under
consideration:amemoryhierarchy,SIMDextensions,andmulti-
plecores/processors.Thesearediscussedinthefollowingsub-
sections.

HierarchyyMemortionCurrandentsignalsemiconductorpropagationtechnologytimes.Thisallowsentailscertainalevtrade-ofelsfofbetwintegra-een
storagesizeandaccesslatency.Inanattempttobridgethegrowing
gapbetweencomputationalpowerandmemorybandwidth,CPUs
proCachesvidearaehierarsmallchyandoffast,storagewhereasincludingmemorycacheproandvidesmainplentifulmemorbuty.
slowstorage.Letusexaminetheirpropertiesinturn.

CacheCachesarestorageareasmanagedbytheCPUthatenablefaster
accesstofrequently-useddata.Forconcreteness,currentmicroar-
chitecturesprovide32KiBL1D(firstleveldata)cacheswithan
aggregatethoughputof650GB/sand256KiBL2Dcapableof
435GB/s[39].Acomparisonwiththe29GB/smemoryband-
width[24]underscorestheimportanceofmakinggooduseofthe
cache.Wethereforestrivetominimizemisses,i.e.caseswhere
thedesireddataisnotstoredwithinanyline(afixed-sizeportion
ofthecache).Tothateffect,letusaddresseachofthepotential
causes:compulsory,capacity,andconflict[40].

14

Compulsory.Evenaninfinite-sizedcachewouldincurcompul-
sorymisseswhendataisfirstaccessed.Theirlatencycanbe
hiddenbyprefetching,i.e.accessingmemorybeforeitisactu-
allyneeded.However,thisisnotalwaysfeasibleorworthwhile;
amorepracticalworkaroundistodownsizethedata.Thismay
involvetheuseofsmallertypes(e.g.singleprecisioninsteadof
double)orcompression.Forexample,smallflagsorindicescanbe
embeddedintothelowerbitsofpointers,becausetheirvaluesare
generallyamultipleoftheprocessorswordsize.Aseriesoflarge,
slowlyvaryingvaluescanbedelta-encoded,storingthedifferences
betweenindividualvalues.Theadditionofoccasionalfull-sized
keyframesenablesefficientrandomaccessbyaccumulatingdeltas
sincethepreviousvalue.Inthecaseof64-bitvalueswith328-
bitdeltasbetweenkeyframes,thedataisreducedbyafactorof
six,andtheaverageaccessisstillfasterthanacachemiss.Even
morespectacularsavingsareenabledbyprobabilisticcounting,
whichapproximatessums≤nwhileusingonlyloglognbits.It
hasbeenshownthatincrementingthetruncatedlogarithmlogn
withprobabilityinverselyproportionaltonyieldsanunbiased
estimatorforn[41].

Capacity.Afinitecachesizeandimperfectreplacementstrategy
ofgivenewriseertodata.so-calledTheprecapacityviouslymissesmentionedwhenlinescomprareeessionvictedimprinfaovvores
thealsoexhibitutilizationlocalityofaofrparticulareferencecache.toderivHoeweanyver,benefit.algorithmsTemporalmust
locality(i.e.re-usingthesamememorylocationswithinashort
timespan)increasesthelikelihoodofdatastillresidinginthecache.
Similarly,spatiallocality(accessingnearbylocations)decreases
thenumberofcachelinestopopulate,thusreducingevictions
ofpreviousdata.Cachesaredesignedtoexploitbothofthese
properties.However,theirbehaviorissuboptimalforsequential
write-onlyaccesspatterns.Thememorytobewrittenisfirst
loadedintoacacheline,whichpollutesthecachebyreplacing

15

itspreviouscontentswithdatathatwillnotbeaccessedagain.
Loadingfrommemoryisalsounnecessaryiftheentirecacheline
willbeoverwritten.Toavoidtheseproblems,algorithmsshould
implementwrite-onlytransfersviaspecialinstructionsthatbypass
thecacheandwritedirectlytomemory.

Conict.Cachelinesareassociatedwithamemorylocationby
meansoftagsthatindicatetheaddress.Becauseitisdifficultto
examineeachlinestagwhencheckingwhetherdataispresentin
thecache,CPUstypicallyprovideafixedmappingofaddresses
tosetsoflines.Theircardinality(thecacheassociativity,e.g.
8)thereforedeterminesthenumberofmemorylocationsthatcan
maptothesamesetwithoutevictingaline.Examplesofaccess
patternsthatexceedthislimitincludeiteratingoverpower-of-two
sizedmatrixrowsandwritingdatatomultipledestinationswith
thesamealignment.Theseproblemscanbemitigatedbyoffsetting
thevariousaddressesbyrandommultiplesofthecachelinesize.

yMemorToalesserextent,memoryalsoexhibitssomeofthesamecharac-
teristicsasthecache.Itisfastertoaccessnearbylocationsinthe
samerowofmemorycellsthatiscurrentlyopen[42][pp.8–9].
izedbNon-uniforyvmariablememorlatencyy.accessForexample,(NUMA)thesystemsintegrationarealsoofmemorcharactery-
controllersintotheCPUhasresultedinfasteraccessestolocal
memorymanagedbythecurrentprocessor.Softwareimplemen-
memortationsyfrshouldombenearbawyarreesourofthisces,issuei.e.theandcurrentexplicitlyNUMAallocateproximitytheir
domain.Itisinterestingtoobservethatthememoryhierarchy
encourageslocaldataaccessesdespitethetrendtowardsever
larger(de)comprmemoressionyosizes.verheadReducing–generallydatasizesalso–evspeedsenupwithaprnon-trivialogram!

16

SIMD

SuperscalarCPUsenabletheconcurrentexecutionofmultiple
instructionsperclockcycle.However,thiscomesatthecostof
parallelism.complicatedManycontrolarcirchitecturcuitryesandhaveonlytheralloeforwseaaddedlimitedsupportdegreeforof
SIMDextensionssuchas3DNow!,AltiVec,MAX,MDMX,MMX,
MVI,SSE,VIS[43]andmorerecently,AVX,LRBniandNEON.
Theinstructionsconcurrentlyapplyoperationstoallelements
(typically4or8)ofashortvector,thussignificantlyincreasing
peakFLOPS.Algorithmsshouldthereforebedesignedtoutilize
softwtheseareisacapabilities.challengeHo[w44e]ver,andautomaticcompilersvcannotectorizationalwaysoftransforexistingm
codeintoaformsuitablefortheoftenincompleteandirregular
instructionsets.AlibrarysolutionforJavaonlyresultedina34%
fic[speedup45].Weduethertoeforesignificantutilizeoverheadintrinsics,andspecialadditionalfunctionsmemorknoywntraf-
tothemajorC++compilersthattypicallyresultinthegeneration
ofniencesingleofSIMDassemblyinstructions.languageandAlthoughmanualravegisteroidingtheallocation,inconvthee-
syntaxissomewhatverbose,asexemplifiedbymultiplicationus-
ingIntelsStreamingSIMDExtensions(SSE)instructionset:
__m128product=_mm_mul_ps(input,multiplier).
Wherepossible,weusecompiler-providedshortvectorclasseswith
overloadedfunctions,whichaffordsmoreconvenientnotation:
alsoF32vec4allowsproductgenerating=bothvmultiplierectorand*scalarmultiplicand(single-operand).This
variantsofthesamecodebymeansofC++templates,whichis
helpfulfortestingandbenchmarking.Besidesdifferingsyntax,
SIMDraiseschallengesconcerningdependenciesandalignment.

Dependencies.Algorithmsmustbestructuredsothatoperations
canproceedinparallel.AlthoughSIMDcannotsignificantlyde-
creasethelatencyoftaskssuchaspolynomialevaluationthat

17

invincrolveaseethrdepenoughputdenciesbyonprecomputingviousseorveralinterresultsmediateinvalues,parallel.itEvdoesen
seeminglysequentialtaskssuchasupdatingasumcanbedonein
sums.efixprusingparallel

Alignment.Tosimplifythehardware,instructionsetsmayre-
quireoperandstobealigned,i.e.residingataddressesthatarea
multipleofthevectorsize.LaterrevisionsoftheSSEinstruction
setprovideseparateinstructionsforloadingalignedandpossibly
unalignedoperands.Theirrelativecostandpossibleworkarounds
arediscussedinSectionB.2.Ifpossible,algorithmsshouldbe
designedtoloadandstorealignedvectors.

ParallelizationItiswell-knownthatsingle-coreimprovementssuchasspeculation,
cachesandsuperscalarpipelineshavereachedthepointofdimin-
ishingreturns.CPUarchitectsthereforebeganallocatingavailable
transistorstowardsmultiplecoresandlogicalprocessors.[15]This
hasalsobeenmotivatedbypowerandcooling,theimportanceof
whichwashighlightedwhenthePentium4processorexceededa
hotplatesthermalpowerdensitybyafactoroften[46].Because
dynamicpowerisproportionaltofrequency×voltage2,acommon
argumentproposesrunningseveralprocessorsatafractionofthe
frequency,thusalsoallowinglowervoltages[47].Thishasthepo-
tentialfornear-cubicreductionsinpowerandmayevenincrease
performance.However,bothoftheseassumptionsareflawed.First,
dynamicpowerconsumptionexcludesvariouskindsofleakagein
semiconductors.Suchstaticpoweralreadyaccountedfor40%of
thetotaldissipationina90nmprocessandincreaseswithsmaller
gatelengths[48].Subthresholdleakagealsogrowsexponentially
withadecreaseinthresholdvoltage[49].Second,algorithmsmay
requirecommunicationorsynchronizationbetweenprocessors,

18

thuserodinganyperformancegains.Amdahlswell-knownargu-
mentalsolimitstheparallelspeeduptothereciprocaloftheserial
algorithm.anofportionLookingbeyondpower,whichaffectscoolingrequirements,
energy(i.e.power×time)isalsoacriticalfactor.Onestudyhas
foundthatlowerfrequenciesincreasethetotalenergyconsumption
becauseothersystemcomponentsareusedforalongerperiod
oftime[50].Theseargumentsnotwithstanding,ouralgorithms
shouldmakefulluseoftheavailablehardware,includingmultiple
coresandlogicalprocessors.Unfortunately,parallelizationalso
bringswithittwochallenges:correctnessandinfrastructure.

Correctness.Itisdifficulttoguaranteethecorrectnessofparallel
programsrunningonmultipleprocessors.Algorithmsmustfirst
splitupthedatainto(preferablyentirelyindependent)subtasks
anddispatchthemtotheprocessors.Ifthetasksdependona
certainorderofexecution,thesoftwaremusttakecareofsynchro-
nization,typicallyviamutualexclusionorlock-freealgorithms.
However,theformerispronetodeadlocks(multipleprocesses
waitingoneachother),whereasthelatterrequiresawarenessofthe
exactmemoryorderingguaranteesmadebythecompilerandCPU.
Toavoidmostofthesedifficulties,westrivetoprocessportions
oftheinputsindependentlyandlateraccumulatetheindividual
esults.r

Infrastructure.Traditionalsoftwaredevelopmenttoolsoftenpro-
videonlylimitedsupportforparallelization.Forexample,the2003
revisionoftheC++standard(ISO/IEC14882)makesnomention
ofmultiplethreads,memoryconsistencynororderingguaran-
tees.Effortshavebeenundertakentodeveloplibrarysolutions,
includingparallelvariantsofC++standardlibraryfunctions[51]
andThreadingBuildingBlockssuitableforcommonparallelid-
ioms[52].Althoughuseful,thesedonotprovidethefulldegree
ofcontrolnecessarytomaximizeperformance.Forexample,a
19

Work0

Begin

Init

Work1WorkN

Done?

End

esY

Figure2.1:Fork-joinparallelizationmodel.

cacheparallelizationtopology,e.g.schemewhenshouldmappingtakeintothreadsaccounttoprtheocessors.NUMAWande
proparallelvidealgorithms.infrastructureItforisthisbasedpuronposethethatfork-joinissharedparadigmbetween(Fig-all
ure2.1),whichischaracterizedbyoneormorephasesconsisting
ofinitialization,parallelworkandsequentialreduction.Thisal-
lowssynchronizationandsafehandlingofdependenciesbetween
thepartsofalgorithmsanalgorithmcanbeexprwhileessedhidingasiftheyimplementationranserially,asdetails.shoInwnbfact,y
Figure2.2.EachworkerthreadexecutesAssist,whichreceives
anindicationofthephasenumberandthethreadsID.Whenall
arewhetherfinished,tocontinue.SuperviseFinallyis,calledaronaeductionsingleisthrperforeadmedandbydecidessuc-
cessivecallstoAccumulate;thisexamplerecordsthelatesttime
r(fork)eportedthebywanyorkerthrthread.eads,WeusewhichOpenMPhastheparalleladvrantageegionsofatovoidinglaunch
platform-specificimplementations.Threadscanalsobecombined
20

voidAssist(size_tphase,size_tid){
LocalLSD(id);2)==if(phase}LocalMSD(id);else

{phase)Supervise(size_tStatusstaticDONE;return2)==if(phase}ComputeGlobalRanks();returnelse

voidendTime=Accumulate(conststd::max(endTime,Group&rhs)rhs.endTime);{}

Figure2.2:SimplifiedexampleofparallelC++codeusingthe
model.fork-join

intogroups,whichcanworktogetheronthesamesubsetofdata.
Thisimprovesresourceutilizationwhenthegroupsprocessors
sharecachesorNUMAmemory.

Discussion2.5

Wehavechosentodevelopimageprocessingalgorithmsforgeneral-
purposeCPUsbecausetheyaremoreflexibleandrequireless
developmenteffortthanspecializedarchitectures.Recentadvances
inCPUcomputationaldesignspohawvere.alsoInprocontrastvidedtothethefreepotentiallunchforpresignificantviously
offeredbyincreasingclockrates[15],developersmusttakeaction
andaccountforSIMDparallelizationandthememoryhierarchy.It
mayevenbedifficulttoadaptexistingdesignstowardsthesenew
requirements.Instead,theyarebestconsideredduringthedesign
phase.Atthispoint,threeconcernsmightberaised.Wouldtheaddi-
tionaleffortexceedthedesignandvalidationcostincurredonother
architecturessuchasFPGAs?Wearguethatsuccessivelyrefined

21

softwarehasvaluablesideeffects.Prototypingavoidswastingef-
fortonoptimizingalgorithmsthatmightturnouttobeunsuitable,
theandwalloay.wsWevdoerifyingnotbeliethevecorrtheectnratheressofeachcomplextransforHotspotmationalgorithmalong
describedinChapter8wouldhavebeenforthcoming–orevenfea-
sible–withoutsuchanapproach.Asecondpotentialinterjectionis
thatfactor.theseThatistechniquestrue,butcannoonlyotherimproimprveovperforementsmancearebyapossibleconstantfor
algorithmsthatarealreadyatthelowerboundoftheircomplexity.
Thepotentialpreviousspeedups:sections4toha16veforalsovhintedectorization,at4theto12formagnitudeparalleliza-ofthe
tion,andupto22fromthecache.Inouropinion,suchfactorsare
highlyrelevant.Afinalconcernrelatestoobsolescence:willthese
considerationsstillapplytofuturemicroarchitectures?Thepast
beingourbestpredictorofthefuture,letusexaminetheevolution
ofCPUsoverthelast10years.Cachelinesizesareanimportant
at64parameterbytes[for53].cache-aThewSSE2areSIMDalgorithms,instructionandhavseterisstillemaineduseful,constantand
codewrittenwithintrinsicswouldevenbenefitfromnewcapa-
bilitiesintheAVXinstructionsetafterarecompile.Effortsare
alsoalgorithmsunderwtoaythetotardevgetelophardwareauto-tuning[54].mechanismsforadapting
Maximizingperformancecurrentlyrequiresanawarenessofthe
systeminternals,whichtypicallyentailsmanualinterventionbythe
developer.However,itisthethesisofthisworkthatsuchefforts
mayberichlyrewarded.Inthesubsequentchapters,notethe
multiplecasesinwhichouralgorithms–runningoncommodity
CPUs–outperformspecializedhardware.

22

Part

Main

II

Course

23

3Chapter

Input/Output

Thefirstandlastlinksoftheimageprocessingchaininvolveload-
ingthepixelsintomemoryandstoringthemtodisk.Thischapter
describesourrepresentationofimagesandhowtoefficientlytrans-
ferthemtoandfromblockstoragedevicessuchasharddiskdrives
(HDD).

3.1RepresentationImage

Imagesaretypicallytwo-dimensionalarraysofpixels.Inaccor-
laydanceoutinwithwhichtheC++therowstandardindices[55,var8.3.4],ywfasterethanmandateacolumnrow-majorindices.
Inotherwords,thepixelsconstitutingarowarestoredbefore
thoseofthenextrow.AnadditionalconstraintarisesfromSIMD
instructionsets.Theyoftenrequireoratleastbenefitfromnatural
operandalignment,size.i.e.Becauseensuringweaddrwishessestoaralloewintegralparallelprmultiplesocessingoftheof
theimages,imagewithrows,eachtheprstartingocessorraddressesponsibleofeachforranowarbitrarshouldybeintervalignedalof
size.ectorvthetoItisconvenientandefficienttorepresenttheimageasacontigu-
theousnextvirtualrow.addrRowessnisrangereachedtogetherbywithaddingan×step,stepi.e.tothetheoffsetstartingto
address.Thisisexpectedtobeatleastasfastasatablelookup

25

[56Intel]andPerforcertainlymancemorePrimitiveseconomical(IPP)liinbraterryms[57of]alsocacheusesusage.suchThea
esentation.eprrneighboringBecauseimagepixelsproreachocessingbandatalgorithmsacertainoftenrpixelequireposition,accesswtoe
choosecomponentsaarband-interleaefollowveded-bbyy-pixelthoselayofouttheinnextwhichpixeltheinfirsttherpixelsow
(Figure3.1).Thisrepresentationcorrespondstosomesimplefile

(1,y)R(1,y)G(1,y)B(∙∙∙)(w,y)R(w,y)G(w,y)B
roFigurwy.e3.1:R/G/Bcomponentorderingforthewpixels(x,y)in

formatssuchasPM(c.f.Section3.3),whichallowsreadinganentire
imageintomemoryandstoringittodiskwithoutanyreshuffling.
WHoewareever,thertheeforroeonlyw-majorconcerlayoutnedhaswithpoorsequential,localitynotforsomerandom,accessI/O.
patternsbecauseverticallyadjacentpixelsarestoredfarapart.This
isparticularlyrelevantforcompression,whichbenefitsfromspatial
smalllocality.squarAecommontiles,weachorkarofoundwhichinvisolvstoresedsplittinginrothew-majorimageorintoder.
Localityisimprovedbecausemostverticallyadjacentpixelsare
nowimagesonlyalsorspacedequironeestilesplittingrowtheapart.imageGPU-basedintotilesrdueenderingtooflimitslaronge
themaximumtexturesize.Wethereforeuseatiledrepresentation
forthefinalresultimagethatistobecompressedanddisplayedin
aviewer(c.f.Section3.3).

I/OficientEf3.2

Insystem.ourHowapplications,ever,moderstoragendeoperatingvicesaresystemsaccessedprothrvideoughthemultiplefile

26

I/Ointerfaces.Thechiefdistinctioniswhethertheapplication
canproceedwhileatransferisinprogress(asynchronous),as
opposedtowaitinginsidetheoperatingsystemkerneluntilI/O
iscomplete(synchronous).Whichoftheseisbettersuitedforour
needs,andwhattechniquescanfurtherimproveperformance?
Thesequestionsareaddressedinthefollowingsections.

Asynchronousvs.SynchronousLetusmeasuretherateatwhichdatacanbewrittentodisk
(throughput)withthesynchronousandasynchronousI/Ometh-
odsprovidedbytheATTODiskBenchmark2.46.Thetestplatform
consistsofaW3550CPUrunningWindows7withthepagefile
disabledandaWD6400AAKSHDD.Duetovariousresourcelimits
intheapplication,operatingsystem,driversandhardware,I/O
requestswilleventuallybesplitintoblocks.Table3.1showsin-
creasingthroughputsforlargerapplication-requestedblocksizes
duetoamortizationofoverhead.Therearefurther,nearlyneg-
ligibleimprovementsforevenlargerblocks.However,1MiBis

Table3.1:Conventionalandasynchronouswritethroughputmea-
suredwiththeATTObenchmarkonaWD6400AAKSHDDfor
sizes.blockariousvsize[KiB]writeMB/sasyncMB/s
45.224.9416868.542.8100.975.6
105.291.53212864105.7103.3108.9107.9
108.5104.92561512024106.1105.5107.4108.2

27

areasonablecutoffpoint(c.f.Section3.2).Asfoundinprevi-
ouswork[58],asynchronouswritesarefastertoconvergetothe
disksmaximumthroughput.Thisisbecausethediskcontroller
canimmediatelybeginthenexttransferafterthepreviousone
completeswithoutrequiringtheapplicationtofirsttransitioninto
kernelmode.AsynchronousI/OgenerallyinvolveshigherCPU
overhead[59][p.381],especiallyonWindows,whichonlyprovides
FastI/OdriverentrypointsforsynchronousI/O[60].However,
ithasthemajoradvantageofallowingtheapplicationtoperform
work(e.g.compression)whilewaitingonprevioustransfers.We
thereforepreferittothemorecommonlyusedsynchronousaccess
method.

SizeBlockWewishtomaximizediskthroughputwhileoverlappingcomputa-
tionwithI/O.Itisstraightforwardtointerleavethesetwotasksby
splittingtransfersintoblocks.Computationscanbecarriedoutfor
acompletedblockwhilewaitingforsubsequentI/Os.Theblock
sizeisboundedbythefollowingconsiderations:Transfersare
carriedoutviaDirectMemoryAccesshardware,whichrequires
contiguousphysicalmemory.Driversmustthereforerepresentthe
application-providedmemorybufferasalistofphysicalpages
(scatter-gatherlist).Thesearestoredinnonpagedpool–asmall
memoryareasetasidebyWindows–andarethereforerestricted
to255entries[61].Theresultinglimitis1MiBgivena4KiBpage
size.Althoughitisdesirabletoamortizesystemcalloverheadover
largerequests,thoseexceedingthislimitincuradditionalover-
headduetosplitting.Conversely,theremustbeaminimumblock
sizebecausethenumberofpendingI/Orequestsmaybefinite.
Windowsalsorequirestransfersizestobesector-aligned,andthe
AdvancedFormatindustryinitiative[62]hasintroduceddrives
with4KiBsectors,soweconsiderthattobetheminimum.Ta-
ble3.2showsthereadandwritethroughputsmeasuredbyATTO

28

Sontheolid-StatepreDiskviously(SSD)omentionedverthisHDDrangeandofablock128GBsizes.CrucialAlthoughC300

Table3.2:Asynchronousreadandwritethroughput[MB/s]mea-
suredwithATTOonaWD6400AAKSHDDandC300SSDfor
sizes.blockariousvsize[KiB]HDwriteHDreadSSDwriteSSDread
202.9126.9102.945.24168100.975.698.4102.4135.3134.2284.1253.9
304.8129.4101.7105.23264128107.9108.977.477.7139.8142.1214.3326.6
323.4141.783.2108.5256325.8141.383.6108.25121024107.483.8140.5326.6

SSDreadthroughputtendstoincreasewithlargerblocksizes,
thebarplotrepresentationofthesenumbersinFigure3.2makes
apparentasharpdropat64KiB.Thecauseisunclear;perhaps
transfersarebeingsplitupduetoscatter-gatherlistlimitations
orotherinefficiencieswithinthedriverorcontroller.However,
writethroughputsremainnearlyconstant.Interestingly,HDD
writeschoosecan128KiBoutperforblocksmrasaeadsrdueeasonabletocachingcomprbyomisethethatcontrprollero.videsWe
goodthroughputwithoutrequiringlargebuffersthatexceedthe
L2cachesize.NotethatthisdiscussionpresumessequentialI/O,
whichisjustifiedinSection3.1.RandomI/Omayrequirelarger
blocksizestoamortizethecostofHDDseeks1.
another1Repositioninglocation.theread/writeheadinpreparationforreadingorwritingfrom

29

300

250[MB/s]200oughput150Thr

100

50

4

8

16

writeHDeadrHDwriteSSD

SSDread

32641282565121024
[KiB]sizeBlock

Figure3.2:Bar-plotrepresentationofHDDandSSDread/write
oughputs.thr

DetailsImplementationLetusnowbrieflyexaminedetailsofourI/Oimplementation.
Toensuresourcecodeportability,weadheretothePOSIXasyn-
chronousI/Ointerface,whichiscodifiedinthe2004editionof
IEEEStandard1003.1[63].Thesefunctionsarenotincludedwith
Windows,buttheIntelCompilerslibicaiolibrary[64]provides
replacements.Theimplementationinversion12.0(ParallelStu-
dio2011)appearstobebasedonsynchronousI/Oinhelper
threads2.Thisapproachdoesnotmaximizediskthroughput,al-
2whenWeappliedobservtoedfilesthreadopenedsuspend/rforWindoesumewsasynchroperationsonousandI/O.foundthatthefunctionsfail

30

steadthoughitimplementdoesavtheoidthePOSIXrestrictionsfunctionsintermentionedmsofWbeloindow.wsWeasyn-in-
chronousI/O.ThisentailsspecifyingFILE_FLAG_OVERLAPPED
andFILE_FLAG_NO_BUFFERINGwhenopeningthefile.Win-
doplewsofthethenvrolumeequiressectoraddrsize.esses,Oursizesloandw-levofelfsetsfunctionstobeapassmulti-on
theseconstraintstotheircallers,whichcanhandlethemwith-
outplied.penalty.ContiguousSeveralstoragelesserfor-knownOVERLAPPEDtricks[65]hastructurvealsoes,thebeenWap-in-
doallowswsequivpinningalentofthemPOSIXintheaiocbkernel(asynchraddressonousspaceI/Obycontrmeansolofblocks),the
SetFileIoOverlappedRangeAPI.ThismeansI/Ocompletion
canbehandledbyanythread,whichavoidsanasynchronouspro-
kercedurnel.ecallandtheassociatedSetFileCompletionNotificationModescontextswitchandlockingisusedintheto
aprvoideallocatedunnecessarviaySetEndOfFilecompletionandnotifications.SetFileValidDataFinally,disk.spaceWith-is
outsynchrtheonouslylatter,,allwhichwritesprevthatentsoextendvaerlappingfileareI/Oforwithcedtocomputationcomplete
(e.g.checksums)[66].Toavoidexposingpreviousdiskcontents,
wedenyreadsharingwhenopeningfiles.
tationHaofvingthegonePOSIXtograioeatinterface,lengthstotheensureapplicationanefficientlogicisimplemen-compar-
ativelysimple.Aringbufferholdsaiocbcontrolblocks.Block
useI/Osareaio_suspendissueduptotowaaitdefaultuntilthemaximumnextI/Oqueueiscompletedepthofand32.Wthene
jectinvokeatemplateusertoav-specifiedoidcallocallbackverhead).(specifiedTheasloopaterC++minatesfunctionwhenob-
allblockI/Oshavecompleted.TheWindowsalignmentrequire-
ments(similarconsiderationsapplywhenusingtheequivalent
Linux/BSDO_DIRECTfunctionality)aresatisfiedbythememory
sectorallocatorsize.,whichAfteralsowriting,expandswetrimblockanybufexcessferstoapaddingmultipleattheofendthe
ofthefilebycallingtruncate.

31

ThroughputTodeterminetheeffectivenessofourimplementationtechniques,
wandeCrcomparystalDiskether3.0.1esultingx64thrbenchmarks.oughputtoNotethethatoutputATTOofonlytheAalloTTwsO
aqueuedepthof10,whichmaylimitperformance.CrystalDisk
isruninsequentialmodewith500MBblocks,becauseitcannot
matchthe256MBusedbybothotherprograms.Ourwaio(POSIX
128aioforKiBWblockindows)sizeestablishedimplementationinSandectionATT3.2.OTaroeensurconfigurethisedvforaluetheis
notspecifictoaparticularsystemconfiguration,weusedifferent
hardwareforthesetests:dualX5690CPUsrunningWindows7x64
withaHitachiHDS721010CLAHDDandSamsungPM810SSD.
NotethatATTOandwaiowritezero-valueddata,whereasCrys-
talDiskdefaultstorandom-valueddata.Diskcontrollersbased
onSandForcechipsetsimprovereadandwriteperformancefor
repetitivedatabymeansofcompression[67].However,tothe
bestofourknowledge,theC300s88SS9174-BJP2andPM810s
S3C29MAXcontrollersdonotincludesuchanoptimization.
allrAsespects.seeninTDespiteable3.3the,ourwstraightforaiowaroutperfordnaturmsebothofsequentialbenchmarksI/Oin
Table3.implementation3:ReadandandthewriteATTthrOandoughputsCrystalDisk[MB/s]reportedbenchmarksbyonoura
HDD.HDS721010CLAandSSDPM810BenchmarkHDwriteHDreadSSDwriteSSDread
CrystalDisk145.00146.00233.70241.20
AwTTaioO151.35144.89146.07143.34252.75250.58256.73255.98

andpreviouseffortstomaximizewritethroughput,wehaveim-
proveditby4%.MeasurementsofATTOsmemoryusageindicate
blockbuffersarebeingreused,whereasourimplementationreads
theentirefileintomemory,whichismoreexpensive.However,
waiosreadsstillturnouttobefaster.
32

File3.3matFor

Withtablished,thewein-memormayynowimagedeciderepruponesentationtheforandmatI/Oofthemethodfilestoes-
rHoweead/write.ver,ourAmultitudeapplicationsofandimagelargefileforamountsmatsofhadatavebeenimposedeexact-vised.
ingrequirements,includingminimalconversionoverhead,support
3forandrelevflexibleantpixelmetadatafor4mats,.Letuscomprbrieflyession,reviewtiling,aimageselectionpofyramidsexisting
formatsandevaluatetheminlightoftheserequirements:

PMisasimplisticformatthatonlyspecifiesoneormoreplanesof
band-interleavedpixelswithoutanyadditionalfeatures[68].
Application-specificmetadatacouldbestoredinthefree-
formcommentfield,butwewouldpreferastandardized
oach.apprOpenEXRisanewerformatforHighDynamicRange(HDR)
imagesthatunfortunatelylackssupportfor8or16-bitinte-
].69[gersHFA/IGEarethefeature-richinternalfileformatsoftheERDAS
IMAGINEframeworkforgeospatialimageprocessing[70].
However,theHFAformatisquitecomplexandsomewhat
inefficient(c.f.Section3.4).
NITFisastandardizedinterchangeformatthatisevenmorecom-
plexthanHFA,butlimitedto10GBandlackingsupport
forembeddedimagepyramids.NotethatNSIF(NATOSec-
ondaryImageFormat)correspondstoNITFwithadifferent
versionfieldintheheader.[71]
3mipmaps.AseriesofSubsequentsuccessivtotheelybasespatially(theoriginalsubsampledvimage),ersionseachoflethevelimage,typicallyalsohalvknoeswntheas
rscreenesolution.pixelsAbyviewinterercanpolatingreducebetwtheeenovertheheadtwooflevelsminifyingwhosermanyesolutionsimagearepixelsclosesttofetow
scale.zoomeddesirthethe4imageLiterallysuchdataasitsaboutgeographicdata,herelocation.understoodtobeadditionalinformationabout
33

BigTIFFbutinheritsexpandsitsthewmajorell-knodisadvwnTIFFantageforofmatallotowing64-bitoffsetsnon-nativ[72e],
byteordersandnon-tiledpixelformats,whichwouldrequire
expensiveconversionwhenloading.

Unfortunately,eachoftheseformatsiseitherpronetoineffi-
ciency,orlackssomeoftherequiredfeatures.Wehavedeviseda
suchflexibleasneSIMDwforvmatectoranddesigneddiskwithsectorknowledgealignmentofrloequirw-levelements.detailsIt
providessupportfortiledpyramidsorderedaccordingtoanovel
space-fillingcurve,thenewlosslesscompressionschemedescribed
inChapter4,anduser-definedmetadata.Detailsaregivenin
AppendixB.3.However,werecognizethevalueofinteroperability
andwishtosupportexistingapplicationsandviewers,particu-
larlywritingERDASNITFandIMAGINE.IGEfiles.WeThetherkeyeforeprenablingovidefactorfastofmethodstheirhighfor
performanceisassemblingthefileinmemoryandwritingitto
diskinlargechunks.Avoidingunnecessarycopyingofthedata
andadditionalallocations(e.g.forheaders)alsosavestime.

mancePerfor3.4

Letusnowstudythereal-worldperformanceattainedbythe
methodsdescribedinthischapter.Wecomparethetotaltime
requiredtowriteNITFandIGEimageswithoursoftwareandthe
ubiquitousGeospatialDataAbstractionLibrary(GDAL),version
1.7.3.Toavoidfavoringaparticulartilesize,wegenerateimageswith
randomdimensionsintheinterval2i,2i+1for10≤i<15.The
resultingvaluesaregiveninTable3.4.Table3.5comparesthe
relativecostsofourNITFandIGEcodecsvs.GDAL.Thecurrent
balanceofCPUperformanceanddiskthroughputmeanswriting
NITFimagestakesabout5–25%longerbecausepixelsmustbe

34

Table3.4:Randomlychosenimagedimensions[pixels]forthe
test.oughputthrwrite

HeightidthWDataset103103914031752917
3288921084107505251
35919608244

Table3.5:Normalizedcostoftheformats–elapsedtimesforNITF
andIGEaredividedbytheI/Otime,GDALmeasurementsare
relativetoourimplementation.

DriveDatasetNITFIGEGDALNITFGDALHFA
HD01.622.613.973.84
HD11.121.365.555.82
HD21.051.475.345.06
HD31.071.415.445.42
HD41.121.495.903.20
SSD01.422.504.315.19
SSD11.151.3811.997.53
SSD21.241.456.887.45
SSD31.221.558.267.40
SSD41.201.357.534.04

reshuffledintoatiledlayout5.Therelativecostofthiscomputation
ishigheronthesmallestdatasetbecauselesstimeisrequiredfor
I/O(possiblyduetocachinginthediskcontroller).OurIGE
writerperformsmuchmorework:computingandstoringan
imagepyramidaswellasstatistics(standarddeviation,minimum,
5whichrOurequirnoresmativNSIFereferimagesencewithforaNITFdimensionisNATOexceedingStandar8192dizationpixelsAgrtobeeementsplit4545,into
tiles.Weuseafixedtiledimensionof256.
35

maximum,mean,median,modeandhistogramofeachbands
vouralues).efficientThisvonlyectorizedrequirandes35–50%parallelizedmoretimeimplementation.thanI/OHowdueeverto,
theoverheadappearsparticularlylargeonthesmallestimage
OurbecauseNITFthecostofimplementationwritingistherextraoughlyfivmetadataetimesfileasisfastnotasamortized.GDALs
when(whosewritinghighertothrtheoughputHDD,incrandeasesuptothe12relativtimeseascostfastofonGDALthesSSDless
effastficientasGDALpixeloncoptheying).HDDOurandIGE7timeswriterasisfastonlyontheaboutSSD5timesbecauseas
GDALdoesnotcomputeimagestatistics.Forreasonsunknown,
widthGDALsisathrmultipleoughputofincr32,buteasesaonblockthesizelargestof64(3.8isGB)used.image.FigureThe3.3
wshoewsbeliethevea3speedupsto12-foldvs.GDAL.improvementAlthoughtobemereofmajorconstantpracticalfactors,
ance.veler

Conclusion3.5

avThisoidsvchapterarioushasinefdescribedficienciesaattechniquetheharfordwarasynchre/operatingonousI/Osystemthat
level,therebyoutperformingexistingbenchmarksby4%.Webuild
uponimagethisfileforfoundationmats.Thewithrefesultficientisar3tooutines12-foldforwritingspeedupvs.commonthe
well-establishedGDALlibrary.Finally,thealignedimagelay-
outindividualdiscussedrowsherviaeinserSIMDvestoainstructions,voidthuspenaltiesenablingwhentheaccessinghigh
performanceofthesubsequentmodules.

36

GDALvs.Speedup

12

10

8

6

4

2

0

0

1

(HD)NITF(SSD)NITF(HD)IGE(SSD)IGE

2Dataset

3

Figure3.3:Speedupofourwritersvs.GDAL.

37

4

4Chapter

SIMDicAsymmetrLosslessCompression

This(LASC)chapterdesignedintrforoducesextraneemelyweflosslessficientdecomprasymmetricessionSIMDoflarcodecge
satelliteimages.Athroughputinexcessof3GB/sallowsdecom-
prfastessionblocktodeprvicesoceedsuchinasparalleldiskarrawithys.asynchrThisisonousmadepotransfersssiblefrbyoma
simpleandfastSIMDentropycoderthatremovesleadingnullbits.
Ourmaincontributionisanewapproachforvectorizedprediction
andencoding.Unlikepreviousapproachesthattreattheentropy
ofcodertheaspraedictorblack.box,Thewreesultingaccountforcompritsessedprstropertieseaministhe1.2todesign1.5
timesaslargeasJPEG-2000,butcanbedecompressed100timesas
quickly–evenfasterthancopyinguncompresseddatainmemory.
sualization.ApplicationsTotheincludebeststrofeamingourknodecomprwledge,essionthisisfortheoutfirstofcorentireelyvi-
vectorizedalgorithmforlosslesscompression.
ThischapterhasbeenpublishedintheSoftware:Practice
forandmattingExperienceandworjourdingnal[73]clarifications.andisreproducedherewithminor

39

4.1IntroductionandRelatedWork

Displayingimagesthataretoolargetofitwithinmainmemory
necessitatesstreaming,thatis,loadingsectionsofthedatafrom
aperforslowermance,storageitismediumimportantwhentotheyminimizearetheneeded.latencyForoftheseinteractivree-
quests.AsynchronousI/Oallowscomputationtoproceedwhile
waitingonthestoragemedium.However,panninga2560×1600
pixelviewportsuchthat10%ofthe16-bit,fourcomponentpix-
elsareupdatedevery16msrequiresasustainedthroughputof
196MB/s,whichexceedsthecapabilityofcurrentmagneticme-
dia[74].Suchdataratesareenabledbydrivearraysandtopofthe
alinecommonsolid-stateremedydisks,invbutolvesthesecomprarenotessionalwofaystheavdata.ailable.IncontrastInstead,
totheentertainmentsector,somemedicalandautomatedimage
analysisapplicationscannottolerateanylossofinformation.

CompressionImageLosslessBy1993,ageneralframeworkforlosslessimagecompressionhad
nextbeenpixelestablishedtoencodethatisisprstilledictedusefulusingtodaay.Thecontextofintensitypreofviouslythe
rseenelayedpixels.toaThestatisticalresultingcoderrthatesiduals,mayactthatis,uponprknoedictionwledgeerrofors,theirare
distribution[75].Thesecomponentsareallinterdependent;we
brieflydiscusstheminincreasingorderofcomplexity.Inmost
cases,thesimpleandintuitiverasterscanorderisused.Surpris-
ingly,theorderinducedbyaHilbertspace-fillingcurvecanincrease
athe4%rimpresidualsoventrementop[y77[].76],Theandcirthecularrainscandependencyorderbetwonlyeenyieldspre-
errdictionorsfolloandwacodingisLaplacianoftenresolvdistributionedby[78],assumingforthatwhichpravedictionariant
ofestablished,Golombmostcodingeffisortsoptimalhave[been79].dirWithectedtheatentrpropyedictioncoder–usingthus

40

laringgerthecontextssquared[80or],absolutecombiningprvedictionariousprerroredictors[81].[77Ho]orwever,minimiz-this
doesnotnecessarilyresultinoptimalcompressedsizes[82],and
conventionalentropycodersaretooslowforourapplication.A
highly-optimizedimplementationofRicesindependentlydiscov-
eredsubsetofGolombcodesonlydecodes200MIntegers/s[83].
Prior90.95workMPixel/sonreducing(includingafastbranchesDCT)ina[84Huf].Hofmanwever,decoderthisralgorithmeached
isnotwell-suitedforaccelerationviaGPU,whichonlymanages
570–750MB/s[85].NotethatHuffmancodesareequivalentto
arestrictedcaseofarithmeticcoding[86],sothelattercannotbe
expectedtobefaster.Dictionary-basedapproachesareneithersig-
suitednificantlyforthisbettertaskintermsbecauseofrperforesidualsmanceare[87not],dranorwnarefrtheomyaideallysmall
alphabet.

CodingyEntropHavingruledoutconventionalentropycoders,wemustconsideral-
ternatives.Variable-lengthcodesaregenerallyinefficienttodecode
becauseoftheirbit-levelaccesses,andeventable-basedapproaches
arOneenotintermuchestingfasterappr[oach88].invWeolvthereseforpacketseturofntocompressedfixed-lengthfieldscodes.and
ausingselector64-bitworindicatingdswiththeirsupportlengthfor[89v].aluesRecentlyspanning,asimilarmultipleschemepack-
etscodeswasandalsoimprproveoposedupon[90].theThesecomprareessionfasterofbthanvyte-alignedariable-lengthcodes,
butquiressuffbiterfromarithmetic.severalThedravarwbacks.yingnumberExtractingoftheoutputfieldsvaluesstillrpere-
Apacketsinglelargecomplicatesresidualsingleincreasesinstructionthesizemultipleofalldatafieldsin(SIMD)thepacket.writes.
Thelatterissuecanbeaddressedbystoringexceptions,thatis,
ationslist[of91].vHoalueswetovero,verthiswriteisafterunlikelytodecomprbeusefulessionforand16-bittheirvaluesloca-

41

tobecausethetheencodedrsizeeductionofinansizeexception.forsmallThepacketsmainisaspectroughlyoftheprequale-
viouslycitedworkisoptimizationforsuperscalarprocessorsthat
thiscanexecuteenablesamorthrethanoughputoneof1instructionGB/s,weperbelieclockvecythecle.keytoWherfullyeas
utilizingmodernCPUsliesinSIMD.Recently,twosuchschemes
for(nullcomprsuppressionessionby[92])omittinghavethebeenintrmost-significantoduced.Thezerfirsto-v[93alued]usesbits
ofvmultiplicationariable-lengthandfields,complexwhichralignmentestrictsitlogictofor32-bitSIMDvaluesextractiondueto
onalimitationsnewininstructiontheforinstructionpermutingset.Thebytes,secondwhichapprrequiroaches[r94]elativreliesely
largelookuptablesandisunabletocompressfieldstolessthan
8alterbits.nativInSethatectionis4.2,alsowesuitabledescribeforasur16-bitprisinglypixelsandsimplerequirbutesfasterno
.ymemoradditional

CompressionicAsymmetrtheOurthrprimaroughputyfocusofisonhigh-enddecomprsolid-stateessiondisks.speed,Wewhicharemustwillingmatchto
acceptanasymmetriccoder/decoder(codec)thatspendsmore
timesiderableoncomprtimetoession,generatebecauseanywlaragey.datasetsIdeally,theusuallyofrflineequireencodercon-
wouldchoosethebestpredictorforeachpixel.Despitepotentially
reducingtheencodedsizeofthepredictionerrors,thesavings
areunlikelytoexceedthecostoftransmittingsomuchadditional
informationtothedecoder.Thisoverheadcanbegreatlyreduced
byentries[quantizing82].Theprhighedictorvectorscomputationaltoacostcodebookofthisoffrmethodequentlycanusedbe
rcipleeducedtobyvideoprmotionedictingentirecompensation.2-DblocksArofecentpixels,approachsimilarinemploprinys-
atimeisbrute-forrceeducedsearbychrforesortingmatchingtoCALICsblockspr[95].edictionTheofcomprindividualession

42

pixels[96]insmoothimageregions.However,evenasimplefunc-
tionofneighboringpixelsisrelativelycostlyforthedecoderto
compute.Weproposetoeliminatethisstepentirelyandrelyupon
efficientSIMDmatchinginaslidingwindowtomaintainaccept-
ablecompressionthroughput.Tofurtherspeedupthealgorithm,
wedealwith1-Dtuples(asmanypixelsaswillfitinaSIMDreg-
ister)insteadof4×4blocks.Incontrasttopreviousapproaches,
thepredictorisdesignedwithfullknowledgeofthesubsequent
entropycoder.Section4.3introducesournewalgorithm,which
webelievetobethefirstSIMDslidingwindowcompressor.The
resultisatwofoldreductioninimagesizewithdecompression
thatoutperformsastate-of-the-artintegercoder[94].

PackingIntegerSIMDFast4.2Letusdefinepackingasreducingannbittwoscomplementrep-
resentationofavaluein−2m−1,2m−1tombits,asshownin
Figure4.1.Thissectionaddressesthequestionofhowtopack
FFF80000FFFF00020007FFF9

80F279
Figure4.1:Hexadecimalrepresentationofsixn=16bitvalues,
eachpackedintom=4bitsbyomittingthe12mostsignificantbits
becausetheycarrynoinformation.

(andusingtheconverselyubiquitousunpack)SSE2tuplesinstructionofvsetalues[97].asInquicklyfact,ourasterpossibleminol-
ogyderivesfromitsmnemonics,whichincludePACKinstructions
fromn∈{16,32}tom=n/2andUNPCKinstructionsthatinter-
leavembitvaluesforpurposesofsign-orzero-extension.With
theiraid,two-andfourfoldpacking/unpackingof32-bitvalues
43

isstraightforward.Thelatencyoftwoback-to-backpack/unpack
instructionsishigherthanasinglePSHUFBuniversalshuffle,but
themorerecentSSE4.1instructionsetprovidesforsign-extending
8-bitneedvforaluesloadingto16orshuf32flebitscontrviaolPMOVSX.masksfromBothmemormethodsy,aandvoidmorthee
importantly,allowm<8.Forexample,wecan1unpackfromm=4
ton=16asexpressedbythefollowingintrinsics:

V;__m128itypedefin);_mm_unpacklo_epi8(in,=hi_lo16VVVlo16left16==_mm_slli_epi16(hi_lo16,_mm_unpacklo_epi16(lo16,4);hi_lo16);
return12);_mm_srai_epi16(left16,

ThePackingfinalfromnarithmetic=16torightm=shift4issomesign-extendswhatmortheevinvaluesolvtoed:16-bits.

V;Iu16vec8typedef_mm_setzero_si128();=zeroVzero);_mm_packs_epi16(values,=values8VVhi=(values8&_mm_set1_epi16(0x0F00))>>4;
Vlo=(values8&_mm_set1_epi16(0x000F));
zero);lo,|_mm_packus_epi16(hireturn

vTheectorlatterclassescodewithusestheoperatormoreovconvenienterloading.notationSimilaraffordedfunctionsbyC++for
specializationspacking/unpackingsothatoftheirothercallerdatacantypesarsimplyeexprinvoke,essedforasexampletemplate
Pack2xwithoutanyadditionaltypedispatching.
1generateFunctionsSIMDbuiltinstructionsintothreewhilermajorelieC++vingthecompilersprogrammer(GCC,ofIntelandinstructionMicrosoft)schedulingthat
allocation.egisterrand

44

4.3SIMDSliding-WindowCompression

Wsionenothrwderivoughputeacodecwhilerdesignedetainingforaranextreasonableemelycomprhighessiondecomprratio.es-
Ourchiefinterestliesincompressinghigh-resolutionsatelliteim-
ages,samples.whichThebandstypicallyareconsistoftenof4interleaorv8edbspectralypixel,bandsforofexample,16-bit
Blue0Green0Red0NIR0,Blue1Green1Red1NIR1,∙∙∙,whereNIRis
thenear-infraredspectralband.Becauseinter-bandcorrelation
ismorwereakereadilythanbespatialdisplaycorredbyelation[graphics98],andhardwinterleaare,vweedavpixelsoidcon-can
vertingtoaplanarrepresentation.Therawdataisnotamenable
tonullsuppression,sowecombinethepreviouslyintroduceden-
tropycoderwithapredictor.Makingfulluseofthetransistorsin
modernCPUsrequiresSIMDprocessing.However,evencompar-
ativelysimplepredictorssuchasLOCO-I[99]arenotsuitablein
thisregardbecausetheyaccessmultiple(unaligned)neighbors.We
insteadpredictatupleofvaluesfromasingleprevious(aligned)
tuple.Thisiseffectiveatreducingspatialredundancy,butassumes
thatobtainedthevianumberofsyntheticbandseaperturvenlyeradardivides,lasertheSIMDscannerswidth.andcurrImagesent
high-resolutionimagingsatellitesmeetthisrequirement.Other-
iswise,tooprexpensivedictionewtoouldencoderelyanonofthefsetwforeakereachintertuple,-bandsocorrweelation.combineIt
themintolargerunitscalledblocks(nottobeconfusedwiththe
2-Ddefineblockstheminto[95]).matchBecausetheL1thesecachearelinealwasize.ysaccessedasaunit,we
blockWhatasaweframehave,ofthusreferfar,ence.isaInblockcontrastofvtoaluesPFORanda[91pr],eviouseach
isthecomponentreferenceofablockpixeltohasbeitsochosen?wnrIteferisenceherevthatalue.weHowtailor,then,the
predictortofittheentropycoder.Becauseourmaximumpacking
[of−8,n8=),it16wbitouldprbeedictionmisleadingerrorstoallowsminimizeforthethem=sum4ofbitinterabsoluteval

45

prtheedictionactualerrpackedorsassizein[of95].theInstead,blockforweagivdefineenthechoicegoalofrfunctioneferenceas
block.assumingThisthedirentrectlyopyminimizescoderwillthehandlecompressedsmallsizepredictioninsteadoferrjustors
efficiently.Thepackedsizeiscomputedbycheckingwhetherall
n(thatbitvis,alueswhetherinaeachtuplevcanaluebeplus2packedn−1isintozerom=whenn/2orshiftedn/4rightbits
bharymdwarbits).e-assistedThisbiaseddecodingreprviaesentationrightofarithmeticsignedbitnumbersshifts,allounlikews
rtheefersignenceinblocktheloyieldingwestthebitsmallestencodingpacked[100].sizeOurbreaksseartieschinforfavtheor
frofomthethemostrdecoderecentsblock,cache.whichToisfurtherlessimprlikelyovtoehavetemporalbeenelocalityvicted,
werestrictthesearchtoaslidingwindowofthepreviousoutputs.
NotetheresultingsimilaritytotheLempel-Zivfamilyofadaptive
dictionarycoders,withthedistinctionthatourmatchesarefixed-
length(helpfulforSIMD)andapproximate(duetotheproperties
oftheadditionalsubsequentmatchesentrbutopydecrcoder).easeLarencodegerthrslidingoughput.windowsWeallowillw
examinethistrade-offinSection4.4.
packing.ItrTemainsotomaintainbeseenthehowworthed-alignmentdecoderisofthenotifiedencodedofthestrtupleseam,
2sevwhicheralavblocksoidsmicrintoaoargrchitecturoupdescribede-specificbyapenaltieswor,d-sizedwecombineheader.
However,binaryencodingsofthreevalues(two-orfourfoldpack-
inganduncompressed)arewastefulorslow.Becauseitisrareto
encounterablockforwhichnosimilarblocksexist,werequire
alltuplescommunicatedwithinbyansuchillegalblocksvaluetobe(0)storforedthereferuncomprenceessed.blocksThisoffset.is
byOtherafactorwise,aofbitfour.fieldTorindicateseducethewhichnumbertuplesofinaconditionalblockarebranchespacked
incur2UnalignedsignificantdelamemorysydependingaccessesonthatthestraddleCPU.aForcacheexample,lineorthepageIntelCoreboundar2yappearmasy
tobypasstheL1cacheandTLBinsuchcases(c.f.SectionB.2).

46

andalsoavoidmisalignment,wedisallowcombinationswithodd
parity(thatis,thenumberofbitswiththevalueone).Theencoder
mapsthebitfieldtoa4-bitselectorindicatingthemethodfor
unpackinganentireblock.Thismakesdecodingblocksextremely
efficient,becauseonlyoneindirectbranch,2–4word-alignedmem-
oryaccessesand8–16instructionsarerequired.Theselectoris
storedinthelowerbitsofthe16-bitreferenceoffset,whichare
zerobecauseblocksarenaturallyaligned(residingataddresses
thatareamultipleoftheirsize).Ourimplementationcurrently
providesfortheselectorslistedinTable4.1.Forexample,selector4

Table4.1:Selectorsareaconvenientrepresentationofabitfield
indicatingwhethereachofthefourtuplesinablockispacked
fourfold.Ourimplementationallowsthefollowingvalues:

MeaningelectorS10isPacked4xisPacked4x==00000011
0101=isPacked4x20110=isPacked4x31001=isPacked4x465isPacked4xisPacked4x==11001010
1111=isPacked4x78Blockresidualsare0andnotstoredinstream
9Streamholdsanuncompressedblock

indicatesthefirstandfourthtuplesinablockarepackedfourfold,
whereasthesecondandthirdarepackedtwofold.
asynchrAfinalonousextensionI/Otosimplifiescomplete.decodingCombiningwhilegrwoupsaitingintoforthechunksnext
thatfitwithinanI/Orequestguaranteeseachgroupcanbede-
codedwithoutanyboundscheckingorcopying.Thedecoder
requiresanindicationofwherethechunkends,forwhichwe

47

prependitscompressedsizetothestream.Notethatthisdoesnot
consumeanyadditionalspace,perthefollowingargument.The
firstblockisalwaysstoreduncompressedbecausetherearenopre-
cedingblockstoserveasreferencevalues.Wecopyuncompressed
blocksviaSIMDinstructionsthatrequiretheoperandstobenatu-
rallyaligned.Thegroupheaderintroducesan8-bytemisalignment
andisnormallyfollowedby8bytesofpadding.However,wecan
usethisspacewithinthefirstblockofeverychunktostorethe
compressedsize.Toclarifytheoperationofthecodec,Figure4.2
showsanannotatedcompressedrepresentationofafourband,
16-bitsyntheticgradientimageinwhichbandi∈[1,4]ofpixel
n∈[0,32)is1000×i−n.
0000000000000080Compressedsize=128
0047004700470009Group:4×16-bitoffset+selector
0FA00BB807D003E80F9E0BB607CE03E60F9F0BB707CF03E7Block1of4:
0F9D0BB507CD03E5header[0]⇒offset0+selector8
0F9C0BB407CC03E4offset0⇒noreferenceblock
0F9B0BB307CB03E3selector9⇒uncompresseddata
0F9A0BB207CA03E2(64bytes;address=∼0(mod16))
0F990BB107C903E18888888888888888Block2:offset(-)64+selector7
888888888888888832×4-bitresiduals-8
8888888888888888Block3of4:sameasBlock2
8888888888888888(residualsrelativetoprev.block)
8888888888888888Block4of4:sameasBlock3
8888888888888888(selector,offsetfromheader[3])
Figure4.2:Annotatedencodingofa256bytegradientimage.The
16hexadecimaldigitsoneachlinerepresent8bytesstoredinlittle-
mat.forendianTosummarize,theencodedstreamisorganizedaccordingto
thefollowingExtendedBackus-NaurFormgrammar:
48

{Chunk}-;=StreamChunk=CompressedSize=CompressedSize,{Bit}64;{Group}-,ChunkPadding;
*ChunkPadding={{Bit}*8}(*<128KiB*);
Group=GroupHeader,{PackedBlock}*4;
4;{Match}=GroupHeader*(*Matchoffsets=Offset,aremultiplesSelector(of*16addedbytes*together)*);
Offset={Bit}*16(*backwardsdistance*);
Selector={Bit}*4(*seeTableI*);
4;{PackedTuple}=PackedBlock*(*omittedifselector=8*)
Bit=PackedTupleunsigned=packedintegerbit;1x|2x|4x;

Measurements4.4

Thissectionpresentsmeasurementsofthespeedandcompres-
sionratioofournewalgorithmforpurposesofcomparisonwith
oaches.apprexisting

areSoftwandareHardwThetestplatformconsistsofdualW5580CPUs(3.2GHz)run-
ningWindowsXPx64,48GiBDDR3-1066memoryandan
80GBFusionIOcard.Ourimplementationiscompiledwith
ICC12.0.1.096/Ox/Ob2/Oi/Ot/GA/GR-/GS-/Gy/EHsc/MD
/Qipo/QxSSE4.1/Qopenmp/Qstd=c++0x.WeuselosslessJPEG-
2000andLempel-ZivMarkovchaincompressionasabasisfor
comparison.TheformerisprovidedbyGeoJasper1.3.1[101],com-
piledwithnearlyidenticalsettings(ouralgorithmisnotinfluenced
bystringmergingnorfloating-pointarithmetic,butweenable/GF
/fp:fast=2forGeoJasperwhileomitting/Qopenmp/Qstd=c++0x,

49

becauseitdoesnotusethosefeatures).LZMAisrepresentedby
thepublic64-bitreleaseof7-Zip,version9.2[102].Bothofthese
algorithmsarerunwiththeirdefaultparameters.

Datasets

Thecodecisprimarilyintendedforcompressionofimageswith
four16-bitbands.Wearbitrarilychosefourpan-sharpened[103]
satellitedatasetsandextractedsubsetsofincreasingsizebased
ontheinterestingareasintheimage.Eachcontainsamixof
urbanandnaturalterrain(Figure4.3).A16-bitpanchromatic
QuickbirdimageofFrankfurt,Germany,isalsoincluded.Because

sharFigurepened4.3:ScrQuickbireendcapturimages.eof16-bit,Clockwise4frchannelomtopsubsetsleft:Wofangenpan-
(Switzerland)andNeureut,Dorsten,Ettlingen(Germany).Copy-
porated.IncorDigitalGloberight

50

8-bitand/orRGBimagesareinwidespreaduse,weimplement
apreliminarytestthatzero-expandsthepixelsto16-bit,addsa
fourthcomponent,andthenappliesthesamecodec.Searching
forlarge,publiclyavailableimages,wefoundtwo8-bitgrayscale
lunarmosaics[104,105],twolargeimages(hs-1999-14-bandhs-
2004-52-a)fromtheHubblespacecraft,twomosaicsoftheStanford
MemorialChurch[106],andthePIA13804panoramafromthe
MarsPhoenixlander.Theirdimensionsandformatarelistedin
.4.2ableTTable4.2:Testimagesandtheirabbreviatedidentifiers,dimensions,
numberofbandsandbitdepth.

DatasetIDWidthHeightBandsBits
QBQBWNeurangeneutQNQW3273527421230123441616
QBEttlingenQE58083692416
QBQBFrankfurtDorstenQFQD107232336106029520141616
LunarMosaic1LM1110001100018
Hubble1LunarMosaic2H1LM24241890002246240001188
MemChuNightHubble2MCNH2114136184378134563388
MemChuMC16965823038
PIA13804P126180618038

ThroughputAfterloadingthepixelsin256×256band-interleavedtilesby
meansoftheGDALlibrary[107],wemeasuredthein-memory
encodeanddecodethroughputsonasingleCPUcore(Table4.3).
Thelattervariesbetween2600and3000MB/s,whichexceeds

51

Table4.3:Single-threadedencode/decodethroughputfortiled
images.

MB/sDecodeMB/sEncodeDatasetQBQBWNeurangeneut192.41230.6122701.95650.22
750.072191.40EttlingenQBQBQBFrankfurtDorsten207.13165.1122674.20828.23
995.022198.68LunarMosaic1Hubble1LunarMosaic2241.77194.1122953.05689.83
MemChuNightHubble2171.29168.7433044.30033.28
MemChuPIA13804165.95189.6633070.56132.61

ourFusion-iodesignDuogoal(1.4ofGB/s).keepingupDecomprwithaession16-drivis13etoarra18y(1timesasGB/s)fastandas
Bothcomprcompression,essionunderscoringanddecomprtheessionasymmetricthrnaturoughputeofincrtheeasesalgorithm.when
theimagecontainsmorehomogeneousregions.
-verboseForafairmode,wcomparisonealsowithwritetheandrtimeseadrtheeportedencodedbydataGeoJasperto/froms
disk.Decodingisoverlappedwithasynchronousreads.The
rTableesulting4.4.LASCelapsedcomprtimesessionandis13speedupsto20vs.timesGeoJasperasfastaraseshoJPEG-2000wnin
onthefour-banddatasets,anddecompressionismorethan100
fast.astimesitstotalBecauseexecutionthe7-Ziptimeandexecutablethereforlacksealsoincludeinstrumentation,I/OinwtheerecorLASCd
librartimings.y,TwhichilesisarenotreadfroptimizedomforimagespeedfilesbandyfallsmeansfarofshorttheofGDALthe

52

Tandable4.4:decomprElapsedessingfrtimesom[s]fileforandthecomprspeedupessingvs.datafromGeoJaspermemor(GJ).y

DatasetEncode+I/Ovs.GJI/O+Decodevs.GJ
QBQBWNeurangeneut0.4720.11720.414.80.0180.062114.6100.0
QBEttlingen1.05819.40.166111.5
QBQBFrankfurtDorsten1.2212.38516.513.30.1990.27488.1103.4
69.60.1368.31.294LunarMosaic1Hubble1LunarMosaic20.1096.5599.313.00.0130.88563.785.9
MemChuNightHubble24.2860.8425.08.20.4240.08346.472.3
PIA13804MemChu8.6746.4786.25.30.8830.69552.445.4

diskthroughput.However,asshowninTable4.5,LASCcompres-
sionisstillbetween33and72timesasfastas7-Ziponthesatellite
data.Thereislessofaspeedupontheotherdatasetsbecausewe
expandedthemto16-bitand/orfourbands.However,forreasons
unknown,7-ZipisalsosurprisinglyefficientontheHubbleand
MemChudatasets.LASCdecompressionis15to20timesasfast
onthemultispectraldatasets.NotethattheLZMAalgorithmis
partiallyparallelized,whereastheaboveLASCresultsareforasin-
glecore.Thisisimportantbecause60%ofsurveyedPCsaresingle-
ordual-core[108].However,morecoresmightbeavailablefor
compression,soweprocesstilesinparallel.Thisenablesathrough-
putof1212.46MB/sontheEttlingendatasetand1122.01MB/son
Dorsten.Becausetilesaretightly-packedwithintheoutputstream,
eachthreadmustencodeintoatemporarybufferandlatercopy
ittothedestination.Thisadditionaloverheadexplainswhythe
eightcoresonlyachievearespectivespeedupof6.3and6.8over
single-threadedcompression.Wehavenotimplementedparallel
53

Tfilesableand4.5:theElapsedspeeduptimesvs.[s]7-Zipfor(7z).compressinganddecompressing

DatasetEncode+I/Ovs.7zI/O+Decodevs.7z
QBQBWNeurangeneut0.9350.21133.435.80.1510.04817.420.8
QBEttlingen1.65449.50.39316.9
QBQBDorstenFrankfurt3.5211.55844.572.10.7310.48412.315.2
6.90.44129.42.362LunarMosaic1Hubble1LunarMosaic20.31111.2471.919.00.0442.3328.510.1
Hubble2MemChuNight7.2431.8059.11.91.2890.2514.37.4
PIA13804MemChu14.03811.23715.66.22.5732.1446.74.3

exceedsdecomprtheessionI/Obecausebandwidththeonsingle-corouresystem.throughputalreadyvastly

RatioCompressionWhereasthealgorithmiscertainlyfast,itsusefulnesshingesonrea-
sonablecompressionratios.Table4.6liststheresultingsizesafter
compressingeachimagewiththethreecontenders.Thebar-plot
reprnumbersesentationinofperspectivthee.comprLASCisessionbetwratioseenin1.2andFigure1.54.4timesputsaslarthesege
asJPEG-2000onthemultispectralsatelliteimagesthatwereour
agesprimarofythefocus.sameWepixelbelieforvemat,theseprroesultsvidedartheeyapplicablepossessatorothereasonableim-
degreeofspatialredundancy.Randomimageswithuncorrelated
pixelsare,ofcourse,incompressible.Ouralgorithmalsoappears
suitableforcompressingsomegrayscaleimages,even8-bit,with
resultsbetween1.59and1.92timesaslargeasJPEG-2000.However,
54

LASCLZMAJP2K

0.72

0.82

0.72

0.66

0.85

0.47

LASCLZMAJP2K0.530.530.51RatioessionCompr0.390.380.38
0.50.470.460.460.470.480.460.450.47
0.420.410.410.350.340.330.290.310.290.320.30.29
0.250.240.240.2

QWQNQEQDQFLM1LM2H1H2MCNMCP1
Dataset

Figure4.4:BarplotofJPEG-2000,LZMAandLASCcompression
ratios(compresseddividedbyoriginalsize)onalldatasets,whose
abbreviationsaredefinedinTable4.2.

55

Table4.6:Compressedsizes[bytes]forlosslessJPEG-2000,7-Zip
LASC.andLZMALASCLZMAJP2KDatasetQBQBWNeurangeneut227943404805146308394579604292339274520056256
QBEttlingen709590547873016186682152
QBQBDorstenFrankfurt1006327373387729910881755713880351102134217249560840
LunarMosaic1302770334197610058110352
Hubble1LunarMosaic2262263684137252030435527493163834174652357632096
Hubble2MemChuNight2159475279077177247487005595819217938893736272160
PIA13804MemChu16184150643453231226119358494586493413274182525168352

notethatalloftheseimagescontainno-dataregionsinthecorners,
whichresultsinspacesavingsof12to34.2%duetotheadditional
all-zero-residualselector.Therighthalfoftheplotclearlyshows
theshortcomingsofourpreliminaryapproachthatexpandsRGB
tofourcomponents.Itisactuallysurprisingthatcompressionwas
attaineddespitehavingexpandedtheoriginaldatabyafactorof
2.6.Thefutureworksectionproposesanapproachforavoiding
thisoverhead.Asitis,thealgorithmtypicallyresultsinatwo-fold
reductionofmultispectraldata;grayscaleimagesmaybereduced
byafactorbetween1.4and2.4.

imentsExperFurtherTable4.7showstheincreaseincompressedsizeoftheNeureut
imageforvarioustiledimensions.Givena16KiBslidingwindow,
thelargesttilesize(512×512)allowsaccessto512×4neighboring
pixels–animbalancethatnoticeablyimpactscompression.The

56

Table4.7:Increaseincompressedsizeforvarioustiledimensions
comparedtothebaselineof256.
sizeΔtileDim14.2%640.2%1280.0%25610.0%512

smallest(64×64)tilesprovidea64×32window,whichisap-
parentlytoonarrowtoexploitmuchofthehorizontalcorrelation
intheimage.Tobetterunderstandtheseeffects,wemeasured
thedistributionofmatchoffsets(Figure4.5)withatilesizeof
256×256.Theleftandupperneighborsofthecurrentblockare
themostcommonlyused.However,about1/3oftheblocksarea
closermatchwithotherblocksonthesameline,thusunderscoring
theimportanceofarbitraryoffsets.Becausepreviouslinesarenot
referencedasoften,werestricttheslidingwindowto2KiB(the
sizeofatileline).Eachhalvingoftheoriginal16KiBsizenearly
doubledencodethroughputwhileincreasingsizebyabout0.7%.

Conclusion4.5

Thischapterdemonstratesthefeasibilityoflosslessasymmetric
SIMDcompression(LASC).Weproposeanewentropycoderbased
onnullsuppressionviaPACKinstructions.Despiteitssimplic-
ity,thisapproachenablesahigherthroughputthantworecently
proposedSIMDintegercodecsandisnotlimitedto32-bitdata
types.Anovelpredictordesignedwithfullknowledgeofthecoder
reducesthespatialandintra-bandredundancyofband-interleaved
pixels.Weavoidintricatecomputationandaccessestomultiple
neighboringvalues,insteadpredictingentiretuplesofvaluesby
meansofcomponent-wisesubtractionfromaprevioustuple.The
57

410×43.53Matches2.5of21.5Number10.500

1

2[KiB]fsetOf

3

4

Figure4.5:DistributionofmatchoffsetsontheNeureutimage.To
preservedetail,wecutoffthepeaksof31×104and38×104at
offsets64(previousblock)and2048(previousline).

resultingdecompressorisfasterthancopyingtheuncompressed
tiondata.errInors,contrastweusetoprtheeviousactualapprcomproachesessedthatsizeonlyastheminimizegoalprfunction.edic-
butThistwroesultsordersinofoutputsmagnitude20to50%fasterlartogerthandecompress.losslessWhereasJPEG-2000,ad-
overditional2600MB/sparallelizationissufisficientpossible,forstrtheeamingsingle-cordecomprethressionoughputfromof
faststoragemediasuchasFusion-iosolidstatedisks.

58

FutureWork.OurLASCalgorithmenablesextremelyfastcom-
pressionandespeciallydecompression,butmanyavenuesforim-
provingitscompressionremaintobeexplored.Wecurrentlyavoid
transmittingall-zeroblocks,butextendingthistoindividualtuples
tainshouldexactimprovmatches.ecomprIdeallyession,anyofsyntheticcombinationimages,ofuncomprwhichoftenessed,con-all-
zero,two-andfourfoldpackedtupleswouldbeallowed.Because
44selectorsoverlyburdentheCPUsindirectbranchpredictor,the
encodercanindicatewhichsubsetisthemostusefulforapar-
ofticularfsetsareinputthemostdataset.frAequentsimilarcouldanalysisenableaofsmallerwhichreferencodingenceofblockthe
matches,significantlyspeedupthecompressor(bycheckingthose
offsetsfirst)andalsoreducecacheevictionsinthedecompressor.If
theencoderexplicitlymodelstheseevictions,theslidingwindow
costcouldtobetheenlardecoderged.(therTheebryimpresultingovingincreasecomprinession)compressionwithouttimeany
canbereducedbymeansofaconstant-timesearchforprevious
matchingblocks,forexample,viahashing.Three-componentRGB
trimages,oducingforanexample,additionalfromband,digitalwhichcameras,increasescurrtheentlyrcomprequireessedin-
sizebyafactorofabout7/6.Thisoverheadcouldbeavoided
bystoringanintegralnumberofRGBtripletsineachblockand
temporarilyexpandingthemtoafour-componentrepresentation
intypestheprotheredictorthan.16-bitFinallyv,thealues.codecAddingshouldsupportbeevforaluated32-bitforintegersdata
(usefulfordocumentindexingorimagesfromlaserscanners)is
straightforward.Nullsuppressionoffloating-pointdataisalso
thecurrchallenging,entandbutpritemaviousybevalueshelpful[109to].XORtherepresentationsof

59

Chapter5

SharpeningPan

Imagingsatellitestypicallycaptureseparatehigh-resolution
panchromaticandlower-resolutionmultispectraldatasets.Combin-
ingthemintoasinglepan-sharpenedimageprovidessubsequent
imageanalysistaskswithcolorandstructuralinformation.This
topichasbeenthefocusofextensiveresearch.However,personal
communicationindicatingtheoperationsofaninternationalagency
arelimitedbythespeedofitspan-sharpeningsoftwarehasmoti-
vatedthedevelopmentofamuchfasteralgorithm.Webuildupon
theFastIHStechnique,usingaweightedlinearcombinationof
theupsampledmultispectralbandstoderiveacompositeimage
closertowhatthepanchromaticsensorhadseen.Thedifferenceto
theactualpanchromaticimageapproximatesthehigh-frequency
detailsignalandisinjectedintothemultispectralbands.However,
thefixedbandweightstypicalofpreviouscommerciallyavailable
algorithmscannotaccountfordifferingatmosphericconditions.
Tofurtherreducecolordistortion,wecomputetheoptimalband
weightsforagivendatasetinthesenseofminimizingthemean-
squaredifferencebetweenthecompositeandpanchromaticimages.
Becausethe(possiblymultiplicative)noiseinthepanchromaticim-
ageimpairsthesubsequentgraph-basedsegmentationalgorithm
describedinChapter6,anadditionaldenoisingstepisapplied
beforefusion.Weintroduceanimprovedapproximationofthe
BilateralFilter,whichpreservesedgesandrequiresonlyonefastit-
eration.Bothalgorithmsareshowntobeextremelyefficient–large
61

satelliteimagescanbeprocessedwithinseconds.Thequalityofthe
fusedimageisevaluatedinacomparativestudyofpan-sharpening
algorithmsavailableinERDASIMAGINE9.3.Objectivemetrics
suchThisasthechapterQ4isaqualitymajorindexreshovisionwofimpraovementscontributionintocolorthefidelityEarth.
ResourcesandEnvironmentalRemoteSensing/GISApplications
conference,co-authoredbyS.Laryea[103].

5.1IntroductionandRelatedWork

ImagingsatellitessuchasIKONOSprovidepanchromatic(pan)
imagerywithsub-meterresolution[110].However,segmenta-
tionbenefitsfrommultispectral(MS)information[111].Limiting
photonstoindividualbandsrequireslargerdetectors,sotheMS
resolutionistypicallybetweentwoandfivetimesascoarse.Inthe
commoncasewherethesatelliterecordsbothpanchromaticand
MSimages,theycanbefusedintoahigh-resolutionoutputthat
alsoincludescolorinformation.Thisiscalledresolutionmerge
orpansharpening(PS),forwhichmanyapproacheshavebeen
proposed.ThepopularIHSapproachinvolvestransformingcolors
toIntensity,HueandSaturation.PrincipalComponentAnalysis
(PCA)andtherelatedGram-Schmidttransformationareexamples
ofstatisticalapproaches.TheBroveytransformationandwavelet-
basedtechniquesareexamplesofnumericalmethods.Finally,
theEhlersapproachisacombinationofIHSwithFastFourier
Transform-basedprefiltering[112].
Eachofthepreviouslymentionedalgorithmshavelimitations
ordrawbacks.Acommonproblemrelatestocolordistortionvs.
theoriginalMSimage,whichiscausedbythespectralmismatch
betweenthepanandMSbands.TheIHSandPCAmethodsarepar-
ticularlyvulnerable,becausetheyreplaceatransformedbandwith
theequalizingoriginalthepanpanimage.histogramThebeformismatchemercangingber[113educed].someAnotherwhatprbob-y
lemrelatestothesensorsspectralresponsefunction.Inthecaseof
62

thecies(c.f.IKONOSFigure5.1satellite,).theBecausepanthebandbasicextendsIHSpasttransforthemNIRignorfresequen-the

Figure5.1:IKONOSspectralresponsefunction[114].NotethatPan
extendsbeyondNIR,andthatBlueandGreenhaveasignificant
erlap.vo

NIRbandentirely,colorsareperceivedasdistorted,especiallyin
regionswithgreenvegetation[115].WeightingtheMSbandscan
mostlycompensateforthiseffect[116].However,knowledgeofthe
sensorsspectralresponseisrequired,andfixedweightscannot
accountforchangesinviewingconditions[117].WhereasEarth
observationsatellitesoftenoperateinsun-synchronousorbits[110],
suchthateachpassoccursatthesamelocalsolartime,differences
inatmosphericconditionsmaystillaffectthespectralresponse.
Weavoidtheseissuesbyestimatingtheoptimalweightsforeach
inputimage,asdiscussedinSection5.2.Thequalitymetricsin
Section5.5indicatethisdecreasesthecolordistortion.
63

Anotherimportantissueconcernsnoiseinthepanchromatic
image,becauseitssignal-to-noiseratio[118]maybeworsethan
thatofthelower-frequencybands[110].Section5.3proposes
edge-preservingfilteringofthepanimagetoavoidinjectingnoise
intotheMSbands.Section5.4showstheresultingincreasein
smoothness,whichisbeneficialforthesubsequentsegmentation
step.Highcomputationalcostisthefinaldrawbackoftheexisting
approaches.Section5.6comparesexecutiontimesandfindsthat
ournewapproachisordersofmagnitudefaster.

ithmAlgor5.2

OuralgorithmisbasedontheFastIHStransformation[115].The
multispectralbandsarefirstupsampledtotheresolutionofthe
panchromaticbandviacubicconvolution.Incontrasttothefixed
weightsofpreviousIHS-basedschemes,wecomputetheoptimal
bandweightsforthegivenimagebyminimizingtheMSE(mean
squarederror)betweenthepanimageandalinearcombination
ofthemultispectralbands[119,117].Asitsnamesuggests,the
MSEisthemeansquareddifferencebetweenanestimationXˆand
thetruevalueX:E[(ˆX−X)2].Thereisaclosed-formsolution
forminimizingthismetric.LetX:=[B1,B2,B3,B4,P]Tdenote
thecomponentsofeachpixel,i.e.themultispectralbandsBiand
panchromaticbandP.Weseekthevectorofweightsasuchthat

(5.1)

4Pˆ=∑aiXi(5.1)
1=iisanoptimal(intermsofMSE)estimationofP.Bytheorthogonal-
ityprinciple,wehaveXTXa=XT[120].Theoptimalbandweights
aaretherefore(XTX)−1XT.Interestingly,theymaybenegative,
whichisplausiblebecausethespectralresponsefunctionsofsome
bandsoverlap(c.f.Figure5.1).ThedifferenceP−Pˆcontainsdetail
64

informationfromthepanchromaticimageandisinjectedbackinto
eachMSbandtoyieldthefinalfusedbandBˆi=Bi+P−Pˆ.
Thisalgorithmissimpleandefficient,buttheexcellentperfor-
manceofourimplementationisdueinlargeparttoadditional
numericaloptimizations.Becausetheouterproduct(XTX)is
symmetric,weavoidredundantmultiplicationsbycomputing
PBi,B4B1,B3B1,B2B3,B1B2,B4B2,B3B4,BiBi(i∈[0,3]).Thisonly
requirestwoSIMDshufflesandfourmultiplicationsperpixel.
Afterreassemblingtheouterproductmatrixfromtheseterms,
wefinishthecomputationofawiththeaidofIPPsoptimized
matrixinversionandmultiplicationroutines.Thetime-critical
computationofPˆisacceleratedbymeansoftheSSE4.1DPPS1
instruction.Whencombinedwithparallelization,thesetechniques
yielda20-foldspeedup,whichisofmajorpracticalrelevance.Note
thatthenegativeweightsanddifferencesbetweenMSandPmay
resultinvaluesofBˆoutsidetheinputdatarange,whichcauses
problemsforthesubsequentfilteringstep.Weavoidthisissueby
clampingallbands,i.e.assigningthenearestpermissiblevalue:
Bˆ:=min(max(0,Bˆ),maxP).

ReductionNoise5.3

Wesuppressnoiseinthepanchromaticimagebyapplyingafast
approximationoftheBilateralFilter.Thisadaptivenonlinearfilter
smoothespixels,butpreservesstrongedges.LetIpdenotethe
pixelvalueatpositionp.TheunnormalizedfilterresultFpfora
pixelwithcoordinatespisaweightedaverageofpixelsatnearby
:qlocationsqFp=∑Gs(p−q)Gr(Ip−Iq)Iq(5.2)
NormalizationentailsdivisionbythesumofweightsWp:
qWp=∑Gs(p−q)Gr(Ip−Iq)(5.3)
1DotProductofPackedSingle-precisionvalues.
65

terTheminednamebyBilateralbothitsarisesspatial(becauses)andtheradiometricinfluenceof(ra)pixeldistanceisde-to
thecentralpixel.Gs,rareGaussianswhoserespectivestandard
deviationsσs,rdeterminetheneighborhoodsizeandsensitivityto
intensitydifferences.[121]
ithasInrthisecentlyform,beentherfilterecastisasaratherlinearexpensiv3Deconvtoolutioncompute.folloHowwedevbery,
nonlinearities(divisionfornormalizationandsamplingtheresult
attheaugmentingoriginalapixelslocation).xandyThecoorthirddinatesdimensionwithitsisintrintensityoducedvaluebyi.
Tointospeedcoarseupbins.theHoconvweverolution,,anefthisficient3DspaceisSIMD-capablefirstdoalgorithmwnsampledis
identifiedasanexcitingavenueforfuturework[122].Wetake
up3Dthisspace,i.e.suggestion.volumetricThebinspicturecanbeelementsview(vedasoxels).smallEachcubescountsofthethe
numberintensities.ofForpixelsanthatimagefallofwithinW×itsHarpixelseaandwithstoresmaximumthesumofintensitytheir
(Rx,,yw,ei)arallocateemappedW/σsto×binH/coorσs×dinatesR/σrbybins.multiplyingPixelcoorwithdinatesthe
reciprocalof(σs,σs,σr)andtruncatingtointegers.Providingtwo
emptypaddingbinsineachdimensionavoidstheneedforbounds
checking.Eachprocessorisassignedastripoftheimageand
ofthepopulatesthesubsequentbins3DwithGaussianpixels.Wconveprolutionoposeaofthefurtherbincountsaccelerationand
(5sums.ofR=2047Because),theonlyker≈nel10%canofbebinsapprareoximatedobservedbytobeseparatedoccupied1D
40=σrfactorsecond-oroftwderoandbinomialaddedtofilters.itsleftTheandcentralrightpixelisneighbors.weightedHowebvyera,
westorebinsasanarrayofrow-majormatrices,thusmakingfor
poorlocalitywheniteratingoverthesecondandthirddimensions.
itsWesixinsteadnearest3Dcomputeneighborsthewineightedasinglesumspass.ofeachBecausecentraltherpixelesultingand
vstoraluesestoaraevoidwrittencachesequentiallypollutionb,ywewritingusedirectlynon-temporaltomemorstryeaming(see

66

AppendixA.2foramoredetaileddiscussion).Perhapssurprisingly,
thesenumericalanddata-layoutoptimizationshaveresultedin
a5-foldspeedupvs.theseparatedconvolutions.Thenextstep
theinvolvesnumbernorofmalization,pixelsthei.e.ycontain.dividingWeachebinsspeedupintensitythesumdivisionby
bymultiplyingwiththeapproximatereciprocal.Maskingavoids
viathetrilinearsingularityinteratzerpolationo.ofFinallythe,atheveragefilteredintensitiespixelsarinetheobtainedeight
nearestbins.Ourcarefullyengineeredalgorithmachievesa14-fold
speedupvs.thereferenceimplementationoftheapproximated
].122[FilterBilateralWealsomeasuredthethroughputfor16-bitsatelliteimagesof
varyingsizesonourtestsystem(c.f.Section2.3).Theresultsare
showninTable5.1.Performanceincreasesslightlyforlargerimage

Table5.1:ThroughputofourapproximatedBilateralFilterfor
images.satellite16-bit

MPixelSatelliteMPixel/sQuickBirIKONOSd5474242304
327109dQuickBir316136dQuickBirQuickBirGeoEyed240229336335

sizesduetoamortizationofstartupoverhead.Forcomparison
purposes,aVirtex-4FPGAimplementationofbilateralbackground
tionofsubtractiontheprBilateralocessesFilter4.6MPixel/srunning[on123].anANVIDIAseparatedGeForapprceoxima-8800
GTXreaches189MPixel/s[124].Themeasuredthroughputofour
softwareimplementationexceedstheirrespectiveperformanceby
1.8.and73offactors

67

Results5.4

Wecessing)firstassessalgorithmthebyqualitymeansofourofanewvisualMSPcomparison(MultiSpectralofitsrPrepresultso-
totheoutputofcommerciallyavailablesoftware.TheModified
aIHSbasistransforformationcomparison.andBothEhlersareFusionincludedinalgorithmsversionwill9.3serofvetheas
well-establishedERDASIMAGINEframework.
IHSModifiedfusion.IHSThe[116Pan]imprchannelovesisuponadjustedthetospectralmatchfidelitytheofintensityclassic
ofthemultispectralinputimagery.ItthenreplacestheIchannel,
aftermethodwhichmaythebeIHSextendedreprtoesentationmorethanisconvthreeertedbandsbackbytoRGB.substitutingThe
oneofEhlerstheFusioninput[112bands]isandalsorbasedepeatingonthetheprIHSocess.transformationwith
additionalfilteringinthefrequencydomain.TheIcomponentis
goesfilteredthrwithoughaalohighwppassasskerfilternel,.Thewherreasesultsthearepanchrthenomatictransforbandmed
backtothespatialdomain,afterwhichthelow-frequencymulti-
tospectralyieldtheandnewhigh-frintensityequencycomponent.panchromaticFinally,signalsIHSisaretransforcombinedmed
RGB.tobackWerunthealgorithmsontwosatellitedatasetsofKarlsruhe
and2003-08-06Feyzabad,andrecor2004-07-05.dedbytheThe4IKONOSmMSimsatelliteagesaresystemr[110esampled]on
to1mbymeansofcubicconvolution,exceptforModifiedIHS
towithavoidtheanKarlsruheapparentdataset,softwarewhicherrorrinequiresERDASbilinearthatintercausessepolationvere
distortion.coloringAthemvisualunderassessmentidenticaloftherconditions.esultswTheouldideallyintentioninvwasolvetostrdisplaetchy-
ofeachtheEhlershistogrambFusionythediffsameeredfunction.significantlyHo,wevercausing,theagreennoticeableband
colorshift.WethereforecomputedthehistogramsoftheEhlers

68

andIHSoutputsviaERDASwithbinfunctiondirect,skipfactor
1andincludingallvalues.TheresultsareshowninFigure5.2.
AlthoughthecauseoftheIHSplateaubetween0and63isun-

Ehlers(b)IHS(a)Figure5.2:Histogramplotindicatingthefrequenciesofintensity
vAalues[substantial0,2048)shiftinisthegrobsereenved.bandsoftheIHSandEhlersoutputs.

theknotwwno(nohistogramssuchpixelisvaluesimmediatelywereapparobservent.ed),Thistheshiftseemstobetwindi-een
cateaflawintheEhlersalgorithm,whichmayhavebeenhidden
bythedefaultERDASviewerbehaviorofstretchingimagesfor
display(i.e.adjustingtheirhistograms).Toenableaside-by-side
comparison,wedisplayallimageswiththisstretchmodeenabled.
AllTheralgorithmsesultingscrproeenvidercaptureseasonableareshooutputs,wninbutFiguralsoes5.3includeandblue5.4.
byborthedersimprattheeciseedgesco-rofegistrationbuildingsofandthetrees.bands.ThisTheefrfecteducediscausednoise
levelinouroutput(Figure5.3(d))isseenwhencomparingwiththe
panchromaticimageandtheotherresults,particularlyinthewater
areas.However,thebordersofthefieldsinFigure5.3(d)indicatea
lossofdetailduetoexcessivesmoothing,whichcanbereducedby
inchoosingFigure5.3(b)smaller,wσes,r.noteUponacolorclosershift–inspectionthecountroftheyroadsEhlersrappearesult
original.theinthandarker

69

(c)

(a)

MS

ModIHS

Figure5.3:Screen
outputs.rithms

escaptur

of

the

70

Karlsruhe

(b)

(d)

Ehlers

MSP

dataset

and

the

algo-

eFigurrithms

(c)

(a)

MS

ModIHS

S5.4:eencroutputs.

escaptur

of

the

71

yzabadFe

(b)

(d)

Ehlers

MSP

dataset

and

the

algo-

icsMetrQuality5.5hoThewprecedingsuccessfullyanqualitativealgorithmassessmentpreservesgivesthearoughmultispectralindicationcharac-of
teristicsofadatasetwhileimprovingitsspatialresolution.How-
ever,wealsoprovideobjectivemeasurementsbymeansofthe
metrics:similaritywingfollo

PDThePer-pixelDeviationisthedifferenceofeachcomponentc
ofthepixelsatcoordinatesi,jinthemultispectralinputBvs.
thoseinthepan-sharpenedoutputFafterresamplingtothe
originalresolution.Itisnormalizedaccordingtotheimage
dimensionN×MandnumberofcomponentsC.Thebest
valueiszero.[112]

NMCc=∑1i=∑1j=∑1Bi,j,c−Fi,j,c
PD=NMC(5.4)
RMSERootMeanSquareErrorissimplythesquare-rootofthe
MSEbetweenthefusedimageandtheoriginalmultispectral
image.Smallervaluesarebetter.

NM∑∑Bi,j,c−Fi,j,c2
RMSEc=i=1j=1(5.5)
MNCCCorrelationCoefficientexpressesthecorrelationbetweenthe
originalandfusedimagesandrangesfrom-1to+1.Values

72

near1.0indicatetheimagesarehighlycorrelatedandsim-
ilar.[125]LetF¯cdenotetheaverageintensity∑i,jFi,j,c/Nof
eachpixelscomponentcinF,andsimilarlyB¯cforB.

NM∑∑(Bi,j,c−B¯c)(Fi,j,c−F¯c)
NMNMCorrc=i=1j=1(5.6)
i=1j=1i=1j=1
∑∑(Bi,j,c−B¯c)2∑∑(Fi,j,c−F¯c)2
ERGASTherelativedimensionlessglobalerrorinfusionsumma-
rizestheerrorsinallbands.Smallervaluesindicatehigher
imagequality.Thescalingfactorlhcorrespondstotheratio
ofpixelsizesinthepanandMSimagery.[126]

ERGAS=100h1∑CRMSEc2(5.7)
lCc=1B¯c
QThetion,UniversalluminanceImagedistortion,QualityandIndexcontrastincorporatesdistortion.lossofItcorrrangesela-
between0and1andismaximizedwhentheimagesare
]127[identical.cQ=4B¯cF¯c∑i,j(Bi,j,c−B¯c)(Fi,j,c−F¯c)(5.8)
(B¯2+F¯2)∑i,j(Bi,j−B¯c)2+∑i,j(Fi,j−F¯c)2
Q4TheQuaternionsTheoryBasedQualityIndexisageneraliza-
ontionofnon-othevQerlappingindexto32×four32bandsblocks.viaThequaterbestvnions,alueis1.computed[128]

73

Table5.2:Per-bandmetricsfortheKarlsruheandFeyzabad
datasets.Thebestvalueofeachmetricisencircled.
yzabadFeKarlsruheCCEhlersModIHSMSPEhlersModIHSMSP
B0.9260.9270.9560.9860.9680.979
G0.9560.9560.9820.9930.9780.991
R0.9710.9700.9860.9970.9840.993
NIR0.7430.9500.9920.9940.9570.987
mean0.8990.9510.9790.9920.9720.988
RMSEEhlersModIHSMSPEhlersModIHSMSP
B0.33019.5813.690.7139.555.899
G1.84024.0613.961.01013.426.378
R3.00123.5514.661.15514.297.035
NIR0.63160.1322.461.19723.309.493
mean1.45131.8316.191.01915.147.201
QEhlersModIHSMSPEhlersModIHSMSP
B0.4171.0001.0000.9441.0001.000
G0.5540.9991.0000.9611.0001.000
R0.4300.8670.9420.9821.0001.000
NIR0.4880.9290.9940.9901.0001.000
mean0.4720.9490.9840.9691.0001.000

Thevaluesoftheper-bandmetricsaregiveninTable5.2.As
expected,mostoutputsarehighlycorrelatedtotheinputs.How-
ever,theNIRbandoftheEhlersresultfortheKarlsruhedataset
apparentlyincludessomediscrepancies,becauseitscorrelation
coefficientisonly0.7428.RMSEishigherfortheIHS-basedal-
gorithms.EspeciallylargedifferencesintheModIHSNIRband
arelikelyduetotheoriginalIHSstrategyofobtainingthefourth
bandbysubstitutingforanotherbandandrepeatingthealgorithm.
Ourapproachavoidsthisissuebyaddingdetailinformationtoall
74

MSbandssimultaneously.AlthoughtheresultingRMSEisstill
higherthantheEhlersoutput,theimagequalityisnotnecessarily
inferior[126].Forexample,theunderlyingL2normundulypenal-
izesoutliers.Bycontrast,theQindexprovidesamoreaccurate
indicationofactualinformationloss[127].Accordingtothismetric,
theIHS-basedapproachessignificantlyoutperformtheEhlersFu-
sion.Asexpected,ouroptimalweightestimationschemeimproves
uponthefixed-weightModIHSinallmeasurements.Letusnow
examinetheglobalmetricsacrossallbands,giveninTable5.3.
TheEhlersFusionresultsinthebestERGAS.However,thismetric

TTheablebest5.3:valueGlobalofeachmetricsmetricfortheisencirKarlsruhecled.andFeyzabaddatasets.

yzabadFeKarlsruheMetricEhlersModIHSMSPEhlersModIHSMSP
PD0.02515.9087.7220.0157.8382.817
ERGAS0.1401.7490.9530.0450.6620.316
Q40.0840.7240.7880.4330.8910.940

cannotruleoutspectraldistortion[125].Bycontrast,theQ4index
multivaccountsariatefordifcorrferelationencesincoefspectralficient[119angle].bOurymethodcomputingthesignificantlyactual
outperformstheEhlersFusionintermsofthismetric.Because
theEhlersalgorithmsQresultsexceedthevaluesofQ4,wecan
inferFusionthatyieldsaspectralbettervshiftalueshasofoccurrRMSE,ed.PDInandsummarERGAy,S,thewherEhlerseas
ourapproachrateshigheraccordingtoQandQ4.Thiskindof
discrepancyhasmotivatedthepessimisticconclusionthatcurrent
metricsarenotcapableofreliablymeasuringimagequalityoreven
similarity[125].However,webelievethesimplisticRMSE,PDand
ERGASmetricshavelessbearingonperceivedqualitythanthe
moreelaborateUniversalQualityindexandQ4.

75

mancePerfor5.6

Indesigningandimplementingourapproach,weemphasized
efficiency.Togainafirstimpressionoftheresultingperformance,
letuscomparetheruntimesforeachofthethreemethodsona
X5365CPU(3.0GHz,32GiBFB-DDR2RAM),showninTable5.4.
Ourapproachisabout40timesasfastasModIHSdespitedoing

Table5.4:Elapsedtime[s]forthethreemethodsandtwodatasets.

KarlsruheAlgorithmyzabadFeModIHSEhlers135923531285721
69MSP

morework(computingthebandweights).Becausethealgorithms
areverysimilar,thedifferenceislargelyduetoimplementation
techniques–vectorization,parallelizationandoptimizingthenu-
mericalcalculations.TheruntimeoftheEhlersalgorithmismuch
higherstill.Itisunclearwhythesmaller61MPixelFeyzabad
imagerequired25timesaslongasthe87MPixelKarlsruhedataset.
Evendisregardingthisdifference,ouralgorithmremainsover100
timesasfast.Wehavealsomeasuredthethroughputofouralgo-
rithmonthemorerecenttestsystem(c.f.Section2.3),shownin
Table5.5.AswiththeBilateralFilter,performancetendstoincrease
onlargerimagesduetoamortizationofoverhead.Oursoftware
outperformsasimilaralgorithmsMatlabimplementation[119]by
1134.offactora

76

Table5.5:Throughputofourpan-sharpeningalgorithmfor16-bit,
datasets.satelliteband4

MPixel/sMPixelSatelliteQuickBirIKONOSd7454212211
230109dQuickBir226136dQuickBirQuickBirGeoEyed240229234238

Conclusion5.7

ThischapterhasdescribedanIHS-basedpan-sharpeningalgo-
rithmseconds.thatisDespitecapablerofequiringprocessingtwoordersgigapixel-scaleofmagnitudeimagerylesswithincom-
putationaltime,objectivemetricsindicateitsqualityisatleast
comparabletocurrentapproaches.Inparticular,thecorrelation
coefthantheficientEhlersandQ4Fusion.qualityThisisindexmadeattestpossibletoabyhigherthecolorestimationfidelityof
optimalbandweightsforeachinputimage.
panchrWehaomaticvealsoimageprbyoposedmeansofedge-prafastesernewvingapprpre-filteringoximationofofthethe
bilateralfilter.Asubjectiveevaluationhasshownitsusefulnessfor
reducingnoiseintheoutput.
oftheFuturpanewandorkmaymultispectralincludeanimagesadditionaltoavoidsub-pixelartifactsrategistrationobject
boundaries.

77

6Chapter

SegmentationImage

Theimagesnextintopipelineregionsstageisr(segmentation).esponsibleforThisautomaticallychapterintrpartitioningoducesa
MinimumSpanningTree-basedalgorithmwithanovelgraph-
cuttingheuristic,theusefulnessofwhichisdemonstratedby
promisingresultsobtainedonstandardimages.Incontrastto
data-parallelschemesthatdivideimagesintoindependentlypro-
cessedwithouttiles,truncatingtheobjectsalgorithmatistiledesignedboundaries.toAallowfastparallelparallelizationimple-
outperformentationmforexistingshared-memoralgorithms.yItmachinesutilizesisaneshowwnmicrtooarsignificantlychitecture-
awaresingle-passsortalgorithm,presentedinAppendixA,thatis
likelytobeofindependentinterest.
Aninitialversionofthischapterappearedintheproceedingsof
the13thInternationalConferenceonComputerAnalysisofImages
andPatterns[129].

6.1IntroductionandRelatedWork

Segmentationisanimportantearlystageofsomeimageprocessing
pipelines,e.g.object-basedchangedetection.Thefinalresultsof
suchapplicationsareoftenstronglydependentonthequalityof
theinitialsegmentation.Becausesubsequentprocessingstepscan
usehigher-levelregioninformationinsteadofhavingtoexamine

79

ofallperforpixels,themance.Manysegmentationalgorithmsmayalsohabevethebeenprlimitingoposed,factorbutintergoodms
qualityresultsoftencomeatthepriceofhighcomputationalcost.
Oneextremeexampleofthisisamulti-scalewatershedap-
proach(MSHLK)[130].Repeatedapplicationsofanisotropicdiffu-
sionsmooththeimageandreducethetendencyofthewatershed
tion).transforThemrtoresultingeturnexcessivsubjectiveequalitynumbersisofverysegmentsgood,(obutvitsprersegmenta-ocess-
ingAnspeedalter(1nativekPixel/s)apprisoachusesunacceptablythelow.Mean-Shift(MS)[131]proce-
ofdurethetoimage.locateThisclustersiswithinguaranteedatohigherconverge-dimensionalonthereprdensestresentationegions
inthisspaceandyieldsgoodresultsinpractice,buttheprocessing
rate(100kPixel/s)isstillinadequate.
tremalInpreRegionsviousw(MSER)ork,w[e132ha]vecanshobewnappliedthattowarMaximallydsStablesegmentationEx-
ofgradientimages.Althoughmoreefficient(2MPixel/s),this
fullschemecoverageonlyofdetectstheimage.high-contrastItalsoseemssegmentsill-suitedanddoesfornotprparalleliza-ovide
tionbecausethecriterionforstabledependsonaglobalordering
pixels.ofGraph-basedsegmentation(GBS)[133]increasestheamountof
datatobehandled(multiplegraphedgesperpixel)buthasseveral
theattractivreeductionprofoperties.VsegmentationiewingtopixelscuttingasanodesofMinimumagraphSpanningallows
Tree(MST).Definingedgeweightsassomefunctionofthepixels
perwithout-bandhavingintensitytodiffercomputeencesimageenablesgradientstheuse1.ofFinallycolor,aninforMSTmationcan
ofbeassembledparallelization.fromInSpartialectionsub-tr6.2,ees,wedewhichvelopproanevideswtheonlinepossibilitygraph-
cuttingheuristicforMST-basedsegmentation.Section6.3shows
thepromisingresultsobtainedonwell-knownimages.Section6.4
to1Aneighboringmeasureofpixels.thechangeinintensityforeachpixel,e.g.bycomputingdifferences

80

introducesPHMSF(ParallelHeuristicforMinimumSpanning
Formentationests),whichalgorithm.webeliePervehapstobemostthefirstimportantly,Snon-trivially-parallelection6.6shoseg-ws
ittosignificantlyoutperformexistingsegmentationtechniques.

ithmAlgor6.2

Segmentationalgorithmsrequire(oftenapplication-dependent)
definitionsofimageregion.Webelievehomogeneityandhigh
contrasttosurroundingpixelsarereasonablecriteria[134].Ho-
mogeneitycanbecomputedasdistancesbetween(vector-valued)
pixels;wefindtheL2normtoyieldbetterresultsthanL1or
pseudo-norms.Priorwork[133]hasadvocatedseparatesegmenta-
tionoftheR/G/Bcomponentimagesandintersectingtheresults.
Becauseobjectedgesarenotalwaysvisibleinallmulti-spectral
bands[135],itissafer(andcertainlyfaster)tosegmentonceusing
allbands.Recallingthegraphsegmentationframework,theabove
homogeneitymeasuredefinestheweightofedges.Itremainsto
beseenhowanonlinegraph-cuttingheuristicshouldpartitionthe
MSTdependingonedgeweight.Amerethresholdisinsufficient
becauseitfailstoaccountfornoiseortheoverallhomogeneityofa
region.Onepossiblesolution[133]involvesanadaptivethreshold
that2isincrementedbyalinearlydecreasingfunctionoftheregion
size.Thefunctionsslopeisauser-definedparameterthatmustbe
determinedbyexperimentationbecauseithasnophysicalexpla-
nation.Thisschemealsounderestimatesaregionshomogeneity
bydefiningitasthemaximumweightinitsMST,thustending
frtowomardsCannysovdetectorersegmentation.forimageWeedgessuggest[136the].Intheadoptioncontextofanofideacom-
putationaledgedetection,pixelswithlargegradientmagnitudes
arelikelytocorrespondtoedgeswithintheimage,butthereisno
when2Thisdividingundulybythepenalizeslogarithmthegrofowththerofegionlargesize.segments;wesawslightlybetterresults

81

strictsinglethrlevelesholdatwhichfindsthissafeceasescandidates,tobethewhichcase.canApplyingbearaugmentedelativbelyy
ingnearbtoypixelssegmentationthatlieteraboveminologya,second,regionsmoregenerconnectedousbylimit.low-wRetureightn-
graphedgesrepresentlikelycandidatesthatcansubsequentlybe
higherexpandedweigbhyts.folloFigurwinge6.1adjoiningillustratesgraphhowaedgesregionwithisforsomemedwhatby
expandingtheinitialcandidate.Toavoidpotentiallyunbounded

Figure6.1:Aregionisobtainedbyexpandinganinitialcandidate
formedfromhomogeneouspixels.

grthatowth,maywbeeaddedinstitutetoaacrcandidateeditlimitregion.ontheThesummotivofatingedgewprincipleeights
–howmuchwatercanbefilledintoabasinwithoutoverflowing–
isshodimensionalwninFigurshapee[6.2137.],itsBecauseciracumfercircleisencethe√4πmost×compactregionSizetwo-
aregionconstitutesawhoselowarereaisboundonregionSizetheperimeterpixels.(LetusminPerimeteralsoassume)of
2additivestimatorsewhitehavebeenGaussianprnoiseoposed[with138v,139ariance].Wσn,ithforaneywhichetosewvareralds
theGaussiancumulativedistributionfunction,wechoose2σnas
anarbitrarycutoffpoint.Itisunlikelyforanylargerintensity
asdiffertheencessmallesttoariseedgefrwomeightnoise.alongWethertheeforborederdefineofanyinterminContrastesting
regionminus2σn.Puttingboththesepiecestogether,thefunction
ComputeCredit:=minContrast×minPerimeterestimates
thetotalweightofedgeswhoseendpointpixelscanbeadded
82

Figure6.2:Motivationforthecreditcomputation.Thegrayarea
denotesaregionbeingfilledwithwater.Spillingbeyonditsbounds
canonlyoccurifthetotalvolumeexceedsafunctionoftheperime-
terandtheminimumwallheight(theredlinesofvaryingheight
suggestboundaryedgesandtheirweights).

toaregionwithoutinadvertentlyexpandingbeyonditsbounds.
Thispropertyisimportantbecausesubsequentregionmergedeci-
sionscanbebaseduponregionfeatures(discussedinSection6.5),
whereassplittingrequiresre-examinationofthepixelsoredges.
However,theresultingregionsarenotnecessarilytoofinebecause
pixelsconnectedbylow-weightedgesarealwaysmerged.We
havethereforeavertedglobalunder-andoversegmentationofthe
imagewhileusingonlylocalinformation.Thealgorithmfirst
formscandidateregionsbymergingtheendpointsoflow-weight
edges,computestheircredit,andthencallsasimpleheuristic(Al-
gorithm6.1)inincreasingorderoftheremainingedgesweights.

83

EdgeHeuristic(edge)6.1:ithmAlgor1region1,region2:=Find(edge.endpoints);
2ifregion1=region2then
3credit:=min{region1.credit,region2.credit};
4ifcredit>edge.weightthen
5survivor:=Union(region1,region2);
6survivor.credit:=credit−edge.weight;
end7end8

DetailsImplementationWerepresentedgesas30-bitintegersindicatingtheindexoftheir
originatingnodetogetherwitha2-bitencodingoftheirfourpossi-
3.ectionsdirbleNodes(pixels)areorganizedintodisjointsets(regions)by
meansoftheUnion-Find(UF)datastructure[140].Eachnodeisas-
sociatedwitha32-bitvaluethattypicallypointstoitsparentnode.
Therootofeachsubtree(i.e.region)istermedthe[canonical]
representativeandholdstheindexofthecorrespondingregion
datastructure,whichstorescreditandsizein32-bitintegers.
Wedifferentiateparentsandrepresentativesbymeansoftheirsign
bit.Thisavoidstheneedforauxiliarystorageduringtheinitial
regionmerging,becausecreditisnotyetneededandtherep-
resentativestoresthe(negated)size.Findtraversestheparent
linksandreturnstherepresentativesoftheregionsadjoiningthe
givenedge.Tospeeduptheserelativelyexpensive(duetotheir
poorlocality)searches,wehalvethesubsequentpathlengthin
everyiterationbyreassigningnodesparentstotheirgrandparents.
Unionmergestworegions;choosingthelargeroneastheparent
alsoservestodecreasepathlengths[140].Weintroduceanaddi-
tionaloptimizationthatavoidsneedingtoinitializetheparentarray
3itsEachneighbors,nodethushasyieldingeastern,ansouthern,eight-connectedsouthwestergridngraph.andsoutheasternconnectionsto
84

andhalvesthenumberofallocatedregionstructures.BecauseWin-
dowsVirtualAllocreturnszeroedmemory,weconsider0to
beavalidregionindex.Recallthatnonpositive32-bitindicesare
interpretedasrepresentatives.Weallocateenoughvirtualaddress
spacetotreatindicesasunsigned32-bitoffsetsandthenmapa
single(read-only)pageofzeroedmemoryattheaddressofregion
0.Whenanodeisfirstmerged,itssizethereforeappearstobe
zero,thuscausingittobelinkedtothe(larger)parent.Weonly
needtoallocatearegionstructurewhentheparentalsoreportsa
sizeofzero.Physicalmemoryforsubsequentregionstructuresis
needed.ascommitted

Results6.3

Todemonstratetheusefulnessofthenewsegmentationresults,we
comparethemtotheoutputsofexistingalgorithmsonstandard
images[141],theresultsofwhichareshowninFigures6.3and
6.4.MSHLK[130]isknownforhigh-qualityresultsandprovides
excellentsmoothingofthewalls(b)butmergestheeavesintothe
skysegment.Wealsocallattentiontotheoversegmentationof
thesecondimageandshockeffects[142]inthebackground(b).
MS[131]ismoresuccessfulatmergingtheindividualobjects(c)
butalsosplitssomeofthem(e.g.belowtheP);spurioussegments
nearedges(c)areitsonlyvisibleflaws.AswithMSHLK,seg-
mentbordersaredelineatedbyblackpixels.MSER[132]produces
mostlyadequatelabelimages,thoughthewallisnotconsideredto
beastableregion(d);theeffectsofthegradientfilterareclearly
visible(d).GBS[133]issatisfactorybutresultsinundersegmenta-
tionneartherooflinesandoversegmentationoftheskyandwall
(e).Italsomergesdifferent-coloredobjects(e)butfailstoreturna
uniformbackground.OurnewPHMSFalgorithmprovidesresults
comparabletoMSHLKandMSandrequiresonly1/4000and1/50
thecomputationtime,respectively(c.f.Section6.6).Theblackpix-
els(f)indicatesurfaceirregularitiesthatresultedinregionssmaller
85

eFigurothers

(a)

Image

(c)

MS

GBS(e)

on6.3:SUSCSIPIegmentation[141]

esultsrimage

theof4.1.05

86

(b)

MSHLK

(d)

MSER

PHMSF(f)

algorithmPHMSFwne(House).

and

eFigurothers

(a)

Image

MS(c)

GBS(e)

on6.4:SUSCSIPIegmentation[141]

esultsrimage

theof4.1.07

87

(b)

MSHLK

MSER(d)

PHMSF(f)

ne(JellywPHMSFbeans).algorithm

and

thantheminimumsize.Thesegmentationin(f)isquiteaccurate,
correctlyseparatingdifferent-coloredobjectswithoutintroducing
boundaries.spurious

ithmAlgorParallel6.4

Despitetheefficiencyofthenewsegmentationalgorithm,ahighly-
tunedsequentialimplementationisstillfarslowerthanthecollec-
90tionkm2rates/s[of110]).commerBecausecialimagingsignificantsatellitesreductions(e.g.ofIKONOSthewithalgorithmsupto
constantfactorsormajorincreasesinsingle-coreCPUperformance
10(c.f.SMPixel/sectionr2.4)equirappearesunlikelyparallelization.,ourHoself-setweverperfor,manceembarrassinglygoalof
parallelschemesthatsimplysplittheinputintoindependenttiles
arestraddlingnotaacceptableborder.Norbecausearetheovydoerlappingnotcorrtilesectlysufficienthandlebecauseobjects
thereisnoupperboundonthesizeofobjectsofinterest(e.g.rivers
orrcomputation.oads).OurThefirstrecentlyattemptintratoducedparallelizationFilter-Kruskaladdressedschemethe[143MST]
combinesideasfromQuicksortandKruskalsalgorithmanddis-
cardsnon-MSTedgeswithouthavingtosortthem.Thisfilter
operation,partitioningandsortingcanallbeparallelized.How-
ever,thetotalspeeduponaquad-coresystemisonly1.5–chiefly
duetoeight-connectedthesequentialgridgraphsportionarofethetoosparsealgorithm,tobutderivealsomuchbecausebenefitour
fromdiscardingedges.Oursecondapproach(Algorithm6.2)is
designedtoallowindependentprocessingofimagetiles,butstill
ensuresconsistentresultsirrespective4ofthenumberofprocessors
P.ThekeyobservationisthatKruskalsMSTalgorithmcanrunin
adata-parallelfashionuntilencounteringanedgethatcrossesatile
4Weignoretheeffectsofunstableparallelsorting.Therelativeorderofitemswith
thethegridsamekegraphyisdependsconstructed.ontheHonumberwever,ofprneitherocessorsappearsandtothehavearbitrarareleyvantmannerinfluenceinwhichon
esults.rthe

88

Algorithm6.2:ParallelSegmentation
dotileforeachparallel132sortimmediatelyedgesinmergeascendingregionsorderofconnectedweight;byedgesofweight
eight;minW<45foreachborderEdgedo//connectandmarkcross-border
regions6region,region:=Find(borderEdge.endpoints);
7surviv1or:=2Union(region1,region2);
8Mark(survivor);
9tile.regions:=tile.regions∪{survivor};
1110endparallelforeachtiledo
12foreachr∈tile.regionsdo
r.credit:=ComputeCredit(r.size);//seeSection6.2
13dotileforeachparallel1415foreachedgeinascendingorderofweightdo
16region1,region2:=Find(edge.endpoints);
17ifedgecrossesborderthen
Mark(region1),Mark(region2);
18elsetheniftile.delaIsMarkedyQ.(Pushregion(1edge)or);IsMarked(region2)
19elseEdgeHeuristic(edge);//seeSection6.2
end2021dotileforeach2223foreachedge∈tile.delayQdoEdgeHeuristic(edge);
end24

89

Figure6.5:Topviewofagraphrepresentingtwosquaretileswithin
theinputimage.Nodesarelocatedattheintersectionsofthedotted
lines,andnon-discardedMSTedgesarerenderedascoloredlines.
ProcessorscanrunKruskalsalgorithmindependentlyontheirtiles
untilreachingoneoftherededges(i.e.thosedirectlyorindirectly
connectedtoacross-tileedge).

border(c.f.Figure6.5).Fromthenon,MSTcomponentsusingsuch
edgesandinturntheirincidentedgesmustbedelayeduntilthe
partialaddingMSTedgessoftobothper-tiletilesarequeuesavthatailable.areThisprcanocessedbeinaaccomplishedsubsequentby
sequentialphase5.WealsoMarkanyregionsreachableviadelayed
edgesbysettingthemost-significantbitoftheirsize,whichcan
arbeedelaqueriedyedb–yalongIsMarkedcr.oss-borItrderemainsregiontobeofseenhohomogeneouswmanyedgespixels
edgescouldatafthefectalarboundargeyprofsuchoportionrofegionsatile.oftenHoserwveevaser,afirehigh-wwalleightbe-
causetheycanbediscardedwithoutaffectingneighboringregions.
Onlyabout5%ofedgesaredelayedinpractice,makingAmdahls
5Thisimplementationwouldbecannotsparparallelizableeanyifspaceedgeswithinindicatethe32-bitwhichreprborderesentation.theycross,butour

90

arwidthgumentandPless.Toofaavoidfactorschedulingthanreal-wandorldlocalitylimitsissues,onthememorymanuallyband-
Sectionpartitioned2.4).AloopsnorvelesidevariantinaofsinglecountingOpenMPsortusesparallelpagedregionvirtual(c.f.
amemorseparateytocountingsimulatebinsphase.ofAnunlimitedexplicitsizebuffanderingthustechniquedispensesfurtherwith
incrpollution.easesperforDetailsmancearebgivyeninenablingAppendixA.2write-combining.withoutcache

FeaturesRegion6.5

Thealgorithmalsocomputesregionfeatures.However,itwouldbe
wastefultoallocaterecordsforthenumeroussmallregionsthatare
oftenignoredbyapplicationsanyway.Wethereforeonlyconsider
regionswhosesizelieswithinauser-definedinterval[min,max].
Thisentailsrelabelingtheper-tileregionsandreplacingthem
withanewsetofcontiguousindices,whichisaccomplishedby
Algorithm6.3.Itsseparateandveryefficientcountphaseseems
preferabletoupdatingtheper-tileregioncountwhencross-border
mergesareperformedbyourparallelKruskalalgorithm.Oneof
thetypicaloutputsofasegmentationalgorithmisalabelimage–
thevalueofapixelindicatestheregiontowhichitbelongs.We
thereforecollapsethearrayofUnion-Findparentssuchthateach
nodepointsdirectlytoitsrepresentativeonceallregionshavebeen
e-labeled.rLetusnowexaminethedatastructurereferencedbythenew
indices.Maintainingalistofmemberpixelsforeachregionwould
becostlyintermsoftimeandspace.Weinsteaditerateoverthe
imagepixelsandascribetheirpropertiestothecorresponding
region.Thisimproveslocalitywhenthe6regionfeaturesrequire
lessstoragethanthepixelsthemselves.Updatingthefeatures
64Ourcomponentsregionof2-bytedescriptorsnumbers,currandentlyregionsoccupy64usuallybytes,encompasswhereasmoraepixelthan8comprisespixels.

91

regionscompress//

regionscount//

RelabelingParallel6.3:ithmAlgor1parallelforeachtiledo//compressregions
2foreachr∈tile.regionsdo
r.isValid:=r.size∈[min,max];
354paralleltile.nforeachumRegionstile:=do0;//countregions
doelpixforeach67ifIsRepresentative(pixel)andFind(pixel).isValid
then8tile.numRegions:=tile.numRegions+1;
end9end101112fori:=0to|tiles|−1do
13tiles[i].startIndex:=∑0≤j<itiles[j].numRegions;
end1415parallelforeachtiledo//re-labelregions
doelpixforeach1617ifIsRepresentative(pixel)andFind(pixel).isValid
then18parents[pixel]:=tile.startIndex;
19tile.startIndex:=tile.startIndex+1;
end20end2122

afteraccumulatorsvisitingofeachinterpixelmamediateybevaluesquitethatcostly,willsowlatereprbeorvideefinedfor
∑intoBitheandtheactualsumfeaturofes.theirThesquarsumesof∑Beachi2willbandsyieldpixelthestandarintensitiesd
deviation.Fittinganellipsetoeachregionallowsinferringtheir
orientationandeccentricity(theratioofmajortominoraxes).We
seekanellipsewithidenticalmomentsandthereforeaccumulate
92

mp,q=∑XpYq(p,q∈N0,p+q≤2)foreachoftheregions
pixelswithcoordinates(X,Y).[144]Thesevaluesarestoredas
still64-bitenablingfloatingvpointectorizationnumbersviatoSSE2mitigateinstructions.precisionItisalsoissuespossiblewhile
toitspixels.estimateTothethatregionend,weperimetercountfromnumEqualasingle,thenumbersequentialofscanedgesof
whoseendpointshavethesamelabel.Thecentralpixeliscopied
intoeachlaneofavectorandcomparedtoavectorcomprisingthe
setfourifthesurrcorroundingespondingpixels.valueThisrwasesultsequal.in32-bitAftermaskspackingwiththeallmasksbits
into16-bitrepresentations,wecomputetheirbyte-wisehorizontal
sumbymeansofthePSADBW7SSE2instruction.Afinalsetof
willbeaccumulatorsusedtoinvolvconstructethethemaximumaxis-alignedXandYboundingcoorboxdinates,(AABB).which
Asthewiththeaccumulatorsparentiftheirindices,initialwevcanaluesavaroidezero.explicitThisistheinitializationcaseforof
accumulatorsrepresentingcountersormaximumvalues.However,
caseAABBsforalsotheirrequirinitialevthealues,wminimumeinsteadcoortrackdinates.theToamaximumvoidaadditivspeciale
floatscomplementwithoutofloss,thesocoorweardinates.eableTheirtovupdatealuesthecanbefourreprmaximaesentedwithas
asingleSIMDMAXPS8instruction.
EachCPUcoreisassignedastripoftheimage,forwhichit
updatessuccessivaelysetrofeducedtoaccumulators.asinglePairsglobalofarraybaccumulatorytakingarratheysarmax-e
imumofthecoordinates,andaddingallothervalues.Wethen
computeeachregionsfeaturesfromitsaccumulator.Letn=m0,0
denotetheregionsize.∑Bi2−nTheµi2i-thbandaverageµiis∑Bi/n,with
standarddeviation.Thecentroid,i.e.centerofmass,is
(mn1,0,mn0,1).Fortheellipsenfit,mwerequirethenormmalizedsam-
1,12,0pleandµ0,2central=mn0,2moments−m0,1µm1,10,1=.nThe−m1,0morientation0,1,µ2,0is=thenn−givmen1,0mb1,0y
87PackedMAXimumSumofPackedAbsoluteSingle-prDifferecisionencesv(Bytealue.toWord).
93

21arctan(µ2,02µ−1,1µ0,2)[145].Toformanequal-areaellipse,wedivide
themomentsbyµ2,0µ0,2−µ1,1µ1,1[146,p.283].Solvingforthe
majorandminoraxesyieldsa+c8d,withd=(a−c)2+4b2
[147].TheAABBisconstructedfromtheX,Ymaximaandthe
differencebetweenthelargestpossiblevalueandtheaccumulated
maximaofthecoordinatescomplements.Finally,ameasureofthe
regionscompactnessisusefulfordifferentiatingraggednatural
structuresfrommoreregularman-madeobjects.Theisoperimetric
quotient4π2nisfrequentlyusedinthiscontext[148].Itsmaximum
of1.0isrLeachedinthecaseofacircle.Toestimatetheperimeter
L,letusreviewthepropertiesofan8-connectedgridgraph.A
regiontouches8nedges,andeachboundarypixelaccountsfor1
to7ofthem.Weassumeanaverageoftwosuchedgesforevery
pixel-widthsegmentalongtheregionsboundary.numEqualis
obtainedbydividingthePSADBWaccumulatorby510,becauseit
isthehorizontalsumofpairsof8-bitmaskhalves,eachofwhich
are0or255.Therefore,L≈8n−2numEqual.

Perfor6.6mance

Wefirstexaminethecomplexityoftheproposedalgorithm.Count-
ingsortisO(n).Regionmerges9viaUnion-Findareeffectively
O(1)forallpracticalinputsizes[150].Allotheroperationsare
alsoconstant-timeandresideinloopswithiterationcountsin
Othis(n),alsosotheappliestocomplexitytheisMSERand(quasi-)linearGBSinthealgorithms,inputwesize.mustBecausecom-
paretheirimplementations.Table6.1liststheperformance10of
eachalgorithmforarepresentative8.19MPixelsubsetofa16-bit,
that9anWevieattemptwtheatrinveplacingerseAckerUnion-Findmannwithfunctionatrueasalinearconstant≤algorithm5for[n149<]10intr80.oducesNote
8.offactorconstantaXP10x64.MeasurOuredonaimplementationX5365isCPU(3.0compiledGHz,with32GiBICC11.0.066FB-DDR2/OxRAM)/Ogrunning/Ob2W/Oiindo/Otws
/fp:fast/GR-/Qopenmp/Qftz/QxSSSE3.
94

Table6.1:Performancecomparisonofvarioussegmentationalgo-
rithms.

MPixel/sAlgorithmMSHLKN/A0.09MS0.45GBS2.53MSER12.80PHMSF

4-component(RGB+NIR)QuickbirdimageofKarlsruhe.Our
PHMSFalgorithmdoesmorework(computingregionfeatures
andprocessingtheoriginalfour-component16-bitpixelsrather
thanan8-bitRGBversion),yetsignificantlyoutperformstheother
algorithms.Inthistestitis138timesasfastasMS[151],28times
asfastasGBS[152]and5timesasfastasoursimilarlyoptimized
implementationofMSER.Notethat(32-bit)MSHLKexhaustedits
addressspaceafterasinglediffusioniteration.OurPHMSFimple-
mentationrequiresmuchlessmemory:theworkingsetisabout
7.1GBfora1.97GBimage,whichequatesto13.5bytes/pixel.Its
parallelspeedupvariesbetween2and3.2whenusingfourcores.
Inthelattercase,sequentialprocessingonlyaccountsfor2%ofpro-
cessingtime;thelimitingfactorismemorybandwidth.RightMark
MemoryAnalyzer[153]measuresreadandwritethroughputsof
roughly3500MB/sand2500MB/sonthissystem.Havingana-
lyzedtheelapsedtimesandminimumamountsofdatathatmust
betransferredto/frommemoryduringthecreditcomputation,
regioncompression/counting/relabelingandfeaturecomputation
phases,wecanconcludethateachisatleast85%efficient.Further
increasesinperformanceorscalabilityarecontingentonadditional
bandwidth.ymemorWehavethereforemeasuredtheperformanceonournewer
dual-CPUsystem.AsshowninTable6.2,thethroughputhas
improvedbyafactoroftwotofour.OurNUMA-awareimple-

95

Tcessedableb6.2:ythePerforpansharmancepeningonlargealgorithm16-bitofsatelliteChapter5.images,prepro-

MPixel/sMPixelSatelliteQuickBirIKONOSd547428.643.2
QuickBirQuickBirdd22913646.250.4
48.3937dQuickBir

mentationbenefitsfromthehighermemorybandwidthenabled
bythesystemsdualmemorycontrollers.Largerimagesalsoof-
ferincreasedparallelismbecausetileinteriorsgrowfasterthan
theirborders.Notethatthelargest,neargigapixel-scaleimageis
seconds!20withinocessedpr

Conclusion6.7

Wehavepresentedanew(quasi-)linear-timesegmentationalgo-
rithmthatprovidesusefulresultsatpreviouslyunmatchedspeeds.
Applicationsincludeautomaticwide-areaappraisalofthesuit-
abilityofroofsforsolarpanels,object-basedchangedetection,
environmentalmonitoringandrapidupdatesofland-usemaps.
Fromanalgorithmengineeringstandpoint,webelievethistobe
thefirstnon-trivially-parallelsegmentationalgorithm.Itsscalabil-
ityischieflylimitedbythememorybandwidthofcurrentSMP
systems.Futureworkincludesstatisticalestimationoftheedge
weightthresholdsandefficientlycomputingasegmentneighbor-
hoodgraph.Wearealsointerestedinapplyingthisalgorithm
towardssegment-basedfusionofhigh-resolutionelectro-optical
.yimagerhyperspectraland

96

7Chapter

izationRasterLineAntialiased

Thisterizerchapterforprannotatingesentsvanerefyfilarcient,geimageshigh-qualitywithsoftwsegmentarelinecontours.ras-
prAlthoughoducethinmanyandfastjaggedlinedralineswingduetoalgorithmsaliasing.areWusknown,algorithmmost
includesnoticeableacrudestepappredges.Evoximationenharofdwareantialiasing,multisamplingwhichstillcannotincludesen-
tirhigh-frelyeliminateequencyaliasing.componentsInstead,byprthepre-filteringoperthesolutionlines.isWetorimpremoovvee
ofuponpreminimizingviousad-hocaliasing)filterscubicbypolynomialderivingthefilter.optimalWhen(inthecombinedsense
ingwithouralgorithm,new,thisoptimizedoutperforvariantmsWofusthefastGupta-Sprapproulloximationlinedrawhilew-
deliveringmuchhigher-qualityresults.
ingsAofprtheeliminarFourthyversionPacific-RimofthischapterSymposiumappearonedImageintheandprVoceed-ideo
].154[echnologyT

7.1IntroductionandRelatedWork

Sisacan-convbasicertingbuildinglineblocksegmentsofmanyforrastercomputer-basedgraphicsdisplaysortasks.imagesOne
aidapplicationhumanrinvolvecognitionesofplottingtheman-madecontoursobjects.ofCurrimageentsegmentsCPUscanto

97

easilyannotatehigh-definitionvideoframes,butthetimelypro-
cessingofgigapixel-scaleimageryremainsaninterestingchallenge.
GPUscannotyethandlesuchlargeamountsofdataduetotexture
dimensionandmemorysize1.Wethereforeconsidersoftwareline
drawingapproachesfromtheliterature.

FastLineDrawingAlgorithms

BresenhamsMidpointalgorithm[155]isthefoundationofmost
subsequentline-drawingschemes.TheDigitalDifferentialAna-
lyzerissimilar,butavoidsconditionalbranches,whichareexpen-
sivegiventhedeeppipelinesofmodernCPUs.Severalfurther
attemptshavebeenmadetospeeduptheunderlyingalgorithm.
Gardner[156]andBoyer/Bourdin[157]takeadvantageofsymme-
trybysimultaneouslydrawingfrombothendsofthelinesegment.
Althoughtheiterationcountishalved,thisleadstomorecomplex
memoryaccesspatterns,whichmaybeproblematicforhardware
prefetchers.Rokne[158]additionallyconsiderstwopixelsata
time,againhalvingtheiterationcountattheexpenseofmany
mispredictedconditionalbranches.Bresenhamsrun-lengthslice
algorithm[159]avoidsredundantper-pixeldecisionsbycomput-
ingthelengthofhorizontalpixelruns.However,specialcasesfor
everypossiblerun-length[160]wouldgreatlyincreasethecode
size.Theseoptimizationsappeartobeintendedforlonglines,but
asurveyofapplications[161]hasfoundthat87%oflinesegments
arelessthan17pixelslong.Thissuggestsfavoringsimplemain
loopsovercomplexstrategiesforreducingtheiterationcounts.
Withregardtooutputquality,alloftheabovealgorithmsproduce
thinlineswithjaggies(astairstepeffectduetoaliasing).
1width.TheIt4canGiBbememordoubledybylimitmeansonofcurrtheentrecentGPUsisGDDR5duetostandarDRAMdsdensityclamshellandmodeinterface[10],
butstillfallsfarshortofthe192GiBavailabletocommodityworkstations.

98

Antialiasing

Antialiasingisdesirablebecauseitremovesspuriousinformation
andsystem[enables162].Thesubpixelcauseofaccuracyaliasingislocalizationshownbbyythethehumansamplingvisualthe-
froromem,whichsamplesindicatesspaceda1apartfunctionifitmayhasbenoenerfaithfullygyrinfreconstructedequencies
f2N≥fNquencies..OtherTherewise,arethethreehigherwaysfrtoequenciesmitigatearethis[aliased163].toPrlowere-filteringfre-
atthetheimagecostofpriortolosingrdetaileconstructionandsharcanrpness.educeHothewefefverects,wofearealiasing,not
willingtomonitor/printer/epresupposeye.specificSamplingrataeconstructionhigherrfiltersesolution[164]isforexem-the
cannotplifiedbyentirharelydwavaroidealiasing.multisampling,Instead,butprhase-filteringpracticalthecontinlimitsuousand
objectsWuspriortoantialiasingsamplingtechniqueisthe[165most]prinvolvomisingesrshadingoute.pairsofpix-
elsline.Thisstraddlingcorralineespondsintopraboxoportionfilterto–atheircrudeverticalapprdistanceoximationfromofthethe
rthrequisiteough[lo166].w-passHowefilterver,thatthealloalgorithmwssomehashighfoundfrequencieswidesprtoeadpassuse
dueusingtoitsfixed-pointsimplicityandarithmeticspeed,areandavefailable.ficientimplementations[167]
aconicalGuptaandpoint-sprSproullead(GS)function[168]pr(PSF).oposeBeinglow-passradiallyfilteringsymmetric,with
itsdistanceconvtoolutiontheline.withThealinedistanceonlyisincrdependsementallyonthepercomputedpendicularbyan
retriealgorithmvedfromsimilaratosmallBresenhams,lookuptable.andtheThisresultframeofwtheorkconvisolutionuseful
becauseitallowsantialiasingwithanyradiallysymmetricPSF
atPSFlittlewaschosenadditional–percost.hapsHothewever,numericalitisunclearintegrationwhyofaaconicalmore
PSFscomplexisalsofunctionwexemplifiedastoobyexpensivmoreeratecentthetime.GPU-basedTheuseprofefilteringad-hoc

99

approaches[169,170,171]usingconical,Gaussianandexponential
PSFs.WepointouttheirweaknessesandderiveanoptimalPSF
(inthesenseofminimizingaliasing)inSection7.4.
Chen[172]suggestsavariantoftheGSalgorithmthatsupports
floating-pointendpointcoordinates,whichdonotariseinour
application,andslightlyacceleratesthemainloopbycomputing
perpendiculardistancesviatrigonometry.
Wedescribefurthermajoroptimizationsthatresultina24.6-
foldspeedupinSection7.2.Ourimplementationthereforeout-
performsWusfastapproximationaccordingtothemeasurements
inSection7.3.However,thenewPSFyieldsmuchhigher-quality
lines,asshownbySection7.5.

ithmAlgor7.2

WebeginwithChens[172][p.23]improvedversionofGS(Algo-
rithm7.1).Theunderlyingassumptionthatlinesresideinthe
rfirstedundantoctantcanpointerbeavoidedarithmetic,bywecombinetransposing/mirrthex,yoring.arTgumentsoavoidof
pointer;IntensityPixelincrementingy(definedisinSaccomplishedection7.4by)intoaddingacurrpitch(theent-positionsize
bofyascanline).special-casingExpensivhorizontaleboundsandvcheckserticalforlineseverandypixelotherarewiseavoideddisal-
loisavwingoidingpointsthelyingmispronedictedtheimageconditionalborder.branchOurinmainlineimpr10boyvementusing
abitmaskderivedfromthesignofthediscriminatordtoselect
betwdistanceeenfrompossibletheline,summandsc.f.fordAlgorithmandD7.1(the).Insignedfact,perthependicularcommon
todsubexpr(∈Z)essionsandthenallowsubtractingunconditionally(2Δx)&addingmask.theDoingfirstterthem2sameΔy
forDissafebecausetheIEEE-754floating-pointrepresentation
ofthe0.0maskisallviazeros.signedNegatingrightshift,thewhichdiscriminatorreplicatesdallothewssignbit.obtainingWe

100

Algorithm7.1:DrawLineChen(x0,y0,x1,y1)
1x:=x0;y:=y0;Δx:=x1−x0;Δy:=y1−y0;
2d:=2Δy−Δx;//discriminator
3D:=0;//signedperp.distance
4(sinα,cosα):=(Δy,Δx)/Δx2+Δy2;
5whilex≤x1do
6IntensifyPixel(x,y−1,D+cosα);
7IntensifyPixel(x,y,D);
8IntensifyPixel(x,y+1,D−cosα);
9x:=x+1;
10ifd≤0then
11D:=D+sinα;
12d:=d+2Δy;
end13else1415D:=D+sinα−cosα;
16d:=d+2(Δy−Δx);
17y:=y+1;
end18end19

touseSSEscomputefast1/butΔxappr2+Δyoximate2.Forreciprdetails,ocalpleasesquarerreferoottotheinstructionC++
sourcecode[173].
setandTheserloequirw-leevelarithmeticoptimizationsbitshifts.areHowspecificeverto,thebothSSEareinstructionsupported
byalargeproportionofcurrentandfuturecomputersystems,
anddecisivtheeovfactorerallin24.6deter-foldminingspeedupthe(seefeasibilitySectionofthis7.3)maalgorithmybethefor
applications.demanding

101

mancePerfor7.3

ThecomplexitiesoftheGSvariantsandWualgorithmarelinear,
becauseeachcoordinateonthemajoraxisisvisitedexactlyonce
andalloperationsareO(1).However,theirconstantfactorsvary
accordingtothenumberofpixelsshadedandtheefficiencyof
theloopbodies.Theseeffectsarebestobservedbymeasuring2
thetimerequiredtodrawmanylonglines,thusde-emphasizing
functioncallandsetupoverhead.Table7.1showstheresulting
fillrateswhendrawing64Kiparallellines(sortedbyincreasingy
coordinate)withslope≈−1/8andlength≈8Ki.Notethelarge
Table7.1:Performance(peakfillrate)ofvariouslinerasterizers.

MPixel/sAlgorithmOriginalParallelGSGS(T(Table)able)107847
OptimizedWu(2GSpixels)(Table)23871898
GSOptimized2634(Polynomial)

ratioof24.6betweentheoriginal(Chensimprovedvariantofthe
GSparallelizationalgorithm)andachievouresafinalnearlyoptimizedlinearvspeedupersion.forSharalled-memoralgorithmsy
(processorscandrawlinesindependentlyunlesstheywritetothe
samecacheline,inwhichcasehardwarecachecoherencyincurs
someoverhead).Acarefulimplementation[167]ofWussimple
linedrawingalgorithmis2.2timesasfast,becauseitonlyrequires
afewfixed-pointoperationsperloopandshadestwoinsteadof
threepixels.However,ouroptimizedvariantofGSisevenfaster,
outperformingtheoriginalversionbyafactorof2.8andWus
2Testplatform:dualW5580CPUs(3.2GHz,48GiBRAM)run-
/OtningW/Qipoindows/GAXP/MDx64./GS-/fp:fast=2Compiler:/GR-ICC/Qopenmp11.1.082/Ox/QxSSE4.1/Og/Ob2/Oi
./Quse-intel-optimized-headers

102

algorithmmid-rangebyGPU1.3.Its(NVIDIAperforGeFormanceceis9600onGT)par[174with].thefillrateofa

Tablelookupversusarithmetic

Interestingly,thefinalversionofourimplementationisanaddi-
tional10%fasterduetoSIMD-basedevaluationofthecubicpoly-
nomial.Thisresultdeservescloseranalysis,becauseconventional
wisdomsuggeststhat(small)lookuptablesoutperformarithmetic.
ThedependencychainofaHornerscheme((h3x+h2)x+h1)x+h0
involvesthreeadditionsandmultiplications.Theseinstructions
havehadfairlyconsistentlatenciesof3or4cyclesinthex86mi-
croarchitecturesofthepast10years[56],foratotalof≈24.This
isincontrasttoatablelookupthatonlyrequiresamultiplica-
tion,rounding/truncationandload.Whereasmemorylatency
continuestoincreasewithrespecttotheCPUclock[42],asmall,
frequentlyaccessedtablecanbeassumedtoresideintheL1cache.
Thetotallatencyisthereforeontheorderof≈12cycles.Afirst
attempttoclosethisgapmightinvolvevectorinstructionstospeed
upthecomputationof<(h0,h1,h2,h3)T,(1,x,x2,x3)T>.However,
thehighlatencyoftheSSE4.1instructionsetshorizontaldotprod-
ucterodesanybenefits.TorealizethefullpotentialofSIMD,the
applicationmustcomputeseveralindependentresultsinparallel.
WhenamortizedoverthefouroperationsperSSEinstruction,each
evaluationofthepolynomialonlyrequires6cycles.Inthiscase,
wearelimitedtothethreepixelsstraddlingtheline,becausethe
computationofsubsequentpixelsrequiresdifferentoperands.In
general,werecommendreplacingtablelookupswith(e.g.cubic)
interpolationpolynomialswhenevermultipleindependentresults
parallel.incomputedbecan

103

AntialiasingOptimal7.4

(7.1)

ItremainstobeseenhowIntensifyPixelcomputesapixels
intensityasafunctionofr,thedistancefromtheline.Theantialias-
ingframeworkofSection7.1callsforconvolvingthelineL(x,y)
witharadiallysymmetricPSFh(r).Becausethelinesorientation
doesnotaffecth,wecanassumeaverticallineL(x,y)=δ(x−r).
Underthecommonassumptionthatpixelsareregularly-spaced
infinitesimalpoints,thelinesinfluenceonthemis
∞∞−∞−∞L(x,y)h(x2+y2)dxdy(7.1)
=∞h(√r2+u2)du(7.2)
∞−FollowingTurkowski[175],werefertothisfunctionastheradial
linetransformationRLT(r).AsexplainedinSection7.3,approxi-
matingitwithacubicpolynomialallowsforefficientcomputation.
Wethereforeintegratenumericallyfor1000uniformlyspaced
values√ofrbetween0andourapplicationsmaximumdistance
R=2andcomputetheleast-squaresfit.Thisyieldsthefunction
RLT(r)=0.5344r3−1.4886r2+0.0086r+1.0014(7.3)
foruseinthemodifiedGSscheme(Algorithm7.2).Notethat

Algorithm7.2:IntensifyPixel(x,y,r)
1intensity:=210×RLT(|r|);
2SetPixel(x,y,intensity);

theintensityremainswellwithinits8-bitrangedespiteRLT(0)
exceeding1.0becausethechosenscalingfactorof210isfairlylow
(wefindoverlybrightlinessubjectivelylessappealing).Forreasons
ofefficiency,thereiscurrentlynospecialhandlingofoverlapping
linesbyblendingorsettingapixeltothemaximumoftheprevious
.intensityentcurrand

104

wasWeusednowtoderivcomputeethetheoptimalaboveRLT(polynomialr).Theideal(optPoly)low-passPSFhfilterthat
multipliesafunctionsFouriertransformbyarectanglefunction,
whichcorrespondstoconvolutionwithsinc(x):=sinπ(xπx).Thisis
notpossibleinpracticeduetoitsinfinitesupport,andtruncatingit
inyieldstheapassbandfunction[166whose].AnotherFouriermeanstransforofmhasconstructingconsiderablealorippw-passles
filterinvolvesminimizingthealiasingenergy[164]
∞|F(ω)H(ω)|2dω−Ω|F(ω)H(ω)|2dω(7.4)
Ω∞−−forΩ=1[πpixel],afilterh(r),theimagef(ξ)andtheirrespective
FouriertransformsH(ω)andF(ω).Theprolate-spheroidalwave
functionisknowntoconcentrateitsenergywithinaminimalinter-
val[−Ω,Ω]2inthespectraldomain[176].However,thesefunctions
aredifficulttocomputeandhavenegativesidelobes,whichisprob-
displalematicybecausetechnology.negativBarkanse[pixels177]prcannotoposesberaeprpositivesentedebiasbytocurralloentw
forpixelsdarkerthanthe(gray)background,butnotesthatthis
workaroundreducesthecontrast.Clippingnegativevaluesincurs
errringingorto[164],neighboringandanpixelsiterativ[e178]schemeistooforslodifw.Wfusingecurrtherentlyesultingonly
rofenderdrawinginwhite-on-blackcolorvialines,alphabut-blewishndingtoleaandvetheropenefortheerequirpossibilityea
nonnegativefilterkernel.Theexistenceofminimaxpolynomials
witharbitrarilylowapproximationerror[179]motivatesrestricting
ourenergyanalysistoconcentrationthissimplerapproachclassofofLinetfunctions.al.[180We],buildwhichuponusesthethe
methodofLagrangemultipliersformaximization.Pleasereferto
theMathematicascripts[173]fordetails.Ourmoreaccuratenu-
acirmericalcularfilterintegrationwithyieldsradiusaRdif=fer2.enceTheoflar6%gestvs.theeigenvvaluesaluegiveindicatesnfor
99.25%ofthefiltersenergyisconcentratedinthelowerfrequen-
cies,whichjustifiesthesimplifyingassumptionofanon-negative
105

cubicpolynomial.However,insteadofthestatednormalization
criterion,weensure−RRh(r)dr=1;thecorr√espondingauthorhas
confirmedthiswasalsotheirintention.ForR=2,weobtainthe
functionmalizednorh(r)=0.2824r3−0.6819r2+0.0120r+0.5999(7.5)
Thisfunctionisplottedalongsideotherad-hocPSFsinFigure7.1,
andtheirrespectiveRLTsareshowninFigure7.2.TheconesRLT
(a)fallsoffratherquickly,leadingtothinnerlines(c.f.Section7.5).
TheExp2function(b)hasanundesirablerisenearthedistancecut-
off.MitchellandNetravaliscubicpolynomial(c)admitsnegative
values,whichisunacceptableperthediscussionabove.

]168[Cone(a)

]169[Exp2(b)

(c)MN31,13[181](d)h(r)(New)
Figure7.1:Ouroptimalfilterpolynomialandotherad-hockernels;
notethediffering(application-defined)domains.

106

]168[Cone(a)

(b)]169[Exp2

(c)MN31,31[181](d)h(r)(New)
Figure7.2:Radiallinetransforms(RLT)fortheabovefilterkernels.

Results7.5

Evaluatingaccomplishedthebyqualitycomparingofantherantialiasingeconstructionschemeofwouldpointideallysamplesbe
rtotheeconstructionoriginalfiltercontinuousdependsonobjectthereprparticularesentation.outputHodewevice.ver,Com-the
paringthesampledpointstoasupersampledoutputrequiresa
tion.decimationInaddition,filter,theaperchoiceceptualofwhichsimilarityisalsometrictiedtortheemainsreconstruc-elusive.
prWeesentedthereforwithertheesorttoarandomlysurveoryderedamongthealgorithmsresearrchesultsstaf3f.atthrWhenee
zoomscales(Figure7.3),4of39preferredtheWulines,11favored
GS,andtherest(61%)votedfortheproposedapproach.Inadirect
comparison,25of33respondentsdescribedourlineasthickeror
darkerthanGS.14perceivedittobesmootherormoreuniform
3Pixelsareinvertedforbettervisualizationonwhitebackgrounds.
107

2xFigurande7.3:4xThermagnificationesultsofthewithWu,nearGSandest-neighboroptPolyresampling.algorithmsat1x,

wandas2rthinnereported,andthe8noticedopposite.moreConvjaggies.ersely,9Ourof30lineisindicatedperceivGSed
aslessmorseveere,uniforasmshownbecausebytheitsdifgapsferbetwenceeenimagetheinmiddleFigurepixels7.4.arThee

Figurpixelse7.4:indicateGSrlaresultgerdiffersubtractedences.fromtheoptPolyoutput;darker

maximumdeviationof45graylevelsisreachedattheedgesand
aiscausedsubsequentbyourlinessegmentationincreased(contoursthickness.aremoreAlthoughlikelytobebeneficialclosed),for
thecorrespondingblurrinessmightbedeemeddetrimentaltothe
humanvisionsystemspositionalacuity.However,thisisnotthe
case–intensitygradientsareinfactthebasisforsub-pixelobject
localization[162].Itisthereforenaturaltoconsiderthenumber
ofdistinctgraylevels,ofwhich≈64maybedistinguished[182].
TheGSapproachisobviouslylimitedbyits24-entrytable.Wus
algorithmgenerates38values,whereasourwiderkerneland
floating-pointarithmeticallowfor55values,thusexplainingthe
increasedsmoothnessoftheresultinglines.Additionalresults3
forvariousslopesareshowninFigure7.5.

108

eFigur

7.5:

Results

for

(a)

(b)

]165[uW

]168[GS

optPoly(c)

slopeswithlines

109

1,±

3,/1±

3,±

0,

.∞

Conclusion7.6

Thischapterhasdescribedahighly-optimizedvariantoftheGupta-
Sproulllinedrawingalgorithm.Itsvalueliesinoutperforming
evenWusfastapproximationalgorithmwhileenablinghigh-
qualityantialiasing,whichcanreduceeyestrainwhenanalyzing
datasets.gelarAnanalysisofconvolutionwithanideallinehasdemonstrated
theflawsofcommonlyusedad-hocpointspreadfunctions.We
insteadderiveanoptimalpolynomialfilter(inthesenseofmini-
mizingaliasing)andshowtheresultingimprovementinquality.
ThefilterkernelisequallyapplicabletowardsCPUandGPU-
basedalgorithms.Interestingly,oursoftwareimplementations
throughputreachesthefillrateofamid-rangeGPU.Thisismade
possiblebySIMDoperations,whicharenowwidelyavailableand
invalidatesomepreviousdesignandimplementationtradeoffs(e.g.
arithmetic).vs.lookupstableApplicationsofthenew,highlyefficientalgorithminclude
annotatinggigapixel-scaleimageswithsegmentcontourstoaid
humanrecognitionofman-madeobjects,orplottingthemany
productionsoftheGESTALTsystem[183].Toeaseitsadoption
andallowforreproducingourresults,thesourcecode[173]is
ailable.vamadebeingFutureworkmayinvolvespecial-casehandlingofthelineend-
points,andusingblendingtoavoidartifactsinoverlappinglines.

110

Chapter8

RadarApertureSynthetic

Wehaveconsideredtheproblemofautomaticallyscreeningfor
man-madeobjects(MMO)ininfrared(IR)videosandsynthetic
apertureradar(SAR)imagery.Becausesuchobjectsareoften
highlyreflectiveinSARanddistinctiveinIR,bothproblemscanbe
reducedtofindingpoint-likeobjects.Thresholding(usuallylocally
adaptive)onlyutilizestheradiometricinformationandignoresthe
maximumobjectsize,whichmeansreflectionartifactsorlargere-
gionsoftencausefalsealarms,thatis,reportingapoint-likeobject
wherenoneexists.Recently,alevel-setapproachhasbeenpro-
posedthattakesspeckle(multiplicativenoiseinSARimages)into
accountandreliablyseparatestargetsfromthebackground[184].
However,itscomputationalcostisalmostcertainlytoohighfor
largedatasetsorreal-timevideoanalysis.Analternativemodel
calledthehotspottransformwasdevelopedforIRSearchand
Trackapplications[185].Thisoperator(definedinSection8.1)
searchesforlocalmaximathatareentirelysurroundedbyaring
ofdarkerpixels,thussuppressingbrightbutnon-point-shaped
regions.Itscomputationalcostfornpixelsandmaximumtar-
getsizeRisO(nR2).Webelievethistechniquetobesuitablefor
screeninginbothIRandSARdataandhavedevelopedanovel
algorithmthatreducesitscomplexitytothelowerboundofO(nR).
Oursophisticatedimplementation,describedinSection8.2,reuses
previouslycomputedintermediateresults,ensurestheworking
setfitsincachesviapipelining,andachievesanadditional27-fold
111

ingspeeduprateofvia72vMPixel/sectorizationonaandsinglewparallelization.orkstationTheenablesattainedscrpreeningocess-
entiresatellitedatasetswithinseconds(c.f.Section8.4).Results
isaregivsuitableenforforairbordetectionneSARofMMOimagesandinasSaectionpre-pr8.3.ocessingThestepalgorithmfor
multi-classtargetrecognitionviasupportvectormachine(SVM).
vancedAnMauiearliervOpticalersionandofthisSpaceSurchaptervweillanceasprTesentedechnologiesattheconferAd--
].186[ence

Hotspot8.1Operator

Thehotspotoperatorforextractingpoint-likeregionsandsup-
prpointessingtexturebackgrandoundshapepixelsarewasgenerallyintroducedhighlyinv[185ariable,].Becausetemplate-the
basedpatternmatchingcannotbeapplied.Instead,thehotspot
ofmodelgenerality)considersbrighterinterestthanpointstheirtosurrbepixelsoundings.thatWarithethe(withoutpointsizeloss
unknoneighborwnhoods(boundedofonlyconcentricbyasquaremaximum),shellsweconsidermultiple

S(xc,yc,r)={I(y,x)|(xc,yc)−(x,y)∞=r}
centeredonthepixelI(yc,xc)intheimageI.Theirmaximum
pixelvaluesarecomparedwiththecentralpixel.Negativediffer-
encesindicatethepixelissurroundedbyuniformlydarkerpixels,
forthusmationattestingistodefinedapointbyrtheegionlargestwithinofthatthesevshell.aluesTheforallhotspotshellstrans-up
toamaximumradiusR(clampingnegativevaluestozero):

Rhotspot(xc,yc)=maxI(yc,xc)−r=min1maxS(xc,yc,r),0
Thisoperatorsuppressesbackgroundpixelsandthusenhances
freestandingpoint-likeregionsasdesired.Itissimpleandintuitive,
112

requiringnoparametersotherthanR,whichisdefinedbythe
maximumobjectsizeandsensorresolution.Unfortunatelyanaïve
implementationhascomplexityproportionaltoR2.Thiscanbe
improvedbytakingadvantageofapropertyoftheminimumand
clampingoperationsshowninLemma8.1:

(8.1)

∃b∈S(xc,yc,r)>I(yc,xc)⇒(8.1)
hotspot(xc,yc)=0∨
minMax(xc,yc)<b≤maxS(xc,yc,r)
Ifashellcontainsapixelbrighterthanthecentralpixel,thenitwill
notThisafhasfectthebeenhotspotobservvedaluetobeand18thertimesestofasitsfastpixelsasthecanbeoriginalskipped.im-
Wherplementation,eastheworst-casealthoughthequadraticexactspeedupcomplexityrdependsemainsontheunchanged,data.
itisdifficulttoconstructsuchinputsandtheywillcertainlynot
beencounteredinpractice.Adrawbackofthisalgorithmisthat
itcannotmakeeffectiveuseofvectorizationduetoitsreliance
onconditionalbranches.Accumulatingshellmaximavia16-way
SIMDaccessonlypenaltiesresultedandintheaoverspeedupheadofoftwcopodueyingtorangesunalignedintormemoregisters.y

ithmAlgor8.2

Wewillnowbuilduponrelatedtheoreticalworktoengineeranew
andimprovedalgorithmforcomputinghotspots.
Recallthecomputationofthemaximumofthe8rpixelsthat
thisconstituteoperationashellcanofberradiuseducedr.GivtoenfouraRangetransposedcopMaximumyoftheQueriesimage,
RMQ(i,j)=maxkj=iA[k]inanarrayorimagerow/columnA.Alon
andSchieberhaveshownthatsuchqueries(generalizabletoany
prsemigreproup)ocessingcan[187be].TheanswerhotspotedinO(1operator)timesaftercomplexitynearis-lineartherefortimee

113

boundedbyO(nlogn+nR2),asignificantimprovementversusthe
preWeviousrefertoalgorithms[188]forO(naR)completecost.presentationoftheRMQalgo-
rithm.Thebasicideaistopre-calculatethemaximaofpower-of-
twointervals.Eachquerycanbesplitintotwo(possiblyoverlap-
etping)al.intersuggestvals;antheefrficientesultisschemethelarforgerprofeprthetwocessingothatmaxima.computesKatriel
prrayefix[189and].sufThisfixonlyrmaximaequiresandO(ninterlealogRv)espreprthemocessingintoatimesingleandar-
space,becausethequerylengthsareboundedby2R+1.Bender
andinputarrayFarach-ColtonintoblocksalsoofsizedescribeO(logan)[scheme188].thatThisrfirsteducesdividestheprthee-
processingtimetoO(n)atthepriceofmorecomplicatedqueries
withseparatehandlingofinter-orintra-blockqueries.Fischer
andwithHeunoptimalhavespacerrecentlyequirintrementsoduced[190a],butsimilaritsqueriessuccinctarealsoalgorithmtoo
practice.ineexpensivisAtheirdisadvmediocrantageelocalityshared–bybothallofintervtheseallengthRMQ-basedandtheapprqueryoachesin-
dicesaffectthelocationofthepreprocessedvalue,whichmakes
fornon-sequentialaccesses.Onealternativewouldbetocastthe
hotspotoperatorasastencilcomputation,maintainingfoursepa-
rateintervals.maximumHotspotvaccumulatorsalueswouldforobeverlappingcomputedasleft,theright,up,maximumdownof
thesecomplexityshellofO(components,nR).Athusdisadvachieantagevingofthethisdesiredmethodandliesinoptimalits
highTospacebridgerequirthegapements.betweentheredundantcalculationsofthe
existingmethodandthepracticalcostsoftheoreticallymotivated
approaches,wehaveengineeredanewalgorithmthatcombines
ideasfromRMQandstencilcomputation.Thefirstkeychange
isThesetostorareeonlyusedatosinglegeneratesetofallrow-shellsandofacolumncertainintervrangealofmaxima.sizes
andarethencombinedin-placetoyieldintervalsoftwicethe

114

andlength.reducingBesidesmemorfoldingyuse,prthisepralsoocessingimprointovesthelocalitymain.Thealgorithmsecond
importantstepistoorganizethealgorithmasapipelinesothat
otheverwimageorkingrosetwsfitsexactlyentironce;elyintostartingcommonfromL2thecurrcaches.entrWoew,priterateevi-
ouslycalculatedintervalmaximaofsuccessivelyincreasinglengths
areusedtocomputetheshellsforpreviousrows.Theresulting
ThistentativeprincipleshellismaximaillustratedarebyFiguraccumulatede8.1.intoBecausetheoutputonlythebuflastfer.

iterateoverrows

1:readIM;3:combineIM
iterateoverrows2:updateminMaxima
Wavefront1:readIM
Figure8.1:Pipelinediterationloop(wave)overrows:readinterval
maxima,usethemtoupdatethecentralrowsminMaxima,and
thencombinetheoldest(nolongerneeded)intervalmaxima.

4R+2rowsareaccessed,acacheofthatsizecanentirelyabsorb
thecostofrepeatedaccesses.Algorithm8.1givesanoverviewof
computingthehotspotimageH.Theactualtransformationoccurs
inAlgorithm8.2,whichbuildsuponAlgorithm8.3forfindingthe
maximumvalueonagivenshellinconstanttime.Algorithm8.4
thencombinesintervalmaximatodoubletheirlengths.

115

ShellMax{4,8}computesthemaximumpixelvalueonashell
fromrow-andcolumnintervalmaxima,asshowninFigure8.2.
Inthiscase,r=2andIL=4.Becausearadius-rshellconsists

2=r

Figure8.2:Assemblingashellfromfour1-Dintervals.

of8rpixelsandintervallengthsarepowersoftwo,itiseasyto
seethatthisschemeappliestoallshellsofradiusr=2i(i∈N0).
EachoftheremainingR−log2Rshellsrequireseightinterval
maxima–theirfoursidesarepiecedtogetherfromthemaximaof
twooverlappingintervals.

AnalysisroOurw-neandwcolumnschemerinterequirvales2nmaxima.valuesofBecauseauxiliartheyinputsstorageareforcopiedthe
thereandnotusedafterwards,theirstoragecanbereusedfor
accumulatingtheminMaximaoutputs.Thepipelinednatureofthe
algorithmenablesafurtherreductionto4R+2rowsbyorganizing
themasaslidingwindow,butthatwouldrequirecomputingthe
rowspositionwithinthewindowduringeveryaccess.
Wenowexaminetherunningtimeofthealgorithm,which
issomewhatobscuredduetothefournestedloops:height×
log2R×width×numIM(IL).Notethatrearrangingtheirorderis
possiblebecausetheinnermostloopdoesnotdependonwidth,so
wecombinethatandheightintoafactorn.Thenumberofinterval

116

Algorithm8.1:Hotspot(I→H)
1for(x,y)dominMax[y,x]:=∞;
2MinMaxima(I);
3for(x,y)do
4H[y,x]:=max(I[y,x]−minMax[y,x],0);
end5

MinMaxima8.2:ithmAlgor//Computelength2intervalmaxima
21forRMy::==I1,toCM:=heightI;doCombineIntervalMaxima(y,1);
3//forwaPipelinedvefrontiter:=ation1tooverheightrowsdo
54roforwL:=:=w1avtoefrontlog;Rdo
2L76ILfor:=x:2=;1//tointerwidthvdoalLengthShellMinMaxima((row,x),IL)
;8oldestRow:=row−IL/2;
109row:=oldestRoCombineIntervalMaximaw−2IL;(oldestRow,IL);
end11end12

Algorinputithm:pos8.3:,ILShellMinMaxima
1//minMaxCompute[pos]min:=Sfminor(interminMaxval[posmaxima],ofShellMax4lengthIL(pos,IL));
2forr:=IL/2+1toIL−1do
3minminMax([minMaxpos][:=pos],ShellMax8(pos,r));
end4

117

Algorinputithm:y,IL8.4:CombineIntervalMaxima
1forx:=1towidthdo
32CMRM[[yy,,xx]]::==maxmax((CMRM[[yy,,xx]],,CMRM[[yy,+xIL+,ILx]]));;
end4//Postcondition:ILnowdoubled

4+maxima8(IL/2−accesses1)=is4IL−defined4,so:byShellMinMaxima:numIM(IL)=
Rlog2timePerPixel=L=∑14(2L)−4=O(R)
ThetotalcomplexityisthereforeO(nR),whichisoptimalbecause
thetransformationmustexamineeachshellandpixel.

ementsvImproFurtherAlthoughthenewalgorithmisasymptoticallyoptimal,therere-
mainsMachinesignificant(RAM)rmodeloomforunderlyingimprovement.typicalThecomplexityRandom-Accessmeasures
whasorldtheperforvirtueofmance[191simplicity].Withbutcacheoftenmissesnomis-characterizeswtwoorthedersreal-of
magnitudemoreexpensivethanbasicoperations1,theseeffects
incanthenocontextlongerofbetheignored.hotspotWewilloperator,discussbutthesomeloexistencew-levelofissuessuch
techniquesandthemagnitudeoftheresultingimprovementsare
likelytobeofindependentinterest.
AsexplainedinChapter2,unlockingthefullpotentialofCPUs
rwasequiraes27-foldvectorizationspeedup.andLocalfiltersparallelization.aregenerallyInthissuitablecase,theforresultdata-
parallelprocessing,butthehotspotoperatorislimitedbymemory
1DDR3memorymodules60nslatencyequatesto160cyclesat2.66GHz[192].
118

bandwidthduetoitsnumerousandnon-sequentialmemoryac-
cesses.Figure8.3showsthescalabilityofthenewalgorithmon
threedifferentSMPsystems.Parallelefficiencyisonly50%on
1

0.90.8N)/(SpeedupficiencyEfParallel
0.7

4x4xIntelAMDQuad-CorDual-Coree
eQuad-CorIntel

0.612345678
eadsthrofN=NumberFigure8.3:ScalabilityofthenewalgorithmonthreeSMPsystems.
Memorybandwidthisthelimitingfactorandismoreplentifulon
system.AMDthe

aconfir16-cormedebIntelybettermachine.scalabilityTheonmemoranyAMDbottleneckmachinewithhypothesismultipleis
memorycontrollersandcorrespondinglyhigherbandwidth.Note
thatsuchsystemshaveNUMAcharacteristics,whichrequirescare
toensurTheenexteachstepthriseadsvwectorization,orkingsetiswhichinislocalpossiblememory[because193].the
per-pixelcomputationsareindependentandcanbemappedtothe
119

SSE2instructionset.Weobtainanadditionalspeedupof3.6via
that8-waytheSIMD,causeiswhichaislimitationhelpfulinthebutsurIntelCorprisinglye2lomicrw.oarItturchitecturnsoute
regardiscusseddinginthedepthhandlinginofAppendixunalignedB.2.loads,Theantakeaissuewaythatisthatwillthebe
newalgorithmwillbenefitfromimprovementsinthisareaandthe
movetowardsmultiplememorycontrollers,furtherimprovingits
.scalabilityandmanceperforAnotherdetailthathasbeenconsideredistheoverheadofso-
calledpagewalks.Eachmemoryaccessrequiresvirtual-to-physical
vaddrolvesessexaminingtranslationinmulti-lethevelmemorpageytables.mappingATunitranslation(MMU),whichLook-asidein-
ofBufthefer(TLB)translationservesfortoadecrsmalleasenumberthisovoferrheadbyecently-accessedstoringthermemoresulty
pages.Thisspecializedcachehasstrictlatencyrequirementsand
bycantherrandomeforeaccessesonlyinalaraccommodategememorafeywregion,entries.ovIferitisheadovincrerloadedeases
dramaticallybecauseseveralaccessestomemoryareneeded[42].
The(e.g.4TLBMiBcoverageinsteadofcould4beKiBincroneasedx86arbyusingchitecturlares).geHomemorwevyer,pagesour
algorithmrarelyaccessesmemorybecauseitisdesignedtooperate
in-cache.Onefinalmicroarchitecturalissuethathasaffectedthedesign
ofthealgorithmisalsocache-related.TheInteli7andAMDfam-
ily10hprocessorsincludeasharedL3cache,whereasIntelCore2
CPUsconsistoflogicalprocessorpairssharinganL2cache.In
bothcases,thecachesareunpartitioned;unnecessaryevictionscan
resultfromthreadsstealingeachothersspace.Havingprocessors
thatsomesharcaseseaduecachetothewrorkeductiontogetherinonacontention.taskisEvaboutenif7%partitioningfasterin
strategiesareimproved,thecooperativeschemehastheadvan-
eftagefectivofeavsizeoidingofthercache.eplicationForofworkingcommonsetsdataapprandoachingincraeasinglogicalthe

120

prspeedupocessorofsshar1.45edueofthetoitscache,avtheoidancecache-aofwarthrashing.emethodachievesa

Results8.3

WeshowtheresultsofthehotspottransformationonaDornier-
SARimageofKühlsheim(Figure8.4(a)),ascenecontainingboth
man-madeobjectsandvegetation.Weareparticularlyinterested

(a)Logarithmofinput(b)Hotspot-transform
Figure8.4:AirborneSARimageofKühlsheim(65cmresolution)
andtheresultofthehotspottransformation.

invehiclesandothercompactobjects.Thehotspottransformation
(radiusR=32)suppressesuniformlybrightregions,because
suchpixelsshellsaregenerallynotdarkerthanthecenterpixel.
Afterthehotspottransformation,vehiclepixelsandtheremaining
backgroundpixelsdifferbythreeordersofmagnitude(107vs.104).
Toimprovethevisualization,wecomputeconnectedcomponentsof
nonzeropixelsanddiscardobjectssmallerthananarbitrarycutoff
of12.7m2.TheresultisshowninFigure8.4(b).Subsequentsteps
intheimageprocessingpipelineexaminethecandidateregions,
SVM.viathemclassifyinge.g.

121

mancePerfor8.4

Thepointofdevelopinganewalgorithmforthehotspotoperator
wastoenablenear-real-timeprocessingoflargedatasets.Itssuccess
isdeterminedbyaperformancecomparisonwiththeprevious
skip-shellalgorithm,whichdependsonthepropertiesoftheinput
data.Toensurerelevantfindings,wemeasurethethroughputfor
asetofseventypicalhigh-resolutionSARimagesofdifferent
areascapturedbyair-andspacebornesensors.Theresultsare
showninTable8.1andindicateamaximumspeedupof14.7.
Notethattheimagedimensionsinfluencetherunningtimeof
Table8.1:ComparisonofthroughputsonvariousSARdatasets.
Ournewalgorithmisupto14.7timesasfastastheskip-shell
algorithm.

DatasetWidthHeightOldMPixel/sNewMPixel/s
Diepholz29282881013.9205.4
Kühlsheim40963079127.2131.8
K.016240963441527.3123.1
K.188229283556014.3203.4
WTSX579alldürn11409632820624665625.835.4131.672.2
TSX58010752612233.272.7

ouralgorithm.Widerdatasetsincreasetheworkingsetsize,and
dimensionsdivisiblebymultiplesofthecachelinesizemayleadto
associativityconflicts.However,theslowestrecordedthroughput
isstill102timesasfastasanin-houseFPGAimplementationof
thebasicalgorithmonaVirtex-II.

Conclusion8.5

Automaticscreeningforman-madeobjectsinSARorIRdatasets
entailsdetectingcompactpixelclusters.Thehotspottransforma-
122

tionsuccessfullysuppressesotherpixels,butiscomputationally
expensive.Wehaveintroducedanewalgorithmwithlinearcom-
plexityinthepixelcountandobjectsize,whichisasymptotically
optimal.Oursophisticatedimplementationavoidsredundantcom-
putationsbymeansofadivideandconquerschemeandorganizes
itsmemoryaccessessotheworkingsetfitsinthecache.Paral-
lelizationandvectorizationyieldacombined27-foldspeedup.A
singleworkstationisabletoprocess72MPixel/s,whichallows
rapidscreeningoflargedatasets.Thealgorithmisusedasapre-
processingstepformulti-classtargetrecognitioninMSTARSAR
dataviasupportvectormachine.

123

9Chapter

Discussion

Thisworkhasdescribedtechniquesformaximizingperformance
onmodernCPUs,namelyvectorization,parallelizationandac-
countingforthememoryhierarchy.Theyhavegivenriseto10–
100xspeedupsinsevenseparatealgorithms,thusemphasizing
theirpracticalrelevanceandwideapplicability.Inseveralcases,
theresultingsoftwareexceedsthereportedperformanceofspecial-
izedhardware.Thisprovidessomewhatunexpectedinputtothe
currentdiscussionofwhichcomputerarchitectureissuitablefor
agiventask.General-purposeCPUscanstillcomparefavorably,
evenwhenperformancegoalsareambitious.Althoughsomeof
ourtechniquesaredesignedforspecificmicroarchitectures,the
pasthasshownthattheirbasicprinciplesremainvalidforadecade
e.mororTheaboveconclusionsstandforthemselves,butourmainobjec-
tivewastodesignandimplementanefficientprocessingchainfor
imageanalysis.Althoughthisworkdoesnotconstituteprogresson
understandingtheimagecontents,norrealizeafull-fledgeddemon-
strationapplication,itprovidesusefulbuildingblocksforthe
increasinglyacceptedobject-basedimageanalysisparadigm[194].
Wehaveintroducednewalgorithmsforeachstepthatsignificantly
outperformpreviousapproacheswhilemaintaininghigh-quality
reveresults.-incrThiseasingisimpamountsortantofbecausedata.Ourmoderresultsnimagingdemonstratesensorsthedelivfeasi-er
bilityofprocessingaerialimageryof100km×100kmareasat1m
125

resolutionwithinminutes,whichgoesfarbeyondourinitialgoal
of2Eachhours.linkofthechainisdesignedaspartofacoherentwhole.
Forexample,thepansharpeningalgorithmarrangesforedge-
primageeservingI/Omodulesmoothingtoincludesaidthesupportsubsequentforstatisticssegmentation,andtiledandpixelour
forchainmatssertovesallotowforshoulderbettertheviewingbruntofoflarthegeimages.expensiveTheprpixel-basedocessing
processingrequiredforvariousimageanalysistasks.Subsequent
theyapplicationscandrawneeduponnotabemoraseconcercompactnedandwithhigherperfor-levelmance,object-basedbecause
representationoftheimage.Thisgeneralapproachofoptimizing
relativelysmallmodulesresponsibleformostoftheexecutiontime
providesmajorperformancebenefitsatareasonablecost.
cationsHoweverbesides,muchourremainschange-detectiontobedone.prototypeBuildingwouldfurtherindicateappli-
rangewhetherofthetasks.currInentsetparticularof,imageextractingfeaturesandissufsimplifyingficientforasegmentwider
havecontoursdevwelopedouldbealgorithmhelpfulprforototypesmatchingforandbothprclassifyingoblemsobjects.(includingWe
vectorizationoftheinherentlysequentialpolygonsimplification
thetask)prthatocessingleaduschaintobeliemayvebeathrattained.oughputcomparabletotherestof
ysisSofometheapplicationsmaximumdealsorviationequirinetheaccuracypan-sharguarantees.peningstageAnanal-and
antheseerrorstepsmodelalsorforequirtheeusersegmentation-definedcouldparametersprovefortheuseful.degrBotheeofof
smoothingandminimumobjectcontrast,respectively.Itwouldbe
helpfultoautomaticallyderivebothfromtheinputdatasets.
Returningonceagaintothegeneralissueofperformance,
wbreoadlybelieveapplicablethatmanytootherofthedomains.techniquesFordevexample,elopedefherficienteinasyn-are
ingchronousextertransfersnal-memorcanyspeedalgorithms.upAnI/O-intensivawareenessofapplications,thememorinclud-y

126

hierarchy,especiallyworkingsetsizeandcachepollution,should
improvenearlyanyalgorithmthatfrequentlyaccessesmemory.
asurFinally,prisingmoderdegrneeofmulti-coreparallelism.CPUswithTheSIMDcombinationinstructionofsetsoptimizedoffer
algorithmsandabalancedarchitecture(includinghighsingle-core
performancefortheserialportionofparallelalgorithms)canallow
aCPUtoremaincompetitivewithotherspecializedarchitectures.

127

Part

III

Desserts

129

AAppendix

irtual-MemorVSortCountingy

Wepresentafastradixsortingalgorithmthatbuildsupona
microarchitecture-awarevariantofcountingsort.Takingadvan-
atageperof-passvirtualthrmemoroughputyandcorrmakingespondingusetoofatleastwrite-combining89%oftheyieldssys-
temspeakmemorybandwidth.Ourimplementationoutperforms
Intelsrecentlypublishedradixsortbyafactorof1.64.Italso
forcomparFermiesfaGPUsvorablywhentotherdata-transfereportedoverperforheadismanceofincluded.anThesealgorithmre-
sultsindicatethatscalar,bandwidth-sensitivesortingalgorithmsre-
maincompetitiveoncurrentarchitectures.Variousothermemory-
intensiveapplicationscanbenefitfromthetechniquesdescribed
ein.herThischapterhasundergoneminorrevisionssinceitspublica-
tionatEuro-Par2011[195].

IntroductionA.1

Sortingisafundamentaloperationthatisatime-criticalcompo-
nentofvariousapplicationssuchasdatabasesandsearchengines.
Thewell-knownlowerboundofΩ(nlogn)forcomparison-based
algorithmsnolongerapplieswhenspecialpropertiesofthekeys
canbeassumed.Inthiswork,wefocuson32-bitintegerkeys,
optionallypairedwitha32-bit(orlarger)value.Thissimplifiesthe

131

implementationwithoutlossofgenerality,becauseapplications
canradixoftensortreplacealgorithlarmgeisrecorcommonlydswithausedpointerinsuchorindexcases[due196].toTheits
Oincr(n)easeovcomplexityerr.esultsInrthisrecentlyeport,wepublishedpresentbyaIntel1.64-fold[197].performance
Theremainingsectionsareorganizedinabottom-upfashion,
withfutureSmicrectionoarA.2chitecturdedicatedesthattoafthefectbasicmemorrealitiesy-intensivofecurrprentogramsand
SandectionmotivA.3ate,shoourwingapprhowoach.toWespeedbuildupuponcountingthissortbfoundationytakingin
advantageofvirtualmemoryandwrite-combining.SectionA.4
forappliesmancethisofourtechniquetowimplementationardsanoisveelvvaluatedariantofinSradixectionsort.A.5.TheBand-per-
widthoptimalformeasurthegivementsenhardwindicateare.theItstwpero-passCPUsthroutperforoughputmisaFernearlymi
GPUwhenaccountingfordata-transferoverhead.

A.2SoftwareWrite-Combining

arWeebeginlikelytowithhaaveadescriptionseriousofimpactbasiconmicroarapplicationschitecturalwithrealitiesnumerthatous
bymemormeansyofSaccesses,oftwarandeWshowhorite-Combining.wtoavoidTheseperfortopicsmancearenotpenaltiesnew,
butwebelievetheyareoftennotadequatelyaddressed.
AniThedealfirstcacheprwithoblematarisesleastaswhenmanywritinglinesitemscouldtoexploitmultiplethestrwriteseams.
everspatial,perfectlocalityhitandratesentirareelynotavachieoidvableinnon-compulsorpracticeyduemisses.toHolimitedw-
towaaysofcacheset,associativityanyfurthera[198].allocationsBecausefromonlyathatlinessetwcanouldberesultmappedin
theshouldevictionavoidofwroneitingoftotheprmanyeviousdifferlines.entstrIfeams.possible,Otherwise,applicationsthe
variouswritepositionsshouldmaptodifferentsetstoavoidthrash-
ingandconflictmisses.ForcurrentL1cacheswitha=8ways,size
132

CCand=32bits[KiBlgB,andlgBlines+lgofS)Bof=the64bytes,destinationtherearaddreSesses=aB=should64dsets,if-
fer(e.g.byensuringthewritepositionsarenotamultipleof
SB=4KiBapart).
Asecondissueisprovokedbyalargenumberofwrite-only
accesses.destinationEvenmemorifanyentirmustefirstcachebelinereadistointobethewritten,cache.thepreAlthoughvious
theing,corrthecacheespondinglinelatencyallocationsmarybeemainprpartiallyoblematichiddendueviatoprcapacityefetch-
linesconstraintsthatareandnotevictionaccessedpolicyafter.havingInsteadbeenoffilled,displacingthewidesprwrite-onlyead
(pseudo-)Least-Recently-Usedstrategydisplacespreviouslycached
dataduetotheiroldertimestamp.Anattempttoavoidthese
evictionsCLFLUSHbyexplicitlyinstruction)invdidnotalidatingyieldcachelinesmeaningful(e.g.imprwithovtheements.IA-32
tionsInstead,thatwriteapplicationsdirectlyshouldtousememory,non-temporalthusavstroidingeamingcachestorepollutioninstruc-
becausetheycircumventthecache.
involvThiseleadssignificantdirectlybustoovtheerhead.nextTheconcerarn:chitectursingleememortheryeforeaccessescom-
binesneighboringnon-temporalwritesintoasinglebursttransfer.
However,currentlymicroarchitecturesonlyprovidefourtoten
write-combine(WC)buffers[199].Non-temporalwritestomulti-
plestreamsmayforcethesebufferstobeflushedtomemoryvia
thispartialbywritesmakingusebeforofeStheoftwyarareeWfull.Therite-Combiningapplication[200can].Theprevdataent
tobecertainlywrittenresideisinfirsttheplacedcacheintobecausetemporartheyyarebuffrfers,equentlywhichaccessed.almost
Whenfull,abufferiscopiedtotheactualdestinationviaconsec-
utivenon-temporalwrites,whichareguaranteedtobecombined
intoasinglebursttransfer.
Thisschemeavoidsreadingthedestinationmemory,which
maandywincurouldronlyelativelypolluteexpensivtheecache.Read-ForItworks-Ownershiparoundthetransactionslimited

133

Internumberestinglyof,WCthisbufisfersbytantamountusingtoL1dircacheectsoftwlinesarforethatcontrolpurofpose.the
cache.managedentlytransparwheneWevreraecommendcoresactivtheeusewriteofsuchdestinationsSoftwareWoutnumberitsrite-Combiningwrite-
combinebuffers.Fortunately,thiscanbedoneatafairlyhigh
leandvel,becausenon-temporalonlythestoresbuffer(whichcopyingarerbestequiresexprspecialessedbvyectortheloadsSSE2
intrinsicsbuiltintothemajorcompilers).

A.3Virtual-MemoryCountingSort
WenowreviewCountingSortofnelementswithkeysin[0,m)and
describeanimprovedvariantthatmakesuseofvirtualmemory
write-combining.andThenaïvealgorithmfirstgeneratesahistogramofthenkeys.
Aftercomputingtheprefixsumtoyieldthestartingoutputlocation
foreachkey,eachvalueiswrittenatitskeysoutputposition,which
emented.incrsubsequentlyisOurfirstoptimizationgoalistoavoidtheinitialcountingpass.
Wecouldinsteadinserteachvalueintoaper-keycontainer,e.g.
alistofdatablocks.However,thisincurssomeoverheadfor
checkingwhetherthecurrentbucketisfull.Preallocatingspace
formarraysofsizenismoreefficient,becauseitemscansimply
bewrittentothenextfreeposition(c.f.AlgorithmA.1,introduced
in[201]).Thisalgorithmonlywritesandreadseachitemonce,
afeatthatcomesatthepriceofnmspace.Althoughthisappears
problematicintheRandom-Access-Machinemodel,itiseasily
handledby64-bitCPUswithvirtualmemoryorganizedintopages
ofsizep.1Physicalmemoryisonlymappedtopageswhentheyare
firstaccessed,thusreducingtheactualmemoryrequirementsto
1Accessestonon-presentpagesresultinapagefaultexception.Theapplication
randeceivreseactssuchbyeventscommittingviasignalsmemory,(POSIX)afterorwhichVectortheedfaultingExceptioninstructionHandlingisr(Wepeated.indows)
134

AlgorithmA.1:Single-passcountingsort
1storage:=ReserveAddressSpace(nm);
2fori:=0tom−1donext[i]:=in;
3foreachkey,valuedo
4storage[next[key]]:=value;
5next[key]:=next[key]+1;
end6

O(n+mp).Theremainderoftheinitialallocationonlyoccupies
addressspace,ofwhichmultipleterabytesareavailableon64-bit
systems.Havingavoidedtheinitialcountingpass,wenowshowhow
toefficientlywritevaluestostorageusingthewrite-combining
techniquedescribedinSectionA.2.Ourimplementationinitializes
thenextpointerstoconsecutive,naturallyaligned,cache-line-sized
buffers.Abufferisfullwhenits(post-incremented)positionis
evenlydivisiblebyitssize.Whenthathappens,anunrolledloop
ofnon-temporalwritescopiesthebuffertoitskeyscurrentoutput
positionwithinstorage.Theseoutputpositionsarealsostoredin
pointers.ofyarraan

SortRadixA.4

Afterabriefreviewofradixsorting,weintroduceanewvari-
antbasedonthevirtual-memorycountingsortdescribedinSec-
.A.3tionAradixsortsuccessivelyexaminesD-bitdigitsoftheK-bit
keys.Theyarecharacterizedbytheorderinwhichdigitsare
processed:startingattheLeastSignificantDigit(LSD),orMost
(MSD).DigitSignificantAnMSDradixsortpartitionstheitemsaccordingtothecurrent
digit,thenrecursivelysortstheresultingbuckets.Althoughitno
longerneedstomoveitemswhosepreviouslyseenkeydigitsare
135

K/unique,Disthissmall.isInnotfact,theespeciallyoverheadhelpfulofwhenmanagingthenumernumberousof(nearlypasses
smallempty)n.bucketsmakesMSDradixsortlesssuitedforrelatively
intoBybucketscontrast,byeachthecurriterationentkeofythedigit.LSDvThisariantamortizespartitionstheallbucketitems
setupcostoverthenumberofelementsandavoidsthepossibility
ofcoploadying.imbalanceforparallelizationatthepriceofincreaseddata
makeToruseeduceofrethisverseoverheadsortingand[202also],inparallelwhichoneorcommunication,moreMSDwe
passespartitionthedataintobuckets,whicharethenlocallysorted
viasystemsLSD.Thisbecauseturnseachoutprtoocessorbeevisenrmoreesponsibleadvforantageouswritingforacontigu-NUMA
ousthoserangepagesoffromoutputs,theprthusocessorensuringsNUMAthenodeoperating[193].systemallocates
Letusnowexaminethepseudocodeoftheradixsort(Algo-
rithmextractingA.2),keychoosingdigitsKwithout=32formasking.brevityEachandPrDocessing=8toElementallow
by(PE)thefirstMSDuses(digitcounting=3).sortNotetothatpartitionitemsitsconsistitemsofaintokeylocalandvbucketsalue,
butwhichlarargereadjacentcombinationsinmemorareypossible(ideallyinwithinouranativeimplementation64-bitworviad,
oflarthegerfirstuseritem-definedofagivtypes).enMSDWhenisallarecomputedfinished,viaprtheefixoutputsum.indexEach
PEisassignedarangeofMSDvalues,sortingthebucketsfrom
allPEsforeachvalue.SkewedMSDdistributionscancauseload
oflarimbalance.gebucketsHow2.everThe,thislocalcouldsortbeentailsresolvK/edD−via1specialiterationstrineatmentLSD
order.ThefirstcopiesallotherPEsbucketsintolocalmemory.The
secondtolastpassalsocomputesthelastdigitshistogram,which
alloNotewsthatwritingthreedirsetsectlyoftobucketstheareoutputrequirpositionsed,whichinthemakesfinalheapass.vy
2Sortingbucketslargerthann/|PE|usingmultiplePEs.
136

AlgorithmA.2:ParallelRadixSort
doitemforeachparallel12d:=Digit(item,3);
3buckets3[d]:=buckets3[d]∪{item};
4;Barrier56foreachi∈0,2Ddo
7bucketSizes[i]:=∑PE|buckets3[i]|;
end8109paralleloutputIndicesforeach:=bucket3PrefixSum∈buc(bkucets3ketSizdoes);
1112foreachd:=itemDigit∈(bucitemk,et30);∀PEdo
13buckets0[d]:=buckets0[d]∪{item};
end141516foreachforeachbuckitemet0∈∈bbucuckket0ets0dodo
17d:=Digit(item,1);
1819bd:uc=kets1Digit[d](:=itemb,uc2)k;ets1[d]∪{item};
20histogram2[d]:=histogram2[d]+1;
end21end222324foreachforeachbuckitemet1∈∈bbucuckket1ets1dodo
25d:=Digit(item,2);
2726i:histog=ram2outputIndices[d]:=[d]histog+rhistogam2[rd]am2+[1;d];
28outputArray[i]:=item;
end29end3031

137

useofvirtualmemory(3×2D×|PE|=6144timestheinputsize).
Whereas64-bitLinuxgrantseachprocess128TiBaddressspace,
Winputsindowscanbelimitssortedthis3.to8TiB,whichmeansonlyabout1.4GiBof
TheWeradix2brieflyDwasdiscussmotivatedadditionalbyeasyaccesssystem-specifictoeachdigit,butconsiderations.isalso
TLBlimitedentries,bythewecachemapandtheTLBbucketssize.withBecausesmallofpages,theformanyrwhichequirtheed
incrInteleasei7micrTLBoarcoverage,chitecturweehasuse512largesecond-lepagesvforeltheTLBinputs.entries.TheTo
workingsetconsistsof2Dbuffers,bufferpointers,outputpositions,
and32-bithistogramcounters.Thisfitsina32KiBL1datacacheif
thecachesoftwline.arTeoavoidwrite-combineassociativitybuffersandarealiasinglimitedtoconflicts,asinglethese64-barrayteys
arecontiguousinmemory.Interestingly,theseoptimizationsdo
notdetractfromthereadabilityofthesourcecode.Knowledge
ofthelanguagesmicrandoarchitecturenablesecanprincipledalsobedesignappliedtodecisions.wardsmiddle-level

mancePerforA.5

Wecharacterizetheperformanceofoursortingimplementa-
tionbyitsthroughput,definedast1−nt0,wherenisthenum-
berofitemsandt0andt1aretheearliestandlateststartand
finishtimesreportedbyanythread.Thetestplatformcon-
sistsofdualW5580CPUs(3.2GHz,48GiBDDR3-1066mem-
ory)runningWindowsXPx64.Ourimplementationiscompiled
withICC11.1.082/Ox/Og/Oi/Ot/Qipo/GA/GR-/GS-/EHsc
/Qopenmp/QaxSSE4.2.Whensorting350Muniformlydistributed
32-bitkeysgeneratedbytheWELL512algorithm[203],thebasic
algorithm(VMonly)reachesathroughputof391Mitems/s,as
sampling.3ThisInthelimitationunlikelycouldcasebecirthatcumvtheyentedarebyexceeded,estimatinganewboundssampleforwbucketouldbesizesdraviawn
andtheprocessrepeated.

138

TableA.1:Throughputs[millionitemspersecond]for32-bitkeys
alues.v32-bitoptionaland

K=32,V=32K=32,V=0AlgorithmVMIntelonlyx2391400238307
KNFGPU+PCIeMIC560501(?)303
452657VM+WC

showninthesecondcolumnofTableA.1.Afterenablingwrite-
combining(VM+WC),performancenearlydoublesto657M/s.
Intelhasreported240M/sforthesametaskandasinglebutiden-
ticalCPU[197].Forafaircomparisonwithourdual-CPUsystem,
wedoubledtheirthroughput,whichoptimisticallyassumestheir
algorithmisNUMA-aware,scalesperfectlyandisnotrunning
atalowermemoryclock(becauseourDDR3-1066isatthelower
endofcurrentlyavailablefrequencies).Wemustalsodividetheir
resultbythegivenspeedupof1.2duetohyperthreads,because
thosearedisabledonourmachine.This(Intelx2)yields400M/s;
theproposedalgorithmistherefore1.64timesasfast.Aseparate
publicationhasalsopresentedresults[204]fortheManyIntegrated
Coresarchitecture.TheKnightsFerryprocessorprovides32cores,
eachwithfourthreadsand16-wideSIMD.Thesimulation(KNF
MIC)showsathroughputof560M/s.Ourscalarimplementation
iscurrently1.17timesasfastwhenrunningon8cores.
Recently,athroughputof1005M/swasreportedonaGTX
480(Fermi)GPU[205].However,thisexcludesdriveranddata-
transferoverhead.Forapplicationsinwhichthedataisgenerated
andconsumedbytheCPU,wemustincludeatleastthetimere-
quiredtoreadandwritedataoverthePCIe2.0bus.Assuming
thepeakper-directionbandwidthof8GB/sisreached,theaggre-
gatethroughput(GPU+PCIe)is501M/s.Ourimplementation,
runningontwoCPUs,thereforeoutperformsthisalgorithmon

139

acurrtransistorentcountstop-of-the-line(2×731MGPUvs.3by000aM)factorandofther1.31maldespitedesignlopowwerer
(2×130Wvs.275–300W).
Similarmeasurementsandextrapolationsforthecaseof32-
bitcolumnkeysofTassociatedableA.1.withBecauseV=32the-bitslovwaluesdownareisgivlessenthanintheathirfactord
oftwo,theimplementationsareatleastpartiallylimitedbycom-
putationinsteadofbandwidth.Intelsalgorithmismoreefficient
inThethisradditionalegard,datawithonlytransfersao1.3-foldverPCIedecrreaseendervs.theourGPUfactorofalgorithm1.45.
e.uncompetitivBecauseradixsortisbandwidth-sensitive,itisalsointeresting
toexamineperformanceforavaryingnumberofprocessors.We
coresmanually(inthatordistributeder)tomakeOpenMPusethrofalleadsavacrailableossCPUmemorypackagescontrollers.and
OurNUMA-awareimplementationscaleslinearlywiththenumber
ofthreads,asshownbyFigureA.1.Toexplainthe95%parallel
efcontrficiencyoller,.weBecausemeasurthisedinforthetotalmationtrafisficnotataveachailablesocketsfromcurrmemorenty
prters),ofilerswehasuchveasdevVTelopedunea(whichsmallusekerpernel-mode-coreperfordrivertomanceprocoun-vide
uncoraccesse4to.theUncachedmodel-specificwritesconstituteperforthemancebulkofcounterstheinwritetheIntelcombin-i7
arerseapparmemoryentlytrafrficeportedandaraseInvthereforalid-Teofo-Exclusivparticulareintertransitionsest.Theandy
canthusbecountedasthetotalnumberofreadsminusnormal
rtoeads64[Mi206].itemsWe×8findbytesthat2per041itemMiB×ar4epasseswritten,(slightlywhichlesscorrespondsbecause
oursitionfinalisnotpassaligned).cannotuseSurprisinglynon-temporal,2272MiBwritesarewhenreadthe–aboutoutput10%po-
morethanexpected.Thisamountseemstobeinfluencedbythe
numberofthreads.Possiblecausesmayincludecoherencytraffic
orpagewalksandwillbeinvestigatedinfuturework.However,
4Thepartofthesocketnotassociatedwithaparticularcore.
140

8

765eadthrsinglevs.Speedup2
43

11

2

3Number4of5threads6

7

8

FigureA.1:Linearscalabilityontwoquad-coreCPUswitha
1.5.offactorNUMA

wecanprovideaconservativeestimateofthebandwidthutiliza-
tion.Giventhepurereadandwritebandwidths(38687MB/sand
r28equir200edMB/s)for4rmeasureadsedandbywritesRightMarkof175[M1538-b],ytetheitemsminimumis343timems,
whichis89%ofthetotalmeasuredtime.Thiscalculationdoesnot
includewrite-to-readturnaround[207,p.486],sothereisevenless
roomforimprovementthanindicated.
Thepreviousmeasurementsconcernlargenumbersofitems.
Wenowstudyperformanceoverawiderrangeofinputsizes.The
theelapsednumbertimeofperitemsitem,nshoduewntoinFiguramortizationeA.2,ofvthrariesinvead-startuperselyovwither-
head.Performanceiswithin10%ofthebestmeasurementwhen
n≥26Mi,orn≥21MiinthecaseoftheapproximatedGaussian
distribution[208].Itisinitiallysurprisingthatthisdistribution

141

mUniforGaussian

doesnotrequiremoretimetosortthanuniformlydistributednum-
bers.However,interleavingbucketsintheLSDpasses(successive
bucketsareassignedtodifferentthreads)reducesloadimbalance,
andincreasedoccupancyofthecentralbucketsimproveslocality
atthememorypagelevel.
1.71.681.66mUniforitem1.64Gaussianper1.621.6Nanoseconds1.581.561.541.521.550100150200250300
Numberof32-bititems[×20]
FigureA.2:Timeperitemforvariousinputsizesanddistributions.

ConclusionA.6Wehaveintroducedimprovementstocountingsortandanovel
variantofradixsortforintegerkey/valuepairs.Bandwidthmea-
surementsindicateouralgorithmsthroughputiswithin11%of
142

thetheoreticaloptimumforthegivenhardware.Itoutperforms
therecentlypublishedresultsofIntelsradixsortbyafactorof
1.64andalsooutpacesaFermiGPUwhendatatransferoverhead
isincluded.Theseresultsindicatethatscalar,bandwidth-sensitive
sortingalgorithmsstillhavetheirplaceoncurrentarchitectures.
However,achievingthislevelofperformancerequiresawarenessof
theunderlyingmicroarchitectureandsomedegreeoftuning.Our
implementationencompasses5700linesofC++(includingtests),
plus40000linesofsharedinfrastructure.Ademoexecutable[209]
capableofgeneratingorreading32-bitintegers,sortingandeffi-
cientlywritingthemtodiskisbeingmadeavailablesothatour
measurementsmaybereproduced.

FutureWork.Althoughcarefullyengineered,ourimplementa-
tionisnotyetageneralsolutionforallpossiblesortingapplications.
sumeRadixatsortleastisonelimitedoftotherkeelativyelydigitssmall(e.g.integerMSB)iskerys,andeasonablywealsoequallyas-
distributed.Skewed(e.g.constant)distributionscurrentlyresultin
loadimbalance.Thiscouldbeavoidedbysortingextremelylarge
bucketsfromtheMSDphaseusingmultipleprocessors.
Wearealsointerestedintestingonlargermulti-socketma-
chineswithhigherNUMAfactors5andinvestigatingdetailsof
wtheebeliememorveythesubsystemgeneralthatsoftwrareeduceeffectivwrite-combiningebandwidth.techniqueFinallycan,
providesimilarspeedupsforothermemory-intensiveapplications.
Inparticular,comparison-basedsamplesortisalsoexpectedto
benefitfromourimplementationtechniques.

5Theratiobetweenremoteandlocalmemorylatency.

143

BAppendix

DetailsImplementation

B.1ingEngineerareSoftw

Buildingtheimageprocessingchainsdescribedinthisworkfrom
thegroundupwasasizableundertakingspanning2008–2011.The
authordevelopedover100000linesofC++code(LOC),which
areorganizedinto12dynamic-linklibrariestoavoidrepetitive
compilation.Thisallowsafullrebuildofoptimizedbinarieswithin
90susingtheIntelcompilerona12-coresystemequippedwith
anSSD.TheMicrosoftVisualStudio2010integrateddevelopment
environment(IDE)isaugmentedwithIntelsParallelStudio2011,
whichencompassesacompiler,toolsfordetectingraceconditions
ormemoryerrors,andaprofiler(formerlyknownasVTune)for
measuringwhereexecutiontimeisspentandreadingtheproces-
counters.manceperforssorEightstandaloneapplicationshavebeendevelopedfortesting
themodulesinisolation.TheSubversion(SVN)softwarecon-
figurationmanagementsystemwasusedtomaintainversioning
information,recordingatotalof38992filechangesin2767revi-
sions.Besidesprovidinginformationsecurity,thiswasvaluable
forshowingwhatchangedsincethelastknown-goodversionand
revertingeditsmadeduringfailedexperiments.Extensivepre-and
postconditionchecksandself-testsbuiltintothesoftwareexposed
manyerrorsearlyon.AcustomASSERTmacroenabledeasier

145

erroranalysisofmessagestheprwithoblemar(eecorvdenofintheproptimizedeviouslybuilds)calledbysubrdisplaoutinesying
andAsthevaluesmentionedofintheirSlocalectionv4.2,ariables.weusespecialC++functions
SIMD(SIMDcode.intrinsics)Pleaserandefertoclassestheprovidedcommentedbythesourcecompilercodetoofthegenerateline
rasterizer[173]foracompleteexampleoftheirsyntaxandsome
low-leveloptimizationtechniques.

yMemorUnalignedB.2Accesses

ItwasmentionedinSection8.4thatvectorizationofthehotspot
operatoryieldsasurprisinglylowspeedupandthatthecauseis
relatedtoIntelCPUspoorhandlingofunalignedmemoryaccesses.
Becausethisissueseriouslyimpactsperformanceandislikelyto
affectotherapplicationsaswell,wewillnowdelveintothedetails.
Apreliminaryversionofthissectionappearedin[186].
TheIntelCore2microarchitecturedelaysSIMDloadoperations
thatcrossacachelineboundary(splits)[53,p.83]by12cycles.
Thisissueisdocumentedin[200,p.5-38],whichrecommendsus-
ingLDDQU1toloadtwoalignedvectorsandshiftthedataintoplace,
thusavoidingacachelinesplit.Anunfortunatedesigntrade-offin
theCore2microarchitecturehasreplacedtheimplementationof
thisinstructionwiththatofthearchitecturallyequivalentMOVDQU2,
whichremainsaffectedbysplits.ThenewerInteli7microarchitec-
turereducesthecostofsplitsto2cycles.
Inthemeantime,severalworkaroundshavebeenattemptedfor
thehotspotoperator:substitutingtwo64-bitloadstodecreasethe
probabilityofsplitsisconsistently4%slower.UsingPALIGNR3to
emulateLDDQUworksbutrequiresthemisalignmenttobeknown
1Unaligned.Double-QuadLoaD23MOVPackedeALIGNDouble-QuadRight.Unaligned.

146

atlengtharecompile-time.fixed,severalRealizingthatShellMaxaccessfunctionspatternswerforeeachgeneratedintervviaal
templatesandcalledthroughfunctionpointers.Thisturnsoutto
be20%slower,probablyduetomis-predictedindirectbranches.A
finalalternativeliesinmanuallyaligningaccesses,whichisfeasible
becausemisalignments.shellmaximaUnfortunatelycomputationstheSSEonlyrinstructionequirethrseteedoesdistinctnot
toallothewvlowariableerhalvshiftsesofofrfullegistersregistersdecrandeasesrperforestrictingmanceallbyoperationsabout
25%.OR-operationRegardless,vastlytheoutwcostofeighstwothealignedexpenseloads,oftwcacheoshiftlineandsplits.oneIt
appearsthatstraightforwarduseofMOVDQUiscurrentlythebest
option,especiallybecauseAMDmicroarchitecturesalsohandle
unalignedloadswithonlyslightpenalties.
Wenowshowtheperformanceimpactofcachelineandpage
splitssumingon2-bCorytee2valuesCPUsinandthe64-bcontextyteL1Dofthecachehotspotlines,7operatoroutof.theAs-
32possibleInstrumentationshomisalignmentswsthattheshouldactualcrossanumbercacheisline22.13%.boundarThisisy.
slightlymorethanexpectedbecausethemisalignmentsarenot
quiteuniformlydistributed.Similarargumentsapplyforpage
andsplits;observeassuming0.34%,sizeswhichof4isinKiB,wgoodeagrexpecteement.aratioUsingof7theoutperof2-split048
prcostsocessorof,12weandther224eforcyeclesexpectgiven1.42ins[of210]CPUandtimetosupposingbelosta3dueGHzto
alltheaddrsplits.essesAvtoarianttheirofnaturalthehotspotalignmentalgorithmruns1.33thatsrfasteroundsthandownthe
prnormaledictionsasingle-corveforeavslightersion.diffThiserencemeasurduetoementtheovermatchesheadtheofabomask-ve
ingtheloweraddressbits.Cacheline-andpagesplitpenalties
havecomputationthereforfreombeen2223shomswntoto3be641rms,esponsiblei.e.afactorforincrof1.63!easingtotal
VTTuneoprgainofileratobetterobserveunderstandingcertainCPUoftheperforcause,mancewehavecounters.usedThethe

147

thefirstfactsurthatprisingtheseobservaccessesationisaraelargelocal.amountThisofandL1Dacachemisseslinedespitesplit
penaltyequaltotheL2accesslatencyleadstothepresumptionthat
suchloadsaresimplynotservicedbytheL1cacheandmustgo
thrtheyoughdoL2.notPagecausesplitsanapparexcessiveentlyhaamountveaofdifL2ferentmisses.effectInsteadbecause
wepagesnoteareausedsignificantandworkingnumbersetofTLBdoesnotmisseseexceedvenTLBthoughcapacitylarge.
Thisespeciallyseemstobecausepointthetoovwerardsheadispagesimilarsplitstorthatequiringreportedapagein[w42,alk,p.
21].TheseAlthoughfindingstheaboarveeinaccordiscussiondancemaywithbe[210deemed].highlysystem-
specific,itisalsoquiterelevantforreal-worldperformance.It
issursafeprisingtosaydegrthatee—prpenalizeocessorswillunalignedgenerallymemor—yandaccesses.perhapsBecausetoa
issueaccessmustpatterbenskeptareinintimatelymindduringtiedtotheirthedesigndesign.ofalgorithms,this

B.3LVTFileFormat

Section3.3statedourrequirementsforanimagefileformat,par-
ticularlyintegerandfloating-pointdatatypes,compression,tiling,
imagepyramidsandflexiblemetadata.Wearenotawareofan
existingformatthatcoverstheseneeds,alignsdataforefficient
accessandavoidsconversionoverhead.Thishasmotivatedthe
developmentofanewLosslessVirtualTexture(LVT)layout.Let
usemphasizethatitisnotintendedtoreplaceexistingformats.
Instead,itcanbeseenasanoptimizedalternaterepresentationthat
providesrapidaccesstoimagetiles,thusenablingsmoothnaviga-
tionandzoomingwithinthefull-resolutionpixels.Itshigh-level
structureisstraightforward:theimagetilesarefollowedbyan
arbitrarynumberofvariable-sizedsectionscontainingmetadata.

148

Inthefollowing,weprovideprecisedefinitionsofthesecompo-
vianentsC++andsyntax,ourdesignwithu8,rationale.u16,u32Theanddatau64rstructurespectivesarelyedescribeddenoting
unsigned8,16,32and64-bitintegerfields.

PyramidiledTTtheoalloforwmatprsmoothovidesnaforvigationamulti-rwithinlaresolutiongepimagesyramidatloofwlevzoomels.Lescales,vel
0isdefinedastheoriginalimageembeddedwithinasquarewhose
dimensionsareapoweroftwo.Subsequentlevelsarehalfaswide
andtiles.Whigheastruncatetheirprthepedecessoryramid.Eachafteralevleelvelisfitssplitintointoasingleindividualtile
becausesubsequentlevelsareneverused.
Itisimportanttocarefullyarrangetilestoimprovelocalityand
enableaparallelexternal-memoryalgorithmforcomputingthe
paccoryramiddingfrtooma2-DtheSpaceoriginalFillingimage.CurvAelev(SFC)els[211tiles],canthusbedecrordereasinged
theaveragedistanceofnearbytileswithinthefile,whichmay
reducethenumberandcostofdiskseeks.A3-Dmappingobtained
byonlyfillsincludingasmallthelepartvelofwtheould3-Dbespace.wasteful,Bycontrast,becausethecontiguouspyramidtile
indices(thenumbercorrespondingtoatilespositiononthespace-
fillingcurve)wouldallowsimpleandefficientlookupsofatiles
location.Moreimportantly,definingthecurvetomatchtheorder
winouldwhichminimizehigher-levelmemortilesyaruseewhengeneratedcreatingfromthetheirimageprpyredecessorsamid.
Weintroduceanovelmappingwithbothoftheseproperties.
Considera2×2quartetoflevel0tiles,denotedquad,from
terwhichwards,onethelevel1quadstilefourmaytilesbearecomputednolongerviadoneededwnsampling.andmaybeAf-
removedfrommemoryoncetheyhavebeenwrittentothefile.The
curveweseekmustfirstvisitthequad,theresultinglevel1tile,
threeotherneighboringquadsandtheirlevel1tiles,andthenthe

149

resultinglevel2tile.Letusbeginwitha2-DZ-ordercurve(the
Peanocurveof[211]rotated90degreesclockwise).Inaccordance
withstandardpractice,wetransformXandYcoordinatestoaZ
indexbyinterleavingtheirbitsviaSIMD[212].Theresultingvalue
isshiftedleftbylog(numLevels+3)bits.Inthelowerbits,
weencodeeitherthequadrant[0,3]ofthelevel0tiles,or3plus
thelevelindexofanyhigher-leveltilesgeneratedfromthequad.
Indicesoftilesabovelevel0areoffsetbythecumulativesumof
thedistancebetweenZneighborsinpreviouslevels,thusshaping
the3-Dspaceintoapyramid.Atileatleveli+1immediately
followsthelevelitilethatisitsfourthandfinalquadrant,which
isthedesiredpropertythatallowsconstructingthepyramidwith
use.ymemorminimalTilesarestoredintheorderinducedbythiscurve.Depending
onthetileEncodingfield,eacheitherconsistsofuncompressed,
band-interleavedpixels,orthecompressedvariable-lengthLASC
representationofthem.Becausethenexttilesoffsetisdetermined
fromitspredecessorssize,thetilesarestoredback-to-back.This
requiresaparallelcompressionpipelinetostalluntilthesizesofall
precedingtilesareknown.Weprefertheresultingslightincrease
incompressiontimeoverlargerfilesizesbecausegeneratinglarge-
scaleimagesisusuallyanoff-lineprocess.

SectionsMetadatawithinthefileisorganizedintovariable-lengthsections.
Eachisidentifiedbyafour-charactercode.Applicationsmaydefine
fortheirownuseanysequencebeginningwith~andcontinuing
withthreeuppercaseletters.ThisdefinitionoftheLVTfileformat
includessixbuilt-insectiontypes,whichshallbediscussedinturn.

150

VTDLToallowrapidlocalizationofsectionswithoutincurringexpensive
hard-diskseekoperations,version3oftheLVTDsectionisa
directoryoffixed-lengthentries–onepersection,includingitsown.
mustEntriesresidemustbeimmediatelysortedbypriorincrtoeasingthefileendofoffset,theandfile.theThedirnumberectory
ofentriesisderivedfromthesectionsize,andeachincludesthe
fields:wingfollo

identifier[4];u8version;u32encoding;u32checksum;u32size;u64offset;u64

identifierisanapplication-definedcharactersequenceorone
vofersionthenumberparagraphoftheheadingssectioninthisdefinition.text.Becauseversiontheforindicatesmatisthein-
tendedasasimpleintermediaterepresentation,wedonotprovide
forbackwardnorforwardcompatibility.A32-bitintegerislarger
thantypetoavnecessaroidy,morbuteweprcomplicatedefertouseinstructionaprocessorencodingssnativforesoftwintegerare
readinguncomprtheessed.fields.checksumencodingmustmustbe0beand0,isrindicatingeservedtheforsectionpossibleis
vtheerificationlength[bofytes]sectionoftheintegrityactualinsection.futurevoffsetersions.pointssizetoitsindicateslo-
cationinthefile.TosimplifyasynchronousI/O(c.f.Section3.2),
bothofthesevaluesmustbeamultipleofsectionAlignment,
whichiscurrently4KiB.64-bitintegersavoidrestrictionsonthe
posizewerand-of-twopositionsizeofofthedirsectionsectorinylarentries,gefiles.whichNotesimplifiestheaddrdeliberateess
computations.

151

PARAVersion3ofthePARAsectionindicatestheparametersthat
governedthecreationoftheLVTfile:

interpolation;u32tileEncoding;u32noDataValue;floatignoreValue;floatbinFunction;u32numThreads;u32interpolationspecifiestheinterpolationmethodwhendown-
sampling:nearestneighbor(0)orbilinear(1).tileEncodingin-
dicateswhethertilesareuncompressed(0)orencodedwithLASC
(1),describedinChapter4.noDataValueisthepixelcomponent
valueusedtoinitializepixelsthatlieoutsidetheoriginalimage.
Tileswhosepixelcomponentsareallequaltothisvalueareomit-
tedfromthefile.Settingitinaccordancewiththemostcommon
luminanceintheimagemayreducethefilesize.ignoreValue
allowsignoringallpixelcomponentswithacertainvaluewhen
computingstatistics.Toavoidthis,specifyanimpossiblevalue
thatdoesnotoccurintheimage.binFunctionindicateswhether
thehistogrambinfunctionislinear(0)orlogarithmic(1)withbase
e.numThreadsspecifiesthemaximumnumberofthreadsinthe
parallelpipelineforcomputingtheimagepyramid.Thisvalue
isofnousetoreadersofanLVTfile,butiswrittentodiskfor
enience.conv

TASTVersion1oftheSTATsectionbeginswithbasicimagecharacter-
pixelFormat;height,width,u32istics:widthandheightindicatethenumberofvalidpixelsineach
dimension,whichneednotbeamultipleoftileDim(256).

152

pixelFormatisaconvenientandcompactencodingofthecom-
ponenttype(therepresentationofadigitalnumberindicatingthe
intensitywithinaspectralbandforeachpixel)andthenumberof
componentsperpixel.Thesizeofthecomponenttypeisstored
withinthelower8bitstoallowefficientcomputationofstorage
requirements.Todistinguishbetween32-bitintegersandsingle-
precisionfloating-pointnumbers,exactlyoneofthreeadditional
bitsmustbeset.Bit15(32768)indicatesanunsignedinteger,bit14
denotesasignedintegerandbit13signalsafloating-pointnumber.
Thenumberofcomponents(upto4096)isstoredinbits16and
above.Thesectionalsostoresstatisticsforeachband:

ignoreValue;floatnumIgnored;u64doublemin,max,mean,stddev,median,mode;
histogram[256];u64

ignoreValuespecifiesthevalueofacomponenttoignorewhen
orcomputingno-dataartheeas,statistics.whichwThisouldisotherusefulwiseforaffectimagesthewithmeanvbackgralue.oundTo
avoidbranchingorcodeduplication,thisfunctionalityisalways
prpossibleesent.vHowaluesever,suchitascanefinfinityfectiv.elybenumIgnoreddisabledbycountsthespecifyingnumberim-
ofandcomponentsmaximumthatcomponentwerevignoraluesed.minencounteranded.maxThearyeartheeinitializedminimum
astoathelardouble,gestrpositivespectiveelyand,andrsmallestemainnegativunchangedevalueifralleprvaluesesentableare
ignored.mean,stddev(standarddeviation)andmedianarethe
eponymousstatisticalmeasures.modeisthemostfrequentvalue,
computedasthelowerboundofthehistogrambinwhosecountis
thelargest.histogramindicateshowmanycomponentsvalues
fallintoeachofitsbins,whichareequal-widthsubdivisionsofthe
interval[min,max].Theuseof64-bitintegersavoidsoverflowand
counts.inexact

153

RANG

Version6oftheRANGsectionisacompressedrepresentationof
therange(i.e.offsetandsize)eachtileoccupiesinthefile.Because
tilesizesarealwaysmultiplesoftileAlignment(whichagain
correspondstotheminimum4KiBsectorsize),wedividebythat
valueandstoretheresultsinunsigned16-bitintegersreferredtoas
quantizedsizes.Tileindicesincludesmallgapsofunusedvalues
becausenotallquadsgeneratetilesoflevels>1.Toavoidstoring
rangesforsuchindices,weintroducegroupsofquadsdenoted
QuadGroup.Thedatastructuredescribingthemisdesignedtofit
line:cachesingleawithin

firstTileOffset;u64quadSizes[4];u16tileSizes[24];u16

BeingafirstTileOffsetmultipleofisthefiletileAlignmentoffsetof,theweletfirstthetileloinwerthis12grbitsoup.
denotewhetherthisgroupincludestilesoflevel>5.quadSizes
arstoretheesthequantizedquantizedtotalsizessizesofof4+1eachtilesquadinin3thequads,group.andatileSizestotalof
4+5tilessizesforthelastquad,becauseitistheonlyonethat
maygeneratemultiplehigher-leveltiles.Theoffsetofatileofa
givskippingenindexpastisrprioretrievquadsedbywithinadvtheancinggrouptotheandgrthenoupsprefirstviousofftilesset,
itsinsidesizetheisalsoquad.rIfetriethevedtilesfromleveldoestileSizesnot.exceedOtherthewise,arraythecapacitytileis,
assumedtobeuncompressedanditssizeiscomputedfromthe
alloimagewpixelelidingfortilesmat.Howhoseweverpixels,thearloewallerbitsequaloftotheno-datafirstTileOffsetvalue;
theirsizesareconsideredtobezero.
AlthoughQuadGroupminimizeswastedspaceduetonon-
prof-twesentogridhigh-leforvelthetiles,sakeofembeddingsimpleZcoorimagesdinatewithinacomputationsquarepowmaery-

154

alsoleadtolargerangesofunusedindices.Wemitigatethiswith
anadditionalQuadChunkdatastructure:
firstGroupOffset;u64unused;u64sizes[8];u16validGroups[8];u32firstGroupOffsetistheoffsetofthefirstofeight32-group
clustersthatconstituteachunk.sizesholdsthetotalsizesof
eachcluster.validGroupsisabitfieldindicatingwhichofeach
clustersgroupsarepresent.TheQuadGroupgoverningagiven
tileislocatedbystartingatthefirstoffset,skippingpreviousclus-
tersandthenaddingthesizeofQuadGrouptimesthenumberof
priornonzerobitsintheclustersvalidGroupsfield.Thisdata
structureenables256-foldcompressionofunusedQuadGroupat
theexpenseofasinglecache-lineaccessandsomeminorcom-
putation.TheRANGsectionconsistsofQuadChunkinstances
coveringallpossibletileindicesfollowedbyasmanyQuadGroup
needed.as

PROJTofotheallowoptionalassociatingPROJpixelssectionwithstoresgeographicinformationcoordinates,aboutvtheersionmap1
ojection:pri32doublezone;ulx,uly,lrx,lry;
band;charulandlrdenotetheupper-leftandlower-rightcorners,for
whichwestorexandycoordinates.zoneis-1iftheother
valuesareinvalid/unknown,-2toindicatethecoordinatesare
latitude/longitude,orazonein[1,60]forUniversalTransverse
Merothercatorwisea(UTM)MilitarycoorGriddinateReferencesystems.Systembandis(MGRS)?iflatitudeunknoband.wn,
155

CELL

Version2oftheoptionalCELLsectionprovidessupportforcom-
biningpresentationslidesorotherpicturesintoonelargeimage.
Thisallowszoominginonindividualslideswithoutrequiring
separateLVTfiles.Eachslideresidesinasquarecell,andtheim-
ageconsistsofasquarecellmatrixwithpower-of-twodimensions.
Cellsaredescribedbythefollowing:

flags;u32cellDim;u32u32u32upperLeftX,elementWidth,upperLeftY;elementHeight;
marginUpper;marginLeft,u32

flagshasbit0setifthecellshouldnotbezoomed.cellDim
divisibleindicatesbythewidthtileDim.andEachheightcellinmustpixelshavofethethecell,sameandmustdimension.be
upperLeftXandupperLeftYarethecoordinatesofthecells
topleftpixelwithintheentireimageandarethereforemultiplesof
cellDim.CellsarearrangedaccordingtoaC-Scan[211].Rows
alternatebetweenlefttorightandrighttoleftordering;thisenables
asimpleslidingtransitionanimationwithoutbringinganyother
cellsintoview.elementWidthandelementHeightdescribe
thesize(inpixels)oftheimagethatisembeddedwithinthecell
andmustnotexceedcellDim.marginLeftandmarginUpper
indicatethenumberofno-datapixelsontheleftandupperborder
ofthecell.Theymustbenon-zeromultiplesoftileDim.

NotesConcluding

TheincludingLVTfileaformultitudemathasofbeenpixelfordesignedmats.forefficiencyExtensibilityandisensurflexibilityed,
viavmetadata.ersioningStoringandaallotiledwingpyramidforalloadditionalwssmoothnaapplication-definedvigationin

156

oryrterapixel-scaleequirementsimages.whenAcrnoveleatingthespace-fillingpyramid,curvewhichminimizesiswrittenmem-
seeks.diskanywithoutsequentiallyEachAnasectionwarenessandoftilelorw-leesidesvelinitsoalignmentwndiskissuessectorr,educesthusoverenablinghead.
directI/Owithoutadditionalcopying(c.f.Section3.2).Thisalso
prensuressingesthetiles.SIMDAcompactalignmentdirrectorequiryavementsoidsareseeksmetwhenwhenfindingdecom-
triesections.vingtheThesizeofcompranyessedtiletileafteronlylookuptwodatacachestructurlineeaccessesallowsrande-
computation.modestefItficientwaslayaoutpleasurthateavtooidsdesigntheashortcomingscapable,yetofprsimpleeviousandforhighlymats.
forAlthoughanimagechieflyviewer,intendeditsefasanficiencyoptimizedmayalsointerlendnalritselfeprtoesentationother
applications.

157

Bibliography

[1]

[2]

[3]

[4]

[5]

[6]

C.PhysicsBohrTeneacherand,A.pagesFraser.267–272,ColorsMaofythe1985.sky.AvTheail-
ablefrom:colors_of_the_sky-http://homepages.wmich.edu/~korista/Bohren_Fraser.pdf.

Y.ChanandV.Koo.Anintroductiontosyntheticaperture
radar(SAR).ProgressInElectromagneticsResearchB,2:27–60,
2008.

AvailableDigitalGlobe.from:DigitalGlobecoreimagerhttp://www.digitalglobe.com/yproductsguide.
digitalglobe2/file.php/811/DigitalGlobe_.Core_Imagery_Products_Guide.pdf

J.Pike.Nationalimageinterpretabilityratingscales,Jan-
uary1998.Availablefrom:http://www.fas.org/irp/
.imint/niirs.htm

K.erasandJacobsen.spaceRecentimageryde,vJanuarelopmentsy2011.ofdigitalAvailablecam-
from:http://www.ipi.uni-tx_tkpublikationen/2011_GISOSTRAVA_KJ.pdfhannover.de/uploads/.

I.changeNiemeyerdetection,S.prNussbaum,oceduresandforM.nuclearCanty.safeguarAds-rutomationelatedof
monitoringpurposes.InGeoscienceandRemoteSensing
Symposium,2005.IGARSS05.Proceedings.2005IEEEIn-
ternational,volume3,pages2133–2136,July2005.doi:
.10.1109/IGARSS.2005.1526439159

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

E.Bjorgo.Unitedaidfromthesky–aframeworkpaperon
currentandpotentialuseofsatelliteimageryinunitedna-
tionshumanitarianorganizations,April2001.Availablefrom:
http://www.humanitarianinfo.org/imtoolbox/03_Mapping_GIS_GPS/Mapping_Reference/Remote_.Sensing_Imagery/2001_UN_Remote_Sensing.doc

TheScientistandEngineersGuidetoDigitalSignalSmith.S.Processing.CaliforniaTechnicalPublishing,1997.Available
.http://www.dspguide.com/om:fr

IEEEJ.Micro,Nickolls30(2):56–69,andW.Dally2010..AThevailableGPUfrom:computingera.http://doi.
.ieeecomputersociety.org/10.1109/MM.2010.41

ELPIDA.IntroductiontoGDDR5,March2010.Available
.http://www.elpida.com/pdfs/E1600E10.pdfom:fr

D.FermiarPatterson.chitecturThee,topand10theinnotopv3ationsnextinthechallenges,newSNVIDIAeptem-
ber2009.Availablefrom:http://www.nvidia.com/
content/PDF/fermi_white_papers/D.Patterson_.Top10InnovationsInNVIDIAFermi.pdf

S.SirowyandA.Forin.Wheresthebeef?why
FPGAsaresofast.TechnicalReportMSR-TR-2008-
130,MicrosoftResearch,September2008.Available
http://research.microsoft.com/apps/pubs/om:fr.default.aspx?id=70636

I.ASICs.KuonandIEEEJ.Trans.Rose.onCADMeasuringoftheIntegratedgapCirbetwcuitseenandFPGAsSystemsand,
26(2):203–215,2007.Availablefrom:http://dx.doi.org/
.10.1109/TCAD.2006.884574

J.Chhugani,A.Nguyen,V.Lee,W.Macy,M.Hagog,Y.Chen,
A.Baransi,S.Kumar,andP.Dubey.Efficientimplementation
160

[15]

[16]

[17]

[18]

[19]

[20]

[21]

ofsortingonmulti-coreSIMDCPUarchitecture.PVLDB,
1(2):1313–1324,2008.Availablefrom:http://www.vldb.
.org/pvldb/1/1454171.pdf

wH.ardSutter.concurrTheencyfree.Drlunch.Dobbisosver:JournalA,Marfundamentalch2005.turAnvto-ail-
able184405990from:.http://www.ddj.com/web-development/

IntelCorporation.Analyzingbusinessasithappens,April
2011.Availablefrom:http://www.intel.com/en_US/
.Assets/PDF/whitepaper/mc_sap_wp.pdf

C.Angelini.Insideofsandybridge:Cores
andcache,January2011.Availablefrom:
http://www.tomshardware.com/reviews/sandy-bridge-core-i7-2600k-core-i5-2500k,
.2.html2833-

T2011.exasAvInstruments.ailablefrom:C6000highperformancehttp://focus.ti.com/docs/multicoreDSP,
.prod/folders/print/tms320c6678.html

L.Nilsson.Intelsromleyplatformwillbeavailablefor
LGA-1356andLGA-2011,February2011.Availablefrom:
.http://goo.gl/GVRV2

NVIDIACorporation.NVIDIAquadro,2011.Available
http://www.nvidia.co.uk/object/quadro_om:frbuy_now_uk.html.

(130BallaTheFearGFloped.peakThelinpack),truepoJanuarweryof2011.sandyAvbridge?ailable
from:916911-true-power-sandy-http://www.overclock.net/intel-bridge-130-gflop.cpus/
.html

161

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

TA.vHagen.ailablefrParallelom:andheterogeneoushttp://www.sintef.no/project/computing,April2010.
Collab/Presentations/Hagen_CollabWorkshop_.HeterogeneousComputing.pdf

T.Grant.Xilinxredefinespower,performance,
anddesignproductivitywiththreeinnovative28
nmFPGAfamilies,March2011.Availablefrom:
http://www.xilinx.com/support/documentation/.white_papers/wp373_V7_K7_A7_Devices.pdf

G.bridgeGasior.perforExploringmance,theFebruarimpactyof2011.Amemorvyailablespeedfrom:onsandyhttp:
.//techreport.com/articles.x/20377/2

TexasInstruments.DDR3designrequirementsforkeystone
devices,April2011.Availablefrom:http://focus.ti.
.com/lit/an/sprabi1a/sprabi1a.pdf

Intel.Intelcorei7-2600kprhttp://ark.intel.com/Product.aspx?id=52214ocessor,2011.Availablefr.om:

P.Dillien.Commentonthelikelysellingpriceofthe2M
LUTdevice,November2010.Availablefrom:http://goo.
gl/eUr7h.

M.2010.KrAeuzerv.ailablefrom:DSP-Messlattehöhergelegt,http://www.elektroniknet.November
de/bauelemente/produkte/halbleiter/article/.30498/

J.withHussein,Xilinx7M.seriesKlein,andFPGAs,M.Hart.FebruarLoyw2011.eringApovwerailableat28frnmom:
http://www.xilinx.com/support/documentation/white_papers/wp389_Lowering_Power_at_28nm..pdf

162

[30]

[31]

[32]

[33]

[34]

[35]

[36]

TexasInstruments.AdvanceddigitalCMOSforembedded
processing,2011.Availablefrom:http://www.ti.com/
.corp/docs/manufacturing/advancedCMOS.shtml

ITRStechnologyInterrnationaloadmapforRoadmapsemiconductors,Committee.2009.InterAnationalvail-
ablefrom:2009Chapters_2009Tables/2009_ExecSum.pdfhttp://www.itrs.net/Links/2009ITRS/.

M.Moncur.Quotation933,2010.Availablefrom:http:
.//www.quotationspage.com/quote/933.html

V.Lee,C.Kim,J.Chhugani,M.Deisher,D.Kim,A.Nguyen,
N.Satish,M.Smelyanskiy,S.Chennupaty,P.Hammarlund,
R.myth:Singhal,AneandvP.aluationDubeyof.thrDebunkingoughputthe100XcomputingGPUonvs.CPUCPU
ArandchitecturGPU.eInPr(37thoc.37thISCA10),InternationalpagesSymposium451–460,onSaint-Malo,Computer
France,June2010.ACM//citeseerx.ist.psu.edu/viewdoc/download?SIGARCH.Availablefrom:http:
.doi=10.1.1.170.2755&rep=rep1&type=pdf

vN.ectorizationDickson,K.forKarimi,CPUandandFGPU.Hamze.softwareperforImportancemance.ofexplicitCoRR,
abs/1004.0024,2010.Availablefrom:http://arxiv.org/
.abs/1004.0024

R.Vuduc,A.Chandramowlishwaran,J.Choi,M.Guney,
andA.Shringarpure.OnthelimitsofGPUacceleration.In
Proc.HotPar10(2ndUSENIXWorkshiponHotTopicsinParal-
lelism),Berkeley,CA,June2010.UsenixAssoc.Available
https://www.usenix.org/events/hotpar10/om:fr.tech/full_papers/Vuduc.pdf

Power-EfficientAcceleratorsforHigh-PerformanceDasika.G.Applications.PhDthesis,UniversityofMichigan,2011.

163

[37]

[38]

[39]

[40]

[41]

[42]

[43]

P.Sanders.Algorithmengineering–anattemptatadef-
inition.InS.Albers,H.Alt,andS.Näher,editors,Effi-
cientAlgorithms,volume5760ofLectureNotesinComputer
Science,pages321–340.Springer,2009.Availablefrom:
http://dx.doi.org/10.1007/978-3-642-03456-5.

P.McKenney.Memorybarriers:ahardware
viewforsoftwarehackers,April2009.Available
http://www.rdrop.com/users/paulmck/om:fr.scalability/paper/whymb.2009.04.05a.pdf

D.ber2010.Kanter.AvIntelsailablefrsandyom:bridgemicroarhttp://www.realworldtech.chitecture,Septem-
.com/page.cfm?ArticleID=RWT091810191937&p=7

M.HillandA.Smith.Evaluatingassociativity
inCPUcaches.IEEETransactionsonComputers,
38(12):1612–1629,December1989.Availablefrom:
ftp://ftp.cs.wisc.edu/markhill/Papers/toc89_.cpu_cache_associativity.pdf

P.Flajolet.Approximatecounting:Adetailedanalysis.BIT,
25(1):113–134,1985.Availablefrom:http://algo.inria.
.fr/flajolet/Publications/Flajolet85c.pdf

U.memorDry,epperNo.vWhatemberev2007.eryAprvailableogrammerfrom:shouldknohttp://people.wabout
.redhat.com/drepper/cpumemory.pdf

N.SlingerlandandA.Smith.Measuringtheper-
Tforrans.manceComputersof,multimedia51(11):1317–1332,instruction2002.sets.AvailableIEEE
from:publications/mm_isa_perf/csd-http://www.cs.berkeley.edu/~slingn/00-1125.pdf.

164

[44]

[45]

[46]

[47]

[48]

[49]

G.Ren,P.Wu,andD.Padua.Apreliminarystudyon
thevectorizationofmultimediaapplicationsformultime-
diaextensions.InL.Rauchwerger,editor,LCPC,volume
2958ofLNCS,pages420–435.Springer,2003.Available
http://polaris.cs.uiuc.edu/publications/om:fr.old.pdf2003-ren-

J.Parri,D.Shapiro,M.Bolic,andV.Groza.Returningcontrol
totheprogrammer:SIMDintrinsicsforvirtualmachines.
CommunicationsoftheACM,54(4):38–43,April2011.Available
http://delivery.acm.org/10.1145/1950000/om:fr.parri.pdf1945954/p30-

frIBMom:Corporation.Cellbroadbandhttp://pcsostres.ac.upc.edu/cellsim/lib/engine,2006.Available
goetz.pdf?id=additional_exe/fetch.php/0845-cell_documents&cache=cache.

T.Mudge.Power:Afirst-classarchitecturaldesignconstraint.
IEEEComputer,34(4):52–58,2001.Availablefrom:http:
.//www.eecs.umich.edu/~tnm/papers/hipc.pdf

S.Naffziger,B.Stackhouse,T.Grutkowski,D.Joseph-
son,J.Desai,E.Alon,andM.Horowitz.Theim-
plementationofa2-coremulti-threadedItanium
cuitsfamily,prpagesocessor.182–183,In2005.IEEEAvJournalailableoffrom:Solid-Statehttp:Cir-
//citeseerx.ist.psu.edu/viewdoc/download?.doi=10.1.1.80.8221&rep=rep1&type=pdf

Y.Liu,R.Dick,L.Shang,andH.Yang.Accuratetemperature-
dependentintegratedcircuitleakagepowerestimation
iseasy.InR.LauwereinsandJ.Madsen,editors,DATE,
pages1526–1531.ACM,2007.Availablefrom:http:
//citeseerx.ist.psu.edu/viewdoc/download?.doi=10.1.1.165.2961&rep=rep1&type=pdf

165

[50]

[51]

[52]

[53]

[54]

[55]

[56]

C.TsengandS.Figueira.Ananalysisoftheenergyefficiency
ofmulti-threadingonmulti-coremachines.InGreenCom-
putingConference,2010International,pages283–290,August
2010.Availablefrom:http://ieeexplore.ieee.org/
.stamp/stamp.jsp?tp=&arnumber=5598301

F.standarPutze,dP.templateSanders,librarandy.J.InPrSingler.oceedingsMCSTL:oftheTheACMmulti-corSIGPLANe
(22thSymposiumPPOPP2007)on,Principlespagesand144–145,PracticeSanofJose,ParallelCA,PrMarchogramming2007.
SIGPLAN.ACM

IntelCorporation.Intelthreadingbuildingblocksde-
signpatterns,September2010.Availablefrom:http:
//threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/.Design_Patterns.pdf

TheoarMicrechitecturofIntelandAMDFog.A.CPUs.CopenhagenUniversity,January2008.Avail-
http://www.agner.org/optimize/om:frable.microarchitecture.pdf

portabilitymancePerforConsortium.PEPPHERandprogrammabilityforheterogeneousmany-
corearchitectures,2010.Availablefrom:http:
//www.par.univie.ac.at/project/peppher/.publications/PEPPHER_Fiche.pdf

ISO/IEC14882.Programminglanguages—C++,October
2003.

InstructionablesTarA.yFog.2010.Availablefr.om:CopenhagenUnivhttp://www.agner.org/ersity,Febru-
.optimize/instruction_tables.pdf

166

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

S.mizeTaylorSoftwar.InteleApplicationsIntegratedUsingPerformanceIntelIPP.Primitives:IntelPrHowess,to2004.Opti-

J.Wcaching,assenberAprilg.2006.AOptimizingvailablefilefrom:accessesviaorhttp://wassenberg.deringand
.dreamhosters.com/articles/study_thesis.pdf

S.chronousBhattacharI/Oya,S.supportPratt,inB.LinuxPulav2.5.artyIn,PrandJ.oceedingsMorofgan.theAsyn-Linux
2003.July,Symposium

TheOSRNTOpenInsider,Systems3(1),FebruarResouryces.1996.LifeAvinailablethefastfrom:I/Ohttp:lane.
.//www.osronline.com/article.cfm?id=166

aMicrbackuposoftCorthatisporation.storedinThertapesestoroneamaySQLfailservwhenery2000ourserestorvere
thatisrunningwindows2000datacenteroradvancedserver.
Availablefrom:http://support.microsoft.com/kb/
.280793

IDEMA.Advancedformatharddiskdrives,March
2011.Availablefrom:http://www.idema.org/
content/uploads/downloads/2011/03/wp-Advanced-Format-for-Hard-Disk-Drives.pdf.

IEEE.IEEEStd1003.1-2001–aio.h,2004.Available
http://pubs.opengroup.org/onlinepubs/om:fr.009695399/basedefs/aio.h.html

sIntelonousAsynchrI/OLibraryporation.CorIntelforWindowsOperatingSystems,2010.Availablefrom:
http://software.intel.com/sites/products/us/cpp/win/documentation/hpc/composerxe/en-cref_cls/common/cppref_asynchioC_intro.htm#.cppref_asynchioC_intro

167

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

R.2008.Vicik.AvailableDesigningfrom:applicationsforhttp://goo.gl/Nc2lYhighperfor.mance,June

MicrosoftCorporation.AsynchronousdiskI/Oappears
asdowssynchrXP,onousFebruaronyW2009.indoAvwsailableNT,frWom:indows2000,http://support.andWin-
.microsoft.com/kb/156932

S.Tsuji.BenchmarksofsandforcebasedSSDs,2010.Avail-
http://www.thosp.com/PC/SSD_vs_HDD/om:frable.SSD_benchmark_SandForce/SandForce_en/

Univstants/macrersityofosforPennsylvPMania.files,1991.includefileAvdefinailableingfrom:con-
http://www.unf.edu/public/cap6400/ychua/.2.21/pm.hxv-

F.FebruarKainzyand2009.R.ABogart.vailableTfrechnicalom:introductionhttp://www.openexr.toOpenEXR,
.com/TechnicalIntroduction.pdf

F.Warmerdam.Erdasimagine.ige(largerasterspill
file)format.Availablefrom:http://home.gdal.org/
.projects/imagine/ige_format.html

A.Grønheim.NATOsecondaryimageryformat
(NSIF),November1998.Availablefrom:http:
//www.nato.int/structur/AC/224/standard/.4545/4545_documents/4545_ed1_amd1.pdf

O.Eichhorn.BigTIFFversionoflibtifflibrary,March2008.
Availablefrom:http://www.aperio.com/bigtiff/
.#FILE_FORMAT

J.pleWdataassenbercodec.g.LosslessSoftware:PracticeasymmetricandsingleExperience,instruction2011.Avmulti-ail-
ablefrom:10.1002/spe.1109/pdf.http://onlinelibrary.wiley.com/doi/
168

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

forStorageRemancerviewesour.com.cecenter,StoragerMayevie2011.w.comsAvailabledrivefrperom:-
http://www.storagereview.com/php/benchmark/.bench_sort.php

P.HowardandJ.Vitter.Fastandefficientlosslessimage
compression.InDataCompressionConference,pages351–360,
1993.

N.Memon,D.Neuhoff,andS.Shende.Ananalysisof
somecommonscanningtechniquesforlosslessimagecod-
ing.IEEETrans.ImageProcessing,9(11):1837–1848,November
2000.Availablefrom:http://dx.doi.org/10.1109/83.
.877207

Tof.Simageeemann,sub-prP.Tischeredictors.,andInB.PicturMeyeer.CodingHistory-basedSymposium,blendingpages
147–151,1997.Availablefrom:http://www.cs.monash.
.edu.au/~torsten/publications.shtml

rJ.Welationang,ofM.Zhang,Landsat-TMandS.dataTang.forlosslessSpectralandcomprspatialession.decorGeo--
1285,scienceSandeptemberRemote1995.Sensing,IEEETdoi:10.1109/36.469492ransactionson,.33(5):1277–

N.Merhav,G.Seroussi,andM.Weinberger.Optimalprefix
codesforsourceswithtwo-sidedgeometricdistributions.
IEEETransactionsonInformationTheory,46(1):121–135,2000.

B.MeyerandP.Tischer.Glicbawls–greylevelim-
agecompressionbyadaptiveweightedleastsquares.
InDataCompressionConference,page503,2001.Avail-
http://computer.org/proceedings/dcc/om:frable.1031/10310503.pdf

Y.basedonHashidumeminimumandY.meanMorikaabsolutewa.errorLosslesspredictors.imageIncodingSICE,

169

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

2007AnnualConference,pages2832–2836,September2007.
.doi:10.1109/SICE.2007.4421471

N.comprMemonessionandK.technique.SayInood.ICIPAn,pagesasymmetricIII:97–100,lossless1995.Avimageail-
able537589fr.om:http://dx.doi.org/10.1109/ICIP.1995.

J.invZhang,ertedX.listLong,cachingandinT.searSuel.chPerforengines.manceInofJ.Huaicompretessedal.,
editors,WWW,pages387–396.ACM,2008.Availablefrom:
.http://doi.acm.org/10.1145/1367497.1367550

J.vanWaveren.Real-timetexturestreaming&decompres-
sion.Technicalreport,IdSoftware,November2006.Available
.http://software.intel.com/file/17248/om:fr

R.comprFraedrich,essionM.ofverBauery,larandgeM.datainStammingervolume.Srequentialendering.dataIn
H.Lenschetal.,editors,VMV,pages41–50.AkaGmbH,
2007.

C.Bloom.Huffman–arithmeticequiva-
lence,August2010.Availablefrom:http:
//cbloomrants.blogspot.com/2010/08/08-11-10-huffman-arithmetic-equivalence.
.html

M.Mahoney.Largetextcompressionbenchmark,January
2011.Availablefrom:http://mattmahoney.net/dc/
.text.html

M.PracticeLiddellandandA.ExperienceMof,fat.36,2006.Decodingprefixcodes.

e:SoftwarJ.ableSteim.from:steimcompression,http://www.ncedc.org/qug/software/March1994.Avail-
steim123.ps.Z.

170

[90]

[91]

[92]

[93]

[94]

[95]

[96]

V.AnhandA.Moffat.Indexcompressionusing64-bitwords.
Software:PracticeandExperience,40(2):131–147,2010.Avail-
.http://dx.doi.org/10.1002/spe.948om:frable

M.Zukowski,S.Héman,N.Nes,andP.Boncz.Super-
scalarRAM-CPUcachecompression.InL.Liuetal.,edi-
tors,ICDE,page59.IEEEComputerSociety,2006.Avail-
http://doi.ieeecomputersociety.org/om:frable.10.1109/ICDE.2006.150

T.Westmann,D.Kossmann,S.Helmer,andG.Mo-
erkotte.Theimplementationandperformanceofcompressed
databases.SIGMODRecord,29(3):55–67,September2000.

T.andWJ.Sillhalm,chaffnerN..PopoSIMD-scan:vici,Y.UltraBoshmaf,fastH.in-memorPlattnery,tableA.Zeierscan,
2009.usingAvon-chipailablevfrectorom:processingunits.http://www.vldb.org/pvldb/2/PVLDB,2(1):385–394,
.327.pdfvldb09-

B.Schlegel,R.Gemulla,andW.Lehner.Fastintegercom-
pressionusingSIMDinstructions.InProceedingsoftheSixth
InternationalWorkshoponDataManagementonNewHard-
ware,DaMoN10,pages34–40,NewYork,NY,USA,2010.
ACM.Availablefrom:http://doi.acm.org/10.1145/
.1869389.1869394

X.superZhao-spatialandZ.structurHe.eprLosslessediction.imageSignalcomprProcessingessionLetters,using
IEEE,17:383–386,April2010.doi:10.1109/LSP.2010.
.2040925

X.losslessWuandimageN.codec.Memon.IEEECALICASSP–,Acontext4:1890–1893,based1996.adaptivAvail-e
.ftp://ftp.csd.uwo.edu/pub/from_wu/om:frable

171

[97]

[98]

[99]

[100]

[101]

[102]

[103]

R.Fisher.General-purposeSIMDwithinaRegister:Parallel
PrUnivocessingersity,onJanuarConsumery2003.MicrAvoprailableocessorsfr.om:PhDthesis,http://docs.Purdue
.lib.purdue.edu/dissertations/AAI3108343

S.sionVanofprAssche,e-pressW.Philips,imagesandusingI.anoLemahieu.velcolorLosslessdecorrcomprelationes-
technique,1997.Availablefrom:http://citeseerx.ist.
.psu.edu/viewdoc/summary?doi=10.1.1.23.9033

M.Weinberger,G.Seroussi,andG.Sapiro.Loco-I:Alow
complexity,context-based,losslessimagecompressionalgo-
rithm.InDataCompressionConference,pages140–149,1996.

J.gren,StrJ.öm,P.MunkberWg,P.ennersten,ClarberJ.g,andRasmusson,T.J.Akenine-MöllerHassel-.
architecturFloating-pointe.bufInferPrcomproceedingsessionoftheina23rdunifiedACMcodecSIG-
GH08,pagesGRAPH/EUROGRAPHICS75–84.EurographicssymposiumonAssociation,Graphics2008.hardwarAvail-e,
ablefrom:id=1413957.1413970.http://portal.acm.org/citation.cfm?

M.JPEG-2000AdamsandcodecF.Kossentini.implementation,Jasper:MayA25softw2000.arAe-basedvail-
ablefrom:summary?doi=10.1.1.33.7339.http://citeseer.ist.psu.edu/viewdoc/

I.Pavlov.7-Zip,2011.Availablefrom:http://www.7-zip.
.org/

J.mizedWwassenberg,eighted-IHSW.panMiddelmann,sharpeningandS.withLaryea.edge-prHighlyeservingopti-
denoising.InU.MichelandD.Civco,editors,Earth
tionsResour,vcesolumeand7831.EnvirSPIE,onmental2010.RemoteAvailableSensing/GISfrom:Applica-http:

172

[104]

[105]

[106]

[107]

[108]

[109]

[110]

[111]

//publica.fraunhofer.de/eprints/urn:nbn:de:0011-n-1515140.pdf,doi:10.1117/12.865014.

B.Declercq.Lunarmosaic,September2010.Available
http://www.astronomie.be/bart.declercq/om:fr.Mozaiek_20100922.jpg

NASA.LROCWACmosaicofthelunarnearside,Decem-
ber2010.Availablefrom:http://wms.lroc.asu.edu/
.lroc_browse/view/wac_nearside

D.highrBlack-Sesolutionchaffer.images,Stanfor2007.dAvmemorialailablechurfrom:ch,
http://cva.stanford.edu/people/davidbbs/.photos/stanford_memorial_church/

F.berWar2010.merAvdam.ailablefrGeospatialom:dataabstractionhttp://www.gdal.org/library,.Novem-

ValveCorporation.Streamhardware&softwaresur-
vey,January2011.Availablefrom:http://store.
.steampowered.com/hwsurvey/cpus/

M.BurtscherandP.Ratanaworabhan.FPC:Ahigh-speed
Trans.compressorComputersfor,double-pr58(1):18–31,ecision2009.Avfloating-pointailablefrom:data.http:IEEE
.//dx.doi.org/10.1109/TC.2008.131

G.Dial,H.Bowen,F.Gerlach,J.Grodecki,and
R.Oleszczuk.IKONOSsatellite,imagery,andprod-
ucts.RemoteSensingofEnvironment,88(1-2):23–36,November
2003.Availablefrom:http://www.sciencedirect.
3/2/4B1W13X-com/science/article/B6V6V-.91061af6561718a9cbdfe0233b0c7285

AA.survKoschaney.TandechnicalW.Skarbek.Report94-32,ColourTechnicalimageUnivsegmentationersityof-

173

[112]

[113]

[114]

[115]

[116]

[117]

[118]

Berlin,October1994.Availablefrom:http://citeseer.
.ist.psu.edu/78729.html

S.KlonusandM.Ehlers.Performanceofevaluationmethods
inimagefusion.InInformationFusion,2009.FUSION09.12th
InternationalConference,pages1409–1416,July2009.Avail-
http://isif.org/fusion/proceedings/om:frable.fusion09CD/data/papers/0136.pdf

rY.esolutionZhang.PrsatelliteoblemsaswinelltheasfusionLANDSAofT7commerimagescialandhigh-ini-
tialsolution.SymposiumonGeospatialTheory,Processingand
2002.,Applications

frGeoEyom:e.IKONOSrelativespectralhttp://www.geoeye.com/CorpSite/assets/response,2008.Available
papers/2008/IKONOS_Relative_docs/technical-.Spectral_Response.xls

T.Tu,Phue-saturation.Huang,fusionC.Hung,techniqueandC.withChang.spectralAfastadjustmentintensity-for
IKONOSimagery.IEEEGeoscienceandRemoteSensingLetters,
2004.1,

Y.imagerSiddiqui.y.InTheASPRS2003modifiedAnnualIHSConfermethodenceforPrfusingoceedings,satellite2003.

B.Aiazzi,S.Baronti,andM.Selva.Improvingcomponent
substitutionpansharpeningthroughmultivariateregression
ofMS+pandata.IEEETrans.GeoscienceandRemoteSensing,
45(10):3230–3239,October2007.Availablefrom:http://
.dx.doi.org/10.1109/TGRS.2007.901007

2005.CookeAvCorailableporation.from:snr–http://www.pco.de/fileadmin/signal-to-noise-ratio,April
user_upload/db/download/pco_cooKe_kb_snr_.0504.pdf

174

[119]

[120]

[121]

[122]

[123]

[124]

[125]

A.Garzelli,F.Nencini,andL.Capobianco.OptimalMMSE
pansharpeningofveryhighresolutionmultispectralimages.
IEEEGeoscienceandRemoteSensingLetters,46:288–236,2008.

J.ableCockburfrom:n.Theorthogonalityhttp://devserv.rit.edu/Topics/principle,2009.Avail-
AnalyticalTopics20091/content/enforced/.030674001.20091/Lec11a_2x.pdf245450-

C.TomasiandR.Manduchi.Bilateralfilteringforgray
andcolorimages.InICCV,pages839–846,1998.Available
http://citeseerx.ist.psu.edu/viewdoc/om:frdownload?doi=10.1.1.126.2091&rep=rep1&type=.pdf

S.filterParisusingandaF.signalDurand.prAocessingfastapprapproach.oximationTofechnicaltherbilateraleport,
ArtificialMassachusettsIntelligenceInstituteofTLaboratoryechnology,2006.ComputerScienceand

S.Han,M.Jeong,S.Woo,andB.You.Architectureand
implementationofreal-timestereovisionwithbilateralback-
groundsubtraction.InD.Huang,L.Heutte,andM.Loog,
editors,ICIC,volume4681ofLNCS,pages906–912.Springer,
2007.Availablefrom:http://dx.doi.org/10.1007/
978-3-540-74171-8_91.

A.usingLangstheandgraphicsM.harBiederdware.mann.InSCIA,Filteringpagesvideo878–887,volumes2007.
Availablefrom:http://www.uni-koblenz.de/~cg/
Veroeffentlichungen/LangsBiedermann_SCIA07_.LNCS.pdf

Y.Zhang.Methodsforimagefusionqualityassessment–
reviewcomparisonandanalysis.TheInternationalArchives
ofthePhotogrammetry,RemoteSensingandSpatialInformation
2008.XXXVII,,Sciences

175

[126]

[127]

[128]

[129]

[130]

[131]

[132]

[133]

Q.Du,N.Younan,R.King,andV.Shah.Ontheper-
formanceevaluationofpan-sharpeningtechniques.Geo-
scienceandRemoteSensingLetters,IEEE,4(4):518–522,October
2007.Availablefrom:http://ieeexplore.ieee.org/
doi:10.,stamp/stamp.jsp?tp=&arnumber=4317530.1109/LGRS.2007.896328

Z.IEEEWangSignalandPrA.ocessingBovik.LettersA,9,univ2002.ersalimagequalityindex.

L.Alparone,S.Baronti,A.Garzelli,andF.Nencini.Aglobal
qualitymeasurementofpan-sharpenedmultispectralim-
agery.IEEEGeoscienceandRemoteSensingLetters,1,2004.

J.Wparallelassenberalgorithmg,W.forMiddelmann,graph-basedandP.imageSanders.segmentation.AnefficientIn
CAIP,pages1003–1010,2009.Availablefrom:http://dx.
doi.org/10.1007/978-3-642-03767-2_122.

I.Vanhameletal.Scalespacesegmentationofcolorim-
agesusingwatershedsandfuzzyregionmerging.InICIP
(1),pages734–737,2001.Availablefrom:http://dx.doi.
.org/10.1109/ICIP.2001.959150

D.ComaniciuandP.Meer.Meanshiftanalysisandappli-
cations.InICCV,pages1197–1203,1999.Availablefrom:
.http://dx.doi.org/10.1109/ICCV.1999.790416

J.DeterWassenberminationg,ofD.Bulatomaximallyv,W.stableextrMiddelmann,emalrandegionsP.inSanders.large
images.InSignalProcessing,PatternRecognition,andApplica-
2008.yFebruar,tions

P.FelzenszwalbandD.Huttenlocher.Efficientgraph-based
imagesegmentation.IJCV,59(2):167–181,September2004.
Availablefrom:http://dx.doi.org/10.1023/B:VISI.
.0000022288.19776.77176

[134]

[135]

[136]

[137]

[138]

[139]

[140]

[141]

[142]

R.CVGIPHaralick,and29:100–132,L.ShapirJanuaro.y1985.Imagesegmentationtechniques.

C.Thomas,T.Ranchin,L.Wald,andJ.Chanussot.Syn-
thesisofmultispectralimagestohighspatialresolution:A
criticalreviewoffusionmethodsbasedonremotesensing
physics.IEEETrans.GeoscienceandRemoteSensing,46(5):1301–
1312,May2008.Availablefrom:http://dx.doi.org/10.
.1109/TGRS.2007.912448

J.Canny.Acomputationalapproachtoedgedetection.In
1987.184–203,pages,RCV87

J.Steiner.Einfachebeweisederisoperimetrischenhauptsätze.
JournalfürdiereineundangewandteMathematik,18:281–296,
1838.

D.estimationShin,R.usingPark,S.Yadaptivang,eandgaussianJ.Jung.filtering.Block-basedIEEETnoiserans.
Consum.Electron.,51:218–226,2005.

A.videoAmernoiseandE.estimation.Dubois.IEEEFastTandrans.rCireliablecuitsSyst.structurVideoe-orientedTechn,
15(1):113–118,2005.Availablefrom:http://dx.doi.org/
.10.1109/TCSVT.2004.837017(410)1

R.TarjanandJ.vanLeeuwen.Worst-caseanalysisofset
unionalgorithms.JACM,31(2):245–281,April1984.

A.06.WAveber.ailableThefrom:USC-SIPIImageDatabase.http://sipi.usc.edu/database/Accessed2008-10-.

A.neighborBuades,hoodB.Coll,filtersandandJ.itsMorel.solution.ThestairIEEEcasingTrans.effectImagein
Processing,15(6):1499–1505,June2006.Availablefrom:http:
.//dx.doi.org/10.1109/TIP.2006.871137

177

[143]

[144]

[145]

[146]

[147]

[148]

[149]

[150]

V.Osipov,P.Sanders,andJ.Singler.Thefilter-kruskalmini-
mumspanningtreealgorithm.InI.FinocchiandJ.Hersh-
berger,editors,ALENEX,pages52–61.SIAM,2009.Available
http://www.siam.org/proceedings/alenex/om:fr.2009/alx09_005_osipovv.pdf

J.ZunicandN.Sladoje.Efficiencyofcharacterizingellipses
andellipsoidsbydiscretemoments.IEEETrans.PatternAnal.
Mach.Intell,22(4):407–414,2000.Availablefrom:http://
.www.computer.org/tpami/tp2000/i0407abs.htm

M.IEEEHu.Trans.VisualInformationpatternrTheory,ecognitionb8(2):179–187,ymomentFebruarinvyariants.1962.
Availablefrom:http://ieeexplore.ieee.org/iel5/
.4547527/22787/01057692.pdf

H.Cramér.MathematicalMethodsofStatistics.Princeton
1946.ess,PrersityUniv

E.ResourWce,eisstein.2011.Ellipse.AvailableMathWfrom:orld–AWhttp://mathworld.olframWeb
.wolfram.com/Ellipse.html

J.Iivcombinedarinen,M.shapePeura,descriptorsJ.Sarela,forandirrA.egularVisa.objects.ComparisonInBMVCof,
pages430–439,1997.Availablefrom:http://www.bmva.
.ac.uk/bmvc/1997/papers/062/bmvc97.html

D.stableNisterextrandemalrH.egions.Stewenius.InECCVLinear,pagestimeII:maximally183–196,
978-2008.3-Av540-ailable88688-from:4_14.http://dx.doi.org/10.1007/

G.analysisHarfstoftheandE.Union-FindReingold.dataAstructure.potential-basedSIGACT,amortized31:86–95,
2000.eptemberS

178

[151]

[152]

[153]

[154]

[155]

[156]

[157]

[158]

[159]

[160]

RobustSystem.ImageAccessedUnderstanding2008-09-23.Lab.AvailableEDISONfrom:
http://www.caip.rutgers.edu/riul/research/.code/EDISON/doc/segm.html

P.Felzenszwalb.Efficientgraph-basedimagesegmentation,
March2007.Accessed2008-01-11.Availablefrom:http:
.//people.cs.uchicago.edu/~pff/segment/

D.09.AvBesedin.ailablefrom:RightMarkmemoryhttp://cpu.rightmark.organalyzer.Accessed.2009-01-

J.Wassenberg.Fast,high-qualitylineantialiasingbyprefilter-
ingwithanoptimalcubicpolynomial.InProc.of4thPacific-
RimSymposiumonImageandVideoTechnology(PSIVT2010),
2010.Availablefrom:http://publica.fraunhofer.
.1516338.pdfn-de/eprints/urn:nbn:de:0011-

J.plotterBr.esenham.IBMSystemsAlgorithmJournal,for4(1):25–30,computerJulycontrol1965.ofadigital

Ppla.y.GarIBMdner.Tech.ModificationsDisclosureofBull.Br18,esenhams1975.algorithmfordis-

V.Comput.BoyerandGraph.J.BourForum,din.Fast18(3):377–384,lines:A1999.spanbyspanmethod.

J.ACMRokne,TB.ransactionsWyvill,onandGraphicsX.,Wu.Fast9(4):376–388,lineOctoberscan-conv1990.ersion.

J.Bresenham.Incrementallinecompaction.Comput.J,
1982.25(1):116–120,

.DrM.Abrash.Thegood,thebad,andtherun-sliced.
DobbsJournal,17(11):171–176,November1992.Available
http://downloads.gamedev.net/pdf/gpbb/om:fr.gpbb36.pdf

179

[161]

[162]

[163]

[164]

[165]

[166]

[167]

[168]

[169]

J.statisticsChen,ofX.Wlineang,anddistribution.J.BrIEEEesenham.ComputerTheanalysisGraphicsandand
Applications,22(6):100–107,2002.Availablefrom:http:
.//computer.org/cga/cg2002/g6100abs.htm

J.toFerwassessingerdatheandD.qualityGrofeenberg.antialiasedApsyimages.chophysicalIEEEapprComputeroach
GraphicsandApplications,8(5):85–95,September1988.

F.generatedCrow.shadedTheimages.aliasingproblemCommunicationsinofcomputerthe-
ACM,20(11):799–805,November1977.Available
http://www.cs.northwestern.edu/~ago820/om:fr.cs395/Papers/Crow_1977.pdf

J.KajiyaandM.Ullner.Filteringhighqualitytextfordisplay
onrasterscandevices.InComputerGraphics(SIGGRAPH81
Proceedings),volume15,pages7–15,August1981.

X.Wu.Anefficientantialiasingtechnique.InT.Sederberg,
editor,ComputerGraphics(SIGGRAPH91Proceedings),vol-
ume25,pages143–152,July1991.

IEEEJ.ComputerBlinn.JimGraphicsBlinnsandcorner:Applications,Return9(2):82–89,oftheMarjaggych.1989.

S.ableTiwfrari.om:Antialiasing:Wualgorithm,http://www.codeproject.com/KB/GDI/November2007.Avail-
.antialias.aspx

S.displaGuptays.andComputerR.F.SprGraphicsoull.,15(3),Filtering1981.edgesforgray-scale

J.methodsBærentzen,forS.antialiasedNielsen,wirM.eframeGjøl,draandwingB.withLarsen.hiddenTwo
lineremoval.InK.Myszkowski,editor,Proceedingsof
theSpringConferenceinComputerGraphics,April2008.

180

[170]

[171]

[172]

[173]

[174]

[175]

[176]

Availablefrom:http://orbit.dtu.dk/getResource?
.recordId=219956&objectId=1&versionId=1

E.ChanandF.Durand.Fastprefilteredlines,2005.Avail-
http://http.developer.nvidia.com/om:frable.GPUGems2/gpugems2_chapter22.html

R.McNamara,J.McCormack,andN.Jouppi.Pre-
filteredantialiasedlinesusinghalf-planedistancefunc-
tions.InS.Spencer,editor,Proceedingsofthe2000SIG-
GRAPH/EUROGRAPHICSWorkshoponGraphicsHardware
(EGGH-00),pages77–86,N.Y.,August2000.ACMPress.

J.Chen.Fastfloatingpointlinescan-conversionandantialias-
ing.TechnicalReportTR98-02,GeorgeMasonUniversity,
1998.Aprilcience,SComputer

J.AWugustassenber2010.g.AvLineAAailablesourfrceom:codeandMathematicahttp://algo2.iti.kit.scripts,
.source.zipedu/wassenberg/LineAA/LineAA-

P.Roberts.fillratetestresultforNVIDIAGeForce9600
GT,July2008.Availablefrom:http://www.m3fe.com/
.fillratetestweb/ViewResult.php?id=539

K.Ttransforurkowski.mations.ACMAnti-aliasingTransactionsthroughontheGraphicsuse,ofcoor1(3):215–234,dinate
1982.July

G.WcomputingalterprandolateT.Sspheroleski.oidalAwavneewfunctionsfriendlyandwmethodavelets.of
AppliedandComputationalHarmonicAnalysis,19(3):432–
1.443,A2005.vailablefrom:ComputationalHarmonichttp://www.sciencedirect.Analysis–Part
2/2/4GSTPNJ-com/science/article/B6WB3-doi:DOI:,fc29524fd7683c81c5e708e3b7c3024e.10.1016/j.acha.2005.04.001

181

[177]A.Barkans.Highspeedhighqualityantialiasedvectorgen-
eration.InF.Baskett,editor,ComputerGraphics(SIGGRAPH
90Proceedings),volume24,pages319–326,August1990.

[178]J.errorSny.der.UnitedSystemsStatesandPatentmethods7233963,fordifJunefusing2007.Aclippingvail-
ablefr7233963.htmlom:.http://www.freepatentsonline.com/

[179]W.FraserandJ.Hart.Near-minimaxpolynomialapprox-
theimationsACM,and7(8):486–489,partitioningAugustofinter1964.vals.AvailablefrCommunicationsom:http:of
.//portal.acm.org/citation.cfm?id=364820

[180]Z.Lin,polynomialH.filters.Chen,H.J.Shum,GraphicsandToolsJ.,Wang.10(1):27–38,Optimal2005.
Availablefrom:http://akpeters.metapress.com/
.content/q12213h4v0m36420/

[181]

[182]

[183]

D.MitchellandA.Netravali.Reconstructionfiltersincom-
putergraphics.InJ.Dill,editor,ComputerGraphics(SIG-
GRAPH88Proceedings),volume22,pages221–228,August
1988.

A.Burgess.Effectofquantizationnoiseonvisualsignal
detectioninnoisyimages.J.Opt.Soc.Am.A,2(9):1424–
1428,September1985.Availablefrom:http://josaa.osa.
.14249-2-org/abstract.cfm?URI=josaa-

E.Michaelsen,U.Stilla,U.Sörgel,andL.Doktorski.
ExtractionofbuildingpolygonsfromSARimages:
Groupinganddecision-levelintheGESTALTsystem.
PatternRecognitionLetters,31(10):1071–1076,2010.Pat-
ternRecognitioninRemoteSensing,FifthIAPRWork-
shoponPatternRecognitioninRemoteSensing(PRRS
2008).Availablefrom:http://www.sciencedirect.

182

[184]

[185]

[186]

[187]

[188]

[189]

1/2/4XJG5FM-com/science/article/B6V15-doi:DOI:,b1e3cf73e446d1bfb9d8876ee10635f1.10.1016/j.patrec.2009.10.004

R.Marques,F.deMedeiros,andD.Ushizima.Targetde-
tectioninSARimagesbasedonalevelsetapproach.IEEE
Trans.Systems,ManandCybernetics,39(2):214–222,March
2009.Availablefrom:http://www.osti.gov/bridge/
.cOAlrS/servlets/purl/939133-

A.Kohnle,R.Neuwirth,W.Schuberth,K.Stein,D.Hoehn,
R.Gabler,L.Hofmann,andW.Euing.Evaluationofessential
designcriteriaforIRSTsystems.InfraredTechnologyXIX,
2020:76–92,1993.Availablefrom:http://link.aip.org/
link/?PSI/2020/76/1.doi:10.1117/12.160530,

J.efWficientassenberscrg,eeningW.forMiddelmann,point-liketarandgetsP.viaSanders.concentricHighlyshells.
InConferAdvancedence,SMauieptemberOptical2010.andSpaceSurveillanceTechnologies

N.AlonandB.Schieber.Optimalpreprocessingforanswer-
TingelAvivon-lineUnivprersityoduct,1987.queries.PrTeprint.echnicalAvailableReportfrom:TR71/87,http:
.Schieber.ps//www.cs.tau.ac.il/~zwick/Alon-

M.BenderandM.Farach-Colton.TheLCAproblemrevis-
ited.InProc.ofthe4thLatinAmericanSymp.onTheoretical
Informatics,volume1776ofLNCS,pages88–94.Springer,
2000.Availablefrom:http://citeseer.ist.psu.edu/
.346677.html

I.Katriel,P.Sanders,andJ.Träff.Apracticalminimum
spanningtreealgorithmusingthecycleproperty.InEuropean
SymposiononAlgorithms,volume2832ofLNCS,pages679–
2003.,Springer690.

183

[190]

[191]

[192]

[193]

[194]

[195]

J.FischerandV.Heun.Anewsuccinctrepresentationof
RMQ-informationandimprovementsintheenhancedsuffix
array.InCombinatorics,Algorithms,ProbabilisticandExperi-
mentalMethodologies,volume4614ofLNCS,pages459–470.
Springer,2007.Availablefrom:http://www.bio.ifi.
.lmu.de/~fischer/fischer07new.pdf

NOTICES:C.McGeoch.Experimentalanalysisofalgorithms.
NoticesoftheAmericanMathematicalSociety,48(3):304–311,
2001.Availablefrom:http://www.ams.org/notices/
.mcgeoch.pdf200103/fea-

W.Fink.DDR3vs.DDR2,May2007.Available
http://www.anandtech.com/memory/showdoc.om:fr.aspx?i=2989

D.anMeyandC.Terboven.Affinitymatters!OpenMP
onmulticoreandccNUMAarchitectures.InParallel
volumeComputing:15.ArchitecturForschungszentrumes,AlgorithmsJülichandandRApplicationsWTH,
AachenUniversity,Febuary2008.Availablefrom:
http://www.compunity.org/events/pastevents/.parco07/AffinityMatters_DaM.pdf

C.ListnerandI.Niemeyer.Multiresolutionsegmentation
adaptedforobject-basedchangedetection.ImageandSig-
nalProcessingforRemoteSensingXVI,7830(1),2010.Avail-
http://link.aip.org/link/?PSI/7830/om:frable.doi:10.1117/12.865133,78300U/1

J.radixWsort.assenberIngandEurP.o-ParSanders.2011ParallelEngineeringProcessinga–multi-cor17thIn-e
ternationalConference,2011.Availablefrom:http://www.
.springerlink.com/index/8451700803HUR4G5.pdf

184

[196]P.Bohannon,P.McIlroy,andR.Rastogi.Main-memoryindex
structureswithfixed-sizepartialkeys.InSIGMODConfer-
ence,pages163–174,2001.Availablefrom:http://www.
acm.org/sigs/sigmod/sigmod01/eproceedings/.al.pdfet-Bohannon-papers/Research-[197]N.Satish,C.Kim,J.Chhugani,A.Nguyen,V.Lee,D.Kim,
andP.Dubey.FastsortonCPUsandGPUs:acasefor
bandwidthobliviousSIMDsort.InA.Elmagarmidand
D.Agrawal,editors,SIGMODConference,pages351–362.
ACM,2010.Availablefrom:http://doi.acm.org/10.
.1145/1807167.1807207[198]K.MehlhornandP.Sanders.Scanningmultiplesequences
viacachememory.Algorithmica,35,2003.
[199]IntelCorporation.IntelArchitectureSoftwareDeveloperManual,
2010.SystemProgrammingGuide.Availablefrom:http://
.www.intel.com/Assets/PDF/manual/253668.pdf[200]IntelCorporation.Intel64andIA-32ArchitecturesOp-
timizationReferenceManual,November2007.Available
http://www.intel.com/design/processor/om:fr.manuals/248966.pdf[201]J.Wassenberg,W.Middelmann,andP.Sanders.An
efficientparallelalgorithmforgraph-basedimage
segmentation,June2009.Availablefrom:http:
karlsruhe.de/wassenberg///algo2.iti.uni-.wassenberg09parallelSegmentation.pdf[202]D.Jimenez-Gonzalez,J.Navarro,andJ.Larriba-Pey.Fast
parallelin-memory64-bitsorting.InProceedingsofthe2001
InternationalConferenceonSupercomputing(15thICS01),pages
114–122,Sorrento,Napoli,Italy,June2001.ACM.
185

[203]

[204]

[205]

[206]

[207]

[208]

[209]

[210]

F.Panneton,P.LEcuyer,andM.Matsumoto.Improved
long-periodgeneratorsbasedonlinearrecurrencesmodulo
2.ACMTransactionsonMathematicalSoftware,32,2006.

N.Satish,C.Kim,J.Chhugani,A.Nguyen,V.Lee,D.Kim,
andP.Dubey.FastsortonCPUs,GPUsandIntelMICarchi-
tectures.Technicalreport,Intel,2010.Availablefrom:http:
us///techresearch.intel.com/userfiles/en-.FASTsort_CPUsGPUs_IntelMICarchitectures.pdf

D.MerrillandA.Grimshaw.RevisitingsortingforGPGPU
streamarchitectures.TechnicalReport3,UniversityofVir-
ginia,February2010.Availablefrom:http://www.cs.
.virginia.edu/~dgm4d/papers/RadixSortTR.pdf

D.Levinthal.PerformanceAnalysisGuideforIntelCorei7
ProcessorandIntelXeon5500processors.IntelCorporation.
Availablefrom:http://software.intel.com/sites/
products/collateral/hpc/vtune/performance_.analysis_guide.pdf

MemoryBdisk.Jacob,MorS.ganNg,andKaufmann,D.Wang.2007.

systems:cache,DRAM,D.sortingHelman,algorithmD.Baderwith,andanJ.JáJá.experimentalArandomizedstudy.J.parallelParallel
1998.52(1):1–23,,Comput.Distrib.

J.ableWfrom:assenberg.vmcsortdemo,http://algo2.iti.kit.edu/wassenberg/May2011.Avail-
.vmcsort/demo.html

Cache/pagelinesandLDDQU,March2008.Available
http://softwarecommunity.intel.com/isn/om:fr.US/forums/thread/30244059.aspxCommunity/en-

186

[211]M.MokbelandW.Aref.Irregularityinmulti-dimensional
space-fillingcurveswithapplicationsinmultimedia
databases.InCIKM,pages512–519.ACM,2001.
[212]S.Anderson.Interleavebitsbybinarymagicnum-
bers.Availablefrom:http://graphics.stanford.edu/
.~seander/bithacks.html#InterleaveBMN187

Index

speedup102x122FPGA,vs.Bound-(Axis-AlignedAABB93Box),ing92(PHMSF),AccumulatorAd-hocPSF,99,106
138limitations,spaceessAddrAdvancedFormat(HDD),28
aiocbtrol(asynchrblock),onous31I/Ocon-
AirborAlgorithmnecameras,Engineering,413
Aliasing,energy,98105
Amdahlarbalancedgument,19,computer91,9
AnisotrAntialiasing,opicdif99fusion,80
ATTAsynchrOonousBenchmarkI/O,27(HDD),,3027
22uto-tuning,A3(spectral),BandBigTIFFBand-interlea,34ved,26
65,FilterBilateral

189

66,filterBinomialBrBoxfilteresenhams,99linealgorithms,
98Broveytransformation,62
Cache,14associativity,16,132
coherencyoverhead,102
14line,22size,line14miss,15pollution,120sharing,146splits,16tags,81,detectoredgeCannyCanonicalrepresentative(UF),
CC(Corr84elationCoefficient),
72CentrClampoid,(inter93val),65
ClusterClamshellmode(computers),(GDDR5),1398
42(Coder/Decoder),CodecCodes

vfixedariablelength,length,4141
91(UF),Collapse62distortion,ColorCompactness94(regionfeature),
type,Component153(pixel),3,26
essionCompr42asymmetric,entropLempel-Zivycoder,46,39
nullratio,55suppression,42
scanslidingorderwindo,40w,57,59
7es,chitecturArComputer10cost,9FLOPS,memormemoryysize,9bandwidth,9
CPUContrast(Central(pixels),Pr82ocessingUnit),
818es,cormultipleCrpoeditwerand(PHMSF),82cooling,18
BenchmarkystalDiskCr32(HDD),DDA(DigitalDifferentialAna-
98lyzer),DelaDeltayedencoding,edges15(PHMSF),90
84set,Disjoint

190

DMA(DirectMemoryAccess),
284(camera),IIDMCDSPDPPS(Digital(SSE4.1),Signal65Processor),
710slices,Backus-Naur(ExtendedEBNF48m),For92(Ellipse),Eccentricity80(PHMSF),eightwEdgeEhlersEdge-preserFusion,ving62filter,65
EllipseEigenvfit,alue,92105
EnergyEmbarrassinglyconcentrationparallel,(Fourier88
ERDAStransfor(framem),work),10533,68
ERGAStiveglobal(dimensionlesssynthesisrerela--
73or),r111m,alarFalseFastI/O(Windows),28
FFT(FastFourierTransform),
6288-Kruskal,FilterFLOPSFixed-point(Floating-Pointarithmetic,99Opera-
Fork-jointionsPerS(parallelization),econd),9,2012
105m,transforFourier

FPGAGate(FieldArraPry),8ogrammable
egmenta-S(Graph-BasedGBS80tion),Ab-Data(GeospatialGDALGDDR5straction(GraphicsLibrary),34Double
sionData5),Rate8,98memory,ver-
posepur(generalGPGPU12GPU),ocessingPr(GraphicsGPU7Unit),103fillrate,memorutilization,yr12estriction,98
Gram-Swarp,12chmidt(PS),62
81heuristic,Graph-cutting84graph,GridGS(Gupta-Sproulllineren-
99er),derGSD(Grtance),ound4SampleDis-
HDRHDD(Har(HighdDiskDynamicDrive),25Range),
33HFA(HierarchicalFileArchi-
33e),tectur68Histogram,81(pixels),HomogeneityHorHorizontalnerscheme,dotpr103oduct,103

191

HPCHotspot(High-Perfortransform,mance111Com-
12puting),Humansub-pixelvisualsystem,localization,99108
139ead,HyperthrI/Oalignmentrequirements,31
32,27benchmark,28size,block30details,implementationqueuerandomdepth,vs.31sequential,26,
29IDEsector(Integratedsize,28Development
145onment),EnvirIdealIEEE-754low-pass(floatingfilter,point),105100
Saturation),Hue(IntensityIHS62Image3o-optical,electr33mats,forfilelaynoiseoutinmodel,memor82y,25,45
pradaryramid,,433,149
72metrics,similarity153statistics,40eaming,str82noise,whiteImagingIMAGINEsatellite,(ERDAS),4,8833,68

InvIntervalid-Talmaxima,o-Exclusive114transition,
140function,mannAckererseInv94IPP(Inteltives),Perfor26,65mancePrimi-
IRIsoperimetric(InfraRed),111quotient,94
108,98(antialiasing),JaggyKerKnightsnelFerr(Operatingy(MIC),System),13927
88(MST),algorithmKruskals91image,Label105multipliers,LagrangeAsymmetric(LosslessLASCLDDQUSIMD(SSE3),Codec),14639
LeLevvelel(p(LVT),yramid),14933
onousAsynchr(IntellibicaioI/O30y),Librar97wing,draLineLOCLocality(LinesOfCode),145
15spatial,Lookuptemporal,table15vs.arithmetic,
103Digit),Significant(LeastLSD135LVT(LosslessVirtualTexture),
148

192

93(SSE),MAXPSyMemorbusobandwidth,verhead,95133
14model,consistency120pages,gelar118,latency148alk,wpageMemoryhierarchy,14
14cache,memortopologyy,,1620
33Metadata,MGRSence(MilitarSystem),yGrid155Refer-
MIC139(ManyIntegratedCores),
MicrMinificationoarchitectur(texturee),(CPU),33103
MipmapMinimax(pyramid),polynomial,33105
Objects),(Man-MadeMMO111Mappingy(MemorMMU120Unit),92Moments,146(SSE2),MOVDQUMSMSD(Most(Mean-Shift),Significant80Digit),
135MSEMSER(Mean-S(MaximallyquaredErrStableor),Ex-64
80Region),emaltr

MSHLKarchical(Multi-SLinkedcalewHierater--
80shed),o-eprPr(MultiSpectralMSP68cessing),MSTTree),(Minimum80Spanning
MSTARary(MoTarvinggetandAcquisitionStation-
123Recognition),and(antialiasing),Multisampling993(image),MultispectralNegativNaturalesidealignment,lobes,25105
NIIRSpr(NationaletabilityRatingImageSIntercale),-
445InfraRed),(NearNIRNITF(NationalImageryTrans-
missionFormat),33,35
Non-temporalwrite,16,66,
13328pool,NonpagedNSIF(NATOSecondaryImage
NUMAFormat),(Non-Unifor33mMem-
oryAccess),16,119,136
143,factor16domain,oximityprOpenMPOpenEXR(image(Openformat),Multi-33
140,20ocessing),Pr

193

OptimalOrientationbandw(ellipse),eights,92,6493
64principle,OrthogonalityOuterOVERLAPPEDproduct,65(I/O),31
80ersegmentation,OvPagePagew(VM),alk,120134
146(SSSE3),ALIGNRP62pening,sharPanPanchrParallelomaticStudio(image),(IDE),3145
Parallelizationcorrlibraryectness,solutions,1919
19onization,synchr133y),(memorwritePartialPathPCAhalving(Principal(UF),84Component
62Analysis),PCIeInter(PeripheralconnectExprComponentess),
139PD(Per-pixelDeviation),72
136Element),ocessing(PrPEPerforPeanocurmanceve(SFC),counters,150140
PHMSFfor(ParallelMinimumSpanningHeuristic
81ests),For3Pixel,153mat,forPMPoint-like(imageforobjects,mat),11233

POSIXSystem(PortableInterfaceOperatingfor
30Unix),99(antialiasing),e-filteringPrPrPRNGefetching,15(Pseudo-RandomNum-
ProbabilisticberGenerator),counting,13815
PrProcessingolate-spherchain,oidal5wavefunc-
105tion,62pening),Shar(PanPS93(SSE2),PSADBWPSF99(PointSpreadFunction),
33(image),PyramidQ(universalimageQualityin-
Q4(quaterdex),73nion-basedQuality
73index),149VT),(LQuad72indicators,Quality88Quicksort,Ma-(Random-AccessRAM118chine),RandomRasterizerI/O,(lines),2997
81(segmentation),Region4(image),Resolution62ge,merResolutionReRFOverse(Read-Forsorting,136-Ownership),
133

194

RLT(RadialLineTransforma-
104tion),Maximum(RangeRMQ113y),QuerRMSEror),(Root72MeanSquareEr-
25,w-majorRoSandForSamplingcetheor(SSD),em,3299
eApertur(SyntheticSAR111,4Radar),SScancatterconv-gatherersion,list,9728
SScrectioneening,(L4VT),150
28(HDD),ectorS29(HDD),eekS79egmentation,S81criteria,Srelectoregion(LASC),features,4791
emiconductorS11length,gate18leakage,26I/O,equentialSSFCShared(SpacememorFillingy,14Curve),149
ShellShocks(dif(hotspot),fusion),11285
64ratio,Signal-to-noiseMul-Instruction(SingleSIMD17,7Data),tipleautomaticalignment,v18,146ectorization,–14817

intrinsics,17,44,134
10,8lanes,listofarchitectures,17
22obsolescence,43packing,Sincpolynomial(function),ev105aluation,103
Softwarerenderer,97
ortingS134sort,countingloradixwersort,bound,135131
111Speckle,62mismatch,SpectralSSDSpectral(Srolid-StateesponseDisk),function,2962
Exten-SIMDeaming(StrSSE17sions),147limitations,SSE3),(SupplementalSSSE3146StrStenciletch(image),computation,68114
114algorithm,Succinct63orbit,onousSun-synchrSVMSuperscalar(SupportCPU,V17,ector42Ma-
112chine),SVNSWWC(Subv(Sersion),oftware145Write-
133Combining),ThrThereadmalpower(parallelization),density,818
27(I/O),oughputThr

195

TTLBile(T(image),ranslation26Look-aside
TrilinearBufinterfer),120polation,,13867
84(Union-Find),UF140e,Uncor83Undersegmentation,88sort,UnstableUTM(UniversalTransverse
155cator),MerVirtualAlloc(Windows),85
VM(VirtualMemory),134
66oxel,VVTune(profiler),140,145
waio(POSIXaioforWindows),
Watershed32transform,80
WCWELL512(W(PRNG),rite-Combine),138133
WorldView-2(satellite),4
Wu(linerendering),99
150(SFC),derZ-or

Zusammenfassung

SIndenensorikerletztenheblichJahrvenoran.schrittGrdieoßforEntwicklungmat-Luftbildkamerasdererbildgebendenmögli-
cheneineBodenauflösungimMillimeterbereich.Mitdenneuen
DatechnischensolcheDatenmengenMöglichkeitenkaumwnochachsenmanuellaberauchauswdieertbarErwsind,artungen.wird
auswzumindesterteristeineweiterteilwhineiseunvAerzichtbarutomatisierung,kannaberunerlässlich.durchScrDereeningBild-
lerwentlasteteisewnurerrden.elevanteHierbeiwGebieteerdendiebetrachtetDatenwsoerrdeneduziert,müssen.dassSelbstidea-
derdieseneintuitivSystemealseineeinfachHerausforeinzuschätzendederungbezüglichAufgabestelltRechenzeitfürundmo-
.darerbrauchvSpeicherteileDieeinigervorliegendeHardwarearArbeitchitekturdiskutierten.FPGAzunächstunddieVor-GPU-basierteundNach-Sys-
temesindwicklungskosten,wenigersodasseinanpassungsfähighandelsüblicherundvPCerursachenvorgezogenhöherewirEnt-d.
AEswiruflösungdgezeigt,innerdasshalbveinon2LuftbildStundenmit100auf×einem100kmGebietsArbeitsplatzrmit1ech-m
nerlangsamerausgewsind,ertetwwererdendenkann.sämtlicheDaGliederbestehendederVBildverfahrenwerarbeitungs-eitaus
ketteLaufzeitvonzuGrundminimieraufen.neuEswerentwickeltdenmitAlgorithmendemvorAnspruch,gestellt,derdieen
nützlicheErgebnissebeibislangunerreichtenGeschwindigkeiten
möglichen.erDieBildsegmentierung,beiderObjekteimBildextrahiert
werDieserden,SistchritteinisteinezeitkritischernotwendigeBestandteilVderoraussetzungVfürerarbeitungskette.vieleAus-
wEinerteaufgaben,naheliegendesdaModelleinzelnefürPixeldieSnichtegmentesiehtaussagekräftigvor,farblichgenugsind.ähn-
lichePixelzusammenzuschließen.Hierfürexistierentheoretisch
fundierteMaximum-NetwAlgorithmenork-Flowwie,diefürMean-Shift,großeanisotrDatenmengenopeDiffusionjedochundzu

rdessenechenaufwHeuristikändigsind.tendenziellEswirzudkleineeinneuesundzuVgrerfahroßeenSvoregmentegestellt,ver-
meidet.DiewichtigsteNeuerungbestehtdarin,eineunabhängige
VerarbeitungeinzelnerBildkachelnzugewährleisten,jedochoh-
erneObjektemöglichtenandenGrParallelisierungenzenundaufzuspalten.derASIMD-Pixelvufgrunddererarbeitungdadurch
istderAlgorithmus50-malsoschnellwieMean-Shift,wobeidie
SAusgabenegmentierersähnlichzurSsind.ortierungDasvonhochoptimierteGanzzahlenUnterhatprsichalsogrammderartdes
derzeitleistungsfähigalsweltschnellstesherausgestellt,VerfahrdasseneinezumWSeiterortierenventwicklungon32-bitdavZah-on
lenaufeinemShared-Memory-Rechnergilt.Diesgeschiehtunter
ZuhilfenahmevonvirtuellemSpeicherundDetailsderProzessor-
.chitekturoarMikrteil,DaSdieSensorrauschenegmentierungvorherzuähnlichereduzierPixelen.Dasgruppiert,istBilateral-FilteresvonVeig-or-
benetwirkt,sichohnehierfürstarkebesonders,Kantendazubereitsschweineächen.IterationDereineFilterkernGlättungge-
wichtetPixelanhandihrerÄhnlichkeitundEntfernung.Essind
einigeApproximationsalgorithmenzurBeschleunigungderFil-
terungbekannt,beispielsweiseeineFaltungineinemunterabge-
tastetenmehrdimensionalenRaum.DiesesVerfahrenwirdetwa
umAnwdeneisungenFaktor14undeinebeschleunigtAnnäherungdurchdesParallelvGauß-Kererarbeitung,nsmitvSIMD-erbes-
serterAlgorithmusLokalität.73-malLautsoverschöfnellwiefentlichteneinFPGALeistungsdatenund1,8-malistsoderschnellneue
wieeineNebendemGPU-basierteRauschenApprmussoximation.eineweitereEigenschaftheutiger
Satellitensystemeberücksichtigtwerden:UmMehrkanalbilderzu
erhalten,werdenFiltervorgeschaltet,sodasseinegrößereDetektor-
flächeerforderlichwird.EinMultispektralbildhatalsoinderRegel
eineteiledergeringerzweieABildtypenuflösungalskönneneindurGrauwchFusionertbild.DiekombiniertjeweiligenwerVden.or-
Einpan-geschärftesBildbeinhaltetsowohlhochaufgelösteDetails

198

alsauchFarbinformation,wasderSegmentierungzugutekommt.
tenAllerzudingsFarbvführenerschiebungen.dieEsunterschiedlichenwirdeinDetektorAlgorithmusempfindlichkei-beschrieben,
derdiesesProblemdurchSchätzungderoptimalenGewichteder
einzelnenKanälelindert.NebenderbesserenFarbwiedergabeun-
terschnelldrücktwiedasVbestehendeerfahrenSdasoftware.Rauschenundistzudem100-malso
satzDaimdieBereichbishervonvormehrergestelltenenVHundertMB/serarbeitungsstufenerreichen,einensollenDurauchch-
dieBibliothekDatentransfersliestundschrbeschleunigteibtdivwerseerden.BildforDievmate,erbrerreiteteeichtGDAL-aber
nichtannähernddenSpitzendurchsatzeinerFestplatte.Inder
vorliegendenArbeitwerdenTechnikenbeschrieben,umeffiziente
asynchroneTransfersdurchzuführenundunnötigesKopierenvon
soDatenschnellzuverbeimSmeiden.chreibenDierwieesultierGDAL.endeWSeiteroftweareistbisSteigerungenzu12-malsind
alsdurchdasLesenKompressionbeansprucht.möglich,EswirsoferdneindasneuesEntpackenKomprwessionsvenigererfah-Zeit
renHälfteveingeführt,erkleinertdasund16-bitunterVerwMultispektralbilderendungeinesverlustfreinzelneneiumRechen-die
kernsmiteinemDurchsatzvon2700MB/sentpackt.Diesistetwa
100-malsoschnellwieJPEG-2000undlediglich20-60%größer.
zurNachKonturderextraktionExtraktionundder-verObjekteeinfachungwärennützlich,zusätzlicheinsbesonderSchrittee
zursolchenErkennungPolygonenanthrannotieropogenerenzuStrukturkönnen,en.wurUmdegreinoßeBilderAlgorithmusmit
zurRasterungvonLinienentwickelt.DieHerleitungdesoptimalen
Aliasing.polynomiellenDasVTerfahreniefpassfiltersist24-malgesowährleistetschnelleinwiederhochwGupta-SprertigesAnti-oull-
AnsatzundübertrifftsogardieLeistungeinerMittelklassen-GPU.
DievorgestellteVerarbeitungskettefürelektro-optischeBilder
Wistolkennützlich,undstehtNebelvallererschleiertdingsvorwerdemdenPrkönnen.oblem,dassBeinahewObjekteetterun-von
abhängigeAufnahmensindmitRadarmöglich.Man-Made-Objects,

199

rück,beispielswsodasseiseeinVFahrzeuge,erfahrenstrahlenzurderDetektionenMikrohellerwellenpunktföroftstarkmigerzu-
ObjektevonInteresseist.DieHotspot-Transformationunterdrückt
durchgängighelleGebiete,indemPixelwerteumdieHelligkeitdes
dunkelstenAlgorithmussiebeschrieben,umgebendenderRingsdieverringertKomplexitätwerdiesesden.VEswirerfahrdensein
mittelseinerbesonderenVariantevonRange-Minimum-Queries
deraufdieZugrifunterfeestelltSchrankeeinerhoheeduziert.Cache-LokalitätEineausgefeiltesicher,sodassUmstellungdie
vektorisierte,parallelisierteSoftwaredieLeistungeinerFPGA-
DieRealisierungErgebnisseumdenderFaktor100beschriebenenübertrifft.Optimierungenstellendie
gängigeMeinunginfrage,derzufolgeFPGAundGPUauto-
matischImplementierungzuhohenführen.DaBeschleunigungensämtlichebetrachtetengegenübereinerAlgorithmenCPU-
errbereitseichtdiehaben,gemäßkönnenO-KalkülnurnochunteredieSchrankekonstantenihrerFaktorKomplexitätenver-
Mikrbessertoprwerozessorden.enEswhateitersichhinwettbeherausgestellt,werbsfähigdasssind.Diehandelsüblichewichtigs-
tenVoraussetzungendafürsindVektorisierung,Parallelisierung
unddieBerücksichtigunggrundlegenderEigenschaftenderRech-
dassnerstrukturdiesewieMaßnahmenetwaderaufeineSpeicherVielfalthierarvonchie.BildvEswurdeerarbeitungsauf-gezeigt,
hinrgabeneichend.übertragbarStattdessensind.mussNachträglichesHardwarTe-WuningissenistinallejedochStufennicht
desImplementierungAlgorithm-Engineering-ZyklusundExperimente.ZumeinfließenBeispiel–wurDesign,deeinAnalyse,hoch-
deroptimierterPixelvSoraussetzt,vonegmentierungsalgorithmus,einemkomplexerderenabereineTotalorparallelisierba-dnung
renVerfahrenübertroffen.DiepraktischeBedeutungdieserMaß-
nahmenAlgorithmenwirdsiebendadurvchherverschiedeneorgehoben,Verfahrdassendieumhierdasv10-orbisgestellten100fa-
inchebereitsbeschleunigen.überlangeEsZeitvermaguntersuchtenzuüberraschen,ThemenwiedassverlustfrFortschritteeier

200

KompressionundRasterungvonLinienerzieltwerdenkonnten.
DiehiervorgestelltenTechnikenlassensichjedochauchaufandere
übertragen.Arbeitsgebiete

201

Lebenslauf

JanWassenbergwurde1983inKoblenzgeboren.DieFamiliezog
1989beruflichbedingtindieVereinigtenStaaten.BiszurRück-
kehrimJahre1998besuchteerdieprivateRandolphSchoolin
Huntsville,Alabama.2001erhielterseinAbitur(Durchschnitts-
note1,2)vomBischöflichenCusanus-GymnasiumKoblenz.Sein
InformatikstudiumanderdamaligenUniversitätKarlsruhe(TH)
schlosser2007mitderGesamtnotesehrgutab.Seit2007arbeitet
JanWassenbergamehemaligenFGAN-FOM,heuteFraunhofer
IOSB,alswissenschaftlicherMitarbeiterundforschtzumThema
effizienteAlgorithmenfürdieautomatischeBildauswertung.

StudiumOktober2001–UniversitätKarlsruhe(TH).
matik.InforDiplom2007JuliGebäudemodellierungAutomatischeThema:.Laserscanning-Datenaus

TätigkeitissenschaftlicheWAugust2007–UniversitätKarlsruhe(TH)/KIT.
BeginnderZusammenarbeitmitProf.Sanders.

Juni2007–FGAN-FOM/FraunhoferIOSB.
.MitarbeiterissenschaftlicherW

TätigkeitFachlicheAugust2006–FGAN-FOM/FraunhoferIOSB,Ettlingen.
.Hilfswissenschaftler2007April

Juni2005–UniversitätKarlsruhe(TH)/KIT,ISAS.

2005Juli

2002Mai

Preise

2001

2001

(UmsetzungStudentischeeinerHilfskraftVirtual-Reality-Umgebung).

WildfireGames.com.
amManagement)und(EntwicklungMitarbeitOpen-SourceEchtzeitstrategiespiel0A.D.

BundeswettbewerbInformatik:P

reisträgerJugendForscht:1.Preis(Regional)