Benchmarking SAT Solvers for Bounded Model Checking
15 pages
English

Benchmarking SAT Solvers for Bounded Model Checking

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
15 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Benchmarking SAT Solvers for Bounded
Model Checking
Emmanuel Zarpas
IBM Haifa Research Laboratory,
zarpas@il.ibm.com
Abstract. Modern SAT solvers are highly dependent on heuristics.
Therefore, benchmarking is of prime importance in evaluating the per-
formances of different solvers. However, relevant benchmarking is not
necessarily straightforward. We present our experiments using the IBM
CNF Benchmark on several SAT solvers. Using the results, we attempt
to define guidelines for a relevant benchmarking methodology, using SAT
solvers for real life BMC applications.
1 Introduction
Over the past decade, formal verification via model checking has evolved from a
theoretical concept to a production-level technique. It is being actively used in
chip design projects across the industry, where formal verification engineers can
now tackle the verification of large industrial hardware designs. Bounded Model
Checking (BMC) [1] has recently enabled the verification of problems that used
to be beyond the reach of formal verification. However, BMC performance relies
heavily on the performance of the underlying SAT solver.
We conducted several experiments using the IBM CNF benchmark [12, 11]
to evaluate SAT tools. Often, we found discrepancies between our results and
the experimental results found in the literature. For example, our results are not
necessarily in line with the results of the SAT03 contest [4]. The main goal of this
paper is to outline a methodology for benchmarking SAT ...

Sujets

Informations

Publié par
Nombre de lectures 42
Langue English

Extrait

BenchmarkingSATSolversforBoundedModelCheckingEmmanuelZarpasIBMHaifaResearchLaboratory,azprsai@.lbi.mocmAbstract.ModernSATsolversarehighlydependentonheuristics.Therefore,benchmarkingisofprimeimportanceinevaluatingtheper-formancesofdifferentsolvers.However,relevantbenchmarkingisnotnecessarilystraightforward.WepresentourexperimentsusingtheIBMCNFBenchmarkonseveralSATsolvers.Usingtheresults,weattempttodefineguidelinesforarelevantbenchmarkingmethodology,usingSATsolversforreallifeBMCapplications.1IntroductionOverthepastdecade,formalverificationviamodelcheckinghasevolvedfromatheoreticalconcepttoaproduction-leveltechnique.Itisbeingactivelyusedinchipdesignprojectsacrosstheindustry,whereformalverificationengineerscannowtackletheverificationoflargeindustrialhardwaredesigns.BoundedModelChecking(BMC)[1]hasrecentlyenabledtheverificationofproblemsthatusedtobebeyondthereachofformalverification.However,BMCperformancereliesheavilyontheperformanceoftheunderlyingSATsolver.WeconductedseveralexperimentsusingtheIBMCNFbenchmark[12,11]toevaluateSATtools.Often,wefounddiscrepanciesbetweenourresultsandtheexperimentalresultsfoundintheliterature.Forexample,ourresultsarenotnecessarilyinlinewiththeresultsoftheSAT03contest[4].ThemaingoalofthispaperistooutlineamethodologyforbenchmarkingSATsolversonBoundedModelCheckingproblems.Thepaperisorganizedasfollows:InSection2,wepresentexperimentalresultsforsomefamousSATsolvers.Section3presentsanewversionoftheIBMBenchmarkanddiscussescorrelationoftheCNFcharacteristicsandsolversperformance.Section4containsadiscussiononSATbenchmarkingforBMC,wherewetrytolearnfromourexperimentalresultsinlightofotherexperimentssuchastheSATcontests.Section5concludesthepaper.2ExperimentswithIBM2002Benchmark:SATSolverComparison[O7]u,rzeCxhpaeriImIe(nztCshcaom2p0a0re4.d5.f1o3u)r[f3a],mBoeurskSMAiTn5s6o1lv[e2r]sa:nzdChSiaegeI(v2400[81]..2.W17evuesresdiotnh)eF.BacchusandT.Walsh(Eds.):SAT2005,LNCS3569,pp.340–354,2005.cSpringer-VerlagBerlinHeidelberg2005
BenchmarkingSATSolversforBoundedModelChecking3412002versionoftheIBMCNFBenchmark[12],whoseCNFsaregeneratedthroughBMCfromtheIBMFormalVerificationBenchmarkLibrary[11].Thislibraryisacollectionofmodelsfromreal-lifeindustrialhardwareverificationprojects.Foreachmodel,weusedtheCNFformulas(SATdat.k.cnf)forboundwithk=1,10,15,20,25,30,35,40,45,50.Thetime-outwassetat10,000secondsonaworkstationwith867841XIntel(R)Xeon(TM)CPU2.40GHz,with512KBfirstlevelcacheand2.5GBphysicalmemory.Weusedthedefaultconfigurationforeachengine.2.1zChaIvs.BerkMin561WeranzChaffI(2001.21.17version)andBerkMin561onthe2002versionoftheIBMCNFBenchmark.Table1isasummaryoftheresultsobtained.MoreTable1.Theresultsaredisplayedinsecondsandincludethetimeout10,000secondszChaffBerkMin561Totaltime(10000sectime-out)344765414094First(#ofCNFwheretheengineisthefastest)298131#ofunsolvedproblems(timeoutreached)2530+(#ofCNFwheretheengineisthefastestbymorethanaminuteand20%)6732Firstbymodel(#ofmodelwheretheengineisthefastest)2818completeexperimentalresultsarepresentedinTable8,whereforeachmodelinthetable,wepresentthesumoftheresultsofSATdat.k.cnfwithk=1,10,15,20,25,30,35,40,45,50(i.e.,theBMCtranslationofeachmodelforthedifferentk).Formoredetails,seethecompleteresultsin[13].BecauseSATsolversdonotalwaysbehaveinhomogeneousmanner(Cf.detailedresultsin[13]ortable2),thecompleteresultsanalysisshouldnotbedisregarded.TheresultsshowthatzChaffandBerkMin561achievedcloseresults.OnsomeCNFs,zChaffrunsfaster,andonothers,BerkMin561runsfaster.Inmostcases,thedifferencesintheirperformanceisnotverysignificant,thehighestspeedisnotfasterbymorethanoneminuteor20%.However,whilezChaffseemstoperformslightlybetteroverall,BerkMin561getsbetterresultsthanzChaffontheUNSATCNFs.Fromtheseresults,itisnotpossibletoconcludethatBerkMin561performsbetterthanzChaff.ThisisnotconsistentwithwhatcanbereadintheliteratureorwiththefinalresultsoftheSAT03contest(Cf.section2.4).ThisisnotaverysatisfyingconclusionandthepreviousmetricsdonottellusinasimplewayhowmuchfasterzChaffisthanBerkmin5612.Weneedsome12BerkMin561wassecondintheSAT03contestindustrialbenchmarkscategoryintheSAT03contest;thewinnerForkliftisnotpubliclyavailable.Itisveryimportanttobeabletogiveclearandconcisevaluesforresultsdissemi-nation.
342E.ZarpasTable2.TheresultsaredisplayedinsecondsfortheCNFsfrom18rulemodel.Thetime-outwassetto10000seconds.zChaffperformstheworstfork=15,20,35;Berk-Min561performstheworstfork=30,40,50;Siegev4hasthebestperformance,exceptfork=15,2518ruleCNFResultzChaBerkMinSiegeSATdat.k1.cnfunsat000SATdat.k10.cnfunsat110SATdat.k15.cnfunsat1345SATdat.k20.cnfunsat654741SATdat.k25.cnfunsat951109251SATdat.k30.cnfsat700755557SATdat.k35.cnfsattime-out22301730SATdat.k40.cnfsat55108060247SATdat.k45.cnfsattime-outtime-out959SATdat.k50.cnfsat5010time-out1370additionalmetricsinordertocomparetheseSATsolvers.Thearithmeticmeanofthespeedupseemsirrelevant.Let’ssaywewanttocomparetheperformancesofsolversAandB.IfthespeedupofAvs.Bis10ontest1and0.0001ontest2,thearithmeticmeanofthespeedup(Avs.B)wouldbe5andthemeanofthespeedup(Bvs.A)wouldbe5000.Thisisclearlynotrelevant.Theglobalspeedup(TotaltimeA/TotalTimeB)isgoodtohavebutthe“longruns”havefarmoreweightthantheshortones.Themedianismuchmoreinteresting.howeveritdoesnottakeintoaccounttheextremevalues(e.g,ifontest1thespeedupisequaltomedian+10ortomedian+0.1,itdoesnotchangethemedianvalue).Ontheotherhand,themedianinsensivitytoextremevaluesmakesitlessdependentonthetime-outs.Thegeometricalmeanseemstobeagoodsolution.Itdefinitelytakesintoaccounttheweightofthe“negativespeedup”(i.e.,aspeedupoflessthan1)3.Wefeltthesevaluesshouldbecomputedthesevaluesonlyonarelevantsubsetofthebenchmark.Forexample,wediscardedallthecaseswherethefourSATsolverstime-out.Inaddition,wediscardedcasesforwhichthefoursolvers(zChaI,Berkmin561,Siegev4,zChaII)runinlessthanveseconds.Therationalisthatitisdifficulttoaccuratelycompareverysmallruntimesandthattheseverysmallruntimesarearguablynotrelevant.TheresultsdisplayedinTable3(b)showthatBerkmin561isnearlytwotimesslowerthanzChaffI,butthatBerkminbehavessomehowbetterinthe“long”cases.2.2Siegev4Siegewashors-concoursfortheSAT03contest,thereforeitdidnotparticipateinthesecondstage.Nevertheless,theSiegeresultswereprettygoodforthe3Notethatthelogarithmofthegeometricalmeanisequaltothearithmeticalmeanofthelogarithms.
BenchmarkingSATSolversforBoundedModelChecking343Table3.Witha10000sec.time-outTotalTime#time-outs(inseconds)zChaI344,76425berkmin561414,03530Siegev419723914zChaII389,30431(a)SpeedupzBCerhkamffinIzCSiheagfefIzzCChhaaffffIIIgmeeodmiaenan00..455122..980711..2239global0.753.580.75(b)firststageandSiegehasareputationforbeingoneofthebestSATsolversavailable.WeranSiegev4[8]onthebenchmarkusing123456789asaseed.Table3displaystheoverallconclusions.FormoredetailsseeTable8and[13].Siegev4isthefastestin298cases(CNFs)outof498.Inmanydicultcases,Siegev4isfastestbyanorderofmagnitude,ormore.Insomecases,Siegev4performedsignicantlyworsethanzChaorBerkMin561.Forexample,for26rule,Siegev4isslowerthanzChaffbyanorderofmagnitudeandslowerthanBerkMinbytwoordersofmagnitude.Inaddition,withinthetime-out,severalCNFscanbesolvedonlybySiegev44.Inconclusion,weseethatSiegev4performssignificantlybetter(roughlyspeaking2.5timesfaster)thanzChaffI,Berkmin561andzChaffIIontheIBMCNF2002benchmark.2.3zChaIITheindustrialcategoryoftheSAT04competition[6]waswonbyzChaffII(zChaff2004.5.13).However“black-box”solverssuchasForklift(the2003edi-tionwinner)were“hors-concours”andassuchnotallowedtoenterthesecondstageofthecompetition.WefocushereontheperformanceofzChaffIIvszChaffI.zChaffIIisfasterthan:zChafffor229CNFs(outof442),Berkmin561for208CNFs(outof442),Siegev4for82CNFs(outof442).Inaddition,zChaIIreachedtime-outmoreoftenthanzChaffI(forsixmorecases).Table3dis-playsasyntheticviewoftheresults.ThezChaffIIcodewasleftunchanged,soitusedrandomseeds.Siegev4isroughlythreetimefasterthanzChaI.Figure1givesusagraphicviewoftheperformanceofzChaffIIvs.zChaffI.Theover-allperformancedifferencebetweenzChaffIandzChaffIIonourbenchmarkissmall5,eventhoughthetwosolverscanbehaveverydifferentlyinspecificcases.45Siegeisarandomizedsolver,therefore,itcouldbearguedthatthecomparisonisnotfairsincethezChaffandBerkMin561,versionsweusedwerenotrandomized.Nevertheless,evenadeterministicsolvercanbeluckyandprovidingSiegewithaseedofSiegemakesitdeterministic.ItshouldbenotethatzChaffIIisrandomized,sotheresultscouldbedifferentforotherruns.ItwouldbeinterestingtohaveastatisticaldescriptionoftherangeofzChaffIIperformances.
344E.Zarpas0000100010010110.11101001000100001.010.0zchaff I runtime08070605040302100niB(a)(b)Fig.1.Histogram(b)givesthedistributionofthedecimallogarithmofzChaffI/zChaffIIspeedupwith10000sectime-out2.4ComparisonwithSATContestResultsInordertounderstandwhetherornotourresultsareconsistentwiththoseoftheSAT03contest6(fordetailsresultssee[5]),wehadalookattheresultsofthefirstandsecondstagesofthecompetition.Firststage.Lookingattheresults7fortheindustrialcategory[5],wenotethefollowing:Series13rule8isover-represented(Cf.Table4).Besides,sinceallsolverstime-outonmostoftheCNFsfromthisseries,itisnotverymeaningful.Exceptforseries13ruleand11rule,theseriesfromtheIBMCNFbench-markareeasyforzCha,BerkMin561,andSiegev1.Resultsforrule07arenotconsistentwithourownexperiences.Thisdis-crepancyisprobablycausedbythe“Lisasyndrome”[4]:CNFswereshuffledforSAT03andsolverperformancecandifferdramaticallybetweenashuffledCNFandtheoriginal.TheSAT03contestresultsforBerkMin561,Forklift,Siegev1,andzChaon(shuffled)seriesfromtheIBMCNFBenchmarkaresummarizedinTable4.Ifresultsfromseries07ruleand13rulearediscarded,zChaffshowsbetterresultsthanBerkMin561.678InthefirststageoftheSAT04competition,zChaffIandzChaffIIhaveverycloseresultsandthesolversthatoutperformedthem(Forklift,Oepir)arenotavailableforexperimentation.TheSiegeversionusedfortheSAT03contestisSiegev1.ThenamesoftheSAT03seriesappearsinitalicsinordertodifferentiatethemfromtheseriesweusedinourexperiments.FortheexactcompositionoftheSAT03series,see[5].
BenchmarkingSATSolversforBoundedModelChecking345Table4.PartialresultsfromfirststageSAT03contestBerkMin561forkliftSiege1zChaffTotal#ofSolvedBenchmarks112112112101TotalCPUtimeneeded(sec)105000103000103000114000without13rule(sec)15105184084160without13ruleand07rule(sec)1410424268553WebelievethatthedifferencesinresultsforthefirststageofSAT03contestresultsforIBMthebenchmarkandourexperimentsareduetoclauseshufflinginSAT03andtothefactthattheSAT03experimentusedasmallertest-bedofCNFsfromtheIBMbenchmark.Secondstage.Duringthesecondstageofthecompetition,solverswererankedaccordingtothenumberofCNFstheycouldsolvefromarestrictedbenchmark.Forkliftwasrankedfirst,Berkmin561second,andzChaffIsixthontheindustrialbenchmark9.Clearly,therespectivezChaffIandBerkMin561rankingdonotcorrespondtoourexperimentalresults(seeTable3).However,inthesecondstage,allsolvers“timed-out”ontheIBMbenchmarksselected.Inotherwords,therankingofthesolversselectedforthesecondstageofthecompetitiondidnottakeintoaccountperformanceontheIBMbenchmarks.ThisprobablyexplainswhyourevaluationofzChaffandBerkMin561ontheIBMCNFBenchmarkgivesresultsthatarenotinlinewiththesecondstageresults(onindustrialbenchmarks)oftheSAT03contest.3ExperimentswithIBM2004Benchmark:SearchforCorrelation3.1DescriptionoftheIBM2004BenchmarkBoundedModelCheckingtranslationtechnologyiscontinuouslyevolving.There-fore,were-generatedtheIBMCNFBenchmarkfromtheIBMModelBenchmarkusinganalternativeBMCtool.Weranthenewtranslationforthefollowingboundsk=0..10,k=11..15,k=16..20,k=21..25,...,k=95..100.The2004CNFsareavailableon-line.Inthispaper,werefertothe2002and2004versionsastheoldandthenewbenchmarks,respectively.Table5and6presentstatisticaldescriptionsoftheoldandthenewbench-marks.Inthesetwotables,averageisthearithmeticalmean,STDEVisthestandarddeviation,clavuasresistheratiobetweenthenumberofclausesandthenumberofvariables,%unitisthepercentofunit(i.e.,oflengthone)clausesinaCNF,%binisthepercentofclausesoflengthtwo,%teristhepercentofclauses9Siegewashors-concours,thereforenotselectedforthesecondstage.
436E.ZarpasTable5.Old(2002)benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average80,167343,8264.120.375.514.23.56.5median54,857220,1804.080.277.012.23.56.6STDEV81,924388,1560.520.36.47.71.01.3max636,0893,172,1075.421.684.555.35.29.1min3,64514,6812.48040.84.30.10.1Table6.New(2004)Benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average73,414305,3013.990.674.515.03.66.3median50,897195,6123.980.476.113.13.66.6STDEV75,553349,2480.450.66.17.30.91.2max565,8892,760,5025.484.684.551.64.99.1min3,60614,1042.55046.54.40.10.1oflengththree,%l=4isthepercentofclausesoflength4and%l>4isthepercentofclauseslongerthan4.ThesetwotablestellusthatthenewbenchmarkCNFsareroughly10%smallerthantheoldbenchmarkCNFs.However,theyencompassroughlythesameproportionoflengthtwo,three,andfourclauses.Thehigherproportionofunitclausesinthenewbenchmarkisduetospecificoptimizations.Wealsoseethattheratioclavuarsesisslightlylowerinthenewbenchmark(seethenextsubsectionforadiscussionontherelevanceofthisratio).Figure2displaysthehistogramof%bin,wheretwomodels(07and131)haveanonstandardpercentofbinaryclausesandareinthe45%-55%andnotinthe70%-85%.WelookedatsomestatisticsforthestructureoftheCNFsofthenewbench-mark.Wefoundthatforanymodelofthebenchmark,therearefourrealnum-bersa,b,c,dsuchthatforalltheCNFsofthemodel,#(variables)ak+band#(clauses)ck+d,withkbeingtheboundusedtogeneratetheCNFs.Inamoregeneralway,foranymodelandforanyintegernstrictlygreaterthan0,therearetworealnumberseandfsuchthatforallCNFsofthemodel#(clausesoflengthn)ek+f(ofcourseeandfcanbenull,forexampleinmostmodels,thenumberofunitclausesisaconstant).Thecorrelations(foradiscussionaboutstatistictechniquesforNP-completeproblemssee[10])be-tweentheseriesfromthebenchmarkandtheseriespredictedwiththepreviousequationsareabout0.99.ThisisnotsurprisingsincetheCNFsaregeneratedfromthemodelinawayessentiallylineartothebound.Inaddition,wefoundoutthat,inourbenchmark,thelengthofthelongestclauseisthesameforallCNFsgeneratedfromthesamemodel.Figure2(b)displaysthehistogramofthelongestclausesinthenewbenchmark.Notethattwomodels,07and131,whichhavethegreaterlongestclauses,arealsotheoneswitha“nonstandard”percentofbinaryclauses.
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents