您的当前位置：首页 Scheduling Malleable Applications in Multicluster Systems

Scheduling Malleable Applications in Multicluster Systems

来源：意榕旅游网

CoreGRIDTechnicalReportNumberTR-0092

May22,2007

InstituteonResourceManagementandSchedulingCoreGRID-NetworkofExcellenceURL:http://www.coregrid.net

CoreGRIDisaNetworkofExcellencefundedbytheEuropeanCommissionundertheSixthFrameworkProgramme

Projectno.FP6-004265

SchedulingMalleableApplicationsinMulticlusterSystems

J´er´emyBuissonjbuisson@irisa.fr

INRIA/IRISA

CampusdeBeaulieu,35042RennesCedex,France

OzanSonmez,HashimMohamed,WouterLammers,DickEpema{o.o.sonmez,h.h.mohamed,d.h.epema}@tudelft.nl

wouter@w78.nl

DelftUniversityofTechnology

P.OBox5031,2600GA,Delft,TheNetherlands

CoreGRIDTR-0092

May22,2007

Abstract

Inlarge-scaledistributedexecutionenvironmentssuchasmulticlustersystemsandgrids,resourceavailabilitymayvaryduetoresourcefailuresandbecauseresourcesmaybeaddedtoorwithdrawnfromsuchenvironmentsatanytime.Inaddition,singlesitesinsuchsystemsmayhavetodealwithworkloadsoriginatingfrombothlocalusersandfrommanyothersources.Asaresult,applicationmalleability,thatis,thepropertyofapplicationstodealwithavaryingamountofresourcesduringtheirexecution,maybeverybeneﬁcialforperformance.InthispaperwepresentthedesignofthesupportofandschedulingpoliciesformalleabilityinourKOALAmulticlusterschedulerwiththehelpofourDYNACOframeworkforapplicationmalleability.Inaddition,weshowtheresultsofexperimentswithschedulingmalleableworkloadswithKOALAinourDASmulticlustertestbed.

1Introduction

Applicationmalleability,thatis,thepropertyofapplicationstousevaryingamountsofresourcessuchasprocessorsduringtheirexecution,ispotentiallyaveryversatileandbeneﬁcialfeature.Allowingresourceallocationtovarydur-ingexecution,malleabilitygivesaschedulertheopportunitytoreviseitsdecisionsevenafterapplicationshavestartedexecuting.Increasingtheﬂexibilityofapplicationsbyshrinkingtheirresourceallocations,malleabilityallowsnewjobstostartsooner,possiblywithresourcesthatarenotgoingtobeusableduringtheirwholeexecution.Makingappli-cationsabletobeneﬁtfromtheresourcesthatappearduringtheirexecutionbygrowingtheirallocations,malleabilityalsohelpsapplicationsterminatesooner.Inadditiontotheseverygeneraladvantages,malleabilitymakesiteasiertodealwiththedynamicnatureoflarge-scaledistributedexecutionenvironmentssuchasmulticlustersystems,andmoregenerallygrids.Inthispaper,wepresentthedesignofthesupportformalleabilityinourKOALA[21]multiclusterschedulerbymeansoftheinclusionofourDYNACOframework[9]forapplicationmalleability,andthedesignandanalysisoftwoschedulingpoliciesformalleableapplications.

Inexecutionenvironmentssuchasmulticlustersandgrids,theavailabilityofresourcesvariesfrequently[16].Inadditiontofailures,resourcesmaybeallocated(orreleased)byconcurrentusers,andorganizationsmayaddor

withdraw(partsof)theirresourcesto/fromtheresourcepoolatanytime.Inanyofthesecases,malleabilityallowsapplicationstobeneﬁtfromappearingavailableresources,whilegracefullyreleasingresourcesthatarereclaimedbytheenvironment.Malleabilitythusholdsagreatpromiseinstrategiestomoreperformantexecutioninmulticlusters.Besides,malleableapplicationsgiveschedulerstheopportunitytoincreasesystemutilization.

Ontheonehand,despitethatseveralapproacheshavebeenproposedtobuildmalleableapplications[9,5,28,4,17,26],virtuallynoexistingmulticlusterandgridinfrastructuresareabletobeneﬁtfromthisproperty.Consequently,manyapplicationsembedtheirownspeciﬁcschedulerandsubmitbulksofjobsinordertobuilddynamicresourcemanagementontopofexistinginfrastructures.Ontheotherhand,mostofthepreviousworkonschedulingmalleableapplicationsdoesnothandlethechallengesthatappearinthecontextofmulticlustersystems.Speciﬁcally,issuessuchastheselectionofasuitableclusterforeachjobandresiliencetobackgroundloadduetolocalusersareoftennottakenintoaccount.Furthermore,manyproposedapproacheshaveonlybeentestedwithsimulations,andanassessmentoftheoverheadduetotheimplementationofgrowandshrinkoperationsarecommonlyomitted.

Ourcontributionsinthispaperarethefollowing.First,wepresentanarchitectureandanactualimplementa-tionofthesupportformalleabilityingridschedulers,showingthebeneﬁtsofthemodularstructureofKOALAasaby-product.Second,wepresenttwopoliciesformanagingmalleabilityinthescheduler,onewhichhandsoutanyadditionalprocessorstothemalleablejobsthathavebeenrunningthelongest,andonethatspreadsthemequallyoverallmalleablejobs;eachofthesepoliciescanbecombinedwithoneoftwoapproacheswhicheitherfavorrunningorwaitingjobs.Third,weevaluatethesepoliciesandapproachesincombinationwithKOALA’sworst-ﬁtload-sharingschedulingpolicywithexperimentsintheDAS3[2]testbed.Theseexperimentsshowthatahigherutilizationandshorterexecutiontimescanbeachievedwhenmalleablityisused.

Therestofthispaperisstructuredasfollows.Section2statesmorepreciselytheproblemaddressedinthispaper,andSection3reviewsthestateoftheartofmalleabilityinresourcemanagementinmulticlusterandgridsystems.Section4describestheKOALAgridschedulerandtheDYNACOframework,whicharethestartingpointsofourwork.Section5describeshowwesupportmalleabilityinKOALA,anddetailsthemalleabilitymanagementapproachesandpoliciesthatwepropose.Section6presentstheexperimentalsetup,andSection7discussesourexperimentalresults.Finally,Section8makessomeconcludingremarksandpointstofuturework.

2ProblemStatement

Inmulticlustersystemsandmoregenerallyingrids,theremaybevarioustypesofparallelapplicationsthatcanbeneﬁtfrombeingabletochangetheirprocessorconﬁgurationaftertheyhavestartedexecution.Malleabilityofparallelapplicationsmayyieldimprovedapplicationperformanceandbettersystemutilizationsinceitallowsmoreﬂexibleschedulingpolicies.Inthispaper,weproposeandcompareschedulingapproachesthattakeintoaccountmalleableapplicationsinordertoassesssuchbeneﬁts.Inthissection,weﬁrstclassifyparallelapplicationsinordertodistinguishmalleableapplications,andthenweaddressseveralaspectsofmalleableapplicationsthatshouldbetakenintoaccountbyaresourcemanagementoraschedulingsystem.

2.1ClassiﬁcationofParallelJobs

Followingthewell-knownparalleljobclassiﬁcationschemepresentedin[12],weconsiderthreetypesofjobs,namely,rigid,moldable,andmalleable.Arigidjobrequiresaﬁxednumberofprocessors.Whenthenumberofprocessorscanbeadaptedonlyatthestartoftheexecution,thejobiscalledmoldable.Similartorigidjobs,thenumberofprocessorsformoldablejobscannotbechangedduringruntime.Jobsthathavetheﬂexibilitytochangethenumberofassignedprocessorsduringtheirruntime(i.e.,theycangroworshrink)arecalledmalleable.

2.2SpeciﬁcationofMalleableJobs

Amalleablejobmayspecifytheminimumandmaximumnumberofprocessorsitrequires.Theminimumvalueistheminimumnumberofprocessorsamalleablejobneedstobeabletorun;thejobcannotshrinkbelowthisvalue.Themaximumvalueisthemaximumnumberofprocessorsamalleablejobcanhandle;allocatingmorethanthemaximumvaluewouldjustwasteprocessors.Wedonotassumethatastepsizeindicatingthenumberofprocessorsbywhichamalleableapplicationcangroworshrinkisdeﬁned.Weleavethedeterminationoftheamountofgrowingandshrinkingtotheprotocolbetweentheschedulerandtheapplication(seeSection5).

CoreGRIDTR-00922

2.3InitiativeofChange

Anotheraspectthatweconsiderispartythattakestheinitiativeofchangingthesizeofamalleablejob(shrinkingorgrowing).Eithertheapplicationortheschedulermayinitiategroworshrinkrequests.Anapplicationmaydosowhenthecomputationitisperformingcallsforit.Forexample,acomputationcanbeinneedofmoreprocessorsbeforeitcancontinue.Ontheotherhand,theschedulermaydecidethatamalleablejobhastoshrinkorgrowbasedontheavailabilityoffreeprocessorsinthesystem.Forexample,thearrivalofnewjobstoasystemthatisheavilyloadedmaytriggeraschedulertorequestscurrentlyrunningmalleablejobstoshrink.

2.4TheObligationtoChange

Requestsforchangingthesizeofamalleablejobmayormaynothavetobesatisﬁed.Avoluntarychangemeansthatthechangedoesnothavetosucceedordoesnotnecessarilyhavetobeexecuted;itismerelyaguideline.Amandatorychange,however,hastobeaccommodated,becauseeithertheapplicationcannotproceedwithoutthechange,orbecausethesystemisindirectneedofthereclaimedprocessors.

3Relatedwork

Muchofthepreviousresearchontheproblemofschedulingandallocatingresourcestomalleablejobshasfocusedontheoreticalaspects[23,7,6].Thus,theresultshavebeenobtainedinsimulatedenvironments,oftenneglectingtheissuesthatariseinrealenvironmentssuchastheeffectivescalabilityofapplicationsandthecostofgrowingorshrinking.Inthissection,wediscussseveralimplementationsofmalleableapplicationsandtheirschedulinginrealmulticlustersystems.

Asnotedin[11]andinourpreviouswork[9],malleabilityhelpsapplicationsperformbetterwhenresourceavail-abilityvaries.Severalapproacheshavebeenusedtomakeparalleland/ormulticlusterapplicationsmalleable.WhileGRADS[28]reliesontheSRS[27]checkpointlibrary,APPLES[5]andASSIST[4]proposetobuildapplicationsuponintrinsicallymalleableskeletons.WithAMPI[17],malleabilityisobtainedbytranslatingMPIapplicationstoalargenumberofCHARM++objects,whichcanbemigratedatruntime.Utreraetal.[26]proposetomakeMPIapplicationsmalleablebyfoldingseveralprocessesontoeachprocessor.

Acoupleofworks[5,28]havestudiedhowtoschedulesuchapplicationsinconjunctionwithbuildingmalleableapplications.Amongthem,APPLES[5]andGRADS[28]aresomewhatspeciﬁcastheyproposethatapplicationsareresponsibletoschedulethemselvesontheirown.However,thisapproachraisesthequestionofhowthesystembehaviourandperformancewouldbeincaseseveralconcurrentmalleableapplicationscompeteforresources.Fur-thermore,asthoseapproachesrelyoncheckpointing,itisunclearhowanapplicationgetsitsresourcesbackwhenitacceptstotryanewallocation.System-wideschedulersdonotsufferfromthesedrawbacks.

Otherapproachesrelyonasystem-widescheduler.Correspondingtotheunderlyingexecutionmodel,AMPIusesanequipartitionpolicy,whichensuresthatalljobsgetalmostthesamenumberofprocessors;whilethepolicyin[26]isbasedonfoldingandunfoldingthejobs(i.e.,doublingorhalvingthenumberofallocatedprocessors).However,thosetwoapproachesrelyonthepropertiesoftheirunderlyingexecutionmodel.Forinstance,equipartitionassumesthatanyapplicationcanbeexecutedefﬁcientlywithanynumberofprocessors,asitisthecasewithAMPI;whilefoldingrestrictsthenumberofprocessestobedivisiblebythenumberofprocessors(oftenapowerof2forpracticalreasons),whichistheonlywaytofoldefﬁcientlynon-malleableapplications.Amoregeneralapproachsuchastheoneweproposeismoreappropriateinthecontextofmulticlusters.

McCannandZahorjan[19]furtherdiscussthefoldingandequipartitionpolicies.Accordingtotheirexperiments,foldingpreserveswellefﬁciency;whileequipartitionprovideshigherfairness.Theyhaveproposedin[19]arotationpolicyinordertoincreasethefairnessofthefoldingpolicy.However,rotationisalmostimpracticableinthecontextofmulticlusters.

Asfairnessintermsofallocatedprocessorsdoesnotimplyefﬁciency,abiasedequipartitionpolicyisproposedinHungersh¨oferetal.[15]suchthatthecumulativespeedupofthesystemismaximized.Italsoconsidersbothmalleableandrigidjobsinasinglesystemin[14],anditguaranteestoallocateaminimumnumberofprocessorstoeachmalleablejob,suchthattheyarenotruledoutbyrigidjobs.However,inmulticlusters,itiscommonthatsomeoftheusersbypassthemulticluster-levelscheduler.TheproblemofmakingtheschedulertakeintoaccountthatincuredbackgroundloadisnotaddressedintheworksofHungersh¨oferetal.

CoreGRIDTR-00923

RLSNIPKISPIPSchedulerCOPCRunners FrameworkKOALA RunnersFigure1:OverviewofthearchitectureoftheKOALAmulticlusterscheduler.

Inaddition,mostofthosepreviousresearchworks[26,19,15,14]havenotconsideredthecombinationofmal-leabilitymanagementandloadsharingpoliciesacrossclusters,whichisanissuespeciﬁctomulticlusters.

4Background

Inthissection,wesummarizeourpreviousworkonaco-allocatinggridschedulercalledKOALAandontheimple-mentationofmalleableapplications,whichisthestartingpointoftheworkreportedinthispaper.Section4.1presentstheKOALAgridscheduler,followedbysection4.2thatdescribesourDYNACOframeworkthatweusetoimplementmalleableapplications.

4.1TheKOALAmulticlusterscheduler

TheKOALA[21]gridschedulerhasbeendesignedformulticlustersystemssuchastheDAS[2].KOALAjobsub-missiontoolsemploysomeoftheGLOBUStoolkit[13]servicesforjobsubmission,ﬁletransfer,andsecurityandauthentication.Ontheotherhand,KOALAschedulerhasitsownmechanismsfordataandprocessorco-allocation,resourcemonitoring,andfaulttolerance.

Figure1showsthearchitectureofKOALA.Thisarchitectureshowsauxiliarytoolscalledrunners,whichprovideuserswithaninterfaceforsubmittingjobsandmonitoringtheirprogressfordifferentapplicationtypes.Runnersarebuiltoutofaframework,whichservesasafrontendbetweeneachrunnerandthecentralizedscheduler.Thelatterismadeofaco-allocator(CO),whichdecidesofresourceallocations,andofaprocessorclaimer(PC),whichensuresthatprocessorswillstillbeavailablewhenthejobstartstorun.Ifprocessorreservationissupportedbylocalresourcemanagers,thePCcanreserveprocessorsimmediatelyaftertheplacementofthecomponents.Otherwise,thePCusesKOALAclaimingpolicy[20,22]topostponeclaimingofprocessorstoatimeclosetotheestimatedjobstarttime.Initstasks,theschedulerissupportedbytheKOALAinformationservice(KIS),whichmonitorsthestatusofresourcesthankstoaprocessorinformationprovider(PIP),anetworkinformationprovider(NIP)andareplicalocationservice(RLS).ProvidersconnectKOALAwiththemulticlustermonitoringinfrastructure,whichcanbeGLOBUSMDSorwhateverelsedependingontheusedresourcemanagers.

WithinthecontextofKOALAjobmodel,ajobcompriseseitheroneormultiplecomponentsthateachcanrunonaseparatecluster.Eachjobcomponentspeciﬁesitsrequirementsandpreferencessuchastheprogramitwantstorun,thenumberofprocessorsitneeds,andthenamesofitsinputﬁles.

Uponreceivingajobrequestfromarunner,theKOALAschedulerusesoneofplacementpolicies,describedbelow,totrytoplacejobcomponentsonclusters.WithKOALA,usersaregiventheoptionofselectingoneoftheseplacementpolicies.

CoreGRIDTR-00924

Figure2:OverviewofthearchitectureoftheDYNACOframeworkforadaptability.

•TheWorst-Fit(WF)[8]placeseachcomponentofajobintheclusterwiththelargestnumberofidleprocessors.TheadvantageofWFisitsautomaticload-balancingbehaviour,thedisadvantageisthatlarge(componentsof)jobshavelesschanceofsuccessfulplacementbecauseWFtendstoreducethenumberofidleprocessorspercluster.

•TheClose-to-Files(CF)policy[20]usesinformationaboutthepresenceofinputﬁlestodecidewheretoplace(componentsof)jobs.Clusterswiththenecessaryinputﬁlesalreadypresentarefavouredasplacementcandi-dates,followedbyclustersforwhichtransferofthoseﬁlestaketheleastamountoftime.

•TheClusterMinimization(CM)andtheFlexibleClusterMinimization(FCM)policies[24]havebeendesignedespeciallyforjobsthatmayuseco-allocationinordertominimizethenumberofclusterstobecombinedforagivenparalleljob,suchthatthenumberofinter-clustermessagesisreduced.Theﬂexibleversiondecreasesinadditionthequeuetimebysplittingjobsintocomponentsaccordingtothenumberofidleprocessors.Iftheplacementofthejobsucceedsandinputﬁlesarerequired,theschedulerinformsarunnertoinitiatethethird-partyﬁletransfersfromtheselectedﬁlesitestotheexecutionsitesofthejobcomponents.Ifaplacementtryfails,KOALAplacesthejobatthetailofaplacementqueue.Thisqueueholdsallthejobsthathavenotyetbeensuccessfullyplaced.Theschedulerregularlyscansthisqueuefromheadtotailtoseewhetheranyjobisabletobeplaced.Foreachjobinthequeuewerecorditsnumberofplacementtries,andwhenthisnumberexceedsacertainthresholdvalue,thesubmissionofthatjobfails.

4.2TheDYNACOframeworkanditsuseformalleability

Inourpreviouswork[9],wehaveproposedtechniquestoimplementmalleabilityasaspecialcaseofadaptability.Basically,adaptabilityisanapproachforaddressingtheproblemofthedynamicityoflarge-scaleexecutionenviron-ments.Itconsistsintheabilityofapplicationstomodifythemselvesduringtheirexecutionaccordingtoconstraintsimposedbytheexecutionenvironment.Ourpreviousworkonabstractingadaptability[3]hasresultedinDYNACO1,agenericframeworkforbuildingdynamicallyadaptableapplications.

AsitsarchitectureshowsinFigure2,DYNACOdecomposesadaptabilityintofourcomponents,similarlytothecontrolloopsuggestedin[18]:theobservecomponentmonitorstheexecutionenvironmentinordertodetectanyrelevantchange;relyingonthisinformation,thedecidecomponentmakesthedecisionaboutadaptability.Itdecideswhentheapplicationshouldadaptitselfandwhichstrategyshouldbeadopted.Whenthestrategyinusehastobechanged,theplancomponentplanshowtomaketheapplicationadoptthenewstrategy;ﬁnally,theexecutecompo-nentschedulesactionslistedintheplan,takingintoaccountthesynchronizationwiththeapplicationcode.Beingaframework,DYNACOisexpectedtobespecializedtoeachapplication.Inparticular,developersprovidethedecisionprocedure,thedescriptionofplanningproblems,andtheimplementationofadaptationactions.

InadditiontotheDYNACOframework,wehaveproposedAFPAC[10]asanimplementationoftheexecutecom-ponentthatisspeciﬁctoSPMDapplications.Itensuresthatadaptationactionsareexecutedbyalloftheprocessesfromthesamepointofprogressintheexecutionoftheapplicationcode.

Asreportedin[9],DYNACOandAFPAChavebeensuccessfullyusedtomakeseveralexistingMPI-basedappli-cationsmalleable,includinganFFTnumericalkernelandalegacyn-bodysimulator.AsDYNACOandAFPACfosterclearseparationofadaptabilityfromtheotherconcernsintheimplementation,onlyfewmodiﬁcationsarerequiredtomakeexistingMPIapplicationsmalleable.

CoreGRIDTR-00925

Figure3:ThearchitectureoftheMalleableRunnerwithDYNACOintheKOALAmulticlusterscheduler.

5DesigningSupportforMalleabilityinKOALA

Inthissection,wepresentourdesignforsupportingmalleableapplicationsinKOALA.First,weexplainhowweincludetheDYNACOframeworkintotheKOALAmulticlusterscheduler,andthenwepresentourapproachesandpoliciesformanagingtheexecutionofmalleableapplications,respectively.

5.1SupportingDYNACOapplicationsinKOALA

InordertosupportDYNACO-basedapplicationsinKOALA,wehavedesignedaspeciﬁcrunnercalledtheMalleableRunner(MRunner);itsarchitectureisshowninFigure3.IntheMRunner,theusualcontrolroleoftherunnerovertheapplicationisextendedinordertohandlemalleabilityoperations,forwhichpurposeacompleteseparateapplication-speciﬁcinstanceofDYNACOisincludedineachcopyoftheMRunner.Afrontend,whichiscommontoalloftherunners,interfacestheMRunnertothescheduler.Weaddamalleabilitymanagerinthescheduler,whichisresponsiblefortriggeringchangesofresourceallocations.

IntheDYNACOframework,thefrontendisreﬂectedasamonitor,whichgenerateseventsasitreceivesgrowandshrinkmessagesfromthescheduler.ResultingeventsarepropagatedthroughouttheDYNACOframeworkandtranslatedintotheappropriatemessagestoGRAMandtotheapplication.Inaddition,thefrontendcatchestheresultsofadaptationsinordertogenerateacknowledgmentsbacktothescheduler.Italsonotiﬁestheschedulerwhentheapplicationvoluntarilyshrinksbelowtheamountofallocatedprocessors.

SinceGRAMisnotabletomanagemalleablejobs,theMRunnermanagesthemalleablejobasacollectionofGRAMjobsofsize1.Upongrowth,DYNACOmakestheMRunnersubmitjobstoGRAM.WhenitreceivesactivemessagesasfeedbackfromGRAM,ittransmitsthenewcollectionofactiveGRAMjobs(i.e.thecollectionofheldresources)totheapplicationinordertomakeitusethem.Inordertoreducetheimpactofgrowmessagesontheexecutiontime,wemakeinteractionswithGRAMoverlapwiththeexecutionoftheapplication.Suspensionoftheapplication(forprocessmanagementanddataredistribution)thusdoesnotoccurbeforealltheprocessorsareheld.Todoso,submissionstoGRAMlaunchanemptystubratherthantheapplication’sprogram.Thestubisturnedintoanapplicicationprocessduringtheprocessmanagementphase,whenprocessorsarerecruitedbytheapplication.ThatlatteroperationisfasterthansubmittingajobtoGRAMasitisrelievedfromtaskssuchassecurityenforcementandqueuemanagement.Conversely,uponshrink,theMRunnerﬁrstreclaimsprocessorsfromtheapplication;thenwhenitreceivesshrunkfeedbackmessages,itreleasesthecorrespondingGRAMjobs.Forperformancereason,whenhandlingshrinkmessages,welettheexecutionresumebeforereleasingprocessorstoGRAM.

5.2JobManagement

UponsubmissionofanapplicationtoKOALA,whetheritisrigid,moldableormalleable,theinitialplacementisperformedbyoneoftheexistingplacementpoliciesasdescribedinSection4.1.Intheplacementphaseofmalleableapplications,theinitialnumberofprocessorsrequiredisdeterminedconsideringthenumberofavailableprocessorsinthesystem.Speciﬁcally,givenamalleableapplication,theplacementpoliciesplaceitifthenumberofavailableprocessorsisatleastequaltotheminimumprocessorrequirementoftheapplication.

Inthejobmanagementcontext,themalleabilitymanagerisresponsibleforinitiatingmalleabilitymanagementpoliciesthatdecideonhowtogroworshrinkmalleableapplications.Below,weproposetwodesignchoicesasto

whentoinitiatemalleablemanagementpolicies,whichgivePrecedencetoRunningApplicationsoverwaitingones(PRA)orviceversa(PWA),respectively.

InthePRAapproach,wheneverprocessorsbecomeavailable,forinstance,whenajobﬁnishesexecution,ﬁrsttherunningapplicationsareconsidered.Iftherearemalleablejobsrunning,oneofthemalleabilitymanagementpoliciesisinitiatedinordertogrowthem;anywaitingmalleablejobsarenotconsideredaslongasatleastonerunningmalleablejobcanstillbegrown.

InPWAapproach,whenthenextjobjinthequeuecannotbeplaced,theschedulerappliesoneofthemalleabilitymanagementpoliciesforshrinkingrunningmalleablejobsinordertoobtainadditionalprocessors.Ifitishoweverimpossibletogetenoughavailableprocessorsinordertoplacejobjtakingintoaccounttheminimumsizesoftherunningjobs,thentherunningmalleablejobsareconsideredforgrowingbyoneofthemalleablemanagementpolicies.Wheneverprocessorsbecomeavailable,theplacementqueueisscannedinordertoﬁndajobtobeplaced.

Inbothapproaches,inordertotriggerjobmanagement,theschedulerperiodicallypollstheKOALAinformationservice.Indoingso,theschedulerisabletotakeintoaccountdynamicallythebackgroundloadduetootheruserseveniftheybypassKOALA.Inaddition,inordernottostressexecutionsiteswhengrowingmalleablejobs,andtherefore,inordertoleavealwaysaminimalnumberofavailableprocessorstolocalusers,athresholdissetoverwhichKOALAneverexpandsthetotalsetofthejobsitmanages.

5.3MalleabilityManagementPolicies

Themalleabilitymanagementpolicieswedescribebelowdeterminethemeansofshrinkingandgrowingofmalleablejobsduringtheirexecution.Inthispaper,weassumethateveryapplicationisexecutedinasinglecluster,andso,noco-allocationtakesplace.Consequently,thepoliciesareappliedforeachclusterseparately.

5.3.1FavourPreviouslyStartedMalleableApplications(FPSMA)

TheFPSMApolicyfavourspreviouslystartedmalleablejobsinthatwheneverthepolicyisinitiatedbythemalleabilitymanager,itstartsgrowingfromtheearlieststartedmalleablejobandstartsshrinkingfromthelateststartedmalleablejob.Figure4presentsthepseudo-codeofthegrowandshrinkproceduresoftheFPSMApolicy.

Inthegrowprocedure,ﬁrst,malleablejobsrunningontheconsideredclustersortedintheincreasingorderoftheirstarttime(line1),thenthevalueofthenumberofprocessorstobeallocatedonbehalfofmalleablejobs(i.e.growValue)isofferedtothesubsequentjobinthesortedlist(line3).Inreplytothisoffer(thejobitselfconsidersitsmaximumnumberofprocessorsrequirement),theacceptednumberofprocessorsareallocated(lines4−5)onbehalfofthatjob.ThenthegrowValueisupdatedandcheckedwhethertherearemoreprocessorstobeoffered(lines6−8).

Theshrinkprocedurerunsinasimilarfashion;thedifferenceswiththegrowprocedureisthatthejobsaresortedinthedecreasingorderoftheirstarttime(line1),andratherthanallocation,theacceptednumberofprocessorsarewaitedtobereleased(line5).5.3.2Equi-Grow&Shrink(EGS)

OurEGSpolicyattemptstobalanceprocessorsovermalleablejobs.Figure5givesthepseudo-codeofthegrowandshrinkproceduresofthatpolicy.Whenitisinitiatedbythemalleabilitymanager,itdistributesavailableprocessors(orreclaimsneededprocessors)equally(line2)overalloftherunningmalleablejobs.Incasethenumberofprocessorstobedistributedorreclaimedisnotdivisiblebythenumberofrunningmalleablejobs,theremainderisdistributedacrosstheleastrecentlystartedjobsasabonus(line5),orreclaimedfromthemostrecentlystartedjobsasamalus(line5).

TheEGSpolicyissimilartothewell-knownequipartitionone.Thetwopolicieshoweverdifferinthefollowingpoints.WhileourEGSpolicydistributesequallyavailableprocessorsamongrunningjobs,theequipartitionpolicydistributesequallythewholesetofprocessorsamongrunningjobs.Consequently,EGSisnotexpectedtomakeateachtimeallofthemalleablejobshavethesamesize,whileequipartitiondoes.Butequipartitionmaymixgrowandshrinkmessages,whileEGSconsistentlyeitergrowsorshrinksalloftherunningjobs.

CoreGRIDTR-00927

procedureFPSMA

SHRINK(clusterName,shrinkValue)

1.List←malleablejobsrunningonclusterName,sortedinthedecreasingorderoftheirstarttime2.foreach(JobinList)do3.sendshrinkrequest(shrinkValue)toJob4.getacceptednumberofprocessorsfromJob5.waitforJobtoreleasetheprocessors6.shrinkValue=shrinkValue-accepted7.ifshrinkValue==0then8.break

Figure4:Pseudo-codeoftheFPSMApolicy

6Experimentalsetup

InthissectionwedescribethesetupoftheexperimentswehaveconductedinordertoevaluatethesupportandtheschedulingpoliciesformalleablejobsinKOALA.WepresenttheapplicationsthathavebeenusedinSection6.1,ourtestbedinSection6.2,andthedetailsoftheworkloadsinSection6.3.

6.1Applications

Fortheexperiments,werelyontwoapplicationsthatwehavepreviouslymademalleablewithDYNACO.TheseapplicationsaretheNASParallelBenchmarkFT[29],whichisabenchmarkforparallelmachinesbasedonafastFouriertransformnumericalkernel,andGADGET2[25],whichisalegacyn-bodysimulator.Furtherdetailsonhowwehavemademalleablethesetwoapplicationscanbefoundin[9].Figure6showshowtheexecutiontimesofthetwoapplicationscalewithrespecttothenumberofmachinesontheDelftcluster(seetable1).With2processors,GADGET2takesnearly10minutes,whileFTlastsalmost2minutes.Thebestexecutiontimesarerespectively4minutesforGADGET2and1minuteforFT.

WhileGADGET2canexecutewithanarbitrarynumberofprocessors,FTonlyacceptspowersof2.Aswehavealreadystated,weproposethattheschedulerdoesnotcareaboutsuchconstraints,inordertoavoidtomakeitimplementanexhaustivecollectionofpossiblecontraints.Consequently,whenrespondingtogrowandshrinkmessages,theFTapplicationacceptsonlythehighestpowerof2processorsthatdoesnotexceedtheallocatednumber.Additionalprocessorsarevoluntarilyreleasedtothescheduler.Inaddition,theFTapplicationassumesprocessorsofequalcomputepower,whileGADGET2includesaload-balancingmechanism.

6.2TheTestbed

OurtestbedistheDistributedASCISupercomputer(DAS3)[2],whichisawide-areacomputersystemintheNether-landsthatisusedforresearchonparallel,distributed,andgridcomputing.Itconsistsofﬁveclustersof272dualAMDOpteroncomputenodes.ThedistributionofthenodesovertheclustersisgiveninTable1.Besidesusingthe1and10Gigabit/sEthernet,DAS3employsthenovellocalhigh-speedMyri-10Ginterconnecttechnology.OneachoftheDASclusters,theSunGridEngine(SGE)[1]isusedasthelocalresourcemanager.SGEhasbeenconﬁguredtorunapplicationsonthenodesinanexclusivefashion,i.e.,inspace-sharedmode.

CoreGRIDTR-00928

procedureEQUI

1.2.3.4.5.6.7.8.9.

SHRINK(clusterName,shrinkValue)

List←malleablejobsrunningonclusterName,sortedinthedecreasingorderoftheirstarttime

jobShrinkValue←⌊shrinkValue/size(List)⌋

shrinkRemainder←remainder(shrinkValue,size(List))foreach(JobinList)do

malus←1ifi≥growRemainder;0otherwise

sendshrinkrequest(shrinkValue+malus)toJobgetacceptednumberofprocessorsfromJobforeach(JobinList)do

waitforJobtoreleasetheprocessors

Figure5:Pseudo-codeoftheEGSpolicy

800FT

Gadget 2

Time (s)020040060001020

Number of machines

3040

Figure6:Theexecutiontimesofthetwoapplicationsdependingonthenumberofusedmachines.

6.3TheWorkloads

TheworkloadsthatweemployinourexperimentscombinethetwoapplicationsofSection6.1withauniformdistri-bution.Theirminimumsizeissetto2processors,whilethemaximumsizeis46forGADGET2and32forFT.Inbothcases,300jobsaresubmitted.Jobsaresubmittedfromasingleclientsite;nostagingoperationisorderedevenwhenprocessorsareallocatedfromremotesites.

ForthePRA-basedexperiments,wehaveusedtwofollowingworkloads.WorkloadWmiscomposedexclusivelyofmalleablejobs,whileworkloadWmrisrandomlycomposedof50%ofmalleablejobsand50%rigidjobs.Inbothcases,inter-arrivaltimeis2minutes.Rigidjobsaresubmittedwithasizeof2processors,andmalleablejobswithaninitialsizeof2.Inourexperiments,KOALAemploystheWFpolicy.ApartfromworkloadWmorWmr,theonlybackgroundloadduringtheexperimentsistheactivityofconcurrentusers.Thisbackgroundloaddoesnotdisturbthemeasures.

′′

WhenanalysingthePWAapproach,wehaveusedtwoworkloadsWmandWmr,whichderiverespectivelyfromWmandWmr.Intheseworkloads,inter-arrivaltimeisreduceddownto30secondsinordertoincreasetheloadofthesystem.

CoreGRIDTR-00929

Table1:ThedistributionofthenodesovertheDASclusters.

VrijeUniversityU.ofAmsterdamDelftUniversityMultimediaNLeidenUniversity8541684632Myri-10G&1/10GbEMyri-10G&1/10GbE

1/10GbE

Myri-10G&1/10GbEMyri-10G&1/10GbE

10080Cumulative number of jobs (%)Cumulative number of jobs (%)6040200051015202530

020406080FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr

100FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr

010203040

Average number of processors per jobMaximum number of processors per job

(a)Thecumulativedistributionofthenumberofprocessorsperjobav-eragedovertheexecutiontimeofjobs.

100FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr

(b)Thecumulativedistributionofthemaximalnumberofprocessorsreachedperjobduringitsexecution.

100FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr

80Cumulative number of jobs (%)Cumulative number of jobs (%)0

200

400

600

Execution time (s)

800

1000

1200

60402000204060800200400

Response time (s)

600800

(c)Thecumulativedistributionofthejobexecutiontimes.

120FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr

Number of grown messages(d)Thecumulativedistributionofthejobresponsetimes.

FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr

Total number of used processors1006040020250002600027000

Time (s)

280002900030000

02004006008080001000020000

Time (s)

30000

(e)Utilizationoftheplatformduringtheexperiment.(f)Activityofthemalleabilitymanager.

Figure7:ComparisonbetweenFPSMAandEGSwiththePRAapproachofjobmanagement(noshrinking).

CoreGRIDTR-009211

10080Cumulative number of jobs (%)Cumulative number of jobs (%)6040FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr

406080100FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr

20001020

Average number of processors per job

3040

0200102030405060

Maximum number of processors per job

(a)Thecumulativedistributionofthenumberofprocessorsperjobav-eragedovertheexecutiontimeofjobs.

100FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr

(b)Thecumulativedistributionofthemaximalnumberofprocessorsreachedperjobduringitsexecution.

100FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr

80Cumulative number of jobs (%)Cumulative number of jobs (%)0

200

400

Execution time (s)

600

800

1000

604020002040608002004006008001000

Response time (s)

(c)Thecumulativedistributionofthejobexecutiontimes.

100500(d)Thecumulativedistributionofthejobresponsetimes.

FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr

Number of malleability operations4000

Time (s)

6000

8000

10000

Total number of used processors002000

010020200403006040080FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr0200040006000Time (s)

80001000012000

(e)Utilizationoftheplatformduringtheexperiment.(f)Activityofthemalleabilitymanager.

Figure8:ComparisonbetweenFPSMAandEGSwiththePWAapproachofjobmanagement(bothgrowingandshrinking).

CoreGRIDTR-009212

Figures8(c)and8(d)showthatdespitethefactthattheexecutiontimeisalmostthesameforthefourexecutions,

′

theresponsetimeisfarworseforthecombinationoftheEGCpolicyandWmworkloadduetohigherwaittime.ThisresultsfromthehighutilizationthatweobserveinFigure8(e)forthispolicy.

8Conclusion

InthispaperwehavepresentedthedesignandtheanalysiswithexperimentswiththeKOALAschedulerinourDAS3testbedofthesupportandpoliciesformalleabilityofparallelapplicationsinmulticlustersystems.Ourexperimentalresultsshowthatmalleabilityisindeedbenﬁcialforperformance.Inourdesign,wehavenotincludedgrowopera-tionsthatareinitiatedbytheapplications.Thisfeatureismainlyusefulincasetheparallelismpatternisirregular,unlikewithourapplications.Designingitishowevernotstraightforward.Forinstance,itraisesthedesignchoicewhethersuchgrowoperationscanbemandatoryoronlyvoluntary,andhowmuchefforttheschedulershoulddotoaccommodatemandatorygrowoperations,forinstancebyshrinkingconcurrentmalleablejobs.Anotherelementthatwehavenotincorporatedinourdesignandimplementationbutthatweintendtoaddismalleabilityofco-allocatedapplications.

Acknowledgment

ThisworkwascarriedoutinthecontextoftheVirtualLaboratoryfore-Scienceproject(www.vl-e.nl),whichissupportedbyaBSIKgrantfromtheDutchMinistryofEducation,CultureandScience(OC&W),andwhichispartoftheICTinnovationprogramoftheDutchMinistryofEconomicAffairs(EZ).PartofthisworkisalsocarriedoutundertheFP6NetworkofExcellenceCoreGRIDfundedbytheEuropeanCommission(ContractIST-2002-004265).

References

[1]Sungridengine.http://gridengine.sunsource.net.

[2]TheDistributedASCISupercomputer.http://www.cs.vu.nl/das3/.

[3]MarcoAldinucci,Franc¸oiseAndr´e,J´er´emyBuisson,SoniaCampa,MassimoCoppola,MarcoDanelutto,and

CorradoZoccolo.Anabstractschemamodellingadaptivitymanagement.InSergeiGorlatchandMarcoDane-lutto,editors,IntegratedResearchinGRIDComputing.Springer,2007.proceedingsoftheCoreGRIDIntegrationWorkshop2005.[4]MarcoAldinucci,MassimoCoppola,MarcoDanelutto,MarcoVanneschi,andCorradoZoccolo.Assistasa

researchframeworkforhigh-performancegridprogrammingenvironments.TechnicalReportTR-04-09,Uni-versit`adiPisa,DipartimentodiInformatica,February2004.[5]FrancineBerman,RichardWolski,HenriCasanova,WalfredoCirne,HollyDail,MarcioFaerman,SilviaFigueira,JimHayes,GrazianoObertelli,JenniferSchopf,GaryShao,ShavaSmallen,NeilSpring,AlanSu,andDmitriiZagorodnov.Adaptivecomputingonthegridusingapples.IEEETrans.ParallelDistrib.Syst.,14(4):369–382,April2003.[6]JacekBlazewicz,MikhailKovalyov,MaciejMachowiak,DenisTrystram,andJanWeglarz.Preemptablemal-leabletaskschedulingproblem.IEEETransactionsonComputers,55(4):486–490,April2006.briefcontribution.[7]JacekBlazewicz,MaciejMachowiak,GregoryMouni´e,andDenisTrystram.Approximationalgorithmsfor

schedulingindependentmalleabletasks.In7thInternationalEuro-ParConference,pages191–197,Manchester,UK,August2001.[8]A.I.D.BucurandD.H.J.Epema.Themaximalutilizationofprocessorco-allocationinmulticlustersystems.

InIPDPS’03:Proceedingsofthe17thInternationalSymposiumonParallelandDistributedProcessing,page60.1,Washington,DC,USA,2003.IEEEComputerSociety.

CoreGRIDTR-009213

[9]J´er´emyBuisson,Franc¸oiseAndr´e,andJean-LouisPazat.Aframeworkfordynamicadaptationofparallelcom-ponents.InInternationalConferenceParCo,pages65–72,M´alaga,Spain,September2005.[10]J´er´emyBuisson,Franc¸oiseAndr´e,andJean-LouisPazat.Afpac:Enforcingconsistencyduringtheadaptationof

aparallelcomponent.ScalableComputing:PracticeandExperience(SCPE),7(3):83–95,September2006.[11]TravisDesell,KaoutarElMaghraoui,andCarlosVarela.Malleablecomponentsforscalablehighperformance

computing.InHPCGridprogrammingEnvironmentandCOmponents,Paris,France,June2006.[12]DrorG.FeitelsonandLarryRudolph.Towardconvergenceinjobschedulersforparallelsupercomputers.In

DrorG.FeitelsonandLarryRudolph,editors,JobSchedulingStrategiesforParallelProcessing,pages1–26.Springer-Verlag,1996.[13]IanFosterandCarlKesselman.Globus:ametacomputinginfrastructuretoolkit.InternationalJournalofSuper-computerApplications,11(2):115–128,1997.[14]JanHungersh¨ofer.Onthecombinedschedulingofmalleableandrigidjobs.In16thSymposiumonComputer

ArchitectureandHighPerformanceComputing(SBAC-PAD),pages206–213,FozdoIguac¸u,Brazil,October2004.[15]JanHungersh¨ofer,AchimStreit,andJens-MichaelWierum.Efﬁcientresourcemanagementformalleableappli-cations.TechnicalReportTR-003-01,PaderbornCenterforParallelComputing,December2001.[16]AlexandruIosup,DickEpema,CarstenFranke,AlexanderPapaspyrou,LarsSchley,BaiyiSong,and

RaminYahyapour.Ongridperformanceevaluationusingsyntheticworkloads.InE.FrachtenbergandU.Schwiegelshohn,editors,Proc.ofthe12thWorkshoponJobSchedulingStrategiesforParallelProcessing(JSSPP),,LectureNotesinComputerScience.SpringerVerlag,June2006.[17]LaxmikantKal´e,SameerKumar,andJayantDeSouza.Amalleable-jobsystemfortimesharedparallelmachines.

In2ndIEEE/ACMInternationalSymposiumonClusterComputingandtheGrid(CCGRID),page230,Berlin,Germany,May2002.[18]JeffreyO.KephartandDavidM.Chess.Thevisionofautonomiccomputing.Computer,36(1):41–50,January

2003.[19]CathyMcCannandJohnZahojan.Processorallocationpoliciesformessage-passingparallelcomputers.In

ACMSIGMETRICSconferenceonMeasurementandModelingofComputerSystems,pages19–32,Nashvill,Tennessee,USA,May1994.[20]HashimMohamedandDickEpema.Anevaluationoftheclose-to-ﬁlesprocessoranddataco-allocationpolicy

inmulticlusters.In2004IEEEInternationalConferenceonClusterComputing,pages287–298.IEEESocietyPress,2004.[21]HashimMohamedandDickEpema.Thedesignandimplementationofthekoalaco-allocatinggridscheduler.

InEuropeanGridConference,volume3470ofLectureNotesinComputerScience,pages640–650.Springer-Verlag,2005.[22]HashimMohamedandDickEpema.ExperienceswiththeKoalaco-allocatingschedulerinmulticlusters.In

ProceedingsoftheinternationalsymposiumonClusterComputingandtheGRID,Cardiff,UK,May2005.[23]GregoryMouni´e,ChristopheRapine,andDenisTrystram.Efﬁcientapproximationalgorithmsforscheduling

malleabletasks.In11thACMsymposiumonParallelAlgorithmsandArchitectures,pages23–32,Saint-Malo,France,June1999.[24]OzanSonmez,HashimMohamed,andDickEpema.Communication-awarejobplacementpoliciesforthekoala

gridscheduler.InE-SCIENCE’06:ProceedingsoftheSecondIEEEInternationalConferenceone-ScienceandGridComputing,page79,Washington,DC,USA,2006.IEEEComputerSociety.[25]VolkerSpringel.Thecosmologicalsimulationcodegadget-2.MonthlyNoticesoftheRoyalAstronomicalSociety,

2005.submittedastro-ph/0505010.CoreGRIDTR-0092

[26]GladysUtrera,JulitaCorbal´a,andJes´usLabarta.Implementingmalleabilityonmpijobs.In13thInterna-tionalConferenceonParallelArchtectureandCompilationTechniques(PACT),pages215–224,Antibes,France,

September2004.[27]SathishVadhiyarandJackDongarra.Srs:aframeworkfordevelopingmalleableandmigratableparallelappli-cationsfordistributedsystems.ParallelProcessingLetters,13(2):291–312,2003.[28]SathishVadhiyarandJackDongarra.Selfadaptabilityingridcomputing.ConcurrencyandComputation:

PracticeandExperience,17(2-4):235–257,February2005.[29]RobF.vanderWijngaart.Nasparallelbenchmarksversion2.4.TechnicalReportNAS-02-007,NASAAdvanced

Supercomputingdivision,October2002.

CoreGRIDTR-009215

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文