May22,2007
InstituteonResourceManagementandSchedulingCoreGRID-NetworkofExcellenceURL:http://www.coregrid.net
CoreGRIDisaNetworkofExcellencefundedbytheEuropeanCommissionundertheSixthFrameworkProgramme
Projectno.FP6-004265
SchedulingMalleableApplicationsinMulticlusterSystems
J´er´emyBuissonjbuisson@irisa.fr
INRIA/IRISA
CampusdeBeaulieu,35042RennesCedex,France
OzanSonmez,HashimMohamed,WouterLammers,DickEpema{o.o.sonmez,h.h.mohamed,d.h.epema}@tudelft.nl
wouter@w78.nl
DelftUniversityofTechnology
P.OBox5031,2600GA,Delft,TheNetherlands
CoreGRIDTR-0092
May22,2007
Abstract
Inlarge-scaledistributedexecutionenvironmentssuchasmulticlustersystemsandgrids,resourceavailabilitymayvaryduetoresourcefailuresandbecauseresourcesmaybeaddedtoorwithdrawnfromsuchenvironmentsatanytime.Inaddition,singlesitesinsuchsystemsmayhavetodealwithworkloadsoriginatingfrombothlocalusersandfrommanyothersources.Asaresult,applicationmalleability,thatis,thepropertyofapplicationstodealwithavaryingamountofresourcesduringtheirexecution,maybeverybeneficialforperformance.InthispaperwepresentthedesignofthesupportofandschedulingpoliciesformalleabilityinourKOALAmulticlusterschedulerwiththehelpofourDYNACOframeworkforapplicationmalleability.Inaddition,weshowtheresultsofexperimentswithschedulingmalleableworkloadswithKOALAinourDASmulticlustertestbed.
1Introduction
Applicationmalleability,thatis,thepropertyofapplicationstousevaryingamountsofresourcessuchasprocessorsduringtheirexecution,ispotentiallyaveryversatileandbeneficialfeature.Allowingresourceallocationtovarydur-ingexecution,malleabilitygivesaschedulertheopportunitytoreviseitsdecisionsevenafterapplicationshavestartedexecuting.Increasingtheflexibilityofapplicationsbyshrinkingtheirresourceallocations,malleabilityallowsnewjobstostartsooner,possiblywithresourcesthatarenotgoingtobeusableduringtheirwholeexecution.Makingappli-cationsabletobenefitfromtheresourcesthatappearduringtheirexecutionbygrowingtheirallocations,malleabilityalsohelpsapplicationsterminatesooner.Inadditiontotheseverygeneraladvantages,malleabilitymakesiteasiertodealwiththedynamicnatureoflarge-scaledistributedexecutionenvironmentssuchasmulticlustersystems,andmoregenerallygrids.Inthispaper,wepresentthedesignofthesupportformalleabilityinourKOALA[21]multiclusterschedulerbymeansoftheinclusionofourDYNACOframework[9]forapplicationmalleability,andthedesignandanalysisoftwoschedulingpoliciesformalleableapplications.
Inexecutionenvironmentssuchasmulticlustersandgrids,theavailabilityofresourcesvariesfrequently[16].Inadditiontofailures,resourcesmaybeallocated(orreleased)byconcurrentusers,andorganizationsmayaddor
withdraw(partsof)theirresourcesto/fromtheresourcepoolatanytime.Inanyofthesecases,malleabilityallowsapplicationstobenefitfromappearingavailableresources,whilegracefullyreleasingresourcesthatarereclaimedbytheenvironment.Malleabilitythusholdsagreatpromiseinstrategiestomoreperformantexecutioninmulticlusters.Besides,malleableapplicationsgiveschedulerstheopportunitytoincreasesystemutilization.
Ontheonehand,despitethatseveralapproacheshavebeenproposedtobuildmalleableapplications[9,5,28,4,17,26],virtuallynoexistingmulticlusterandgridinfrastructuresareabletobenefitfromthisproperty.Consequently,manyapplicationsembedtheirownspecificschedulerandsubmitbulksofjobsinordertobuilddynamicresourcemanagementontopofexistinginfrastructures.Ontheotherhand,mostofthepreviousworkonschedulingmalleableapplicationsdoesnothandlethechallengesthatappearinthecontextofmulticlustersystems.Specifically,issuessuchastheselectionofasuitableclusterforeachjobandresiliencetobackgroundloadduetolocalusersareoftennottakenintoaccount.Furthermore,manyproposedapproacheshaveonlybeentestedwithsimulations,andanassessmentoftheoverheadduetotheimplementationofgrowandshrinkoperationsarecommonlyomitted.
Ourcontributionsinthispaperarethefollowing.First,wepresentanarchitectureandanactualimplementa-tionofthesupportformalleabilityingridschedulers,showingthebenefitsofthemodularstructureofKOALAasaby-product.Second,wepresenttwopoliciesformanagingmalleabilityinthescheduler,onewhichhandsoutanyadditionalprocessorstothemalleablejobsthathavebeenrunningthelongest,andonethatspreadsthemequallyoverallmalleablejobs;eachofthesepoliciescanbecombinedwithoneoftwoapproacheswhicheitherfavorrunningorwaitingjobs.Third,weevaluatethesepoliciesandapproachesincombinationwithKOALA’sworst-fitload-sharingschedulingpolicywithexperimentsintheDAS3[2]testbed.Theseexperimentsshowthatahigherutilizationandshorterexecutiontimescanbeachievedwhenmalleablityisused.
Therestofthispaperisstructuredasfollows.Section2statesmorepreciselytheproblemaddressedinthispaper,andSection3reviewsthestateoftheartofmalleabilityinresourcemanagementinmulticlusterandgridsystems.Section4describestheKOALAgridschedulerandtheDYNACOframework,whicharethestartingpointsofourwork.Section5describeshowwesupportmalleabilityinKOALA,anddetailsthemalleabilitymanagementapproachesandpoliciesthatwepropose.Section6presentstheexperimentalsetup,andSection7discussesourexperimentalresults.Finally,Section8makessomeconcludingremarksandpointstofuturework.
2ProblemStatement
Inmulticlustersystemsandmoregenerallyingrids,theremaybevarioustypesofparallelapplicationsthatcanbenefitfrombeingabletochangetheirprocessorconfigurationaftertheyhavestartedexecution.Malleabilityofparallelapplicationsmayyieldimprovedapplicationperformanceandbettersystemutilizationsinceitallowsmoreflexibleschedulingpolicies.Inthispaper,weproposeandcompareschedulingapproachesthattakeintoaccountmalleableapplicationsinordertoassesssuchbenefits.Inthissection,wefirstclassifyparallelapplicationsinordertodistinguishmalleableapplications,andthenweaddressseveralaspectsofmalleableapplicationsthatshouldbetakenintoaccountbyaresourcemanagementoraschedulingsystem.
2.1ClassificationofParallelJobs
Followingthewell-knownparalleljobclassificationschemepresentedin[12],weconsiderthreetypesofjobs,namely,rigid,moldable,andmalleable.Arigidjobrequiresafixednumberofprocessors.Whenthenumberofprocessorscanbeadaptedonlyatthestartoftheexecution,thejobiscalledmoldable.Similartorigidjobs,thenumberofprocessorsformoldablejobscannotbechangedduringruntime.Jobsthathavetheflexibilitytochangethenumberofassignedprocessorsduringtheirruntime(i.e.,theycangroworshrink)arecalledmalleable.
2.2SpecificationofMalleableJobs
Amalleablejobmayspecifytheminimumandmaximumnumberofprocessorsitrequires.Theminimumvalueistheminimumnumberofprocessorsamalleablejobneedstobeabletorun;thejobcannotshrinkbelowthisvalue.Themaximumvalueisthemaximumnumberofprocessorsamalleablejobcanhandle;allocatingmorethanthemaximumvaluewouldjustwasteprocessors.Wedonotassumethatastepsizeindicatingthenumberofprocessorsbywhichamalleableapplicationcangroworshrinkisdefined.Weleavethedeterminationoftheamountofgrowingandshrinkingtotheprotocolbetweentheschedulerandtheapplication(seeSection5).
CoreGRIDTR-00922
2.3InitiativeofChange
Anotheraspectthatweconsiderispartythattakestheinitiativeofchangingthesizeofamalleablejob(shrinkingorgrowing).Eithertheapplicationortheschedulermayinitiategroworshrinkrequests.Anapplicationmaydosowhenthecomputationitisperformingcallsforit.Forexample,acomputationcanbeinneedofmoreprocessorsbeforeitcancontinue.Ontheotherhand,theschedulermaydecidethatamalleablejobhastoshrinkorgrowbasedontheavailabilityoffreeprocessorsinthesystem.Forexample,thearrivalofnewjobstoasystemthatisheavilyloadedmaytriggeraschedulertorequestscurrentlyrunningmalleablejobstoshrink.
2.4TheObligationtoChange
Requestsforchangingthesizeofamalleablejobmayormaynothavetobesatisfied.Avoluntarychangemeansthatthechangedoesnothavetosucceedordoesnotnecessarilyhavetobeexecuted;itismerelyaguideline.Amandatorychange,however,hastobeaccommodated,becauseeithertheapplicationcannotproceedwithoutthechange,orbecausethesystemisindirectneedofthereclaimedprocessors.
3Relatedwork
Muchofthepreviousresearchontheproblemofschedulingandallocatingresourcestomalleablejobshasfocusedontheoreticalaspects[23,7,6].Thus,theresultshavebeenobtainedinsimulatedenvironments,oftenneglectingtheissuesthatariseinrealenvironmentssuchastheeffectivescalabilityofapplicationsandthecostofgrowingorshrinking.Inthissection,wediscussseveralimplementationsofmalleableapplicationsandtheirschedulinginrealmulticlustersystems.
Asnotedin[11]andinourpreviouswork[9],malleabilityhelpsapplicationsperformbetterwhenresourceavail-abilityvaries.Severalapproacheshavebeenusedtomakeparalleland/ormulticlusterapplicationsmalleable.WhileGRADS[28]reliesontheSRS[27]checkpointlibrary,APPLES[5]andASSIST[4]proposetobuildapplicationsuponintrinsicallymalleableskeletons.WithAMPI[17],malleabilityisobtainedbytranslatingMPIapplicationstoalargenumberofCHARM++objects,whichcanbemigratedatruntime.Utreraetal.[26]proposetomakeMPIapplicationsmalleablebyfoldingseveralprocessesontoeachprocessor.
Acoupleofworks[5,28]havestudiedhowtoschedulesuchapplicationsinconjunctionwithbuildingmalleableapplications.Amongthem,APPLES[5]andGRADS[28]aresomewhatspecificastheyproposethatapplicationsareresponsibletoschedulethemselvesontheirown.However,thisapproachraisesthequestionofhowthesystembehaviourandperformancewouldbeincaseseveralconcurrentmalleableapplicationscompeteforresources.Fur-thermore,asthoseapproachesrelyoncheckpointing,itisunclearhowanapplicationgetsitsresourcesbackwhenitacceptstotryanewallocation.System-wideschedulersdonotsufferfromthesedrawbacks.
Otherapproachesrelyonasystem-widescheduler.Correspondingtotheunderlyingexecutionmodel,AMPIusesanequipartitionpolicy,whichensuresthatalljobsgetalmostthesamenumberofprocessors;whilethepolicyin[26]isbasedonfoldingandunfoldingthejobs(i.e.,doublingorhalvingthenumberofallocatedprocessors).However,thosetwoapproachesrelyonthepropertiesoftheirunderlyingexecutionmodel.Forinstance,equipartitionassumesthatanyapplicationcanbeexecutedefficientlywithanynumberofprocessors,asitisthecasewithAMPI;whilefoldingrestrictsthenumberofprocessestobedivisiblebythenumberofprocessors(oftenapowerof2forpracticalreasons),whichistheonlywaytofoldefficientlynon-malleableapplications.Amoregeneralapproachsuchastheoneweproposeismoreappropriateinthecontextofmulticlusters.
McCannandZahorjan[19]furtherdiscussthefoldingandequipartitionpolicies.Accordingtotheirexperiments,foldingpreserveswellefficiency;whileequipartitionprovideshigherfairness.Theyhaveproposedin[19]arotationpolicyinordertoincreasethefairnessofthefoldingpolicy.However,rotationisalmostimpracticableinthecontextofmulticlusters.
Asfairnessintermsofallocatedprocessorsdoesnotimplyefficiency,abiasedequipartitionpolicyisproposedinHungersh¨oferetal.[15]suchthatthecumulativespeedupofthesystemismaximized.Italsoconsidersbothmalleableandrigidjobsinasinglesystemin[14],anditguaranteestoallocateaminimumnumberofprocessorstoeachmalleablejob,suchthattheyarenotruledoutbyrigidjobs.However,inmulticlusters,itiscommonthatsomeoftheusersbypassthemulticluster-levelscheduler.TheproblemofmakingtheschedulertakeintoaccountthatincuredbackgroundloadisnotaddressedintheworksofHungersh¨oferetal.
CoreGRIDTR-00923
RLSNIPKISPIPSchedulerCOPCRunners FrameworkKOALA RunnersFigure1:OverviewofthearchitectureoftheKOALAmulticlusterscheduler.
Inaddition,mostofthosepreviousresearchworks[26,19,15,14]havenotconsideredthecombinationofmal-leabilitymanagementandloadsharingpoliciesacrossclusters,whichisanissuespecifictomulticlusters.
4Background
Inthissection,wesummarizeourpreviousworkonaco-allocatinggridschedulercalledKOALAandontheimple-mentationofmalleableapplications,whichisthestartingpointoftheworkreportedinthispaper.Section4.1presentstheKOALAgridscheduler,followedbysection4.2thatdescribesourDYNACOframeworkthatweusetoimplementmalleableapplications.
4.1TheKOALAmulticlusterscheduler
TheKOALA[21]gridschedulerhasbeendesignedformulticlustersystemssuchastheDAS[2].KOALAjobsub-missiontoolsemploysomeoftheGLOBUStoolkit[13]servicesforjobsubmission,filetransfer,andsecurityandauthentication.Ontheotherhand,KOALAschedulerhasitsownmechanismsfordataandprocessorco-allocation,resourcemonitoring,andfaulttolerance.
Figure1showsthearchitectureofKOALA.Thisarchitectureshowsauxiliarytoolscalledrunners,whichprovideuserswithaninterfaceforsubmittingjobsandmonitoringtheirprogressfordifferentapplicationtypes.Runnersarebuiltoutofaframework,whichservesasafrontendbetweeneachrunnerandthecentralizedscheduler.Thelatterismadeofaco-allocator(CO),whichdecidesofresourceallocations,andofaprocessorclaimer(PC),whichensuresthatprocessorswillstillbeavailablewhenthejobstartstorun.Ifprocessorreservationissupportedbylocalresourcemanagers,thePCcanreserveprocessorsimmediatelyaftertheplacementofthecomponents.Otherwise,thePCusesKOALAclaimingpolicy[20,22]topostponeclaimingofprocessorstoatimeclosetotheestimatedjobstarttime.Initstasks,theschedulerissupportedbytheKOALAinformationservice(KIS),whichmonitorsthestatusofresourcesthankstoaprocessorinformationprovider(PIP),anetworkinformationprovider(NIP)andareplicalocationservice(RLS).ProvidersconnectKOALAwiththemulticlustermonitoringinfrastructure,whichcanbeGLOBUSMDSorwhateverelsedependingontheusedresourcemanagers.
WithinthecontextofKOALAjobmodel,ajobcompriseseitheroneormultiplecomponentsthateachcanrunonaseparatecluster.Eachjobcomponentspecifiesitsrequirementsandpreferencessuchastheprogramitwantstorun,thenumberofprocessorsitneeds,andthenamesofitsinputfiles.
Uponreceivingajobrequestfromarunner,theKOALAschedulerusesoneofplacementpolicies,describedbelow,totrytoplacejobcomponentsonclusters.WithKOALA,usersaregiventheoptionofselectingoneoftheseplacementpolicies.
CoreGRIDTR-00924
Figure2:OverviewofthearchitectureoftheDYNACOframeworkforadaptability.
•TheWorst-Fit(WF)[8]placeseachcomponentofajobintheclusterwiththelargestnumberofidleprocessors.TheadvantageofWFisitsautomaticload-balancingbehaviour,thedisadvantageisthatlarge(componentsof)jobshavelesschanceofsuccessfulplacementbecauseWFtendstoreducethenumberofidleprocessorspercluster.
•TheClose-to-Files(CF)policy[20]usesinformationaboutthepresenceofinputfilestodecidewheretoplace(componentsof)jobs.Clusterswiththenecessaryinputfilesalreadypresentarefavouredasplacementcandi-dates,followedbyclustersforwhichtransferofthosefilestaketheleastamountoftime.
•TheClusterMinimization(CM)andtheFlexibleClusterMinimization(FCM)policies[24]havebeendesignedespeciallyforjobsthatmayuseco-allocationinordertominimizethenumberofclusterstobecombinedforagivenparalleljob,suchthatthenumberofinter-clustermessagesisreduced.Theflexibleversiondecreasesinadditionthequeuetimebysplittingjobsintocomponentsaccordingtothenumberofidleprocessors.Iftheplacementofthejobsucceedsandinputfilesarerequired,theschedulerinformsarunnertoinitiatethethird-partyfiletransfersfromtheselectedfilesitestotheexecutionsitesofthejobcomponents.Ifaplacementtryfails,KOALAplacesthejobatthetailofaplacementqueue.Thisqueueholdsallthejobsthathavenotyetbeensuccessfullyplaced.Theschedulerregularlyscansthisqueuefromheadtotailtoseewhetheranyjobisabletobeplaced.Foreachjobinthequeuewerecorditsnumberofplacementtries,andwhenthisnumberexceedsacertainthresholdvalue,thesubmissionofthatjobfails.
4.2TheDYNACOframeworkanditsuseformalleability
Inourpreviouswork[9],wehaveproposedtechniquestoimplementmalleabilityasaspecialcaseofadaptability.Basically,adaptabilityisanapproachforaddressingtheproblemofthedynamicityoflarge-scaleexecutionenviron-ments.Itconsistsintheabilityofapplicationstomodifythemselvesduringtheirexecutionaccordingtoconstraintsimposedbytheexecutionenvironment.Ourpreviousworkonabstractingadaptability[3]hasresultedinDYNACO1,agenericframeworkforbuildingdynamicallyadaptableapplications.
AsitsarchitectureshowsinFigure2,DYNACOdecomposesadaptabilityintofourcomponents,similarlytothecontrolloopsuggestedin[18]:theobservecomponentmonitorstheexecutionenvironmentinordertodetectanyrelevantchange;relyingonthisinformation,thedecidecomponentmakesthedecisionaboutadaptability.Itdecideswhentheapplicationshouldadaptitselfandwhichstrategyshouldbeadopted.Whenthestrategyinusehastobechanged,theplancomponentplanshowtomaketheapplicationadoptthenewstrategy;finally,theexecutecompo-nentschedulesactionslistedintheplan,takingintoaccountthesynchronizationwiththeapplicationcode.Beingaframework,DYNACOisexpectedtobespecializedtoeachapplication.Inparticular,developersprovidethedecisionprocedure,thedescriptionofplanningproblems,andtheimplementationofadaptationactions.
InadditiontotheDYNACOframework,wehaveproposedAFPAC[10]asanimplementationoftheexecutecom-ponentthatisspecifictoSPMDapplications.Itensuresthatadaptationactionsareexecutedbyalloftheprocessesfromthesamepointofprogressintheexecutionoftheapplicationcode.
Asreportedin[9],DYNACOandAFPAChavebeensuccessfullyusedtomakeseveralexistingMPI-basedappli-cationsmalleable,includinganFFTnumericalkernelandalegacyn-bodysimulator.AsDYNACOandAFPACfosterclearseparationofadaptabilityfromtheotherconcernsintheimplementation,onlyfewmodificationsarerequiredtomakeexistingMPIapplicationsmalleable.
CoreGRIDTR-00925
Figure3:ThearchitectureoftheMalleableRunnerwithDYNACOintheKOALAmulticlusterscheduler.
5DesigningSupportforMalleabilityinKOALA
Inthissection,wepresentourdesignforsupportingmalleableapplicationsinKOALA.First,weexplainhowweincludetheDYNACOframeworkintotheKOALAmulticlusterscheduler,andthenwepresentourapproachesandpoliciesformanagingtheexecutionofmalleableapplications,respectively.
5.1SupportingDYNACOapplicationsinKOALA
InordertosupportDYNACO-basedapplicationsinKOALA,wehavedesignedaspecificrunnercalledtheMalleableRunner(MRunner);itsarchitectureisshowninFigure3.IntheMRunner,theusualcontrolroleoftherunnerovertheapplicationisextendedinordertohandlemalleabilityoperations,forwhichpurposeacompleteseparateapplication-specificinstanceofDYNACOisincludedineachcopyoftheMRunner.Afrontend,whichiscommontoalloftherunners,interfacestheMRunnertothescheduler.Weaddamalleabilitymanagerinthescheduler,whichisresponsiblefortriggeringchangesofresourceallocations.
IntheDYNACOframework,thefrontendisreflectedasamonitor,whichgenerateseventsasitreceivesgrowandshrinkmessagesfromthescheduler.ResultingeventsarepropagatedthroughouttheDYNACOframeworkandtranslatedintotheappropriatemessagestoGRAMandtotheapplication.Inaddition,thefrontendcatchestheresultsofadaptationsinordertogenerateacknowledgmentsbacktothescheduler.Italsonotifiestheschedulerwhentheapplicationvoluntarilyshrinksbelowtheamountofallocatedprocessors.
SinceGRAMisnotabletomanagemalleablejobs,theMRunnermanagesthemalleablejobasacollectionofGRAMjobsofsize1.Upongrowth,DYNACOmakestheMRunnersubmitjobstoGRAM.WhenitreceivesactivemessagesasfeedbackfromGRAM,ittransmitsthenewcollectionofactiveGRAMjobs(i.e.thecollectionofheldresources)totheapplicationinordertomakeitusethem.Inordertoreducetheimpactofgrowmessagesontheexecutiontime,wemakeinteractionswithGRAMoverlapwiththeexecutionoftheapplication.Suspensionoftheapplication(forprocessmanagementanddataredistribution)thusdoesnotoccurbeforealltheprocessorsareheld.Todoso,submissionstoGRAMlaunchanemptystubratherthantheapplication’sprogram.Thestubisturnedintoanapplicicationprocessduringtheprocessmanagementphase,whenprocessorsarerecruitedbytheapplication.ThatlatteroperationisfasterthansubmittingajobtoGRAMasitisrelievedfromtaskssuchassecurityenforcementandqueuemanagement.Conversely,uponshrink,theMRunnerfirstreclaimsprocessorsfromtheapplication;thenwhenitreceivesshrunkfeedbackmessages,itreleasesthecorrespondingGRAMjobs.Forperformancereason,whenhandlingshrinkmessages,welettheexecutionresumebeforereleasingprocessorstoGRAM.
5.2JobManagement
UponsubmissionofanapplicationtoKOALA,whetheritisrigid,moldableormalleable,theinitialplacementisperformedbyoneoftheexistingplacementpoliciesasdescribedinSection4.1.Intheplacementphaseofmalleableapplications,theinitialnumberofprocessorsrequiredisdeterminedconsideringthenumberofavailableprocessorsinthesystem.Specifically,givenamalleableapplication,theplacementpoliciesplaceitifthenumberofavailableprocessorsisatleastequaltotheminimumprocessorrequirementoftheapplication.
Inthejobmanagementcontext,themalleabilitymanagerisresponsibleforinitiatingmalleabilitymanagementpoliciesthatdecideonhowtogroworshrinkmalleableapplications.Below,weproposetwodesignchoicesasto
whentoinitiatemalleablemanagementpolicies,whichgivePrecedencetoRunningApplicationsoverwaitingones(PRA)orviceversa(PWA),respectively.
InthePRAapproach,wheneverprocessorsbecomeavailable,forinstance,whenajobfinishesexecution,firsttherunningapplicationsareconsidered.Iftherearemalleablejobsrunning,oneofthemalleabilitymanagementpoliciesisinitiatedinordertogrowthem;anywaitingmalleablejobsarenotconsideredaslongasatleastonerunningmalleablejobcanstillbegrown.
InPWAapproach,whenthenextjobjinthequeuecannotbeplaced,theschedulerappliesoneofthemalleabilitymanagementpoliciesforshrinkingrunningmalleablejobsinordertoobtainadditionalprocessors.Ifitishoweverimpossibletogetenoughavailableprocessorsinordertoplacejobjtakingintoaccounttheminimumsizesoftherunningjobs,thentherunningmalleablejobsareconsideredforgrowingbyoneofthemalleablemanagementpolicies.Wheneverprocessorsbecomeavailable,theplacementqueueisscannedinordertofindajobtobeplaced.
Inbothapproaches,inordertotriggerjobmanagement,theschedulerperiodicallypollstheKOALAinformationservice.Indoingso,theschedulerisabletotakeintoaccountdynamicallythebackgroundloadduetootheruserseveniftheybypassKOALA.Inaddition,inordernottostressexecutionsiteswhengrowingmalleablejobs,andtherefore,inordertoleavealwaysaminimalnumberofavailableprocessorstolocalusers,athresholdissetoverwhichKOALAneverexpandsthetotalsetofthejobsitmanages.
5.3MalleabilityManagementPolicies
Themalleabilitymanagementpolicieswedescribebelowdeterminethemeansofshrinkingandgrowingofmalleablejobsduringtheirexecution.Inthispaper,weassumethateveryapplicationisexecutedinasinglecluster,andso,noco-allocationtakesplace.Consequently,thepoliciesareappliedforeachclusterseparately.
5.3.1FavourPreviouslyStartedMalleableApplications(FPSMA)
TheFPSMApolicyfavourspreviouslystartedmalleablejobsinthatwheneverthepolicyisinitiatedbythemalleabilitymanager,itstartsgrowingfromtheearlieststartedmalleablejobandstartsshrinkingfromthelateststartedmalleablejob.Figure4presentsthepseudo-codeofthegrowandshrinkproceduresoftheFPSMApolicy.
Inthegrowprocedure,first,malleablejobsrunningontheconsideredclustersortedintheincreasingorderoftheirstarttime(line1),thenthevalueofthenumberofprocessorstobeallocatedonbehalfofmalleablejobs(i.e.growValue)isofferedtothesubsequentjobinthesortedlist(line3).Inreplytothisoffer(thejobitselfconsidersitsmaximumnumberofprocessorsrequirement),theacceptednumberofprocessorsareallocated(lines4−5)onbehalfofthatjob.ThenthegrowValueisupdatedandcheckedwhethertherearemoreprocessorstobeoffered(lines6−8).
Theshrinkprocedurerunsinasimilarfashion;thedifferenceswiththegrowprocedureisthatthejobsaresortedinthedecreasingorderoftheirstarttime(line1),andratherthanallocation,theacceptednumberofprocessorsarewaitedtobereleased(line5).5.3.2Equi-Grow&Shrink(EGS)
OurEGSpolicyattemptstobalanceprocessorsovermalleablejobs.Figure5givesthepseudo-codeofthegrowandshrinkproceduresofthatpolicy.Whenitisinitiatedbythemalleabilitymanager,itdistributesavailableprocessors(orreclaimsneededprocessors)equally(line2)overalloftherunningmalleablejobs.Incasethenumberofprocessorstobedistributedorreclaimedisnotdivisiblebythenumberofrunningmalleablejobs,theremainderisdistributedacrosstheleastrecentlystartedjobsasabonus(line5),orreclaimedfromthemostrecentlystartedjobsasamalus(line5).
TheEGSpolicyissimilartothewell-knownequipartitionone.Thetwopolicieshoweverdifferinthefollowingpoints.WhileourEGSpolicydistributesequallyavailableprocessorsamongrunningjobs,theequipartitionpolicydistributesequallythewholesetofprocessorsamongrunningjobs.Consequently,EGSisnotexpectedtomakeateachtimeallofthemalleablejobshavethesamesize,whileequipartitiondoes.Butequipartitionmaymixgrowandshrinkmessages,whileEGSconsistentlyeitergrowsorshrinksalloftherunningjobs.
CoreGRIDTR-00927
procedureFPSMA
SHRINK(clusterName,shrinkValue)
1.List←malleablejobsrunningonclusterName,sortedinthedecreasingorderoftheirstarttime2.foreach(JobinList)do3.sendshrinkrequest(shrinkValue)toJob4.getacceptednumberofprocessorsfromJob5.waitforJobtoreleasetheprocessors6.shrinkValue=shrinkValue-accepted7.ifshrinkValue==0then8.break
Figure4:Pseudo-codeoftheFPSMApolicy
6Experimentalsetup
InthissectionwedescribethesetupoftheexperimentswehaveconductedinordertoevaluatethesupportandtheschedulingpoliciesformalleablejobsinKOALA.WepresenttheapplicationsthathavebeenusedinSection6.1,ourtestbedinSection6.2,andthedetailsoftheworkloadsinSection6.3.
6.1Applications
Fortheexperiments,werelyontwoapplicationsthatwehavepreviouslymademalleablewithDYNACO.TheseapplicationsaretheNASParallelBenchmarkFT[29],whichisabenchmarkforparallelmachinesbasedonafastFouriertransformnumericalkernel,andGADGET2[25],whichisalegacyn-bodysimulator.Furtherdetailsonhowwehavemademalleablethesetwoapplicationscanbefoundin[9].Figure6showshowtheexecutiontimesofthetwoapplicationscalewithrespecttothenumberofmachinesontheDelftcluster(seetable1).With2processors,GADGET2takesnearly10minutes,whileFTlastsalmost2minutes.Thebestexecutiontimesarerespectively4minutesforGADGET2and1minuteforFT.
WhileGADGET2canexecutewithanarbitrarynumberofprocessors,FTonlyacceptspowersof2.Aswehavealreadystated,weproposethattheschedulerdoesnotcareaboutsuchconstraints,inordertoavoidtomakeitimplementanexhaustivecollectionofpossiblecontraints.Consequently,whenrespondingtogrowandshrinkmessages,theFTapplicationacceptsonlythehighestpowerof2processorsthatdoesnotexceedtheallocatednumber.Additionalprocessorsarevoluntarilyreleasedtothescheduler.Inaddition,theFTapplicationassumesprocessorsofequalcomputepower,whileGADGET2includesaload-balancingmechanism.
6.2TheTestbed
OurtestbedistheDistributedASCISupercomputer(DAS3)[2],whichisawide-areacomputersystemintheNether-landsthatisusedforresearchonparallel,distributed,andgridcomputing.Itconsistsoffiveclustersof272dualAMDOpteroncomputenodes.ThedistributionofthenodesovertheclustersisgiveninTable1.Besidesusingthe1and10Gigabit/sEthernet,DAS3employsthenovellocalhigh-speedMyri-10Ginterconnecttechnology.OneachoftheDASclusters,theSunGridEngine(SGE)[1]isusedasthelocalresourcemanager.SGEhasbeenconfiguredtorunapplicationsonthenodesinanexclusivefashion,i.e.,inspace-sharedmode.
CoreGRIDTR-00928
procedureEQUI
1.2.3.4.5.6.7.8.9.
SHRINK(clusterName,shrinkValue)
List←malleablejobsrunningonclusterName,sortedinthedecreasingorderoftheirstarttime
jobShrinkValue←⌊shrinkValue/size(List)⌋
shrinkRemainder←remainder(shrinkValue,size(List))foreach(JobinList)do
malus←1ifi≥growRemainder;0otherwise
sendshrinkrequest(shrinkValue+malus)toJobgetacceptednumberofprocessorsfromJobforeach(JobinList)do
waitforJobtoreleasetheprocessors
Figure5:Pseudo-codeoftheEGSpolicy
800FT
Gadget 2
Time (s)020040060001020
Number of machines
3040
Figure6:Theexecutiontimesofthetwoapplicationsdependingonthenumberofusedmachines.
6.3TheWorkloads
TheworkloadsthatweemployinourexperimentscombinethetwoapplicationsofSection6.1withauniformdistri-bution.Theirminimumsizeissetto2processors,whilethemaximumsizeis46forGADGET2and32forFT.Inbothcases,300jobsaresubmitted.Jobsaresubmittedfromasingleclientsite;nostagingoperationisorderedevenwhenprocessorsareallocatedfromremotesites.
ForthePRA-basedexperiments,wehaveusedtwofollowingworkloads.WorkloadWmiscomposedexclusivelyofmalleablejobs,whileworkloadWmrisrandomlycomposedof50%ofmalleablejobsand50%rigidjobs.Inbothcases,inter-arrivaltimeis2minutes.Rigidjobsaresubmittedwithasizeof2processors,andmalleablejobswithaninitialsizeof2.Inourexperiments,KOALAemploystheWFpolicy.ApartfromworkloadWmorWmr,theonlybackgroundloadduringtheexperimentsistheactivityofconcurrentusers.Thisbackgroundloaddoesnotdisturbthemeasures.
′′
WhenanalysingthePWAapproach,wehaveusedtwoworkloadsWmandWmr,whichderiverespectivelyfromWmandWmr.Intheseworkloads,inter-arrivaltimeisreduceddownto30secondsinordertoincreasetheloadofthesystem.
CoreGRIDTR-00929
Table1:ThedistributionofthenodesovertheDASclusters.
VrijeUniversityU.ofAmsterdamDelftUniversityMultimediaNLeidenUniversity8541684632Myri-10G&1/10GbEMyri-10G&1/10GbE
1/10GbE
Myri-10G&1/10GbEMyri-10G&1/10GbE
10080Cumulative number of jobs (%)Cumulative number of jobs (%)6040200051015202530
020406080FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr
100FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr
010203040
Average number of processors per jobMaximum number of processors per job
(a)Thecumulativedistributionofthenumberofprocessorsperjobav-eragedovertheexecutiontimeofjobs.
100FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr
(b)Thecumulativedistributionofthemaximalnumberofprocessorsreachedperjobduringitsexecution.
100FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr
80Cumulative number of jobs (%)Cumulative number of jobs (%)0
200
400
600
Execution time (s)
800
1000
1200
60402000204060800200400
Response time (s)
600800
(c)Thecumulativedistributionofthejobexecutiontimes.
120FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr
Number of grown messages(d)Thecumulativedistributionofthejobresponsetimes.
FPSMA/WmFPSMA/WmrEGS/WmEGS/Wmr
Total number of used processors1006040020250002600027000
Time (s)
280002900030000
02004006008080001000020000
Time (s)
30000
(e)Utilizationoftheplatformduringtheexperiment.(f)Activityofthemalleabilitymanager.
Figure7:ComparisonbetweenFPSMAandEGSwiththePRAapproachofjobmanagement(noshrinking).
CoreGRIDTR-009211
10080Cumulative number of jobs (%)Cumulative number of jobs (%)6040FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr
406080100FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr
20001020
Average number of processors per job
3040
0200102030405060
Maximum number of processors per job
(a)Thecumulativedistributionofthenumberofprocessorsperjobav-eragedovertheexecutiontimeofjobs.
100FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr
(b)Thecumulativedistributionofthemaximalnumberofprocessorsreachedperjobduringitsexecution.
100FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr
80Cumulative number of jobs (%)Cumulative number of jobs (%)0
200
400
Execution time (s)
600
800
1000
604020002040608002004006008001000
Response time (s)
(c)Thecumulativedistributionofthejobexecutiontimes.
100500(d)Thecumulativedistributionofthejobresponsetimes.
FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr
Number of malleability operations4000
Time (s)
6000
8000
10000
Total number of used processors002000
010020200403006040080FPSMA/W’mFPSMA/W’mrEGS/W’mEGS/W’mr0200040006000Time (s)
80001000012000
(e)Utilizationoftheplatformduringtheexperiment.(f)Activityofthemalleabilitymanager.
Figure8:ComparisonbetweenFPSMAandEGSwiththePWAapproachofjobmanagement(bothgrowingandshrinking).
CoreGRIDTR-009212
Figures8(c)and8(d)showthatdespitethefactthattheexecutiontimeisalmostthesameforthefourexecutions,
′
theresponsetimeisfarworseforthecombinationoftheEGCpolicyandWmworkloadduetohigherwaittime.ThisresultsfromthehighutilizationthatweobserveinFigure8(e)forthispolicy.
8Conclusion
InthispaperwehavepresentedthedesignandtheanalysiswithexperimentswiththeKOALAschedulerinourDAS3testbedofthesupportandpoliciesformalleabilityofparallelapplicationsinmulticlustersystems.Ourexperimentalresultsshowthatmalleabilityisindeedbenficialforperformance.Inourdesign,wehavenotincludedgrowopera-tionsthatareinitiatedbytheapplications.Thisfeatureismainlyusefulincasetheparallelismpatternisirregular,unlikewithourapplications.Designingitishowevernotstraightforward.Forinstance,itraisesthedesignchoicewhethersuchgrowoperationscanbemandatoryoronlyvoluntary,andhowmuchefforttheschedulershoulddotoaccommodatemandatorygrowoperations,forinstancebyshrinkingconcurrentmalleablejobs.Anotherelementthatwehavenotincorporatedinourdesignandimplementationbutthatweintendtoaddismalleabilityofco-allocatedapplications.
Acknowledgment
ThisworkwascarriedoutinthecontextoftheVirtualLaboratoryfore-Scienceproject(www.vl-e.nl),whichissupportedbyaBSIKgrantfromtheDutchMinistryofEducation,CultureandScience(OC&W),andwhichispartoftheICTinnovationprogramoftheDutchMinistryofEconomicAffairs(EZ).PartofthisworkisalsocarriedoutundertheFP6NetworkofExcellenceCoreGRIDfundedbytheEuropeanCommission(ContractIST-2002-004265).
References
[1]Sungridengine.http://gridengine.sunsource.net.
[2]TheDistributedASCISupercomputer.http://www.cs.vu.nl/das3/.
[3]MarcoAldinucci,Franc¸oiseAndr´e,J´er´emyBuisson,SoniaCampa,MassimoCoppola,MarcoDanelutto,and
CorradoZoccolo.Anabstractschemamodellingadaptivitymanagement.InSergeiGorlatchandMarcoDane-lutto,editors,IntegratedResearchinGRIDComputing.Springer,2007.proceedingsoftheCoreGRIDIntegrationWorkshop2005.[4]MarcoAldinucci,MassimoCoppola,MarcoDanelutto,MarcoVanneschi,andCorradoZoccolo.Assistasa
researchframeworkforhigh-performancegridprogrammingenvironments.TechnicalReportTR-04-09,Uni-versit`adiPisa,DipartimentodiInformatica,February2004.[5]FrancineBerman,RichardWolski,HenriCasanova,WalfredoCirne,HollyDail,MarcioFaerman,SilviaFigueira,JimHayes,GrazianoObertelli,JenniferSchopf,GaryShao,ShavaSmallen,NeilSpring,AlanSu,andDmitriiZagorodnov.Adaptivecomputingonthegridusingapples.IEEETrans.ParallelDistrib.Syst.,14(4):369–382,April2003.[6]JacekBlazewicz,MikhailKovalyov,MaciejMachowiak,DenisTrystram,andJanWeglarz.Preemptablemal-leabletaskschedulingproblem.IEEETransactionsonComputers,55(4):486–490,April2006.briefcontribution.[7]JacekBlazewicz,MaciejMachowiak,GregoryMouni´e,andDenisTrystram.Approximationalgorithmsfor
schedulingindependentmalleabletasks.In7thInternationalEuro-ParConference,pages191–197,Manchester,UK,August2001.[8]A.I.D.BucurandD.H.J.Epema.Themaximalutilizationofprocessorco-allocationinmulticlustersystems.
InIPDPS’03:Proceedingsofthe17thInternationalSymposiumonParallelandDistributedProcessing,page60.1,Washington,DC,USA,2003.IEEEComputerSociety.
CoreGRIDTR-009213
[9]J´er´emyBuisson,Franc¸oiseAndr´e,andJean-LouisPazat.Aframeworkfordynamicadaptationofparallelcom-ponents.InInternationalConferenceParCo,pages65–72,M´alaga,Spain,September2005.[10]J´er´emyBuisson,Franc¸oiseAndr´e,andJean-LouisPazat.Afpac:Enforcingconsistencyduringtheadaptationof
aparallelcomponent.ScalableComputing:PracticeandExperience(SCPE),7(3):83–95,September2006.[11]TravisDesell,KaoutarElMaghraoui,andCarlosVarela.Malleablecomponentsforscalablehighperformance
computing.InHPCGridprogrammingEnvironmentandCOmponents,Paris,France,June2006.[12]DrorG.FeitelsonandLarryRudolph.Towardconvergenceinjobschedulersforparallelsupercomputers.In
DrorG.FeitelsonandLarryRudolph,editors,JobSchedulingStrategiesforParallelProcessing,pages1–26.Springer-Verlag,1996.[13]IanFosterandCarlKesselman.Globus:ametacomputinginfrastructuretoolkit.InternationalJournalofSuper-computerApplications,11(2):115–128,1997.[14]JanHungersh¨ofer.Onthecombinedschedulingofmalleableandrigidjobs.In16thSymposiumonComputer
ArchitectureandHighPerformanceComputing(SBAC-PAD),pages206–213,FozdoIguac¸u,Brazil,October2004.[15]JanHungersh¨ofer,AchimStreit,andJens-MichaelWierum.Efficientresourcemanagementformalleableappli-cations.TechnicalReportTR-003-01,PaderbornCenterforParallelComputing,December2001.[16]AlexandruIosup,DickEpema,CarstenFranke,AlexanderPapaspyrou,LarsSchley,BaiyiSong,and
RaminYahyapour.Ongridperformanceevaluationusingsyntheticworkloads.InE.FrachtenbergandU.Schwiegelshohn,editors,Proc.ofthe12thWorkshoponJobSchedulingStrategiesforParallelProcessing(JSSPP),,LectureNotesinComputerScience.SpringerVerlag,June2006.[17]LaxmikantKal´e,SameerKumar,andJayantDeSouza.Amalleable-jobsystemfortimesharedparallelmachines.
In2ndIEEE/ACMInternationalSymposiumonClusterComputingandtheGrid(CCGRID),page230,Berlin,Germany,May2002.[18]JeffreyO.KephartandDavidM.Chess.Thevisionofautonomiccomputing.Computer,36(1):41–50,January
2003.[19]CathyMcCannandJohnZahojan.Processorallocationpoliciesformessage-passingparallelcomputers.In
ACMSIGMETRICSconferenceonMeasurementandModelingofComputerSystems,pages19–32,Nashvill,Tennessee,USA,May1994.[20]HashimMohamedandDickEpema.Anevaluationoftheclose-to-filesprocessoranddataco-allocationpolicy
inmulticlusters.In2004IEEEInternationalConferenceonClusterComputing,pages287–298.IEEESocietyPress,2004.[21]HashimMohamedandDickEpema.Thedesignandimplementationofthekoalaco-allocatinggridscheduler.
InEuropeanGridConference,volume3470ofLectureNotesinComputerScience,pages640–650.Springer-Verlag,2005.[22]HashimMohamedandDickEpema.ExperienceswiththeKoalaco-allocatingschedulerinmulticlusters.In
ProceedingsoftheinternationalsymposiumonClusterComputingandtheGRID,Cardiff,UK,May2005.[23]GregoryMouni´e,ChristopheRapine,andDenisTrystram.Efficientapproximationalgorithmsforscheduling
malleabletasks.In11thACMsymposiumonParallelAlgorithmsandArchitectures,pages23–32,Saint-Malo,France,June1999.[24]OzanSonmez,HashimMohamed,andDickEpema.Communication-awarejobplacementpoliciesforthekoala
gridscheduler.InE-SCIENCE’06:ProceedingsoftheSecondIEEEInternationalConferenceone-ScienceandGridComputing,page79,Washington,DC,USA,2006.IEEEComputerSociety.[25]VolkerSpringel.Thecosmologicalsimulationcodegadget-2.MonthlyNoticesoftheRoyalAstronomicalSociety,
2005.submittedastro-ph/0505010.CoreGRIDTR-0092
14
[26]GladysUtrera,JulitaCorbal´a,andJes´usLabarta.Implementingmalleabilityonmpijobs.In13thInterna-tionalConferenceonParallelArchtectureandCompilationTechniques(PACT),pages215–224,Antibes,France,
September2004.[27]SathishVadhiyarandJackDongarra.Srs:aframeworkfordevelopingmalleableandmigratableparallelappli-cationsfordistributedsystems.ParallelProcessingLetters,13(2):291–312,2003.[28]SathishVadhiyarandJackDongarra.Selfadaptabilityingridcomputing.ConcurrencyandComputation:
PracticeandExperience,17(2-4):235–257,February2005.[29]RobF.vanderWijngaart.Nasparallelbenchmarksversion2.4.TechnicalReportNAS-02-007,NASAAdvanced
Supercomputingdivision,October2002.
CoreGRIDTR-009215
因篇幅问题不能全部显示,请点此查看更多更全内容