ITUniversityPhilippeBonnet
Copenhagen,RuedLangaardofCopenhagen
Denmark
Vej7phbo@itu.dk
ABSTRACT
Whilediskshaveofferedastablebehaviorfordecades-thusguaranteeingthetimelessnessofmanydatabasede-signdecisions,flashdeviceskeeponmutating.Theirbe-haviorvariesacrossmodelsandacrossfirmwareupdatesforthesamemodel.Manyresearchershaveproposedtoadaptdatabasealgorithmsforexistingflashdevices;othershavetriedtocapturetheperformancecharacteristicsofflashde-vices.However,today,weneitherhaveareferenceDBMSdesignnoraperformancemodelforflashdevices:databaseresearchersarerunningafterflashmemorytechnology.Inthispaper,wetakethereverseapproachandwedefinehowflashdevicesshouldsupportdatabasemanagement.Wead-vocatethatflashdevicesshouldprovideDBMSwithmorecontroloverIObehaviorwithoutsacrificingcorrectnessorrobustness.Weintroducethenotionofbimodalflashde-vicesthatexposethefullpotentialoftheunderlyingflashchipsaslongasthesubmittedIOsrespectafewwell-definedconstraints.Wesuggesttwoapproachesforimplementingbimodalflashdevices:(a)basedonthenarrowblockde-viceinterface,or(b)basedonarichinterfacethatallowsaDBMStoexplicitlycontrolIObehavior.Webelievethattheseapproachesarenaturalevolutionsofthecurrentgen-erationofflashdevices,whosecomplexityandopacityisill-suitedfordatabasemanagement.Wediscusshowbimodalflashdeviceswouldbenefitmanyexistingtechniquespro-posedbythedatabaseresearchcommunity,andidentifyasetofnewresearchissues.
1.INTRODUCTION
Forsometimenow,flashdeviceshavebeenpoisedtore-placedisksassecondarystorage[12].Today,manydifferenttypesofflashdevicesarefindingtheirwayintothememoryhierarchyofdatabasemanagementsystems(DBMS),fromSSDtoPCI-basedracks(e.g.,fusionIOandRamSan)andenergyefficientFAWNs[5].However,despitesignificantef-forts[2,9,8,17,21,29,31,20,32],areferencedesignfor
ThisarticleispublishedunderaCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by/3.0/),whichpermitsdistributionandreproductioninanymediumaswellallowingderivativeworks,pro-videdthatyouattributetheoriginalworktotheauthor(s)andCIDR2011.5thBiennialConferenceonInnovativeDataSystemsResearch(CIDR’11)January9-12,2011,Asilomar,California,USA.
INRIAParis-RocquencourtLucBouganim
DomaineLeChesnay,deVoluceau&PRiSM/UVSQ
France
Luc.Bouganim@inria.fr
databasemanagementwithflashdeviceshasyettoemerge1.Indeed,flashdevicesdonotexhibitconsistentcharacter-istics.TheyembedacomplexsoftwarecalledFlashTrans-lationLayer(FTL)inordertohideflashchipconstraints(erase-before-write,limitednumberoferase-writecycles,se-quentialpage-writeswithinaflashblock).AFTLprovidesaddresstranslation,wearlevelingandstrivestohidetheimpactofupdatesandrandomwritesbasedonobservedupdatefrequencies,accesspatterns,temporallocality,etc.Theirperformancecharacteristicsandenergyprofilesvaryacrossdevices[9,8].Forinstance,randomwritesarefasterthanreadsonFusionIO’sioDrive[7]whilerandomwritesaremuchslowerthantheotheroperationsontheSamsungmodel[9].Forsomedevices,performancevariesintimebasedonthehistoryofIOs,e.g.,theperformanceoftheIntelX25-Mvariesbyanorderofmagnitudedependingonwhetherthedeviceisfilledwithrandomwritesornot.WhatisthevalueofaDBMSdesignbasedonastoragesubsys-temwhosebehaviorisnotwellunderstoodandkeepsonmutating?
Bycontrast,successivegenerationsofdiskshavecompliedwithtwosimpleaxioms:(1)localityinthelogicaladdressspaceispreservedinthephysicaladdressspace;(2)sequen-tialaccessismuchfasterthanrandomaccess.Aslongasharddisksremainedthesolemediumforsecondarystorage,theblockdeviceinterfaceprovedtobeaveryrobustabstrac-tionthatallowedtheoperatingsystemtohidethecomplex-ityofIOmanagementwithoutsacrificingperformance.Theblockdeviceinterfaceisasimplememoryabstractionbasedonreadandwriteprimitivesandaflatlogicaladdressspace(i.e.,anarrayofsectors).SincetheadventofUnix[30],thestabilityoftheinterfaceandthestabilityofdiskscharacter-isticshaveguaranteedthetimelessnessofmajordatabasesystemdesigndecisions,i.e.,pagesaretheunitofIOwithanidenticalrepresentationofdataon-diskandin-memory;randomaccessesareavoided(e.g.,queryprocessingalgo-rithms)whilesequentialaccessesarefavored(e.g.,extent-basedallocation,clustering).
Wemustaddressthetensionthatexistsbetweenthede-signgoalsofflashdevicesandDBMS.Flashdevicedesign-ers,especiallySSDandPCI-basedracksdesigners,aimathidingtheconstraintsofflashchipstocompetewithharddisksproviders.Theyalsocompetewitheachother,tweak-ingtheirFTLtoimproveoverallperformance,andmaskingtheirdesigndecisiontoprotecttheiradvantage.Databasesystems,ontheotherhand,controltheIOstheyissue.1
directWedoaccessnotconsidertotheflashinthischips,papere.g.,architecturesembeddedflashproviding[4]
1
Whatdatabasesystemsdesignersneedisaclearandsta-bledistinctionbetweenefficientandinefficientIOpatterns,sothattheycanadapttheirallocationstrategies,datarep-resentationorqueryprocessingalgorithmstothecharacter-isticsoftheunderlyingstoragedevices.Theymightevenbeabletotradeincreasedcomplexityforimprovedperformanceandstablebehavioracrossdevices.
Sotheproblemisthefollowing:HowcanflashdevicesprovideDBMSwithguaranteesoverIObehavior?Inter-estingly,flashchipsalreadyprovideaclear,stabledistinc-tionbetweenefficientpatterns(pagereads,sequentialpage-writeswithinablock),andinefficientpatterns(in-placeup-dates).Ourkeyinsightisthatflashdevicesshouldexposethisdistinctioninsteadofaggressivelymitigatingtheim-pactofinefficientpatternsattheexpenseoftheefficientones(e.g.,tradingreducedreadperformancetoobtainim-provedrandomwrites).Inthispaper,weintroducethenotionofbimodalflashdevicethatexposethisdichotomytotheupperlayers,thusprovidingefficientandpredictableIObehaviorfordatabasesystems.Intermsofdesign,weseetwoapproaches:
1.NarrowInterfaceDevice:Themostimmediateap-proachistokeeptheexistingblockdeviceinterface.Sinceflashdeviceshavenoknowledgeofthemanipu-lateddata,weshould(a)lettheDBMSoptimizeitsaccessesand,(b)avoiduncontrolledFTLoptimiza-tions.Thisapproachissimilar,inspirit,tothewayDBMSsinteractwithvirtualmemory[30].Plagiariz-ingStonebraker,thisconsistsinreplacinga“notquiteright”serviceprovidedbytheflashdevicewithacom-parable,applicationspecific,servicewithintheDBMS.Thekeyquestionshereare:WhatabstractionsshouldtheFTLprovide?Howtohandletheincreasedcom-plexityattheDBMSlevel?2.RichInterfaceDeviceAnalternativeapproachwouldbetorelyonarichinterfacetoletDBMSandflashdevicecollaborateonhowtooptimizeperformances.Indeed,thereisanemergingconsensusthattheblockinterfaceistoonarrow[28,22,23,25,26].Copingwiththeblockinterfaceforcesflashdevicestoperformcomplextasks(i.e.,wearleveling,garbagecollection)independentlyfromtheapplication,possiblyagainstitsbestinterest.Thekeyquestionshereare:WhatinformationshouldbepassedfromtheDBMStotheflashdevicesothatitcanoptimizeitsperformance2?Whatkindofoptimizationscanbedefinedonflashdevices?HowshouldwedesignDBMSstoleveragearichflashdeviceinterface?WeseeNarrowandRichbimodaldevicesasnaturalevo-lutionsofthecurrentgenerationofflashdeviceswhosecom-plexityandopacityisill-suitedfordatabasemanagement.Suchanevolutionisparticularlyimportantfordatabasema-chinesdesigners(e.g.,Oracle’sExadataorNeteeza’sTwin-Fin)thathavetospecifywell-suitedflashcomponentsfortheirsystems.Moregenerally,weunderstandthatSSDmanufacturerswillonlymoveifthegainisclearfortheir2
integratingNotethatwefocusonIOdevices,urallyleadse.g.,high-leveltoactiveactivedisksdatabaseperformance;wedonotconsiderdisks[24].abstractionswithinstorageisaWhethertopicforafuturerichinterfacework.
nat-business.Weseehereanopportunityforthedatabasecom-munitytoinfluencetheevolutionofflashdevicesforthebenefitsofcommoditydatabasesystems.
Inthispaper,wedefinetheguaranteesthataFTLshouldprovidetoaDBMSandderivethenotionofbimodalflashdevice,weoutlinethedesignofNarrowandRichbimodalflashdevicesandweexplorehowtheywillimpactdatabasemanagement.Throughoutthepaper,weillustratehowideasexpressedintheliteraturewouldbenefitfromthesenewclassesofdevices.Wealsodescribenewresearchchallenges.
2.FLASHDEVICES
Attheircore,flashdevicesrelyonNANDflashchipsthatstoredatainindependentarraysofmemorycells.Eachcellaccommodates1,2,or3bitsofinformation(SLC,MLC,TLC).Eacharrayisaflashblock,androwsofmemorycellsareflashpages3.
TheGood.Asingleflashchipcanoffergreatperfor-mance(e.g.,40MB/sReads,10MB/swrite)withlowen-ergyconsumption[8].Thus,tensofflashchipswiredinparallelcandeliverhundredsofthousandsIOspersecond.Atthechiplevel,randomoperationsareasfastassequen-tialones.Recentflashchipscaninterleaveoperationsandincludemultipleindependentplanes,thusprocessingoper-ationsconcurrently[3].Aflashdeviceiscomposedofacollectionofflashchips,wiredinparalleltoacontroller.Thecontrollerincludessomecache(e.g.16-32MB),poten-tiallysafewithrespecttopowerfailure[3]—e.g.,cachecanbeRAMwithcapacitorsorotherNVM(eg.PCM).Fromthisperspective,thepotentialofflashdeviceisimpressive.TheBad.Unfortunately,flashchipshaveseverecon-straints:(C1)Writegranularity.Writesmustbeperformedatapagegranularity4.(C2)Erasebeforewrite.Acostlyeraseoperationmustbeperformedbeforeoverwritingaflashpage.Evenworse,eraseoperationsareonlyperformedatthegranularityofaflashblock(typicallyflashpages).(C3)Sequentialwriteswithinablock.Writesmustbeper-formedsequentiallywithinaflashblockinordertominimizewriteerrorsresultingfromtheelectricalsideeffectsofwrit-ingaseriesofcells5.(C4)Limitedlifetime.SLC,MLCandTLCflashchipscansupportrespectivelyupto106,105,5×104eraseoperationsperflashblock.Thetrendisthatflashchipsstoremorebitspercell(e.g.,TLC)withasmallerprocessgeometry(e.g.,25nm),largerpagesize,largernum-berofpagebyblocksandsmallerlifetime.NoneoftheseevolutionschallengethenatureofC1-C4.
AndtheFTL.Thecontrollerembedstheso-calledFlashTranslationLayer(FTL)inordertohidetheaforementionedconstraints.Typically,theFTLimplementsout-of-placeup-datestohandleC2usingsomereservedflashblockscalledlogblocks.Eachupdateleaves,however,anobsoleteflashpage(thatcontainsthebeforeimage).Overtimesuchobsoleteflashpagesaccumulate,andmustbereclaimedbyagarbagecollector.Amappingbetweenthelogicaladdressspaceex-posedbytheFTLandthephysicalflashspaceisnecessary3
4Flashpagesmayfurtherbebrokenupintoflashsub-pagesforWhilesmallerMLCtheNANDwrite(nogranularitysub-pages),isthethepagereadone(4KB-8KB)cansector[31].Actually,readgranularitydependsonthebe5
size(512B-1KB).ECCmanagedElectricsidelevel.
witheffectserrorcorrectionmaygeneratecodeswrite(ECC)errorsattheeffectivelyhardware2
tohandlewritessmallerthanaflashpage(C1),updates(C2),randomwrites(C3),andtosupportwearlevelingtech-niques(C4),whichdistributeeraseoperationsacrossflashblocksandmaskbadblocks.Thismappingisimplementedbasedonamappingtablelocatedinthecontrollercache,onflashorboth[13].Pagemapping,withamappingentryforeachflashpagegenerateslargemapsthatdonotfitinthecontrollercacheforlargecapacitydevices.Blockmappingreducesdrasticallythemappingtabletooneentryperflashblock[15]—thechallengeisthentominimizetheoverheadforfindingapagewithinablock.Thus,blockmappingdoesnotsupportrandomwritesandupdatesefficiently.Morere-cently,manyHybridmappingtechniques[19,18,13,15]havebeenproposedthatcombineblockmappingasabaselineandpagemappingforlogblocks.Besidesmapping,existingFTLalgorithmsalsodifferontheirwearlevelingalgorithms[10],aswellastheirlogblocksmanagementandgarbagecollec-tionmethods(blockassociative[16],fullyassociative[19],usingdetectedpatterns[18]ortemporallocality[15]).
3.TUNNELINGDBMSIO
OuroverallproblemistoresolvethetensionthatexistsbetweenFTLandDBMSdesigngoals.FromthepointofviewofaDBMSdesigner,thetrivialsolutionistotaketheFTLoutoftheequationandletaDBMSdirectlyaccessflashchips.Obviously,thefirststepthatDBMSdesignerswouldtakeistointroducemodularitytohidethecomplexityofdealingwithflashchips,ineffectre-introducingsomeformofFTL.So,theinterestingquestionisnothowtobypasstheFTL,butwhatkindofFTLcanDBMSdesignersrelyon?
3.1MinimalFTL
Tostartwith,letusdefinetheminimallevelofservicethataFTLshouldprovide—becauseaDBMScannot.TheideaisthataminimalFTLwouldgivemaximalcontrolandmaximalstabilitytotheDBMS.Letuslookbackattheconstraintsimposedbyflashchips.First,aDBMScantriviallyissueIOsatthegranularityofaflashpage(C1).ThiswouldrequiretheFTLtoadvertisehowflashpagesshouldbealignedatthelogicallevel.Second,aDBMScouldrelyontheTrimcommand6totellaflashdevicetoeraseaflashblockbeforeitiswritten(C2).ThiswouldrequiretheFTLtoexposeanabstractionofflashblocksatitsinterface.Third,aDBMScouldwriteflashpagesinsequencewithinflashblocks(C3).ThiswouldrequiretheFTLtoadvertisehowflashblocksarealigned,andhowflashpagesaremappedintoflashblocks.Fourth,aDBMScannotimplementwear-leveling(C4).ItmustbeprovidedbytheFTL.Indeed,wear-levelingisabsolutelynecessarytoguar-anteethelifetimeofadevice.Flashdevicemanufacturerswillneverreleaseaproductthatcanwear-outaftersomeminutesoffocusedandintensivewrite/erasecycles.
OnecanthusimaginebuildingflashdeviceswithanFTLthatimplementswearlevelingandprovidesabstractionsforflashpages(alogicalpage)andflashblocks(alogicalblock).SuchaFTLwouldrelyonablocklevelmapwhichiscom-pactenoughtobeefficientlykeptintheflashdevicesafe6
faceTherangestandardTrimcommandofLBAs[28]aretohasbeenintroducedintheATAinter-nocommunicatelongerusedbytoanaflashapplication
devicethataFigure1:MinimalFTLdoesnotsupportupdatesorrandomwritesbutoffersoptimalperformanceforIOpatternsrespectingC1,C2andC3.
cache(e.g.,16MBcachefor1TBof256KBflashblocks)7WithsuchaFTL,theDBMSwouldhavetohandlecon-straintsC1-C3—i.e.,theDBMScouldnotsubmitIOsforin-placeupdatesorrandomwrites.Ontheotherhand,theFTLwouldguaranteethattheIOsthatrespectthesecon-straintswouldbetunneledtotheunderlyingflashchipsasdirectlyaspossible(seeFigure1).
Wouldflashdevicemanufacturersbeinterestedinpro-vidingsuchFTLs?Probablynot.CouldDBMSdesigneralwayscircumventrandomwritesonflashdevices?Maybeinsomecases(e.g.ifflashdevicesareonlyusedforim-mutabledatasets),butdefinitelynotforgeneral-purposedatabases.
3.2BimodalFTLs
ToresolvethetensionbetweenFTLandDBMSdesigngoals,weproposeabimodalFTLwhichachievesoptimalperformanceaslongastheDBMSmanagesconstraintsC1-C3,whileprovidingbesteffortperformanceforallotherIOpatterns8.Interferencebetweenthesetwomodesofopera-tionsshouldbeminimized(seeFigure2).
Whileflashdevicedesignershavefocusedonefficientlyenforcingtheflashchipconstraintsforupdatesorrandomwrites,therehasnotbeenmuchworkoncharacterizingop-timalperformanceforaflashdevice.Moreprecisely,whatisthemostoptimalmappingthataFTLcanachieve?WehaveseenthataFTLmustatleastimplementaformofblockmappingtosupportwear-leveling.Amappingisthusoptimalif(a)theblocklook-upisperformedinthecon-trollercache,and(b)theoffsetofthepagewithintheblockisderivedfromthelogicaladdress(i.e.,consecutivelogi-caladdressesarewrittensequentiallywithinablock).Inthefollowing,aflashblockforwhichmappingisoptimaliscalledanoptimalblock.
AbimodalFTLmustprovideoptimalmappingforthose7
onOtherwise,8
flash(aswasthetheblockcasemappinginearlyhasblocktomappingbepartiallyFTLswritten[11]).andAbimodalFTLmustimplementaformofwear-levelingresult,garbageequalaDBMScollection.Thisrequiressomework,andasawhenthetotheFTLthroughputcanneverguaranteesofobtainanIOthroughputstrictlyoptimaltheunderlyingmapping.
flashchipseven3
Figure2:BimodalFTLthatcombinesnear-optimalperformanceforIOaslongastheDBMSrespectsconstraintsC1-C3,andbesteffortperformanceforotherIOpatternswheretheFTLmustenforcetheseconstraints.
logicalblocksforwhichtheDBMSguaranteesthatwritesareperformedsequentially(C3)atthegranularityofaflashpage(C1),whileanyupdateisprecededbyaTrimcommand(C2).Theotherlogicalblocks—onwhichallIOpatternsareallowed—aremappedtooneormorephysicalflashblocks,dependingonthealgorithmsusedbytheFTLtomanageconstraintsC1-C3.ThedesignoftheFTLoptimizationsfornon-optimalblocksisnotinthescopeofthispaper.Stateofthearttechniquescanbeused[16,18,19,15,13]aslongastheydonotdirectlyinterferewithoptimalblocks.WearguethefeasibilityofthisapproachinSections4and5.Obviouslysequentialwriteswillresultinanoptimalmap-ping.Interestingly,semi-randomwrite,introducedin[21],i.e.,randomwriteIOsdoneinsuchawaythattheycanbemappedsequentiallyondifferentlogicalblocks,willalsoresultinanoptimalmapping9.Sequentialreadsandran-domreadsonoptimalblocksbenefitequallyfromanoptimalmapping.Aninterestingsideeffectisthatanoptimalblockneverneedstobegarbagecollectedasthesequenceofwritesinanoptimalmappingdoesnotcreateobsoletespacetobereclaimed.Fornon-optimalblocks,thegoalofthegarbagecollectormustbetotendtowardsoptimalmapping.
3.3ImpactonDBMSdesign
BimodalflashdevicesprovideastableandoptimalbasisforDBMSdesign.Evenifmoreworkisneededtoinvestigatetheactualimpactofourapproach,wecanalreadymakesomepreliminaryobservations
Existingworkfocusedonmakingdatabasewritessequen-tial(e.g.,AppendandPack[29],ReSSD[20],NIPU[32])istodayrestrictedtorepairingthebadrandomwriteperfor-manceoflow-endSSDs.SuchatechniquewouldberequiredtoleveragetheoptimalperformanceofsequentialwritesincaseofabimodalFTL.Also,techniquesdesignedforflashchips(e.g.,Lazyadaptivetrees[2],orPage-differentialLog-ging[17]),whicharenottodayadequateinthecontextofSSDs,couldnaturallybeappliedtobimodalFTLssince9
hashingForinstance,approach,arelation)fillinginparallelseveralflashbuckets(fore.g.degreeofparallelismitisthewillnumberresultinsemi-randomwrites.Inourintheofsemi-randomlogicalblockswrites.
thatlimitsthetheywouldgenerateonlyoptimalblocks.Othertechniquesbasedonhashingorsorting(e.g.,onlinemaintenanceofverylargerandomsamples[21])cantodayonlybeappliedtothoseSSDSthatsupport(alimitedformof)semi-randomwrites.Suchtechniquesaswellashash-joinorsort-mergewouldclearlybenefitfromtheperformanceofsemi-randomwritesonabimodalFTLaslongasbuketsarealignedonblockboundaries.Finally,theFlashScanandFlashJoinal-gorithmsproposedin[31],thataggressivelymakeuseofran-domreadIOsofsmallgranularitiesaretodayratherwellsupportedincurrentSSDs,evenifrandomreadsareslowerthansequentialreadsonmanySSDs.AbimodalFTLcouldensureoptimalperformancewithrandomreadsasperfor-mantassequentialreads.
WhileitisimpossibletodevelopaperformancemodelofexistingSSDs(becauseoftheiropacityandcomplexity),itispossibletoenvisagebothanalyticmodelsandsimula-tionmodelsofbimodalflashdevices,inordertoexploretheDBMSdesignspace.
4.NARROWBIMODALFLASHDEVICES
Now,letusfocusonhowwecandesignbimodalflashdevices.Basically,thequestionisthefollowing:Isitpos-sibletoimplementabimodalFTLwithoutviolatingtheconstraintsofablockdeviceinterface,i.e.,fixedsizeIOs,flatnamespaceoflogicaladdresses,interfacereducedtoread/write/trimcommands?Thisquestionboilsdownto(a)Howtorepresentlogicalblocksandpageswithablockde-viceinterface,(b)HowcantheFTLdetectthattheDBMSissubmittingIOsthatrespectconstraintsC1-C3,and(c)Howtoguaranteeoptimalmappinginthiscase?
4.1BimodalDesign
Weproposeanobviousimplicitandimmutableschemeforassociatinglogicaladdressestologicalblocksandpages.Theflashdevicemustexposetwoconstants10:LogicalBlockSizeorLBSandLogicalPageSizeorLPS.LBS(resp.LPS)correspondtothesize,inbytesofaflashblock(resp.aflashpage)11.ForalogicaladdressA,thelogicalblocknumberLBNandlogicalpagenumberLPNareobtainedusingthefollowingtrivialformulas:LBN=A/LBSandLPN=(A−LBN×LBS)/LPS,where/istheintegerdivision.
WhilepagesizeIOsaresubmittedsequentiallywithinalogicalblock(startingfrompage0),flashpagesarewrittenintheordertheIOsaresubmitted.Thisway,thelayoutofflashpageswithinaflashblockisoptimal,andeachread(eithersequentialorrandom)canbetriviallymappedtoanoffsetwithinaflashblock.Thisisdetectedbymaintaininginthesafecache,foreachflashblock,thephysicalpositionofthelastwritewithintheblock.Abitindicateswhetherthemappingisoptimalornot(initially,freeblocksareoptimal).Wethusmaintaininthecontrollercacheamappingwith4bytesperblock(22bitsforphysicalblockid+6-8bitsforcurrentpositionwithinblock(-256pagesperblocks)+1bitflagfortheoptimalmapping).
Anoptimalblockmightbecomenon-optimalifthesub-mittedIOsarenolongersequential(e.g.,updates,unaligned
10
SuchconstantscanberetrievedbytheDBMSusingspecific11
commandlikeGetDriveGeometryLBSParallelismwithintheflashdevicemightleadtodefinetheflash(resp.pageLPsize.
S)asamultipleoftheflashblocksize(resp.4
Figure3:Thestatesofalogicalblockandcorre-spondingactions(transitions).
IOsacrosspageandblockboundaries,randomIOs).Con-versely,classicalgarbagecollectoralgorithms[19,15],trig-geredincaseofupdatesorrandomwrites,willtendtocon-vertnon-optimalblocksintooptimalones.Again,manyexistingtechniquescanbeusedtomitigatetheeffectsofrandomwritesorupdatesonnon-optimalblocks[16,18,19,15,13].Figure3schematizethebehaviorofasinglelogicalblockinabimodalFTL.
TheobviousquestionnowiswhetheranyofthecurrentlyavailableflashdevicesimplementsuchaFTL.
4.2AreExistingFlashDevicesBimodal?
Wecanalmostanswerthisquestionbylookingatthedatasheetsofexistingdevices.First,nodeviceexplicitlyprovidesanotionoflogicalblock–blocksareabstractedawayatthenarrowinterface.Inpractice,devicesmightsilentlyimple-mentthedesignweproposeabove.So,thisisinconclusive.Second,onlyfewdevicesactuallyprovideasafecache(e.g.,Memorightdoes,IntelX25doesnot).ADBMSrelyingonaflashdevicewithoutsafecachemustchoosebetweenper-formance(usingthecache)anddurability(writethroughthecache).Indeed,inourexperience,disablingthecachetypicallyresultsinanorderofmagnitudedegradationofwriteperformance(asneithermappingnordataiscached).Strictlyspeaking,adevicemightbebimodalevenifitdoesnotprovideasafecache.Thisisagaininconclusive.Third,onlytheX25devicesfromIntelsupporttheTRIMcom-mand.Thisismoresignificant.Indeed,aflashdevicethatdoesnotsupportTRIM,doesnotallowtheupperlayerstorespectconstraint(C2),erasebeforewrite.WithoutTRIM,evenacircularlogstructureasadatabaselogwillinduceupdates.Asaresult,onlytheX25couldbebimodal.
WedevisedasimpleexperimenttotestwhethertheX25isbimodal.Theexperimentconsistsinpartitioningthelog-icaladdressspaceinthree(P1,P2,P3).Startingfromatrimmeddevice,theexperimentconsistsoftwosteps:(a)performsequentialwrites(i.e.,C1,C2andC3areenforced)onP2andrandomwritesonP1andP3,and(b)TrimP2(wemakesurethattrimisactuallyperformed).ThosetwostepsarerepeatedthreetimesandwemeasuretheresponsetimeforallsequentialwritesonP2.Notethatwehaveob-servedlongpausesafterrandomwritesonP1andp3,toensurethatthedevicedoesnotperformanybackgroundreorganizationduringsequentialwritesonP2.
Ifthedeviceisbimodal,thentheresponsetimeofse-quentialwritesshouldremainconstantforthethreerunsofsequentialwrites.Indeed,ifthedeviceisbimodal,thenthe
Figure4:Responsetime(logscale)foreachindivid-ualIOs(16KB)onpartitionP2forruns1,2and3.Thesolidhorizontallinerepresentsthemeanvalue.
Figure5:Thegraphshowsthemean,min,maxandstandarddeviationofresponsetimeforthesequen-tialwritesonP2foreachrun.TheIntelX25-Eisnotbimodalbecausethecostofsequentialwritesisnotconstantacrossruns.
largepartitionP2shouldbemostlycomposedofoptimalblocksforwhichallconstraintshavebeenenforced(theremightbenonoptimalblocksattheboundariesoftheparti-tionduetothefactthatlogicalblocksizeandalignmentisnotdocumented).
Figure4showsdetailedresponsetimeofeachindividual16KBIO(logscale),whileFigure5showsaggregatedval-ues(mean,min,maxandstandarddeviation)12.Weobservethatresponsetimeandstabilitydegradewitheachrun.Themeanresponsetimeincreasesbyafactorgreaterthan5be-tweenRun1andRun3.Whiletheminimumresponsetimeremainsapproximatelyconstantat0.1msec,themaximumincreasesfrom2.1msecinthefirstrunto301msecinthethirdrun.ThisexperimentclearlydemonstratesthattheX25isnotbimodal.WecanevensaythataX25equippedwithabimodalFTLshouldbeabletoachieveatmost0.1msecmeanresponsetimeforsequentialwritesonP2.Notethatthisexperimentisanegativetest.Definingabatteriesofexperimenttovalidatethatadeviceindeedisbimodalisfuturework.
12
ningThe(seetheexperimentflashIOutilitywasrunweona4-coreInteli5processorrun-usedhttp://code.google.com/p/flashio/).definedfortheuFlipfortheexperimentisanX25-E80GB.
Theflashbenchmarkdevice5
5.RICHBIMODALFLASHDEVICES
Whileasignificantimprovementwithrespecttotoday’sdevices,thenarrowapproachisnotperfect:(i)Wear-levelingandgarbagecollectionareperformedwithoutanyknowl-edgeofthestoreddata,(ii)Theutilizationofthecontrollercacheisnotoptimized;(iii)AlotofcomplexityrelatedtoflashchipsismanagedattheDBMSlevel.Richbimodalflashdevicesmayaddresstheseshortcomings.
Extensionoftheblockdeviceinterfacehavealreadybeenproposedinthecontextofflashdevices[28,23,22]toman-agecomplexity,controltrade-offsandoptimizeembeddedresources.While,Shuetal.[28]introducetheDataMan-agementSetintheATAprotocol—connectinghostandperipherals—thatallowsanapplicationtocommunicatein-formationaboutitsI/Oaccessbehaviorstotheunderlyingflashdevice,Rasjimwaleetal[23]proposetouseexpressiveinterfacesuchasobject-basedstorage,andPrabhakaranetal.[22]focusonprovidingatomicwrites.Inthefollowing,wedescribehowrichinterfacescouldbeusedtooptimizetheminimalFTLmode–mostoptimizationshavealreadybeenpresentedintheliterature–,andwequicklydiscussnon-optimalblocks.Wedeliberatelyavoiddiscussionsofimplementationorsyntaxissues,whichareleftforfutureworks.
5.1OptimalBlocks
Fromoptimalblockstooptimalchunk:Sincetheblockdeviceinterfacedoesnotprovideanymeanstoexplic-itlydeclareasetofoptimalblocks(calledhereafterchunk),thesehavetobedetectedbytheFTL.Moreimportantly,thisdetectionleadstothemanagementofalargenumberofsmallblocks,havinganimpactonthevolumeofmeta-data(mapping,statistics,anddetectiondata)thatmustbestoredintothesafecache.Explicitdeclarationoflargerchunksmaythusbringsignificantsavingsintermsofsafecache.Forinstance,asingle100MBoptimalchunk,ex-plicitlydeclaredtotheflashdevice,e.g.,forthelogfileofaDBMS,maybringatwoorderofmagnitudereductionofthemetadatawrttheimplicitdetectionof400optimalblocks.Whilethisoptimizationhasnodirectimpactontheperfor-manceofoptimalblocks(themappingisalreadyoptimal),itallowsforimprovedcaching(seebelow).
Providingmetadata:Inablockdevice,wearlevelingandgarbagecollectionproceedblindly,withoutknowledgeoftheaccesspatternsonthemanageddata.Choosinghighlyerasedblockstostoredatawithlowerasefrequencyandconverselywillavoiduselessblocksmovementstobalanceerasecounts.Asimilarstrategy,basedondetection,hasbeenproposedin[10]inthecontextofablockdevicein-terface.Notethatminimizingblocksmovementsincreaseperformanceandlifetimeofthedevice.
Cachingdata:DBMSandFTLcouldcollaboratetoavoidproducingnon-optimalblocksforappenddatastruc-tures(e.g.,logrecords,tablesinappendmode).TheFTLcanavoidupdateoperations(andthusdegradinganoptimalblocktonon-optimal)ifforeachappendstructure,thelastuncompletewrittenflashpagecanbekeptinsidethesafecacheandonlywrittentoflashwhenitiscompleted.
Transmittingandcachingpayload:Arichflashde-vicenolongerhastoberestrictedtoaflat,regularaddressspacebasedonlogicalblockaddresses.WecanconsiderwriteIOswhichonlytransmitusefuldata,calledhereafterthepayload,i.e.,fragmentsofpagesasopposedtopages
(notethatwedonotconsiderherethattheflashdevicemanagesarepresentationofthedatabasepagelayout).Themainadvantageofusingpayloadinsteadofpagesisthatwriteoperationscanbebufferedefficiently.(e.g.,ninsertedtuplesinadatabasetablewillleadtobuffertheaggregatesizeofthetuplesplussomeoffsets-torecomposethepages-,comparedwithnIOsectorswithanarrowinterface!).In-deed,thesafecacheisnotpollutedwithdatathatalreadyexistsintheflashmemory.Notethatevenifthesafecacheisfull,itmaybemoreinterestingtotemporarilyflushthepayloadpartofthecacheinadedicatedflasharea(asaswapfile)ratherthanflushingontheirdestination,inordertokeepthedestinationblocksoptimal.WritepayloadsharessomeideaswiththePage-DifferentialLoggingapproach[17].Readingpayload:Payloadoptimizationisalsoveryin-terestingforreadoperationsinceitavoidstransmittinguse-lessdatafromthechiptothecontrollerandthentothehost(typically,readinga4KBpagecostsaround150µsfromwhich125µsaretransmissioncoststhroughtheserialinterfacefromthechiptothecontroller).FlashScan[31]alreadymentionedinsection3.3couldbenefitfurtherfromreadpayload,readingonlysubpages(i.e.,correspondingtotheECCgranularity).
Namelesswrites:Namelesswriteswereintroducedin[6]:theDBMSdoesnotprovideanydestinationaddressbutlettheFTLdecidesontheplacementofthedataandreturnsahandleforfuturereference.Namelesswritesthusgener-atesoptimalblocksbutaresomehoweasiertomanageattheDBMSlevel,typicallyfortemporarydata.
Reorganizationforfree:AfurtherimprovementwouldconsistinlettingtheFTLexposesomeaspectsofthewear-levelingprocesstotheDBMStosupportadditionalopti-mizations.Thewearlevelingprocesssometimeshastomovestaticdatatobalancetheerasecountacrossthewholede-vice.Inthiscase,theFTLreadsawholeblockthatitrewritesonanotherphysicallocation,thusbringinganop-portunitytoreorganizetheblockforfree.TheFTLcouldinvolvetheDBMSinthisprocessusingacall-backmecha-nisminthewaythatexternalpagersinteractwiththeop-eratingsystems[1].Indeed,toallowusingoptimalblocks,DBMSswillprobablyresorttolog-basedapproaches[2,17]whichmightgreatlybenefitfromsuchcleaning(almost)forfree.
5.2Non-optimalBlocks
Whilethispaperfocusonoptimalblocks,thereisinfactagreaterpotentialforminimizingtheimportantoverheadsgeneratedbythemanagementofnon-optimalblock:map-pingcost(metadatamanagement),garbagecollectioncosts(withstatisticslikeR/Wfrequency,expectedpattern,etc.)andwearlevelingcosts(updatefrequency).Forinstance:Hot-random-chunksandcold-random-chunks:TheFTLcanusedifferenttechniquesformappingdifferentchunkswhenevertheDBMSassociatesaccesspatternstogivenchunks.TheFTLcanmanagehotrandomchunks(i.e.,frequentwrites,scatteredonthewholechunk)withapagemappingcachedinthesafecache,whilemappingcanbeperformedatalargergranularityandonflashforcoldrandomchunks.Thesestrategiessharethebasicprinciplesproposedin[13,14],butarebasedoninformationdeliveredbytheDBMS(andnotdetectedbytheFTL).Investigatingfurtherhowarichinterfacecanbenefitnon-optimalblockmanagementisatopicforfuturework.
6
6.CONCLUSION
WearguedthatflashdevicesshouldprovideguaranteestoaDBMSsothatitcandevisestableandefficientIOman-agementmechanisms.Basedonthecharacteristicsofflashchips,wedefinedabimodalFTLthatdistinguishesbetweenamodewheresequentialwrites,sequentialreadsandran-domreadsareoptimalwhileupdatesandrandomwritesareforbidden,andamodewhereupdatesandrandomwritesaresupportedatthecostofsub-optimalIOperformance.Inter-estingly,theguaranteesofaminimalmodehavebeentakenforgrantedinmanyarticlesfromthedatabaseresearchlit-erature.Ourpointwiththispaperisthattheseguaranteesarenotalawofnature,wemustguidetheevolutionofflashdevicessothattheyareenforced.
Wedescribedthedesignspaceforbimodalflashdevicesinthecontextofablockdeviceinterface,andinthecontextofaricherinterface.WhichoneismostappropriateforaDBMSisanopenissue.Animportantpointisthatprovid-ingoptimalmappingguaranteesdoesnothindercompeti-tionbetweenflashdevicemanufacturers.Onthecontrary,theycancompeteto(a)bringdownthecostofoptimalIOpatterns(e.g.,usingparallelism),and(b)bringdownthecostofnon-optimalpatternswithoutjeopardizingDBMSdesign.Futureworkincludesdesigningandbuildingabi-modalFTLincollaborationwithaflashdevicemanufac-turer.
Wecanalsoderiveafutureworkroadmapfordatabasedesigners:(1)DefineaperformancemodelforbimodalflashdevicesinordertoexploretheDBMSdesignspace.ThereferencehereistheworkofJohnWilkesetal.onharddisksmodels[27];(2)ExploretheDBMSdesignspaceontopofnarrowbimodalflashdevices.Thefirstchallengehereistocompareandintegratethemanyideasalreadyproposedintheliteraturetoestablishabaseline.Theinterestingproblemisthentooptimizethisbaselinedesign;and(3)ExplorethedesignspaceforthecollaborationofDBMSandrichbimodalflashdevices.
Finally,futureworkincludesstudyinghowoperatingsys-temsaswellasmanydata-intensiveapplicationswouldben-efitfromflashdeviceswithoptimalmappingguarantees(e.g.,warehousescaledistributedsystems,gameengines,IO-consciousalgorithms).
7.ACKNOWLEDGMENTS
TheauthorswishtoacknowledgeMatiasBjørlingforhisprecioushelponexperiments,PhilippePucheralforhiscom-mentsonearlierversionsofthispaperandMehulShahforinspiringdiscussionsatSIGMOD2010.
8.REFERENCES
[1]M.J.Accetta,R.V.Baron,W.J.Bolosky,D.B.
Golub,R.F.Rashid,A.Tevanian,andM.Young.Mach:ANewKernelFoundationForUNIXDevelopment.InUSENIXSummer,1986.
[2]D.Agrawal,D.Ganesan,R.K.Sitaraman,Y.Diao,
andS.Singh.Lazy-AdaptiveTree:AnOptimized
IndexStructureforFlashDevices.PVLDB,2(1),2009.[3]N.Agrawal,V.Prabhakaran,T.Wobber,J.D.Davis,
M.Manasse,andR.Panigrahy.DesigntradeoffsforSSDperformance.InUSENIXATC,2008.
[4]T.Allard,N.Anciaux,L.Bouganim,Y.Guo,L.L.
Folgoc,B.Nguyen,P.Pucheral,I.Ray,I.Ray,and
S.Yin.SecurePersonalDataServers:aVisionPaper.PVLDB,3(1),2010.
[5]
D.G.Andersen,J.Franklin,M.Kaminsky,
A.Phanishayee,L.Tan,andV.Vasudevan.FAWN:aFastArrayofWimpyNodes.InACMSOSP,2009.[6]
A.C.Arpaci-dusseau,R.H.Arpaci-dusseau,andV.Prabhakaran.RemovingTheCostsOfIndirectioninFlash-basedSSDswithNamelessWrites.InUSENIXHotStorage,2010.
[7]
M.Bjørling,L.L.Folgoc,A.Mseddi,P.Bonnet,
L.Bouganim,andB.Jónsson.Performingsoundflashdevicemeasurements:somelessonsfromuFLIP.InSIGMODConference,2010.
[8]
M.Bjørling,P.Bonnet,L.Bouganim,andB.Jónsson.UnderstandingtheEnergyConsumptionofFlashDeviceswithuFLIP.IEEEDataEng.Bull.,toappear,2010.
[9]L.Bouganim,B.Jónsson,andP.Bonnet.uFLIP:UnderstandingFlashIOPatterns.InCIDR,2009.[10]
L.-P.ChangandC.-D.Du.Designand
implementationofanefficientwear-levelingalgorithmforsolid-state-diskmicrocontrollers.ACMTrans.Des.Autom.Electron.Syst.,15(1),2009.
[11]E.GalandS.Toledo.Algorithmsanddatastructuresforflashmemories.ACMComput.Surv.,37(2),2005.[12]
J.Gray.Tapeisdead,diskistape,flashisdisk.
http://research.microsoft.com/en-us/um/people/gray/talks/Flash_is_Good.ppt.
[13]
A.Gupta,Y.Kim,andB.Urgaonkar.DFTL:aflashtranslationlayeremployingdemand-basedselectivecachingofpage-leveladdressmappings.InASPLOS,2009.
[14]
J.Hu,H.Jiang,L.Tian,andL.Xu.PUD-LRU:AnErase-EfficientWriteBufferManagementAlgorithmforFlashMemorySSD.MASCOTS,2010.
[15]
D.Jung,J.-U.Kang,H.Jo,J.-S.Kim,andJ.Lee.SuperblockFTL:Asuperblock-basedflashtranslationlayerwithahybridaddresstranslationscheme.ACMTrans.Embed.Comput.Syst.,9(4):1–41,2010.
[16]
J.Kim,J.M.Kim,S.Noh,S.L.Min,andY.Cho.Aspace-efficientflashtranslationlayerfor
CompactFlashsystems.ConsumerElectronics,IEEETransactionson,48(2),may.2002.
[17]
Y.-R.Kim,K.-Y.Whang,andI.-Y.Song.Page-differentiallogging:anefficientand
DBMS-independentapproachforstoringdataintoflashmemoryInSIGMODConference,2010.[18]
S.Lee,D.Shin,Y.-J.Kim,andJ.Kim.LAST:locality-awaresectortranslationforNANDflashmemory-basedstoragesystems.SIGOPSOper.Syst.Rev.,42(6),2008.
[19]
S.-W.Lee,D.-J.Park,T.-S.Chung,D.-H.Lee,S.Park,andH.-J.Song.Alogbuffer-basedflashtranslationlayerusingfully-associativesector
translation.ACMTrans.Embed.Comput.Syst.,6(3),2007.
[20]
Y.Lee,J.KimandS.Maeng.ReSSD:asoftwarelayerforresuscitatingSSDsfrompoorsmallrandomwriteperformance.InSAC,2010.
7
[21]S.NathandP.B.Gibbons.Onlinemaintenanceof
verylargerandomsamplesonflashstorage.VLDBJ.,19(1),2010.
[22]V.Prabhakaran,T.L.Rodeheffer,andL.Zhou.
TransactionalFlash.InOSDI,2008.
[23]A.Rajimwale,V.Prabhakaran,andJ.D.Davis.
Blockmanagementinsolid-statedevices.InUSENIXATC,2009.
[24]E.Riedel,C.Faloutsos,G.A.Gibson,andD.Nagle.
Activedisksforlarge-scaledataprocessing.Computer,34(6),2001.
[25]S.W.SchlosserandG.R.Ganger.MEMS-based
StorageDevicesandStandardDiskInterfaces:ASquarePeginaRoundHole?InUSENIXFAST,2004.
[26]S.W.Schlosser,J.Schindler,A.Ailamaki,andG.R.
Ganger.ExposingandExploitingInternalParallelisminMEMS-basedStorage.CMUTechnicalReportCMU-CS-03-125,2003.
[27]E.A.M.Shriver,A.Merchant,andJ.Wilkes.An
AnalyticBehaviorModelforDiskDrivesWithReadaheadCachesandRequestReordering.InSIGMETRICS,1998.
[28]F.ShuandN.Obr.Datasetmanagementcommands
proposalforATA8-ACS2.http://www.t13.org/,2007.
[29]R.Stoica,M.Athanassoulis,R.Johnson,and
A.Ailamaki.Evaluatingandrepairingwriteperformanceonflashdevices.InDaMoN,2009.[30]M.Stonebraker.Operatingsystemsupportfor
databasemanagement.Commun.ACM,24(7),1981.[31]D.Tsirogiannis,S.Harizopoulos,M.A.Shah,J.L.
Wiener,andG.Graefe.Queryprocessingtechniquesforsolidstatedrives.InSIGMODConference,2009.[32]Y.Wang,K.GodaandM.Kitsuregawa.Evaluating
Non-In-PlaceUpdateTechniquesforFlash-BasedTransactionProcessingSystems.InDEXA,2009.
8
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- yrrf.cn 版权所有 赣ICP备2024042794号-2
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务