ThomasSerre
LiorWolf
CenterforBiologicalandComputationalLearning
McGovernInstitute
BrainandCognitiveSciencesDepartmentMassachusettsInstituteofTechnology
Cambridge,MA02142
{serre,liorwolf}@mit.edu,tp@ai.mit.eduAbstract
Weintroduceanovelsetoffeaturesforrobustobjectrecognition.Eachelementofthissetisacomplexfeatureobtainedbycombiningposition-andscale-tolerantedge-detectorsoverneighboringpositionsandmultipleorienta-tions.Oursystem’sarchitectureismotivatedbyaquantita-tivemodelofvisualcortex.
Weshowthatourapproachexhibitsexcellentrecogni-tionperformanceandoutperformsseveralstate-of-the-artsystemsonavarietyofimagedatasetsincludingmanydif-ferentobjectcategories.Wealsodemonstratethatoursys-temisabletolearnfromveryfewexamples.Theperfor-manceoftheapproachconstitutesasuggestiveplausibilityproofforaclassoffeedforwardmodelsofobjectrecogni-tionincortex.
gories[24,4],particularlywhentrainedwithveryfewtrain-ingexamples[3].Onelimitationoftheserigidtemplate-basedfeaturesisthattheymightnotadequatelycapturevariationsinobjectappearance:theyareveryselectiveforatargetshapebutlackinvariancewithrespecttoobjecttrans-formations.Attheotherextreme,histogram-baseddescrip-tors[12,2]areveryrobustwithrespecttoobjecttransfor-mations.TheSIFT-basedfeatures[12],forinstance,havebeenshowntoexcelinthere-detectionofapreviouslyseenobjectundernewimagetransformations.However,asweconfirmexperimentally(seesection4),withsuchdegreeofinvariance,itisunlikelythattheSIFT-basedfeaturescouldperformwellonagenericobjectrecognitiontask.Inthispaper,weintroduceanewsetofbiologically-inspiredfeaturesthatexhibitabettertrade-offbetweenin-varianceandselectivitythantemplate-basedorhistogram-basedapproaches.Eachelementofthissetisafeatureob-tainedbycombiningtheresponseoflocaledge-detectorsthatareslightlyposition-andscale-tolerantoverneighbor-ingpositionsandmultipleorientations(likecomplexcellsinprimaryvisualcortex).Ourfeaturesaremoreflexiblethantemplate-basedapproaches[7,22]becausetheyallowforsmalldistortionsoftheinput;theyaremoreselectivethanhistogram-baseddescriptorsastheypreservelocalfea-turegeometry.Ourapproachisasfollows:foraninputim-age,wefirstcomputeasetoffeatureslearnedfromtheposi-tivetrainingset(seesection2).Wethenrunastandardclas-sifieronthevectoroffeaturesobtainedfromtheinputim-age.Theresultingapproachissimplerthantheaforemen-tionedhierarchicalapproaches:itdoesnotinvolvescanningoverallpositionsandscales,itusesdiscriminativemethodsanditdoesnotexplicitlymodelobjectgeometry.Yetitisabletolearnfromveryfewexamplesanditperformssig-nificantlybetterthanallthesystemswehavecompareditwiththusfar.1
TomasoPoggio
1Introduction
Hierarchicalapproachestogenericobjectrecognitionhavebecomeincreasinglypopularovertheyears.Theseareinsomecasesinspiredbythehierarchicalnatureofprimatevisualcortex[10,25],but,mostimportantly,hierarchicalapproacheshavebeenshowntoconsistentlyoutperformflatsingle-template(holistic)objectrecognitionsystemsonavarietyofobjectrecognitiontasks[7,10].Recognitiontyp-icallyinvolvesthecomputationofasetoftargetfeatures(alsocalledcomponents[7],parts[24]orfragments[22])atonestepandtheircombinationinthenextstep.Fea-turesusuallyfallinoneoftwocategories:template-basedorhistogram-based.Severaltemplate-basedmethodsex-hibitexcellentperformanceinthedetectionofasingleob-jectcategory,e.g.,faces[17,23],cars[17]orpedestri-ans[14].Constellationmodelsbasedongenerativemeth-odsperformwellintherecognitionofseveralobjectcate-
BandΣfilt.sizess
σλ
gridsizeNΣorient.θpatchsizesni
17&92.8&3.63.5&4.6
8
211&134.5&5.45.6&6.810
315&176.3&7.37.9&9.112
456731&3314.6&15.818.2&19.7
20
835&3717.0&18.221.2&22.8
22
19&2123&2527&298.2&9.210.2&11.312.3&13.410.3&11.512.7&14.115.4&16.8
141618
ππ3π0;4;2;4
4×4;8×8;12×12;16×16(×4orientations)
Table1.Summaryofparametersusedinourimplementation(seeFig.1andaccompanyingtext).
Biologicalvisualsystemsasguides.Becausehumansandprimatesoutperformthebestmachinevisionsystemsbyalmostanymeasure,buildingasystemthatemulatesobjectrecognitionincortexhasalwaysbeenanattractiveidea.However,forthemostpart,theuseofvisualneuro-scienceincomputervisionhasbeenlimitedtoajustifica-tionofGaborfilters.Norealattentionhasbeengiventobiologicallyplausiblefeaturesofhighercomplexity.Whilemainstreamcomputervisionhasalwaysbeeninspiredandchallengedbyhumanvision,itseemstoneverhavead-vancedpastthefirststageofprocessinginthesimplecellsofprimaryvisualcortexV1.Modelsofbiologicalvi-sion[5,13,16,1]havenotbeenextendedtodealwithreal-worldobjectrecognitiontasks(e.g.,largescalenatu-ralimagedatabases)whilecomputervisionsystemsthatareclosertobiologylikeLeNet[10]arestilllackingagreementwithphysiology(e.g.,mappingfromnetworklayerstocor-ticalvisualareas).Thisworkisanattempttobridgethegapbetweencomputervisionandneuroscience.
Oursystemfollowsthestandardmodelofobjectrecog-nitioninprimatecortex[16],whichsummarizesinaquan-titativewaywhatmostvisualneuroscientistsagreeon:thefirstfewhundredsmillisecondsofvisualprocessinginpri-matecortexfollowsamostlyfeedforwardhierarchy.Ateachstage,thereceptivefieldsofneurons(i.e.,thepartofthevisualfieldthatcouldpotentiallyelicitaneuron’sre-sponse)tendtogetlargeralongwiththecomplexityoftheiroptimalstimuli(i.e.,thesetofstimulithatelicitaneuron’sresponse).Initssimplestversion,thestandardmodelcon-sistsoffourlayersofcomputationalunitswheresimpleSunits,whichcombinetheirinputswithGaussian-liketun-ingtoincreaseobjectselectivity,alternatewithcomplexCunits,whichpooltheirinputsthroughamaximumoper-ation,therebyintroducinggradualinvariancetoscaleandtranslation.Themodelhasbeenabletoquantitativelydu-plicatethegeneralizationpropertiesexhibitedbyneuronsininferotemporalmonkeycortex(theso-calledview-tunedunits)thatremainhighlyselectiveforparticularobjects(aface,ahand,atoiletbrush)whilebeinginvarianttorangesofscalesandpositions.Themodeloriginallyusedaverysimplestaticdictionaryoffeatures(fortherecognitionofsegmentedobjects)althoughitwassuggestedin[16]thatfeaturesinintermediatelayersshouldinsteadbelearnedfromvisualexperience.
2
Weextendthestandardmodelandshowhowitcanlearnavocabularyofvisualfeaturesfromnatu-ralimages.Weprovethattheextendedmodelcanrobustlyhandletherecognitionofmanyobjectcate-goriesandcompetewithstate-of-the-artobjectrecogni-tionsystems.Thisworkappearedinaveryprelim-inaryformin[18].Oursourcecodeaswellasanextendedversionofthispaper[20]canbefoundathttp://cbcl.mit.edu/software-datasets.
2TheC2features
OurapproachissummarizedinFig.1:thefirsttwolay-erscorrespondtoprimateprimaryvisualcortex,V1,i.e.,thefirstvisualcorticalstage,whichcontainssimple(S1)andcomplex(C1)cells[8].TheS1responsesareobtainedbyapplyingtotheinputimageabatteryofGaborfilters,whichcanbedescribedbythefollowingequation:
2π(X2+γ2Y2)
X,G(x,y)=exp−×cos
2σ2λwhereX=xcosθ+ysinθandY=−xsinθ+ycosθ.
Weadjustedthefilterparameters,i.e.,orientationθ,ef-fectivewidthσ,andwavelengthλ,sothatthetuningpro-filesofS1unitsmatchthoseofV1parafovealsimplecells.Thiswasdonebyfirstsamplingthespaceofparametersandthengeneratingalargenumberoffilters.WeappliedthosefilterstostimulicommonlyusedtoprobeV1neurons[8](i.e.,gratings,barsandedges).Afterremovingfiltersthatwereincompatiblewithbiologicalcells[8],wewereleftwithafinalsetof16filtersat4orientations(seeTable1and[19]forafulldescriptionofhowthosefilterswereob-tained).
Thenextstage–C1–correspondstocomplexcellswhichshowsometolerancetoshiftandsize:complexcellstendtohavelargerreceptivefields(twiceaslargeassimplecells),respondtoorientedbarsoredgesanywherewithintheirreceptivefield[8](shiftinvariance)andareingen-eralmorebroadlytunedtospatialfrequencythansimplecells[8](scaleinvariance).ModifyingtheoriginalHubel&Wieselproposalforbuildingcomplexcellsfromsimplecellsthroughpooling[8],Riesenhuber&Poggioproposedamax-likepoolingoperationforbuildingposition-andscale-tolerantC1units.Inthemeantime,experimentalevidence
GivenaninputimageI,performthefollowingsteps:.S1:ApplyabatteryofGaborfilterstotheinputimage.Thefilterscomein4orientationsθand16scaless(seeTable1).Obtain16×4=maps(S1)sθthatarearrangedin8bands(e.g.,band1containsfilteroutputsofsize7and9,inallfourorientations,band2containsfilteroutputsofsize11and13,etc).
C1:Foreachband,takethemaxoverscalesandpo-sitions:eachbandmemberissub-sampledbytakingthemaxoveragridwithcellsofsizeNΣfirstandthemaxbetweenthetwoscalememberssecond,e.g.,forband1,aspatialmaxistakenoveran8×8gridfirstandthenacrossthetwoscales(size7and9).Notethatwedonottakeamaxoverdifferentorientations,hence,eachband(C1)Σcontains4maps.
Duringtrainingonly:ExtractKpatchesPi=1,...Kofvarioussizesni×niandallfourorientations(thuscontainingni×ni×4elements)atrandomfromthe(C1)Σmapsfromalltrainingimages.
S2:ForeachC1image(C1)Σ,compute:
Y=exp(−γ||X−Pi||2)forallimagepatchesX(atallpositions)andeachpatchPlearnedduringtrainingforeachbandindependently.ObtainS2maps(S2)Σi.C2:ComputethemaxoverallpositionsandscalesforeachS2maptype(S2)i(i.e.,correspondingtoaparticularpatchPi)andobtainshift-andscale-invariantC2features(C2)i,fori=1...K.
Figure1.ComputationofC2features.
Figure2.Scale-andposition-toleranceatthecomplexcells(C1)level:EachC1unitreceivesinputsfromS1unitsatthesamepre-ferredorientationarrangedinbandsΣ,i.e.,S1unitsintwodiffer-entsizesandneighboringpositions(gridcellofsizeNΣ×NΣ).Fromeachgridcell(left)weobtainonemeasurementbytakingthemaxoverallpositionsallowingtheC1unittorespondtoanhorizontaledgeanywherewithinthegrid(tolerancetoshift).Sim-ilarly,bytakingamaxoverthetwosizes(right)theC1unitbe-comestoleranttoslightchangesinscale.
infavorofthemaxoperationhasappeared[6,9].AgainpoolingparametersweresetsothatC1unitsmatchthetun-ingpropertiesofcomplexcellsasmeasuredexperimentally(seeTable1and[19]forafulldescriptionofhowthosefilterswereobtained).
Fig.2illustrateshowpoolingfromS1toC1isdone.S1unitscomein16scalessarrangedin8bandsΣ.Forin-stance,considerthefirstbandΣ=1.Foreachorientation,itcontainstwoS1maps:oneobtainedusingafilterofsize7,andoneobtainedusingafilterofsize9.NotethatbothoftheseS1mapshavethesamedimensions.InordertoobtaintheC1responses,thesemapsaresub-sampledusingagridcellofsizeNΣ×NΣ=8×8.Fromeachgridcellweobtainonemeasurementbytakingthemaximumofallelements.Asalaststagewetakeamaxoverthetwoscales,byconsideringforeachcellthemaximumvaluefromthetwomaps.Thisprocessisrepeatedindependentlyforeachofthefourorientationsandeachscaleband.
Inournewversionofthestandardmodelthesubse-quentS2stageiswherelearningoccurs.AlargepoolofK
3
patchesofvarioussizesatrandompositionsareextractedfromatargetsetofimagesattheC1levelforallorienta-tions,i.e.,apatchPiofsizeni×nicontainsni×ni×4el-ements,wherethe4factorcorrespondstothefourpossibleS1andC1orientations.Inoursimulationsweusedpatchesofsizeni=4,8,12and16butinpracticeanysizecanbeconsidered.ThetrainingprocessendsbysettingeachofthosepatchesasprototypesorcentersoftheS2unitswhichbehaveasradialbasisfunction(RBF)unitsduringrecog-nition,i.e.,eachS2unitresponsedependsinaGaussian-likewayontheEuclideandistancebetweenanewinputpatch(ataparticularlocationandscale)andthestoredpro-totype.Thisisconsistentwithwell-knownneuronresponsepropertiesinprimateinferotemporalcortexandseemstobethekeypropertyforlearningtogeneralizeinthevisualandmotorsystems[15].Whenanewinputispresented,eachstoredS2unitisconvolvedwiththenew(C1)Σinputim-ageatallscales(thisleadstoK×8(S2)Σiimages,wheretheKfactorcorrespondstotheKpatchesextractedduringlearningandthe8factor,tothe8scalebands).Aftertakingafinalmaxforeach(S2)imapacrossallscalesandposi-tions,wegetthefinalsetofKshift-andscale-invariantC2units.ThesizeofourfinalC2featurevectorthusdependsonlyonthenumberofpatchesextractedduringlearningandnotontheinputimagesize.ThisC2featurevectorispassedtoaclassifierforfinalanalysis.1
Animportantquestionforbothneuroscienceandcom-putervisionregardsthechoiceoftheunlabeledtargetsetfromwhichtolearn–inanunsupervisedway–thisvocab-ularyofvisualfeatures.Inthispaper,featuresarelearnedfromthepositivetrainingsetforeachobjectcategory(butsee[20]foradiscussiononhowfeaturescouldbelearnedfromrandomnaturalimages).
islikelythatour(non-biological)finalclassifiercouldcorrespondtothetask-specificcircuitsfoundinprefrontalcortex(PFC)andC2unitswithneuronsininferotemporal(IT)cortex[16].TheS2unitscouldbelocatedinV4and/orinposteriorinferotemporal(PIT)cortex.
1It
DatasetsLeaves(Calt.)Cars(Calt.)Faces(Calt.)Airplanes(Calt.)Moto.(Calt.)Faces(MIT)Cars(MIT)
Figure3.ExamplesfromtheMITfaceandcardatasets.
Bench.
[24][4][4][4][4][7][11]
84.084.6.494.095.090.475.4C2featuresboostSVM97.095.999.799.8.298.196.794.998.097.495.995.395.193.3
Table2.C2featuresvs.otherrecognitionsystems(Bench.).
3.ExperimentalSetup
Wetestedoursystemonvariousobjectcategorization
tasksforcomparisonwithbenchmarkcomputervisionsys-tems.Alldatasetsweusedaremadeupofimagesthateithercontainordonotcontainasingleinstanceofthetargetob-ject;Thesystemhastodecidewhetherthetargetobjectispresentorabsent.
MIT-CBCLdatasets:Theseincludeanear-frontal(±30◦)facedatasetforcomparisonwiththecomponent-basedsystemofHeiseleetal.[7]andamulti-viewcardatasetforcomparisonwith[11].Thesetwodatasetsareverychallenging(seetypicalexamplesinFig.3).ThefacepatternsusedfortestingconstituteasubsetoftheCMUPIEdatabasewhichcontainsalargevarietyoffacesun-derextremeilluminationconditions(see[7]).Thetestnon-facepatternswereselectedbyalow-resolutionLDAclas-sifierasthemostsimilartofaces(theLDAclassifierwastrainedonanindependent19×19low-resolutiontrainingset).Thefullsetusedin[7]contains6,900positiveand13,700negative70×70imagesfortrainingand427positiveand5,000negativeimagesfortesting.ThecardatabaseontheotherhandwascreatedbytakingstreetscenepicturesintheBostoncityarea.Numerousvehicles(includingSUVs,trucks,buses,etc)photographedfromdifferentview-pointsweremanuallylabeledfromthoseimagestoformapositiveset.Randomimagepatternsatvariousscalesthatwerenotlabeledasvehicleswereextractedandusedasthenegativeset.Thecardatasetusedin[11]contains4,000positiveand1,600negative120×120trainingexamplesand3,400testexamples(halfpositive,halfnegative).Whilewetestedoursystemonthefulltestsets,weconsideredarandomsub-setofthepositiveandnegativetrainingsetscontainingonly500imageseachforboththefaceandthecardatabase.TheCaltechdatasets:TheCaltechdatasetscontain101objectsplusabackgroundcategory(usedasthenegativeset)andareavailableathttp://www.vision.caltech.edu.Foreachob-jectcategory,thesystemwastrainedwithn=1,3,6,15,30or40positiveexamplesfromthetargetobjectclass(asin[3])and50negativeexamplesfromthebackgroundclass.Fromtheremainingimages,weextracted50images
4
fromthepositiveand50imagesfromthenegativesettotestthesystem’sperformance.Asin[3],thesystem’sperformancewasaveragedover10randomsplitsforeachobjectcategory.Allimageswerenormalizedto140pixelsinheight(widthwasrescaledaccordinglysothattheimageaspectratiowaspreserved)andconvertedtograyvaluesbeforeprocessing.Thesedatasetscontainthetargetobjectembeddedinalargeamountofclutterandthechallengeistolearnfromunsegmentedimagesanddiscoverthetargetobjectclassautomatically.ForaclosecomparisonwiththesystembyFergusetal.wealsotestedourapproachonasubsetofthe101-objectdatasetusingtheexactsamesplitasin[4](theresultsarereportedinTable2)andanadditionalleafdatabaseasin[24]foratotaloffivedatasetsthatwerefertoastheCaltechdatasetsinthefollowing.
4Results
Table2containsasummaryoftheperformnaceoftheC2featureswhenusedasinputtoalinearSVMandtogentleAdaBoost(denotedboost)onvariousdatasets.Forbothoursystemandthebenchmarks,wereporttheerrorrateattheequilibriumpoint,i.e.,theerrorrateatwhichthefalsepositiverateequalsthemissrate.Resultsob-tainedwiththeC2featuresareconsistentlyhigherthanthosepreviouslyreportedontheCaltechdatasets.Oursys-temseemstooutperformthecomponent-basedsystempre-sentedin[7](alsousingSVM)ontheMIT-CBCLfacedatabaseaswellasafragment-basedsystemimplementedby[11]thatusestemplate-basedfeatureswithgentleAdaBoost(similarto[21]).
Fig.4summarizesthesystemperformanceonthe101-objectdatabase.OntheleftweshowtheresultsobtainedusingoursystemwithgentleAdaBoost(wefoundqual-itativelysimilarresultswithalinearSVM)overall101categoriesfor1,3,6,15,30and40positivetrainingex-amples(eachresultisanaverageof10differentrandomsplits).Eachplotisasinglehistogramofall101scores,ob-tainedusingafixednumberoftrainingexamples(e.g.,with40examplesthesystemgets95%correctfor42%oftheobjectcategories).OntherightwefocusonsomeofthesameobjectcategoriesastheonesusedbyFei-Feietal.forillustrationin[3]:theC2featuresachieveerrorratesvery
0.50.450.4Proportion of object categories0.350.30.250.20.150.10.05060 1 3 6153040Performance (ROC area)100959085807570656055500FacesMotorbikesLeopardsCougar faceCrocodileMayflyGrand−piano510152025Number of training examples303065707580ROC area859095100Figure4.C2featuresperformanceonthe101-objectdatabasefordifferentnumbersofpositivetrainingexamples:(left)histogramacrossthe101categoriesand(right)performanceonsamplecategories,seeaccompanyingtext.10095Performance (Equilibrium point)9085807570656055c2 / AirplanesSift / Airplanesc2 / LeavesSift / Leavesc2 / MotorcyclesSift / Motorcyclesc2 / FacesSift / Facesc2 / CarsSift / Cars51050100Number of features200500100010095C2 features performance (equilibrium point)9085807570656055505060708090Sift−based features performance (equilibrium point) 1 3 61530100Figure5.SuperiorityoftheC2vs.SIFT-basedfeaturesontheCaltechdatasetsfordifferentnumberoffeatures(left)andonthe101-objectdatabasefordifferentnumberoftrainingexamples(right).
similartotheonesreportedin[3]withveryfewtrainingexamples.
WealsocomparedourC2featurestoSIFT-basedfea-tures[12].Weselected1000randomreferencekey-pointsfromthetrainingset.Givenanewimage,wemeasuredtheminimumdistancebetweenallitskey-pointsandthe1000referencekey-points,thusobtainingafeaturevectorofsize1000(forthiscomparisonwedidnotusethepositionin-formationrecoveredbythealgorithm).WhileLowerecom-mendsusingtheratioofthedistancesbetweenthenearestandthesecondclosestkey-pointasasimilaritymeasure,wefoundthattheminimumdistanceleadstobetterper-formancethantheratioonthesedatasets.AcomparisonbetweentheC2featuresandtheSIFT-basedfeatures(bothpassedtoaGentleAdaboostclassifier)isshowninFig.5(left)fortheCaltechdatasets.Thegaininperformanceob-tainedbyusingtheC2featuresrelativetotheSIFT-basedfeaturesisobvious.ThisistruewithgentleAdaBoost–usedforclassificationonFig.5(left)–butwealsofound
5
verysimilarresultswithSVM.Also,asonecanseeinFig.5(right),theperformanceoftheC2features(erroratequilib-riumpoint)foreachcategoryfromthe101-objectdatabaseiswellabovethatoftheSIFT-basedfeaturesforanynumberoftrainingexamples.
Finally,weconductedinitialexperimentsonthemultipleclassescase.Forthistaskweusedthe101-objectdataset.Wespliteachcategoryintoatrainingsetofsize15or30andatestsetcontainingtherestoftheimages.Weusedasimplemultiple-classlinearSVMasclassifier.TheSVMappliedtheall-pairsmethodformultiplelabelclassifica-tion,andwastrainedon102labels(101categoriesplusthebackgroundcategory,i.e.,102AFC).ThenumberofC2featuresusedintheseexperimentswas4075.Weobtainedabove35%correctclassificationratewhenusing15trainingexamplesperclassaveragedover10repetitions,and42%correctclassificationratewhenusing30trainingexamples(chancebelow1%).
Motorbikes1100.9200.8Faces300.7400.650Airplanes0.50.5600.4700.3Starfish800.2900.1Yin yang100102030405060708090100Figure6.(left)Samplefeatureslearnedfromdifferentobjectcategories(i.e.,first5featuresreturnedbygentleAdaBoostforeachcategory).ShownareS2features(centersofRBFunits):eachorientedellipsecharacterizesaC1(afferent)subunitatmatchingorientation,whilecolorencodesforresponsestrength.(right)Multiclassclassificationon101objectdatabasewithalinearSVM.
5Discussion
Thispaperdescribesanewbiologically-motivatedframeworkforrobustobjectrecognition:Oursystemfirstcomputesasetofscale-andtranslation-invariantC2fea-turesfromatrainingsetofimagesandthenrunsastandarddiscriminativeclassifieronthevectoroffeaturesobtainedfromtheinputimage.Ourapproachexhibitsexcellentper-formanceonavarietyofimagedatasetsandcompetewithsomeofthebestexistingsystems.
Thissystembelongstoafamilyoffeedforwardmodelsofobjectrecognitionincortexthathavebeenshowntobeabletoduplicatethetuningpropertiesofneuronsinseveralvisualcorticalareas.Inparticular,Riesenhuber&Poggioshowedthatsuchaclassofmodelsaccountsquantitativelyforthetuningpropertiesofview-tunedunitsininferotem-poralcortex(testedwithidealizedobjectstimulionuniformbackgrounds),whichrespondtoimagesofthelearnedob-jectmorestronglythantodistractorobjects,despitesignif-icantchangesinpositionandsize[16].Theperformanceofthisarchitectureonavarietyofreal-worldobjectrecog-nitiontasks(presenceofclutterandchangesinappearance,illumination,etc)providesanothercompellingplausibilityproofforthisclassofmodels.
Whilealong-timegoalforcomputervisionhasbeentobuildasystemthatachieveshuman-levelrecognitionperformance,state-of-the-artalgorithmshavebeendiverg-ingfrombiology:forinstance,someofthebestexistingsystemsusegeometricalinformationabouttheconstitu-tivepartsofobjects(constellationapproachesrelyonbothappearance-basedandshape-basedmodelsandcomponent-basedsystemusetherelativepositionofthedetectedcom-ponentsalongwiththeirassociateddetectionvalues).Biol-6
ogyishoweverunlikelytobeabletousegeometricalinfor-mation–atleastinthecorticalstreamdedicatedtoshapeprocessingandobjectrecognition.Thesystemdescribedinthispaperisrespectsthepropertiesofcorticalprocessing(includingtheabsenceofgeometricalinformation)whileshowingperformanceatleastcomparabletothebestcom-putervisionsystems.
Thefactthatthisbiologically-motivatedmodeloutper-formsmorecomplexcomputervisionsystemsmightatfirstappearpuzzling.Thearchitectureperformsonlytwomajorkindsofcomputations(templatematchingandmaxpool-ing)whilesomeoftheothersystemswehavediscussedinvolvecomplexcomputationsliketheestimationofprob-abilitydistributions[24,4,3]ortheselectionoffacial-componentsforusebyanSVM[7].Perhapspartofthemodel’sstrengthcomesfromitsbuilt-ingradualshift-andscale-tolerancethatcloselymimicsvisualcorticalprocess-ing,whichhasbeenfinelytunedbyevolutionoverthou-sandsofyears.Itisalsoverylikelythatsuchhierarchicalarchitectureseasetherecognitionproblembydecomposingthetaskintoseveralsimpleronesateachlayer.FinallyitisworthpointingoutthatthesetofC2featuresthatispassedtothefinalclassifierisveryredundant,probablymorere-dundantthanforotherapproaches.Whileweshowedthatarelativelysmallnumberoffeatures(about50)issufficienttoachievegooderrorrates,performancecanbeincreasedsignificantlybyaddingmanymorefeatures.Interestingly,thenumberoffeaturesneededtoreachtheceiling(about5,000features)ismuchlargerthanthenumberusedbycur-rentsystems(ontheorderof10-100for[22,7,21]and4-8forconstellationapproaches[24,4,3]).
Acknowledgments
WewouldliketothanktheanonymousreviewersaswellasAntonioTorralbaandYuriIvanovforusefulcommentsonthismanuscript.
ThisreportdescribesresearchdoneattheCenterforBiological&ComputationalLearning,whichisintheMcGovernInstituteforBrainResearchatMIT,aswellasintheDept.ofBrain&Cogni-tiveSciences,andwhichisaffiliatedwiththeComputerSciences&ArtificialIntelligenceLaboratory(CSAIL).
Thisresearchwassponsoredbygrantsfrom:OfficeofNavalResearch(DARPA)ContractNo.MDA972-04-1-0037,OfficeofNavalResearch(DARPA)ContractNo.N00014-02-1-0915,Na-tionalScienceFoundation(ITR/IM)ContractNo.IIS-0085836,NationalScienceFoundation(ITR/SYS)ContractNo.IIS-0112991,NationalScienceFoundation(ITR)ContractNo.IIS-02092,NationalScienceFoundation-NIH(CRCNS)ContractNo.EIA-0218693,NationalScienceFoundation-NIH(CRCNS)ContractNo.EIA-0218506,andNationalInstitutesofHealth(Conte)ContractNo.1P20MH66239-01A1.Additionalsupportwasprovidedby:CentralResearchInstituteofElectricPowerIn-dustry,Centerfore-Business(MIT),Daimler-ChryslerAG,Com-paq/DigitalEquipmentCorporation,EastmanKodakCompany,HondaR&DCo.,Ltd.,ITRI,KomatsuLtd.,EugeneMcDermottFoundation,Merrill-Lynch,MitsubishiCorporation,NECFund,NipponTelegraph&Telephone,Oxygen,SiemensCorporateRe-search,Inc.,SonyMOU,SumitomoMetalIndustries,ToyotaMo-torCorporation,andWatchVisionCo.,Ltd.
[9]I.Lampl,D.Ferster,T.Poggio,andM.Riesenhuber.In-tracellularmeasurementsofspatialintegrationandthemaxoperationincomplexcellsofthecatprimaryvisualcortex.J.Neurophysiol.,92:2704–2713,2004.[10]YannLeCun,Fu-JieHuang,andLeonBottou.Learning
methodsforgenericobjectrecognitionwithinvariancetoposeandlighting.InProceedingsofCVPR’04.IEEEPress,2004.[11]B.Leung.Component-basedcardetectioninstreetscene
images.Master’sthesis,EECS,MIT,2004.[12]D.G.Lowe.Objectrecognitionfromlocalscale-invariant
features.InICCV,pages1150–1157,1999.[13]B.W.Mel.SEEMORE:Combiningcolor,shapeandtexture
histogramminginaneurally-inspiredapproachtovisualob-jectrecognition.NeuralComputation,9(4):777–804,1997.[14]A.Mohan,C.Papageorgiou,andT.Poggio.Example-based
objectdetectioninimagesbycomponents.InPAMI,vol-ume23,pages349–361,2001.[15]T.PoggioandE.Bizzi.Generalizationinvisionandmotor
control.Nature,431:768–774,2004.[16]M.RiesenhuberandT.Poggio.Hierarchicalmodelsofobject
recognitionincortex.Nat.Neurosci.,2(11):1019–25,1999.[17]H.SchneidermanandT.Kanade.Astatisticalmethodfor3D
objectdetectionappliedtofacesandcars.InCVPR,pages746–751,2000.[18]T.Serre,J.Louie,M.Riesenhuber,andT.Poggio.Onthe
roleofobject-specificfeaturesforrealworldrecognitioninbiologicalvision.InBiologicallyMotivatedComputerVi-sion,SecondInternationalWorkshop(BMCV2002),pages387–97,Tuebingen,Germany.,2002.[19]T.SerreandM.Riesenhuber.Realisticmodelingofsimple
andcomplexcelltuninginthehmaxmodel,andimplicationsforinvariantobjectrecognitionincortex.TechnicalReportCBCLPaper239/AIMemo2004-017,MassachusettsInsti-tuteofTechnology,Cambridge,MA,July2004.[20]T.Serre,L.Wolf,andT.Poggio.Anewbiologicallymoti-vatedframeworkforrobustobjectrecognition.TechnicalRe-portCBCLPaper243/AIMemo2004-026,MassachusettsInstituteofTechnology,Cambridge,MA,November2004.[21]A.Torralba,K.P.Murphy,andW.T.Freeman.Sharingfea-tures:efficientboostingproceduresformulticlassobjectde-tection.InCVPR,2004.[22]S.Ullman,M.Vidal-Naquet,andE.Sali.Visualfeatures
ofintermdediatecomplexityandtheiruseinclassification.NatureNeuroscience,5(7):682–687,2002.[23]P.ViolaandM.Jones.Robustreal-timefacedetection.In
ICCV,volume20(11),pages12–1259,2001.[24]M.Weber,M.Welling,andP.Perona.Unsupervisedlearning
ofmodelsforrecognition.InECCV,Dublin,Ireland,2000.[25]H.WersingandE.Korner.Learningoptimizedfeaturesfor
hierarchicalmodelsofinvariantrecognition.NeuralCompu-tation,15(7),2003.
References
[1]Y.AmitandM.Mascaro.Anintegratednetworkforin-variantvisualdetectionandrecognition.VisionResearch,43(19):2073–2088,2003.[2]S.Belongie,J.Malik,andJ.Puzicha.Shapematchingand
objectrecognitionusingshapecontexts.PAMI,2002.[3]L.Fei-Fei,R.Fergus,andP.Perona.Learninggenerative
visualmodelsfromfewtrainingexamples:Anincrementalbayesianapproachtestedon101objectcategories.InCVPR,WorkshoponGenerative-ModelBasedVision,2004.[4]R.Fergus,P.Perona,andA.Zisserman.Objectclassrecog-nitionbyunsupervisedscale-invariantlearning.InCVPR,volume2,pages2–271,2003.[5]K.Fukushima.Neocognitron:Aselforganizingneuralnet-workmodelforamechanismofpatternrecognitionunaf-fectedbyshiftinposition.Biol.Cybern.,36:193–201,1980.[6]T.J.GawneandJ.M.Martin.Responseofprimatevisual
corticalV4neuronstosimultaneouslypresentedstimuli.J.Neurophysiol.,88:1128–1135,2002.[7]B.Heisele,T.Serre,M.Pontil,T.Vetter,andT.Poggio.Cat-egorizationbylearningandcombiningobjectparts.InNIPS,Vancouver,2001.[8]D.HubelandT.Wiesel.Receptivefieldsandfunctionalar-chitectureintwononstriatevisualareas(18and19)ofthecat.J.Neurophys.,28:229–,1965.
7
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- yrrf.cn 版权所有 赣ICP备2024042794号-2
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务