您的当前位置：首页 Object Recognition with Features Inspiredby Visual Cortex

Object Recognition with Features Inspiredby Visual Cortex

来源：意榕旅游网

ObjectRecognitionwithFeaturesInspiredbyVisualCortex

ThomasSerre

LiorWolf

CenterforBiologicalandComputationalLearning

McGovernInstitute

BrainandCognitiveSciencesDepartmentMassachusettsInstituteofTechnology

Cambridge,MA02142

{serre,liorwolf}@mit.edu,tp@ai.mit.eduAbstract

Weintroduceanovelsetoffeaturesforrobustobjectrecognition.Eachelementofthissetisacomplexfeatureobtainedbycombiningposition-andscale-tolerantedge-detectorsoverneighboringpositionsandmultipleorienta-tions.Oursystem’sarchitectureismotivatedbyaquantita-tivemodelofvisualcortex.

Weshowthatourapproachexhibitsexcellentrecogni-tionperformanceandoutperformsseveralstate-of-the-artsystemsonavarietyofimagedatasetsincludingmanydif-ferentobjectcategories.Wealsodemonstratethatoursys-temisabletolearnfromveryfewexamples.Theperfor-manceoftheapproachconstitutesasuggestiveplausibilityproofforaclassoffeedforwardmodelsofobjectrecogni-tionincortex.

gories[24,4],particularlywhentrainedwithveryfewtrain-ingexamples[3].Onelimitationoftheserigidtemplate-basedfeaturesisthattheymightnotadequatelycapturevariationsinobjectappearance:theyareveryselectiveforatargetshapebutlackinvariancewithrespecttoobjecttrans-formations.Attheotherextreme,histogram-baseddescrip-tors[12,2]areveryrobustwithrespecttoobjecttransfor-mations.TheSIFT-basedfeatures[12],forinstance,havebeenshowntoexcelinthere-detectionofapreviouslyseenobjectundernewimagetransformations.However,asweconﬁrmexperimentally(seesection4),withsuchdegreeofinvariance,itisunlikelythattheSIFT-basedfeaturescouldperformwellonagenericobjectrecognitiontask.Inthispaper,weintroduceanewsetofbiologically-inspiredfeaturesthatexhibitabettertrade-offbetweenin-varianceandselectivitythantemplate-basedorhistogram-basedapproaches.Eachelementofthissetisafeatureob-tainedbycombiningtheresponseoflocaledge-detectorsthatareslightlyposition-andscale-tolerantoverneighbor-ingpositionsandmultipleorientations(likecomplexcellsinprimaryvisualcortex).Ourfeaturesaremoreﬂexiblethantemplate-basedapproaches[7,22]becausetheyallowforsmalldistortionsoftheinput;theyaremoreselectivethanhistogram-baseddescriptorsastheypreservelocalfea-turegeometry.Ourapproachisasfollows:foraninputim-age,weﬁrstcomputeasetoffeatureslearnedfromtheposi-tivetrainingset(seesection2).Wethenrunastandardclas-siﬁeronthevectoroffeaturesobtainedfromtheinputim-age.Theresultingapproachissimplerthantheaforemen-tionedhierarchicalapproaches:itdoesnotinvolvescanningoverallpositionsandscales,itusesdiscriminativemethodsanditdoesnotexplicitlymodelobjectgeometry.Yetitisabletolearnfromveryfewexamplesanditperformssig-niﬁcantlybetterthanallthesystemswehavecompareditwiththusfar.1

TomasoPoggio

1Introduction

Hierarchicalapproachestogenericobjectrecognitionhavebecomeincreasinglypopularovertheyears.Theseareinsomecasesinspiredbythehierarchicalnatureofprimatevisualcortex[10,25],but,mostimportantly,hierarchicalapproacheshavebeenshowntoconsistentlyoutperformﬂatsingle-template(holistic)objectrecognitionsystemsonavarietyofobjectrecognitiontasks[7,10].Recognitiontyp-icallyinvolvesthecomputationofasetoftargetfeatures(alsocalledcomponents[7],parts[24]orfragments[22])atonestepandtheircombinationinthenextstep.Fea-turesusuallyfallinoneoftwocategories:template-basedorhistogram-based.Severaltemplate-basedmethodsex-hibitexcellentperformanceinthedetectionofasingleob-jectcategory,e.g.,faces[17,23],cars[17]orpedestri-ans[14].Constellationmodelsbasedongenerativemeth-odsperformwellintherecognitionofseveralobjectcate-

BandΣﬁlt.sizess

σλ

gridsizeNΣorient.θpatchsizesni

17&92.8&3.63.5&4.6

211&134.5&5.45.6&6.810

315&176.3&7.37.9&9.112

456731&3314.6&15.818.2&19.7

835&3717.0&18.221.2&22.8

19&2123&2527&298.2&9.210.2&11.312.3&13.410.3&11.512.7&14.115.4&16.8

141618

ππ3π0;4;2;4

4×4;8×8;12×12;16×16(×4orientations)

Table1.Summaryofparametersusedinourimplementation(seeFig.1andaccompanyingtext).

Biologicalvisualsystemsasguides.Becausehumansandprimatesoutperformthebestmachinevisionsystemsbyalmostanymeasure,buildingasystemthatemulatesobjectrecognitionincortexhasalwaysbeenanattractiveidea.However,forthemostpart,theuseofvisualneuro-scienceincomputervisionhasbeenlimitedtoajustiﬁca-tionofGaborﬁlters.Norealattentionhasbeengiventobiologicallyplausiblefeaturesofhighercomplexity.Whilemainstreamcomputervisionhasalwaysbeeninspiredandchallengedbyhumanvision,itseemstoneverhavead-vancedpasttheﬁrststageofprocessinginthesimplecellsofprimaryvisualcortexV1.Modelsofbiologicalvi-sion[5,13,16,1]havenotbeenextendedtodealwithreal-worldobjectrecognitiontasks(e.g.,largescalenatu-ralimagedatabases)whilecomputervisionsystemsthatareclosertobiologylikeLeNet[10]arestilllackingagreementwithphysiology(e.g.,mappingfromnetworklayerstocor-ticalvisualareas).Thisworkisanattempttobridgethegapbetweencomputervisionandneuroscience.

Oursystemfollowsthestandardmodelofobjectrecog-nitioninprimatecortex[16],whichsummarizesinaquan-titativewaywhatmostvisualneuroscientistsagreeon:theﬁrstfewhundredsmillisecondsofvisualprocessinginpri-matecortexfollowsamostlyfeedforwardhierarchy.Ateachstage,thereceptiveﬁeldsofneurons(i.e.,thepartofthevisualﬁeldthatcouldpotentiallyelicitaneuron’sre-sponse)tendtogetlargeralongwiththecomplexityoftheiroptimalstimuli(i.e.,thesetofstimulithatelicitaneuron’sresponse).Initssimplestversion,thestandardmodelcon-sistsoffourlayersofcomputationalunitswheresimpleSunits,whichcombinetheirinputswithGaussian-liketun-ingtoincreaseobjectselectivity,alternatewithcomplexCunits,whichpooltheirinputsthroughamaximumoper-ation,therebyintroducinggradualinvariancetoscaleandtranslation.Themodelhasbeenabletoquantitativelydu-plicatethegeneralizationpropertiesexhibitedbyneuronsininferotemporalmonkeycortex(theso-calledview-tunedunits)thatremainhighlyselectiveforparticularobjects(aface,ahand,atoiletbrush)whilebeinginvarianttorangesofscalesandpositions.Themodeloriginallyusedaverysimplestaticdictionaryoffeatures(fortherecognitionofsegmentedobjects)althoughitwassuggestedin[16]thatfeaturesinintermediatelayersshouldinsteadbelearnedfromvisualexperience.

Weextendthestandardmodelandshowhowitcanlearnavocabularyofvisualfeaturesfromnatu-ralimages.Weprovethattheextendedmodelcanrobustlyhandletherecognitionofmanyobjectcate-goriesandcompetewithstate-of-the-artobjectrecogni-tionsystems.Thisworkappearedinaveryprelim-inaryformin[18].Oursourcecodeaswellasanextendedversionofthispaper[20]canbefoundathttp://cbcl.mit.edu/software-datasets.

2TheC2features

OurapproachissummarizedinFig.1:theﬁrsttwolay-erscorrespondtoprimateprimaryvisualcortex,V1,i.e.,theﬁrstvisualcorticalstage,whichcontainssimple(S1)andcomplex(C1)cells[8].TheS1responsesareobtainedbyapplyingtotheinputimageabatteryofGaborﬁlters,whichcanbedescribedbythefollowingequation:

󰀂󰀁󰀂󰀁2π(X2+γ2Y2)

X,G(x,y)=exp−×cos

2σ2λwhereX=xcosθ+ysinθandY=−xsinθ+ycosθ.

Weadjustedtheﬁlterparameters,i.e.,orientationθ,ef-fectivewidthσ,andwavelengthλ,sothatthetuningpro-ﬁlesofS1unitsmatchthoseofV1parafovealsimplecells.Thiswasdonebyﬁrstsamplingthespaceofparametersandthengeneratingalargenumberofﬁlters.WeappliedthoseﬁlterstostimulicommonlyusedtoprobeV1neurons[8](i.e.,gratings,barsandedges).Afterremovingﬁltersthatwereincompatiblewithbiologicalcells[8],wewereleftwithaﬁnalsetof16ﬁltersat4orientations(seeTable1and[19]forafulldescriptionofhowthoseﬁlterswereob-tained).

Thenextstage–C1–correspondstocomplexcellswhichshowsometolerancetoshiftandsize:complexcellstendtohavelargerreceptiveﬁelds(twiceaslargeassimplecells),respondtoorientedbarsoredgesanywherewithintheirreceptiveﬁeld[8](shiftinvariance)andareingen-eralmorebroadlytunedtospatialfrequencythansimplecells[8](scaleinvariance).ModifyingtheoriginalHubel&Wieselproposalforbuildingcomplexcellsfromsimplecellsthroughpooling[8],Riesenhuber&Poggioproposedamax-likepoolingoperationforbuildingposition-andscale-tolerantC1units.Inthemeantime,experimentalevidence

GivenaninputimageI,performthefollowingsteps:.S1:ApplyabatteryofGaborﬁlterstotheinputimage.Theﬁlterscomein4orientationsθand16scaless(seeTable1).Obtain16×4=maps(S1)sθthatarearrangedin8bands(e.g.,band1containsﬁlteroutputsofsize7and9,inallfourorientations,band2containsﬁlteroutputsofsize11and13,etc).

C1:Foreachband,takethemaxoverscalesandpo-sitions:eachbandmemberissub-sampledbytakingthemaxoveragridwithcellsofsizeNΣﬁrstandthemaxbetweenthetwoscalememberssecond,e.g.,forband1,aspatialmaxistakenoveran8×8gridﬁrstandthenacrossthetwoscales(size7and9).Notethatwedonottakeamaxoverdifferentorientations,hence,eachband(C1)Σcontains4maps.

Duringtrainingonly:ExtractKpatchesPi=1,...Kofvarioussizesni×niandallfourorientations(thuscontainingni×ni×4elements)atrandomfromthe(C1)Σmapsfromalltrainingimages.

S2:ForeachC1image(C1)Σ,compute:

Y=exp(−γ||X−Pi||2)forallimagepatchesX(atallpositions)andeachpatchPlearnedduringtrainingforeachbandindependently.ObtainS2maps(S2)Σi.C2:ComputethemaxoverallpositionsandscalesforeachS2maptype(S2)i(i.e.,correspondingtoaparticularpatchPi)andobtainshift-andscale-invariantC2features(C2)i,fori=1...K.

Figure1.ComputationofC2features.

Figure2.Scale-andposition-toleranceatthecomplexcells(C1)level:EachC1unitreceivesinputsfromS1unitsatthesamepre-ferredorientationarrangedinbandsΣ,i.e.,S1unitsintwodiffer-entsizesandneighboringpositions(gridcellofsizeNΣ×NΣ).Fromeachgridcell(left)weobtainonemeasurementbytakingthemaxoverallpositionsallowingtheC1unittorespondtoanhorizontaledgeanywherewithinthegrid(tolerancetoshift).Sim-ilarly,bytakingamaxoverthetwosizes(right)theC1unitbe-comestoleranttoslightchangesinscale.

infavorofthemaxoperationhasappeared[6,9].AgainpoolingparametersweresetsothatC1unitsmatchthetun-ingpropertiesofcomplexcellsasmeasuredexperimentally(seeTable1and[19]forafulldescriptionofhowthoseﬁlterswereobtained).

Fig.2illustrateshowpoolingfromS1toC1isdone.S1unitscomein16scalessarrangedin8bandsΣ.Forin-stance,considertheﬁrstbandΣ=1.Foreachorientation,itcontainstwoS1maps:oneobtainedusingaﬁlterofsize7,andoneobtainedusingaﬁlterofsize9.NotethatbothoftheseS1mapshavethesamedimensions.InordertoobtaintheC1responses,thesemapsaresub-sampledusingagridcellofsizeNΣ×NΣ=8×8.Fromeachgridcellweobtainonemeasurementbytakingthemaximumofallelements.Asalaststagewetakeamaxoverthetwoscales,byconsideringforeachcellthemaximumvaluefromthetwomaps.Thisprocessisrepeatedindependentlyforeachofthefourorientationsandeachscaleband.

Inournewversionofthestandardmodelthesubse-quentS2stageiswherelearningoccurs.AlargepoolofK

patchesofvarioussizesatrandompositionsareextractedfromatargetsetofimagesattheC1levelforallorienta-tions,i.e.,apatchPiofsizeni×nicontainsni×ni×4el-ements,wherethe4factorcorrespondstothefourpossibleS1andC1orientations.Inoursimulationsweusedpatchesofsizeni=4,8,12and16butinpracticeanysizecanbeconsidered.ThetrainingprocessendsbysettingeachofthosepatchesasprototypesorcentersoftheS2unitswhichbehaveasradialbasisfunction(RBF)unitsduringrecog-nition,i.e.,eachS2unitresponsedependsinaGaussian-likewayontheEuclideandistancebetweenanewinputpatch(ataparticularlocationandscale)andthestoredpro-totype.Thisisconsistentwithwell-knownneuronresponsepropertiesinprimateinferotemporalcortexandseemstobethekeypropertyforlearningtogeneralizeinthevisualandmotorsystems[15].Whenanewinputispresented,eachstoredS2unitisconvolvedwiththenew(C1)Σinputim-ageatallscales(thisleadstoK×8(S2)Σiimages,wheretheKfactorcorrespondstotheKpatchesextractedduringlearningandthe8factor,tothe8scalebands).Aftertakingaﬁnalmaxforeach(S2)imapacrossallscalesandposi-tions,wegettheﬁnalsetofKshift-andscale-invariantC2units.ThesizeofourﬁnalC2featurevectorthusdependsonlyonthenumberofpatchesextractedduringlearningandnotontheinputimagesize.ThisC2featurevectorispassedtoaclassiﬁerforﬁnalanalysis.1

Animportantquestionforbothneuroscienceandcom-putervisionregardsthechoiceoftheunlabeledtargetsetfromwhichtolearn–inanunsupervisedway–thisvocab-ularyofvisualfeatures.Inthispaper,featuresarelearnedfromthepositivetrainingsetforeachobjectcategory(butsee[20]foradiscussiononhowfeaturescouldbelearnedfromrandomnaturalimages).

islikelythatour(non-biological)ﬁnalclassiﬁercouldcorrespondtothetask-speciﬁccircuitsfoundinprefrontalcortex(PFC)andC2unitswithneuronsininferotemporal(IT)cortex[16].TheS2unitscouldbelocatedinV4and/orinposteriorinferotemporal(PIT)cortex.

1It

DatasetsLeaves(Calt.)Cars(Calt.)Faces(Calt.)Airplanes(Calt.)Moto.(Calt.)Faces(MIT)Cars(MIT)

Figure3.ExamplesfromtheMITfaceandcardatasets.

Bench.

[24][4][4][4][4][7][11]

84.084.6.494.095.090.475.4C2featuresboostSVM97.095.999.799.8.298.196.794.998.097.495.995.395.193.3

Table2.C2featuresvs.otherrecognitionsystems(Bench.).

3.ExperimentalSetup

Wetestedoursystemonvariousobjectcategorization

tasksforcomparisonwithbenchmarkcomputervisionsys-tems.Alldatasetsweusedaremadeupofimagesthateithercontainordonotcontainasingleinstanceofthetargetob-ject;Thesystemhastodecidewhetherthetargetobjectispresentorabsent.

MIT-CBCLdatasets:Theseincludeanear-frontal(±30◦)facedatasetforcomparisonwiththecomponent-basedsystemofHeiseleetal.[7]andamulti-viewcardatasetforcomparisonwith[11].Thesetwodatasetsareverychallenging(seetypicalexamplesinFig.3).ThefacepatternsusedfortestingconstituteasubsetoftheCMUPIEdatabasewhichcontainsalargevarietyoffacesun-derextremeilluminationconditions(see[7]).Thetestnon-facepatternswereselectedbyalow-resolutionLDAclas-siﬁerasthemostsimilartofaces(theLDAclassiﬁerwastrainedonanindependent19×19low-resolutiontrainingset).Thefullsetusedin[7]contains6,900positiveand13,700negative70×70imagesfortrainingand427positiveand5,000negativeimagesfortesting.ThecardatabaseontheotherhandwascreatedbytakingstreetscenepicturesintheBostoncityarea.Numerousvehicles(includingSUVs,trucks,buses,etc)photographedfromdifferentview-pointsweremanuallylabeledfromthoseimagestoformapositiveset.Randomimagepatternsatvariousscalesthatwerenotlabeledasvehicleswereextractedandusedasthenegativeset.Thecardatasetusedin[11]contains4,000positiveand1,600negative120×120trainingexamplesand3,400testexamples(halfpositive,halfnegative).Whilewetestedoursystemonthefulltestsets,weconsideredarandomsub-setofthepositiveandnegativetrainingsetscontainingonly500imageseachforboththefaceandthecardatabase.TheCaltechdatasets:TheCaltechdatasetscontain101objectsplusabackgroundcategory(usedasthenegativeset)andareavailableathttp://www.vision.caltech.edu.Foreachob-jectcategory,thesystemwastrainedwithn=1,3,6,15,30or40positiveexamplesfromthetargetobjectclass(asin[3])and50negativeexamplesfromthebackgroundclass.Fromtheremainingimages,weextracted50images

fromthepositiveand50imagesfromthenegativesettotestthesystem’sperformance.Asin[3],thesystem’sperformancewasaveragedover10randomsplitsforeachobjectcategory.Allimageswerenormalizedto140pixelsinheight(widthwasrescaledaccordinglysothattheimageaspectratiowaspreserved)andconvertedtograyvaluesbeforeprocessing.Thesedatasetscontainthetargetobjectembeddedinalargeamountofclutterandthechallengeistolearnfromunsegmentedimagesanddiscoverthetargetobjectclassautomatically.ForaclosecomparisonwiththesystembyFergusetal.wealsotestedourapproachonasubsetofthe101-objectdatasetusingtheexactsamesplitasin[4](theresultsarereportedinTable2)andanadditionalleafdatabaseasin[24]foratotalofﬁvedatasetsthatwerefertoastheCaltechdatasetsinthefollowing.

4Results

Table2containsasummaryoftheperformnaceoftheC2featureswhenusedasinputtoalinearSVMandtogentleAdaBoost(denotedboost)onvariousdatasets.Forbothoursystemandthebenchmarks,wereporttheerrorrateattheequilibriumpoint,i.e.,theerrorrateatwhichthefalsepositiverateequalsthemissrate.Resultsob-tainedwiththeC2featuresareconsistentlyhigherthanthosepreviouslyreportedontheCaltechdatasets.Oursys-temseemstooutperformthecomponent-basedsystempre-sentedin[7](alsousingSVM)ontheMIT-CBCLfacedatabaseaswellasafragment-basedsystemimplementedby[11]thatusestemplate-basedfeatureswithgentleAdaBoost(similarto[21]).

Fig.4summarizesthesystemperformanceonthe101-objectdatabase.OntheleftweshowtheresultsobtainedusingoursystemwithgentleAdaBoost(wefoundqual-itativelysimilarresultswithalinearSVM)overall101categoriesfor1,3,6,15,30and40positivetrainingex-amples(eachresultisanaverageof10differentrandomsplits).Eachplotisasinglehistogramofall101scores,ob-tainedusingaﬁxednumberoftrainingexamples(e.g.,with40examplesthesystemgets95%correctfor42%oftheobjectcategories).OntherightwefocusonsomeofthesameobjectcategoriesastheonesusedbyFei-Feietal.forillustrationin[3]:theC2featuresachieveerrorratesvery

0.50.450.4Proportion of object categories0.350.30.250.20.150.10.05060 1 3 6153040Performance (ROC area)100959085807570656055500FacesMotorbikesLeopardsCougar faceCrocodileMayflyGrand−piano510152025Number of training examples303065707580ROC area859095100Figure4.C2featuresperformanceonthe101-objectdatabasefordifferentnumbersofpositivetrainingexamples:(left)histogramacrossthe101categoriesand(right)performanceonsamplecategories,seeaccompanyingtext.10095Performance (Equilibrium point)9085807570656055c2 / AirplanesSift / Airplanesc2 / LeavesSift / Leavesc2 / MotorcyclesSift / Motorcyclesc2 / FacesSift / Facesc2 / CarsSift / Cars51050100Number of features200500100010095C2 features performance (equilibrium point)9085807570656055505060708090Sift−based features performance (equilibrium point) 1 3 61530100Figure5.SuperiorityoftheC2vs.SIFT-basedfeaturesontheCaltechdatasetsfordifferentnumberoffeatures(left)andonthe101-objectdatabasefordifferentnumberoftrainingexamples(right).

similartotheonesreportedin[3]withveryfewtrainingexamples.

WealsocomparedourC2featurestoSIFT-basedfea-tures[12].Weselected1000randomreferencekey-pointsfromthetrainingset.Givenanewimage,wemeasuredtheminimumdistancebetweenallitskey-pointsandthe1000referencekey-points,thusobtainingafeaturevectorofsize1000(forthiscomparisonwedidnotusethepositionin-formationrecoveredbythealgorithm).WhileLowerecom-mendsusingtheratioofthedistancesbetweenthenearestandthesecondclosestkey-pointasasimilaritymeasure,wefoundthattheminimumdistanceleadstobetterper-formancethantheratioonthesedatasets.AcomparisonbetweentheC2featuresandtheSIFT-basedfeatures(bothpassedtoaGentleAdaboostclassiﬁer)isshowninFig.5(left)fortheCaltechdatasets.Thegaininperformanceob-tainedbyusingtheC2featuresrelativetotheSIFT-basedfeaturesisobvious.ThisistruewithgentleAdaBoost–usedforclassiﬁcationonFig.5(left)–butwealsofound

verysimilarresultswithSVM.Also,asonecanseeinFig.5(right),theperformanceoftheC2features(erroratequilib-riumpoint)foreachcategoryfromthe101-objectdatabaseiswellabovethatoftheSIFT-basedfeaturesforanynumberoftrainingexamples.

Finally,weconductedinitialexperimentsonthemultipleclassescase.Forthistaskweusedthe101-objectdataset.Wespliteachcategoryintoatrainingsetofsize15or30andatestsetcontainingtherestoftheimages.Weusedasimplemultiple-classlinearSVMasclassiﬁer.TheSVMappliedtheall-pairsmethodformultiplelabelclassiﬁca-tion,andwastrainedon102labels(101categoriesplusthebackgroundcategory,i.e.,102AFC).ThenumberofC2featuresusedintheseexperimentswas4075.Weobtainedabove35%correctclassiﬁcationratewhenusing15trainingexamplesperclassaveragedover10repetitions,and42%correctclassiﬁcationratewhenusing30trainingexamples(chancebelow1%).

Motorbikes1100.9200.8Faces300.7400.650Airplanes0.50.5600.4700.3Starfish800.2900.1Yin yang100102030405060708090100Figure6.(left)Samplefeatureslearnedfromdifferentobjectcategories(i.e.,ﬁrst5featuresreturnedbygentleAdaBoostforeachcategory).ShownareS2features(centersofRBFunits):eachorientedellipsecharacterizesaC1(afferent)subunitatmatchingorientation,whilecolorencodesforresponsestrength.(right)Multiclassclassiﬁcationon101objectdatabasewithalinearSVM.

5Discussion

Thispaperdescribesanewbiologically-motivatedframeworkforrobustobjectrecognition:Oursystemﬁrstcomputesasetofscale-andtranslation-invariantC2fea-turesfromatrainingsetofimagesandthenrunsastandarddiscriminativeclassiﬁeronthevectoroffeaturesobtainedfromtheinputimage.Ourapproachexhibitsexcellentper-formanceonavarietyofimagedatasetsandcompetewithsomeofthebestexistingsystems.

Thissystembelongstoafamilyoffeedforwardmodelsofobjectrecognitionincortexthathavebeenshowntobeabletoduplicatethetuningpropertiesofneuronsinseveralvisualcorticalareas.Inparticular,Riesenhuber&Poggioshowedthatsuchaclassofmodelsaccountsquantitativelyforthetuningpropertiesofview-tunedunitsininferotem-poralcortex(testedwithidealizedobjectstimulionuniformbackgrounds),whichrespondtoimagesofthelearnedob-jectmorestronglythantodistractorobjects,despitesignif-icantchangesinpositionandsize[16].Theperformanceofthisarchitectureonavarietyofreal-worldobjectrecog-nitiontasks(presenceofclutterandchangesinappearance,illumination,etc)providesanothercompellingplausibilityproofforthisclassofmodels.

Whilealong-timegoalforcomputervisionhasbeentobuildasystemthatachieveshuman-levelrecognitionperformance,state-of-the-artalgorithmshavebeendiverg-ingfrombiology:forinstance,someofthebestexistingsystemsusegeometricalinformationabouttheconstitu-tivepartsofobjects(constellationapproachesrelyonbothappearance-basedandshape-basedmodelsandcomponent-basedsystemusetherelativepositionofthedetectedcom-ponentsalongwiththeirassociateddetectionvalues).Biol-6

ogyishoweverunlikelytobeabletousegeometricalinfor-mation–atleastinthecorticalstreamdedicatedtoshapeprocessingandobjectrecognition.Thesystemdescribedinthispaperisrespectsthepropertiesofcorticalprocessing(includingtheabsenceofgeometricalinformation)whileshowingperformanceatleastcomparabletothebestcom-putervisionsystems.

Thefactthatthisbiologically-motivatedmodeloutper-formsmorecomplexcomputervisionsystemsmightatﬁrstappearpuzzling.Thearchitectureperformsonlytwomajorkindsofcomputations(templatematchingandmaxpool-ing)whilesomeoftheothersystemswehavediscussedinvolvecomplexcomputationsliketheestimationofprob-abilitydistributions[24,4,3]ortheselectionoffacial-componentsforusebyanSVM[7].Perhapspartofthemodel’sstrengthcomesfromitsbuilt-ingradualshift-andscale-tolerancethatcloselymimicsvisualcorticalprocess-ing,whichhasbeenﬁnelytunedbyevolutionoverthou-sandsofyears.Itisalsoverylikelythatsuchhierarchicalarchitectureseasetherecognitionproblembydecomposingthetaskintoseveralsimpleronesateachlayer.FinallyitisworthpointingoutthatthesetofC2featuresthatispassedtotheﬁnalclassiﬁerisveryredundant,probablymorere-dundantthanforotherapproaches.Whileweshowedthatarelativelysmallnumberoffeatures(about50)issufﬁcienttoachievegooderrorrates,performancecanbeincreasedsigniﬁcantlybyaddingmanymorefeatures.Interestingly,thenumberoffeaturesneededtoreachtheceiling(about5,000features)ismuchlargerthanthenumberusedbycur-rentsystems(ontheorderof10-100for[22,7,21]and4-8forconstellationapproaches[24,4,3]).

Acknowledgments

WewouldliketothanktheanonymousreviewersaswellasAntonioTorralbaandYuriIvanovforusefulcommentsonthismanuscript.

ThisreportdescribesresearchdoneattheCenterforBiological&ComputationalLearning,whichisintheMcGovernInstituteforBrainResearchatMIT,aswellasintheDept.ofBrain&Cogni-tiveSciences,andwhichisafﬁliatedwiththeComputerSciences&ArtiﬁcialIntelligenceLaboratory(CSAIL).

Thisresearchwassponsoredbygrantsfrom:OfﬁceofNavalResearch(DARPA)ContractNo.MDA972-04-1-0037,OfﬁceofNavalResearch(DARPA)ContractNo.N00014-02-1-0915,Na-tionalScienceFoundation(ITR/IM)ContractNo.IIS-0085836,NationalScienceFoundation(ITR/SYS)ContractNo.IIS-0112991,NationalScienceFoundation(ITR)ContractNo.IIS-02092,NationalScienceFoundation-NIH(CRCNS)ContractNo.EIA-0218693,NationalScienceFoundation-NIH(CRCNS)ContractNo.EIA-0218506,andNationalInstitutesofHealth(Conte)ContractNo.1P20MH66239-01A1.Additionalsupportwasprovidedby:CentralResearchInstituteofElectricPowerIn-dustry,Centerfore-Business(MIT),Daimler-ChryslerAG,Com-paq/DigitalEquipmentCorporation,EastmanKodakCompany,HondaR&DCo.,Ltd.,ITRI,KomatsuLtd.,EugeneMcDermottFoundation,Merrill-Lynch,MitsubishiCorporation,NECFund,NipponTelegraph&Telephone,Oxygen,SiemensCorporateRe-search,Inc.,SonyMOU,SumitomoMetalIndustries,ToyotaMo-torCorporation,andWatchVisionCo.,Ltd.

[9]I.Lampl,D.Ferster,T.Poggio,andM.Riesenhuber.In-tracellularmeasurementsofspatialintegrationandthemaxoperationincomplexcellsofthecatprimaryvisualcortex.J.Neurophysiol.,92:2704–2713,2004.[10]YannLeCun,Fu-JieHuang,andLeonBottou.Learning

methodsforgenericobjectrecognitionwithinvariancetoposeandlighting.InProceedingsofCVPR’04.IEEEPress,2004.[11]B.Leung.Component-basedcardetectioninstreetscene

images.Master’sthesis,EECS,MIT,2004.[12]D.G.Lowe.Objectrecognitionfromlocalscale-invariant

features.InICCV,pages1150–1157,1999.[13]B.W.Mel.SEEMORE:Combiningcolor,shapeandtexture

histogramminginaneurally-inspiredapproachtovisualob-jectrecognition.NeuralComputation,9(4):777–804,1997.[14]A.Mohan,C.Papageorgiou,andT.Poggio.Example-based

objectdetectioninimagesbycomponents.InPAMI,vol-ume23,pages349–361,2001.[15]T.PoggioandE.Bizzi.Generalizationinvisionandmotor

control.Nature,431:768–774,2004.[16]M.RiesenhuberandT.Poggio.Hierarchicalmodelsofobject

recognitionincortex.Nat.Neurosci.,2(11):1019–25,1999.[17]H.SchneidermanandT.Kanade.Astatisticalmethodfor3D

objectdetectionappliedtofacesandcars.InCVPR,pages746–751,2000.[18]T.Serre,J.Louie,M.Riesenhuber,andT.Poggio.Onthe

roleofobject-speciﬁcfeaturesforrealworldrecognitioninbiologicalvision.InBiologicallyMotivatedComputerVi-sion,SecondInternationalWorkshop(BMCV2002),pages387–97,Tuebingen,Germany.,2002.[19]T.SerreandM.Riesenhuber.Realisticmodelingofsimple

andcomplexcelltuninginthehmaxmodel,andimplicationsforinvariantobjectrecognitionincortex.TechnicalReportCBCLPaper239/AIMemo2004-017,MassachusettsInsti-tuteofTechnology,Cambridge,MA,July2004.[20]T.Serre,L.Wolf,andT.Poggio.Anewbiologicallymoti-vatedframeworkforrobustobjectrecognition.TechnicalRe-portCBCLPaper243/AIMemo2004-026,MassachusettsInstituteofTechnology,Cambridge,MA,November2004.[21]A.Torralba,K.P.Murphy,andW.T.Freeman.Sharingfea-tures:efﬁcientboostingproceduresformulticlassobjectde-tection.InCVPR,2004.[22]S.Ullman,M.Vidal-Naquet,andE.Sali.Visualfeatures

ofintermdediatecomplexityandtheiruseinclassiﬁcation.NatureNeuroscience,5(7):682–687,2002.[23]P.ViolaandM.Jones.Robustreal-timefacedetection.In

ICCV,volume20(11),pages12–1259,2001.[24]M.Weber,M.Welling,andP.Perona.Unsupervisedlearning

ofmodelsforrecognition.InECCV,Dublin,Ireland,2000.[25]H.WersingandE.Korner.Learningoptimizedfeaturesfor

hierarchicalmodelsofinvariantrecognition.NeuralCompu-tation,15(7),2003.

References

[1]Y.AmitandM.Mascaro.Anintegratednetworkforin-variantvisualdetectionandrecognition.VisionResearch,43(19):2073–2088,2003.[2]S.Belongie,J.Malik,andJ.Puzicha.Shapematchingand

objectrecognitionusingshapecontexts.PAMI,2002.[3]L.Fei-Fei,R.Fergus,andP.Perona.Learninggenerative

visualmodelsfromfewtrainingexamples:Anincrementalbayesianapproachtestedon101objectcategories.InCVPR,WorkshoponGenerative-ModelBasedVision,2004.[4]R.Fergus,P.Perona,andA.Zisserman.Objectclassrecog-nitionbyunsupervisedscale-invariantlearning.InCVPR,volume2,pages2–271,2003.[5]K.Fukushima.Neocognitron:Aselforganizingneuralnet-workmodelforamechanismofpatternrecognitionunaf-fectedbyshiftinposition.Biol.Cybern.,36:193–201,1980.[6]T.J.GawneandJ.M.Martin.Responseofprimatevisual

corticalV4neuronstosimultaneouslypresentedstimuli.J.Neurophysiol.,88:1128–1135,2002.[7]B.Heisele,T.Serre,M.Pontil,T.Vetter,andT.Poggio.Cat-egorizationbylearningandcombiningobjectparts.InNIPS,Vancouver,2001.[8]D.HubelandT.Wiesel.Receptiveﬁeldsandfunctionalar-chitectureintwononstriatevisualareas(18and19)ofthecat.J.Neurophys.,28:229–,1965.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文