您的当前位置：首页 Methodology for reliable schema development and evaluation of manual annotation

Methodology for reliable schema development and evaluation of manual annotation

来源：意榕旅游网

MethodologyforReliableSchemaDevelopmentand

EvaluationofManualAnnotations

PetraS.Bayerl

AppliedandComputationalLinguisticsOtto-Behaghel-Strasse10D,35394Giessen

Justus-Liebig-University,GermanyPetra.S.Bayerl@psychol.uni-giessen.de

UlrikeGut

DepartmentofEnglish

Fahnenbergplatz,79085FreiburgAlbert-Ludwigs-University,Germanyulrike.gut@anglistik.uni-freiburg.de

ABSTRACT

Thequalityofmanualannotationsoflinguisticdatadependsontheuseofreliablecodingschemasaswellasontheabil-ityofhumanannotatorstohandlethemappropriately.Asiswellknownfromawiderangeofpreviousexperiencesan-notationsusinghighlycomplexcodingschemasoftenleadtounacceptableannotationquality.Reducingcomplexitymightmakeschemaseasiertohandle,butinthiswayvaluablein-formationneededformoresophisticatedapplicationsisex-cludedaswell.Inordertodealwiththisproblem,wedevel-opedasystematicapproachtoschemadevelopment,whichallowsfordevelopingcodingschemasforﬁne-grainedse-manticannotationswhilesystematicallysecuringthequalityofsuchannotations.Forillustration,wepresentexamplesfromtwoprojectswheretextandspeechdataareannotated.

Keywords

schemadevelopment,reliability,kappa,semanticannotation,speechdata

INTRODUCTION

Despiteeffortstoautomatizeannotationsoflinguisticdata[33,3]manualannotationsstillplayanimportantroleinthecompilationofcorporaandlinguisticresearchmaterial.Thequalityofsuchmanualannotationsdependsontheuseofadequateandreliablecodingschemas,whichdeﬁnethecat-egoriesunderlyingannotationsoflinguisticsdata.Theirde-velopmentandevaluationmustthereforebeseenasoneofthemajortasksinannotationprojects.Schemadevelopmentiscrucialbecauseunreliableschemasmayleadtoinconsis-tenciesinthelabelingofobjectsnotonlybyasinglecoderovertime,butalsotoinconsistenciesinthelabelingamong

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontheﬁrstpage.Tocopyoth-erwise,orrepublish,topostonserversortoredistributetolists,requirespriorspeciﬁcpermissionand/orafee.

HaraldL¨ungen

AppliedandComputationalLinguistics

Otto-Behaghel-Strasse10D,35394Giessen

Justus-Liebig-University,Germany

Harald.Luengen@germanistik.uni-giessen.de

KarstenI.Paul

OrganizationalandSocialPsychologyLangeGasse20,90403N¨urnbergFriedrich-Alexander-University,GermanyPaul.Karsten@wiso.uni-erlangen.de

differentcoders.Bothtypesofinconsistenciesindicateare-ducedusabilityofannotateddata.

Asmanyannotationprojectshaveshown,especiallyhighlycomplexcodingschemasaredifﬁculttouseandwillthusoftenleadtoanunacceptablelowqualityinmanualanno-tations[29,4].Mostofthetime,researcherswillchoosetoreducethecomplexityofschemastomakethemmoremanageableforhumanannotators.[29],forinstance,re-ducedherschemafromoriginally31tosevenbasiccate-goriesforthesereasons.Butevenwhentheevaluationofanannotationschemaleadstogoodresultsintermsofre-liability,thenumberofcategoriesmightstillbereducedtoyieldevenbetterresults[20].Thisprocedure,however,hastheseveredrawbackofalsoreducingtheamountofinforma-tionwhichcanberepresentedinlinguisticdataandwhichmaybevaluableformorecomplexresearchquestionsorap-plicationssuchasinformationextraction,word-sensedisam-biguation,documentlayout,orinthecontextofthesemanticweb[24,32,27]Hence,itseemsvitaltodevelopasystem-aticapproachtoschemadevelopmentwiththeaimtocreatehighlycomplexreliablecodingschemaswhicharenonethe-lessmanageableforhumanannotators.Theapproachpre-sentedhereconsistsofmeasuringthereliabilityofthenewlydevelopedschema,systematicallyconsideringandidentify-ingsourcesofunreliabilitybystatisticalmeans,andthusiter-ativelyevaluatingandimprovingtheschema.Theresultingmethodologicalframeworkisbelievedtobefruitful,espe-ciallyasmostpreviousapproachesseemtolackasystematicmethodologyofschemadevelopmentandevaluation,whichoftenmakescomplexschemausagesodissatisfying.Ap-proachesliketheonepresentedby[8]foradialoguecodingschemestillseemtobeanexception.

Intheremainderofthearticle,weﬁrstwanttodescribethebasicprinciplesofthemethodologicalframeworkandthere-afterdemonstrateitsapplicabilitybypresentingdatafromtwoseparateannotationprojects.

CriticalIssuesinSchemaDevelopment

Thetaskofannotatinglinguisticdatacanbeseenasacat-egorizationtaskinwhichobjects(e.g.morphemes,words,phrases,sentences)havetobeassignedtoasinglecategory.Theassignmentofoneobject(usually)hastobeexclusiveandindependentfromthecategorizationofotherobjects.Tobevaluable,theprerequisiteofsuchanassignmentisthatidenticalobjectswillbeassembledinthesamecategory,whichleadstothefollowingconclusions:

1.Thecategoriesofthecodingschemamustbedeﬁnedinawaythatenableshumanstoadequatelydifferentiateamongthem.

2.Theschemamustbeusableinaconsistentwaybyseveralpersonsaswellasbyonepersonovertime.Althoughtheﬁrstpointseemsrathertrivial,itconstitutesex-actlythewayinwhichmostcomplexcodingschemasfail.Especiallyinthecaseofsemanticallycloseconceptsthein-terpretationsofsinglecodersoftenvaryconsiderably[2,35].Thisleadstothesecondpoint,whichreferstothequestionofconsistencyinmanualannotationsandisthusdirectlyrelatedtotheissueofreliability.Reliabilitycanbedeﬁnedas”thecomplexpropertyofaseriesofobservationsorofthemea-suringprocessthatmakesitpossibletoobtainsimilarresultsifthemeasurementisrepeated”[15,p.51].

Incontrasttotheuseofexistingschemaswhereinconsis-tencyinannotationsareusuallyattributedtodifferencesintheapplicationoftheschemabyannotatorsoreventochar-acteristicsofthehumanannotators,thesourcesofinconsis-tencyinschemadevelopmentmustbeattributedtoalackofreliabilityoftheschemaitself.Accordingly,thestepstobetakenwhenreliabilityisnothighenougharesupposedtobedifferent.Whereasinthecaseoftextannotationintensivetrainingandtheapplicationofsupportingtoolsareappro-priatemeasurestosecureannotationquality[21],inschemadevelopmentimprovementofthecodingschemamustbetheprimarygoal.

Whenaskingwhyschemasmightbeusedinconsistentlybydifferentcodersorevenonecoderovertime,severalreasonsmayplayarole.Onatheoreticalbasis,twomajorproblem-aticaspectsinschemadevelopmentcanbedifferentiatedthatleadtosystematicvarianceinannotationbehavior[16,31].Asstatedabovetheinterpretationofcategoriesmightbeam-biguous.Secondly,theprobabilityofassigninganobjecttoacategorymaydifferamongcoders.Inaddition,annotatorsmightdevelopneworslightlyaberrantcodinghabitsovertime.Thelatterpointrefersmainlytothereliabilityofthean-notationprocess,butitcanbehypothesizedthatsuchaberra-tionsoccurmostlywhencategorydeﬁnitionsandboundariesarediffuse.

Asaconsequence,theactionstakentoimprovethequalityofaschemamustbebasedonthespeciﬁcreasonsresponsible

forsystematicvarianceinmanualannotations.Problematicfeaturesintheschemaoritsapplicationmustthusbeana-lyzedthoroughlyinordertoensureorimprovethereliabilityofthecodinginstrument.

OUTLINEOFMETHODOLOGY

Theaboveconsiderationsledtothedesignofamethodologi-calframeworkforsystematicschemadevelopmentandeval-uation.Itcomprisesﬁvesuccessivesteps,whicharerepeatedaslongastheresultsarenotconsideredtobesufﬁcient.Thesestepsare

Step1:lettwoormorecodersannotateasufﬁcientamountofdatawithapreliminarycodingschema

Step2:repeattheannotationwiththesamecodersandma-terialafteracertainperiodoftime

Step3:checkbothannotationsforinter-andintra-coderagree-mentasameasureofreliability

Step4:identifysourcesofinconsistency

Step5:takeappropriatestepstoimprovethecodingschemaMethodologicalconsiderationsconcerningsinglestepswillbeconsideredfurtherinthefollowingsections.

PreliminaryConsiderations(Step1)

Beforestartingtheevaluationprocesssomebasicfactshavetobeconsideredsuchasthenumberofcodersandtheamountofmaterialtobeannotatedinordertogiveasufﬁcientdatabasisforstatisticalanalyses.Concerningtheamountofdataneeded,toourknowledge,thereisnoevidenceofhowmanyobservationsdependentontheoverallnumberofcategoriesshouldbeobtainedtogetreliablestatistics.1Literatureonkappa(seebelow)usuallymentionsatleast100observa-tionstogetsufﬁcientdataforcalculatingsigniﬁcance[13,12].[12]proposeamuchlargeramountofdataifconsid-erabledifferencesinthenumberofcasespercategoriesareexpected.

SecondAnnotation(Step2)

Thesecondannotationisdoneacertaintimeaftertheﬁrstus-ingthesamematerial.Theamountoftimeelapsingbetweenthetwoannotationsisnoteasytochose,however.

Asknownfromsocialsciencesthetimebetweentheﬁrstandthesecondtesting,i.e.annotationmayinﬂuencetheresultorthedegreeofagreementbetweenthetwomeasurements[9,14].Sincelinguisticannotationsdonotdealwithper-sonaltraitsorattitudesthatmaychangeovertime,theimpactmightnotbeassevereasinthecontextofthesocialsciences.Still,certaineffectsoftimeshouldbetakenintoaccount.Forinstance,ifthetwoannotationslietooclosetogether,thecodersmayrememberlargepartsoftheirformerannotationwhichleadstooverestimationsofretest-reliability.Toolong

Someconsiderationsaboutsuitablesamplesizecouldbefoundincaseofequalmarginalfrequenciesorwithrestrictiontoonlytwocategories[6].Bothcasesarenotapplicabletoourproblem,however.

aperiod,ontheotherhand,mightnotonlycauseundesirabledelaysinannotations.Iftheannotationprocessisstoppedthroughoutthisperiod,unrealisticallyhighnegativeeffectscouldappearasdeﬁnitionsorwholecategoriesofacodingschematendtobeforgotten(especiallyinthecaseofhighlycomplexschemas).Obviously,thecalculationoftheappro-priateperiodoftimethatshouldelapsebeforestartingthesecondannotationisnotastrivialasitmayﬁrstseem.We,unfortunately,donothaveadeﬁniteanswertowhatthe’best’spanoftimeis.Thispartisstillopentoresearch.

MeasuringInter-andIntra-CoderReliability(Step3)

TypesofReliabilityTwowaystomeasurethereliabilityof

acodingschemaseemfeasible,knownalsofromsocialsci-encesforthedevelopmentofratinginstruments.Theﬁrstpossibilityistheapplicationofaschemabydifferentcodersannotatingthesamematerialwhichleadstothemeasurementofinter-coderagreement(ICA).Thesecondisitsapplica-tiontotwoormoretimesbyonecoderforthesamematerialwhichindicatesintra-coderagreementortest-retestreliabil-ity(TRR)intermsofmeasurementtheory[14].Thesetwotypesarecomparabletowhat[18]namedstabilityandre-producibility.Theﬁrstapproachthusmeasuresconsistencyamongdifferentpersons,whereasthesecondapproachmea-suresconsistencyofonepersonovertime.Thesemeasurescanbeaffectedbyvariationsinthecodingbehavior,leadingtoinconsistencies,whichmaybebothobservableinman-ualannotations,withoutbeingnecessarilyattributabletothesamesources.Sincefeaturesofhighlycomplexschemasmayinducebothkindsofinconsistenciesindependently,theyshouldhencebeanalyzedseparately.

CalculatingReliability

Inter-coderagreementandintra-coder

agreement(test-retestreliability)canbothbecalculatedbythesamemeans.Asannotationsoflinguisticdataprimar-ilyconsistofannotationswithmutuallyexclusivecategorieswithoutanyordering(i.e.nominaldata)calculationscanbedonewiththeκ-(kappa)-coefﬁcientdevelopedby[10].κmeasurestheagreementbetweentwocoderswhilecorrectingforchanceagreement,whichisthereasonwhyitshouldbepreferredtothemerecalculationofthepercentageofagree-ment[7].Thevalueoftheresultingkappa-coefﬁcientindi-catesthedegreeofagreement.Fortheinterpretationoftheresultingkappaonemayrefertorulesofthumblikethosegivenby[19],where0≤κ<0.2meanslightagreement,0.2≤κ<0.4fairagreement,0.4≤κ<0.6moder-ateagreement,0.6≤κ<0.8substantialagreement,and0.8≤κ<1.0almostperfectagreement.Inspiteofseveralproblemswiththismeasureofagreement[1],kappahastheadvantageofbeingwidelyacceptedandeasytocalculate.

IdentifyingSourcesofUnreliability(Step4)

Todetectpossiblereasonsforlackofagreement,wedecidedtotestthehomogeneityofmarginaldistributions,whichcanbeseenasanindicatorofwhethercodershavedifferentin-terpretationsofthemeaningofcategoriesorwhethercoders

Categories–Coder2ABCDEFGΣA00000000B01000102CategoriesC0426010334Coder1

D00000000E1111460353F00010214G0520321325Σ11129250520118Table1:Assignmentdecisionsoftwocoders(imaginarydata)

justusesinglecategorieswithdifferentfrequencies[31].Thecomparisonisbasedondifferencesincoders’assignmentstosinglecategories,e.g.thenumberoftimescoder1assignedanobjecttocategoryA(0times)withthenumberoftimescoder2usedcategoryA(1times)(cp.table1).Homogene-ityisassumedwhenthedistributions,i.e.marginalsdonotdiffersigniﬁcantly.

Fortwocodersthischeckcanbedonewiththenon-parame-trictestbyStuartandMaxwell[28,22],whichcalculatestheoverallhomogeneityoverallcategories.Signiﬁcanceofthetestindicatesthatmarginalhomogeneityisnotgiven,andthusadifferentinterpretationofcategoriesmustbeassumed.Furthermoreitisimportanttoidentifytheproblematiccat-egories,i.e.thosewhichareinterpreteddifferentlybythecoders.ThiscanbedonewiththeaidoftheMcNemar-test[23].Thistestconsidersthemarginaldistributionsofschematawithonlytwocategories,i.e.thecategoryunderconsiderationandacompoundinwhichtheremainingcate-goriesarejoinedtogether.Signiﬁcanceofthetestindicatesdifferentinterpretationsofthecategoryunderconsideration.ForthecalculationofbothstatisticsweresortedtotheMH-programdevelopedbyUebersax.Thetoolcanbeobtainedasfreewarefromhttp://ourworld.compuserve.com//homepages/jsuebersax/mh.htm.

CASE1:CODINGSCHEMAFORSEMANTICTEXTANNOTATIONSSetting

ThemethodologicalframeworkforschemadevelopmentwasdevelopedwithinanannotationprojectattheUniversityofGiessen.Theaimofthisprojectistheanalysisoftheseman-ticsofdocumentstructures.2Forthispurpose,EnglishandGermanscientiﬁcarticlesaremanuallyannotatedonmul-tiplelevels,namelythestructuralandtwosemanticlevels,calledrhetoricalandthematic.Thethematicstructureofthearticledescribesthe’textworld’thatisreferredtobythear-ticle,thearticle’srhetoricalstructuredescribestherhetoricalrelationsthatholdbetweenthediscourseunitsofthearticle.

ProjectC1/SemDoc,DFG-Forschergruppe437/TexttechnologischeIn-formationsmodellierung.Formoredetailedinformationabouttheprojectseehttp://www.text-technology.de/

CategoryDeﬁnitionassumptiontheoreticalassumptionorsuppositionbytheauthor

theoreticalBasiswellestablishedtheoreticalknowledgeintheresearcharea

hypothesis

concreteformulationofastatisticallytestableassumption,whichistobeeithercorroboratedorrefutedbytheresultsofthestudy

Table3:Examplesofcategorydeﬁnitions

Whilewecouldresorttoexistingcodingschemasforthestructuralandrhetoricallevel,whichonlyneededtobead-justedtoourpurposes,thethematicschemahadtobedevel-opednearlyfromscratch.Usingexistingschemas[17,30]aswellasanalysesofsamplescientiﬁcarticlesasastart-ingpointwecompiledacodingschemaoforiginally71top-icssuchasmethod,history,andinducements.Byapplyingtheschematoawiderangeofdocuments,itwasextendedtopresently121differenttopics.Someofthesecategoriesrepresentverysubtlesemanticdifferences(seeta-ble3),whichmadeitnecessarytoaccomplishtheannotationtaskmanually.TheannotationitselfisdonebyhandinanXML-formatinthestyleof[25].Asmallpartofananno-tateddocumentisshowninﬁgure2.

Guidelinesdeﬁningthetopicsandclarifyingproblematiccaseswerewritten.Atthebeginningoftheannotationprocessthequality,measuredintermsofinter-coderagreement,wasverylow.Weobtainedagreementratesbetweenκ=.09andκ=.50(m=.22)fortwocoderseachannotatingthesamesixdocuments.Sincetheannotationqualitydidnotimprovemuchduringthefollowingannotationsessionsweattributedtheproblemtotheannotationschemaitself.Sincewedidnotwanttoreduceourschemainordertoretainasmuchinfor-mationaspossiblewedecidedtodevelopamethodologytoimprovetheusabilityofthecodingschemainstead.

FirstEvaluationCycleSteps1and2:Annotations

Asintheprojectthreeseparate

annotationlevelsareused,thenumberofannotatorsforeachlevelwaskepttotheminimumoftwoannotatorseach.Inordertomeettherequirementsforkappa(seeabove)andtoensureamoreorlessevendistributionintheprobabilityofoccurrencesoftopicswedecidedtoannotatetwocompletescientiﬁcarticlesforeachevaluationcycle.Thechosenarti-clesfortheﬁrstannotationcyclecontainedbetween102and192segmentstobeannotatedleadingtoanaveragenumberof293annotatedtopicsforeachcoder.Thetwocodersanno-tatedindependentlyfromeachother.Thesecondannotationwasdoneapproximatelytwoweeksaftertheﬁrst.

Step3:CalculatingReliability

Forthetwodocumentsinour

ﬁrstevaluationcycleweobtainedkappasattheslighttomod-eratelevelofagreement(seetable4),whichclearlycouldnotbeconsideredassatisfying.Inter-coderagreementwascal-

TRRICAcoder1coder2text1.18.45.64text2.26.55.62mean.22.50.63ICA:inter-coderagreement;TRR:test-retestreliability

Table4:Degreesofagreementattheﬁrstevaluationcycle

[kappa-values]

Number

Signiﬁcance

Category

Coder1Coder2Level24100.0313900.004545790.00061900.00071900.00011900.004126180.01120

920

0.001

Table5:Differentlyinterpretedcategories

culatedwithdatafromtheﬁrstannotationsofeachcoder.Theratherlowkappacoefﬁcientsledtothequestionofwhysuchalowagreementwasobtainedand,inturn,wherethecausesforthelackinagreementcouldbefound.

Step4:IdentifyingSourcesofUnreliability

Asourresults

fromtheﬁrstevaluationcycleshowinterpretationsturnedouttodifferentiateconsiderably.TheStuart-Maxwelltestwashighlysigniﬁcant(χ2=90.42;p<0.001).TheMcNemar-Textforsinglecategoriesshowedthatintheﬁrstevaluationcycleeightcategorieswereinterpreteddifferentlybythetwocoders(table5).Twodifferences,however,occurredbe-causecoder1introducednewtopicswhichwasthereforenotknowntocoder2(categories6and7).

Byfurthercheckingthetypesandnumberofcategoriesan-notatedbybothcoderswefoundtheeffectthattheﬁrstcoderannotatedmanymoredifferentcategoriesthanthesecondcoder.Intext1andtext2theﬁrstcoderannotated71and51categories,respectively,whereascoder2chosebetween39categoriesintext1and34categoriesintext2.InthislightthehigherTRR-valuesofcoder2donotseemsosurprisinganymore.

Step5:AdjustmentoftheSchema

Startingfromthestatis-ticalevidence,wenowbegantoadjustourcodingschema.First,wediscussedtheproblematiccategoriesfromtable5withtheannotatorstoclarifytheirunderstanding.Deﬁni-tionswereadjustedandﬁxedintheannotationguidelineslikeincaseofcategory23(table6).Thetwonewlyinventedcategories6and7weredroppedbecausediscussionshowedthattheycouldbesubsumedintwoexistingcategories.Thedifferencesinannotationbehaviorofthetwocodersconcern-ingtheuseofadifferentamountofcategorieswerealsodis-cussedandmorerigorousguidelinesestablished.

Inthesesituations,itdoesnotmatterwhetherheorshementionsanyinformationduringdiscussion.

However,Stasser(1988;seealsoStasseretal.,1989)identifiedatypesofinformationdistributioninwhichthebestdecisionisnotapparenttomemberspriortodiscussion.

Thisistermedahiddenprofile.

Table2:PartofanannotationatthethematiclevelCategoryDeﬁnitiontextual(old)statementsoftheauthor’sintentionsorabouttheorganizationoftextortextpartstextual(new)

statementsoftheauthor’sintentionsorabouttheorganizationoftextortextparts,alsoinformationforfurtherreading;tablecaptionsareexcluded

Table6:Adaptationofdeﬁnitionforcategory23

TRRICAcoder1coder2text1.44.80.55text2.40.74.64mean.42.77.60ICA:inter-coderagreement;TRR:test-retestreliability

Table7:Degreesofagreementatthesecondevaluationcycle[kappa-values]

SecondEvaluationCycle

Afterthemodiﬁcationsofthecodingschemaanewevalua-tioncyclestarted,whichincludedthesamestepsasdescribedabove.Inthesecondevaluationcycleweobtainedtheresultsstatedintable7.ICAvalueswerenearlytwiceashighthanincycle1.AlsoTRRvaluesforcoder1increasedconsiderably.(Dataforthesecondannotationofcoder2wasnotavailableintime,butwillbereadyinshort.)Accordingto[19]thetest-retestreliabilityforcoder1couldnowbeconsideredassubstantialtoalmostperfect,indicatingthattheschemamaybeusedconsistentlyovertimebyasinglecoder.Inter-coderagreementturnedfromfairtomoderate.

Thetestformarginalhomogeneitystillwashighlysigniﬁ-cant(χ2=153.02;p<0.001).Thecomparisonofsin-glecategories,however,showedthatthreeinsteadofthefor-mereightcategorieswerenotusedinaccordance(table8).Hence,otherevaluationcycleswillfollowinthenearfuturetofurtherimprovethecodingschema.

CASE2:EVALUATIONOFANNOTATIONSOFSPEECHDATA

Wealsotestedourmethodologicalapproachforcodingschemaevaluationwithdatafromanotherproject.TheLeaP(http://leap.lili.uni-bielefeld.de)projectisconcernedwiththeacquisitionofprosodybyforeignlanguagelearn-

NumberSigniﬁcanceCategoryCoder1Coder2Level120150.00023190.01125

102

0.021

Table8:Differentlyinterpretedcategories

ersandhassetupalargecorpusofannotatedspeechﬁles.Thesewereannotatedbysixcodersusingasix-tiercodingschema.Ontheﬁrsttier,typeofphrases(e.g.complete,in-terrupted)andinterveningnon-speecheventssuchaslaugh-terandnoisearecoded.Thesecondtierconsistsofanor-thographicannotationofwords.Onthethirdtier,syllablesareannotatedinSAMPA[34],andonthefourthtiervowelandconsonantboundariesareannotated.Ontier5,tonesareannotatedusingtheToBI[26]system,andonthesixthtier,initialhighs,ﬁnallowsandintermediatehighsandlowsofpitcharemarked.Foronespeechﬁle,anaverageof1000annotationsarecarriedout.Allannotatorsweretrainedfortwomonthsatthebeginningoftheproject.

Forthecalculationofinter-coderagreementonespeechﬁleconsistingof368wordswasannotatedseparatelybythreetofourannotators.Forameasureofoverallagreementthemedianofallpairwisecomparisonspertier(kappa-values)wascalculated.Sinceorthographicenvironment,i.e.wordsandsyllablescannotbeconsideredascategories,noagree-mentwascalculatedforthesecondandthirdtier.Theresultsofpairwiseandoverallagreementforeachtierareshownintable9.Kappa-valuesclearlyindicatethatcertaintiersaremoredifﬁculttoannotateinagreementthanothers,e.g.tonesandphrases.Thesedifferencesseemattributablemainlytothecomplexityoftheunderlyingschemasasthenumberofcategoriesfromtier1totier6areseven(phrases),three(vowels),34(tones),four(pitch).Forthecalculationofretest-reliabilitytheﬁrstﬁleannotatedwasannotatedagaintwoyearslaterbyeachcoder.Resultsfortier1andtier4showthatkappa-valuesareonamoderatelevelofagreement(ta-ble10).Inthelightofthelongperiodoftimethatelapsedbetweentheﬁrstandthesecondannotationthismuststillbeseenasarathergoodresult.

Anevaluationofthereasonsfordisagreementwillbepre-sentedhereonlyforthepaircoder1–coder3intheﬁrsttier

CoderPairTier

1-21-31-42-32-43-4Median1-phrases.40.39.43.57.63.60.504-vowels.46.46.52.46.46.49.465-tones.21.20.29.30.35.25.276-pitch.58

.68–.62–

–.62Table9:Inter-coderagreementatdifferentannotationtiers[kappa-values]

CoderTier

12341-phrases.53.24.51.654-vowels.58.35.46

.53

Table10:Retest-reliabilityatdifferentannotationtiers[kappa-values]

(inter-coderagreement),sincethisisthepairwiththelow-estagreementonthislevel.Procedureandinterpretationareidenticaltothosedescribedincase1.Astheonlytenden-tiallysigniﬁcantStuart-MaxwellTest(χ2=12.404;p<0.05)proposes,theoverallinterpretationofcategoriescanbeconsideredasnearlyidentical.Thisleadstotheconclu-sionthatthedifferencesareattributableprimarilytoasys-tematicvarianceinassigningobjectstodifferentcategories.Additionally,however,theMcNemar-Testrevealsthatthereisonecategoryintheschema(category2)thatisinterpreteddifferently(χ2=7.36;p<0.05).Theimplicationinthiscasewouldbetoﬁrstclarifythedeﬁnitionoftheproblematiccategorywithbothcoders,andthentoresumetrainingwiththeaimofimprovingthedifferentiationbetweenobjects.

PRACTICALPROBLEMS

Inapplyingthemethodologytothetwoprojectsdescribedaboveweencounteredsomepracticalproblems,whichmightbeworthnoting,sincetheyarelikelytooccurinotherappli-cationsaswellandinquiteasimilarway.

CoderCharacteristics

Inourcasestudiesweassumedthatcodercharacteristicswerestableorhadnodirectinﬂuenceonannotationqual-ity.Thisofcourseisanoverlyoptimisticview.Individualcharacteristicsofcoderssuchasfamiliaritywiththemate-rial,amountofformertraining,butalsomotivationandin-terestmayclearlyhaveavaryingimpactontheirwork.Inbothstudieswetriedtokeepthesevariablesasstableaspos-siblebyprovidingequaltrainingforeverycoder,choosingannotatorsfamiliarwiththesubjectormaterialandgivingguidelinesfortheannotationprocessaimedatreducingef-fectsoffatigue(e.g.restrictingtheannotationtimetomax-imallythreehourspersession).Nonetheless,asinteractioneffectsofcodercharacteristicsandcodingtaskcannotbeex-cluded,thechoiceofagroupofsimilarcodersshouldbeaspired.

CoderPairsTier

1-21-31-42-32-43-4Median1-phrases.86.92.88.89.93.90.904-vowels.991.00.99.991.00.99.995-tones.44.44.58.56.58.51.546-pitch.96

.94–1.00–

–.96

Table11:Inter-coderagreementincase2[correctedkappa-values]

KappaasMeasureofReliability

Oneofthemajorproblemswhenemployingkappaisthatthecoefﬁcientdependsontheactualmarginaldistribution[11,5].Incaseswithheterogeneousmarginaldistributionskappamaynothavetheoriginallyintendedrangeof−1to+1,butamorerestrictedone.Thiswillnotonlyreducethekappa-valuesobtained,butalsotheinterpretabilityoftheco-efﬁcient,sincerulesofthumbforinterpretingthegoodnessofthecoefﬁcient[19]donotapplyanymore.

Inthiscasesomeauthorssuggestthecalculationofthepos-siblemaximumthatkappacanreach(κmax)withthegivenmarginaldistribution[10,1].Theexpressionκ/κmaxwillthenleadtoacorrectedκwiththeoriginalrangeof−1to+1[10,1].Eventhoughthisprocedurewouldhavethebigadvantageofnotonlytremendouslyimprovingthekappa-values(seetable11foranexample),butalsoofrestoringtheoriginalinterpretationofkappa,werefrainedfromusingitinthecontextofourframework.Severeaberrationsfromthehomogeneityofmarginaldistributionsoftenindicateunder-lyingproblemswiththeuseofthecategories.Bycorrectingkappa,valuableinformationwouldbediscarded.

CONCLUSIONS

Theaimoftheworkpresentedherewastopresenthands-onexperiencewiththedevelopmentofhighlycomplexcod-ingschemasformanualannotationsoflinguisticdata.Themethodologicalframeworkwecreatedinordertosolveourproblemswithpoorannotationqualitybecauseofthehighcomplexityoftheannotationtaskprovedfruitfulnotonlyinthecontextofouroriginalprojectaimingatthesemantican-notationoftextdocuments,butalsointranslatingittotheannotationofspeechdata.Wethereforefeelconﬁdentthatthesystematicanditerativeprocesspresentedherecanprof-itablybeappliedinotherannotationprojects,wherecomplexcodingschemashavetobedevelopedandevaluated.

REFERENCES

1.Brennan,R.L.andPrediger,DaleJ.(1981).Coefﬁcientkappa:Someuses,misuses,andAlternatives.EducationalandPsychologicalMeasurement,41,687-699.2.Bruce,R.andWiebe,J.(1999).Recognizingsubjectivity:Acasestudyinmanualtagging.NaturalLanguageEngineer-ing,5(2),187-205.3.Bulyko,I.andOstendorf,M.(2002).Abootstrappingap-proachtoautomatingprosodicannotationforlimited-domain

synthesis.ProceedingsoftheIEEEWorkshoponSpeechSyn-thesis,11-13September,SantaMonica,CaliforniaUSA.4.Butler,T.,Fisher,S.,Coulombe,G.,Clements,P.,Brown,S.,Grundy,I.,Carter,K.,Harvey,K.andWood,J.(2000).Canateamtagconsistently?ExperiencesontheOrlandoproject.MarkupLanguages,2(2),111-125.5.Byrt,T.,Bishop,J.andCarlin,J.B.(1993).Bias,prevalenceandkappa.JournalofClinicalEpidemiology,46(5),423-429.6.Cantor,A.B.(1996).Sample-sizecalculationsforCohen’skappa.PsychologicalMethods,1(2),150-153.7.Carletta,J.(1996).Assessingagreementonclassiﬁcationtasks:Thekappastatistic.ComputationalLinguistics,22(2),249-254.8.Carletta,J.,Isard,A.,Isard,S.,Kwotko,J.C.,Doherty-Sneddon,G.andAnderson,A.H.(1997).Thereliabilityofadialoguestructurecodingscheme,23(1),13-31.9.Carmines,E.G.andZeller,R.A.(1979).Reliabilityandva-lidityassessment.SagePublications:BeverlyHillsandLon-don.PaperseriesonQuantitativeApplicationsintheSocialSciences,07-017.10.Cohen,J.(1960).Acoefﬁcientofagreementfornominal

scales.EducationalandPsychologicalMeasurement,20(1),37-46.11.Feinstein,A.R.andCicchetti,D.V.(1990).Highagreement

butlowkappa:I.Theproblemoftwoparadoxes.JournalofClinicalEpidemiology,43(6),543-549.12.Flack,V.F.,Aﬁﬁ,A.A.,Lachenbruch,P.A.andSchouten,

H.J.A.(1988).Samplesizedeterminationsforthetworaterkappastatistics.Psychometrika,53(3),321-325.13.Hanley,J.A.(1987).Standarderrorofthekappastatistic.Psy-chologicalBulletin,102(2),315-321.14.Helmstadter,G.C.(1964).Principlesofpsychologicalmea-surement.MeredithPublishing:NewYork.15.Hollnagel,E.(1993).HumanReliabilityAnalysisContext

andControl.AcademicPress:London.16.Hoyt,W.andKerns,M.-D.(1999).Magnitudeandmodera-torsofbiasinobserverratings:Ameta-analysis.Psychologi-calMethods,4(4),403-424.17.Kando,N.(1997).Text-levelstructureofresearchpapers:

Implicationsfortext-basedinformationprocessingsystems.ProceedingsoftheBritishComputerSocietyAnnualCollo-quiumofInformationRetrievalResearch,Aberdeen,Scot-land,8-9April1997,68-81.18.Krippendorff,K.(1980).Contentanalysis:Anintroduction.

SagePublications:BeverlyHillsandLondon.19.Landis,J.R.andKoch,G.G.(1977).Themeasurementofob-serveragreementforcategoricaldata.Biometrics,33(1),159-174.20.Maier,E.(1997).EvaluatingaSchemeforDialogueAnnota-tion.VERBMOBILReport193.DFKIGmbH,Saarbr¨ucken.21.Marcu,D.,Romera,M.andAmorrortu,E.(1999).Experi-mentsinConstructingaCorpusofDiscourseTrees:Prob-lems,AnnotationChoices,Issues.TheWorkshoponLevelsofRepresentationinDiscourse,Edinburgh,Scotland,71-87.

22.Maxwell,A.Comparingtheclassiﬁcationofsubjectsbytwo

independentjudges.BritishJournalofPsychiatry,116,651-655.23.McNemar,Q.Noteonthesamplingerrorofthedifferencebe-tweencorrelatedproportionsorpercentages.Psychometrika,12,153-157.24.Ng,H.T.,Lim,C.Y.andFoo,S.K.(1999).ACaseStudyon

Inter-AnnotatorAgreementforWordSenseDisambiguation.ProceedingsoftheACLSIGLEXWorkshop:StandardizingLexicalResources,CollegePark,Maryland,USA,21-22June1999,9-13.25.O’Donnell,M.RST-Tool2.4-AMarkupToolforRhetorical

StructureTheory.ProceedingsoftheInternationalNaturalLanguageGenerationConference(INLG’2000),MitzpeRa-mon,Israel,12-16June2000,253-256.26.Silverman,K.,Beckman,M.,Pitrelli,J.Ostendorf,M.,

Wightman,C.,Price,P.,Pierrehumbert,J.andHirschbergJ.(1992).ToBI:AstandardforlabelingEnglishprosody.Proceedingsofthe1992InternationalConferenceonSpo-kenLanguageProcessing,DenverColorada,USA,16-20September1992,867–870.27.Staab,S.,Maedche,A.andHandschuh,S.(2001).Creating

metadataforthesemanticweb:Anannotationframeworkandthehumanfactor.TechnicalReport412.InstituteAIFB,UniversityofKarlsruhe.28.Stuart,A.(1955).Atestforhomogeneityofthemarginaldis-tributionsinatwo-wayclassiﬁcation.Biometrika,42,412-416.29.Teufel,S.(1999).Argumentativezoning:Informationextrac-tionfromscientiﬁctext.PhDThesis,UniversityofEdinburgh.30.Teufel,S.,Carletta,J.andMoens,M.(1999).Anannotation

schemefordiscourse-levelargumentationinresearcharticles.ProceedingsoftheNinthConferenceoftheEuropeanChap-teroftheAssociationforComputationalLinguistics(EACL-99),Bergen,8-12June1999.31.Uebersax,J.(2001).StatisticalMethodsforRaterAgreement.

onlineavailable:http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm.32.Veronis,J.(2000).Sensetagging:Don’tlookforthemean-ingbutfortheuse.WorkshoponComputationalLexicog-raphyandMultimediaDictionaries(COMLEX’2000),22-23September2000,Patras,Greece,1-9.33.Vorsterman,A.,Martens,J.-P.andCoile,B.van.(1996).Au-tomaticsegmentationandlabelingofmulti-lingualspeechdata.ComputationalLinguistics,19(4),271-293.34.Wells,J.C.,Barry,W.,Grice,M.,Fourcin,A.andD.Gibbon.

(1992).StandardComputer-CompatibleTranscription.SAMStageReportSen.3SAMUCL-037,UniversityCollegeLon-don.35.Wiebe,J.M.,Bruce,R.F.andO’Hara,T.P.(1999).Develop-mentanduseofagoldstandarddatasetforsubjectivityclas-siﬁcations.Proceedingsofthe37thAnnualMeetingoftheAssociationforComputationalLinguistics(ACL-99),20-26June1999,UniversityofMaryland,246-253.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文