EvaluationofManualAnnotations
PetraS.Bayerl
AppliedandComputationalLinguisticsOtto-Behaghel-Strasse10D,35394Giessen
Justus-Liebig-University,GermanyPetra.S.Bayerl@psychol.uni-giessen.de
UlrikeGut
DepartmentofEnglish
Fahnenbergplatz,79085FreiburgAlbert-Ludwigs-University,Germanyulrike.gut@anglistik.uni-freiburg.de
ABSTRACT
Thequalityofmanualannotationsoflinguisticdatadependsontheuseofreliablecodingschemasaswellasontheabil-ityofhumanannotatorstohandlethemappropriately.Asiswellknownfromawiderangeofpreviousexperiencesan-notationsusinghighlycomplexcodingschemasoftenleadtounacceptableannotationquality.Reducingcomplexitymightmakeschemaseasiertohandle,butinthiswayvaluablein-formationneededformoresophisticatedapplicationsisex-cludedaswell.Inordertodealwiththisproblem,wedevel-opedasystematicapproachtoschemadevelopment,whichallowsfordevelopingcodingschemasforfine-grainedse-manticannotationswhilesystematicallysecuringthequalityofsuchannotations.Forillustration,wepresentexamplesfromtwoprojectswheretextandspeechdataareannotated.
Keywords
schemadevelopment,reliability,kappa,semanticannotation,speechdata
INTRODUCTION
Despiteeffortstoautomatizeannotationsoflinguisticdata[33,3]manualannotationsstillplayanimportantroleinthecompilationofcorporaandlinguisticresearchmaterial.Thequalityofsuchmanualannotationsdependsontheuseofadequateandreliablecodingschemas,whichdefinethecat-egoriesunderlyingannotationsoflinguisticsdata.Theirde-velopmentandevaluationmustthereforebeseenasoneofthemajortasksinannotationprojects.Schemadevelopmentiscrucialbecauseunreliableschemasmayleadtoinconsis-tenciesinthelabelingofobjectsnotonlybyasinglecoderovertime,butalsotoinconsistenciesinthelabelingamong
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyoth-erwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
K-CAP’03,October23–26,2003,SanibelIsland,Florida,USA.Copyright2003ACM1-58113-583-1/03/0010...$5.00
HaraldL¨ungen
AppliedandComputationalLinguistics
Otto-Behaghel-Strasse10D,35394Giessen
Justus-Liebig-University,Germany
Harald.Luengen@germanistik.uni-giessen.de
KarstenI.Paul
OrganizationalandSocialPsychologyLangeGasse20,90403N¨urnbergFriedrich-Alexander-University,GermanyPaul.Karsten@wiso.uni-erlangen.de
differentcoders.Bothtypesofinconsistenciesindicateare-ducedusabilityofannotateddata.
Asmanyannotationprojectshaveshown,especiallyhighlycomplexcodingschemasaredifficulttouseandwillthusoftenleadtoanunacceptablelowqualityinmanualanno-tations[29,4].Mostofthetime,researcherswillchoosetoreducethecomplexityofschemastomakethemmoremanageableforhumanannotators.[29],forinstance,re-ducedherschemafromoriginally31tosevenbasiccate-goriesforthesereasons.Butevenwhentheevaluationofanannotationschemaleadstogoodresultsintermsofre-liability,thenumberofcategoriesmightstillbereducedtoyieldevenbetterresults[20].Thisprocedure,however,hastheseveredrawbackofalsoreducingtheamountofinforma-tionwhichcanberepresentedinlinguisticdataandwhichmaybevaluableformorecomplexresearchquestionsorap-plicationssuchasinformationextraction,word-sensedisam-biguation,documentlayout,orinthecontextofthesemanticweb[24,32,27]Hence,itseemsvitaltodevelopasystem-aticapproachtoschemadevelopmentwiththeaimtocreatehighlycomplexreliablecodingschemaswhicharenonethe-lessmanageableforhumanannotators.Theapproachpre-sentedhereconsistsofmeasuringthereliabilityofthenewlydevelopedschema,systematicallyconsideringandidentify-ingsourcesofunreliabilitybystatisticalmeans,andthusiter-ativelyevaluatingandimprovingtheschema.Theresultingmethodologicalframeworkisbelievedtobefruitful,espe-ciallyasmostpreviousapproachesseemtolackasystematicmethodologyofschemadevelopmentandevaluation,whichoftenmakescomplexschemausagesodissatisfying.Ap-proachesliketheonepresentedby[8]foradialoguecodingschemestillseemtobeanexception.
Intheremainderofthearticle,wefirstwanttodescribethebasicprinciplesofthemethodologicalframeworkandthere-afterdemonstrateitsapplicabilitybypresentingdatafromtwoseparateannotationprojects.
CriticalIssuesinSchemaDevelopment
Thetaskofannotatinglinguisticdatacanbeseenasacat-egorizationtaskinwhichobjects(e.g.morphemes,words,phrases,sentences)havetobeassignedtoasinglecategory.Theassignmentofoneobject(usually)hastobeexclusiveandindependentfromthecategorizationofotherobjects.Tobevaluable,theprerequisiteofsuchanassignmentisthatidenticalobjectswillbeassembledinthesamecategory,whichleadstothefollowingconclusions:
1.Thecategoriesofthecodingschemamustbedefinedinawaythatenableshumanstoadequatelydifferentiateamongthem.
2.Theschemamustbeusableinaconsistentwaybyseveralpersonsaswellasbyonepersonovertime.Althoughthefirstpointseemsrathertrivial,itconstitutesex-actlythewayinwhichmostcomplexcodingschemasfail.Especiallyinthecaseofsemanticallycloseconceptsthein-terpretationsofsinglecodersoftenvaryconsiderably[2,35].Thisleadstothesecondpoint,whichreferstothequestionofconsistencyinmanualannotationsandisthusdirectlyrelatedtotheissueofreliability.Reliabilitycanbedefinedas”thecomplexpropertyofaseriesofobservationsorofthemea-suringprocessthatmakesitpossibletoobtainsimilarresultsifthemeasurementisrepeated”[15,p.51].
Incontrasttotheuseofexistingschemaswhereinconsis-tencyinannotationsareusuallyattributedtodifferencesintheapplicationoftheschemabyannotatorsoreventochar-acteristicsofthehumanannotators,thesourcesofinconsis-tencyinschemadevelopmentmustbeattributedtoalackofreliabilityoftheschemaitself.Accordingly,thestepstobetakenwhenreliabilityisnothighenougharesupposedtobedifferent.Whereasinthecaseoftextannotationintensivetrainingandtheapplicationofsupportingtoolsareappro-priatemeasurestosecureannotationquality[21],inschemadevelopmentimprovementofthecodingschemamustbetheprimarygoal.
Whenaskingwhyschemasmightbeusedinconsistentlybydifferentcodersorevenonecoderovertime,severalreasonsmayplayarole.Onatheoreticalbasis,twomajorproblem-aticaspectsinschemadevelopmentcanbedifferentiatedthatleadtosystematicvarianceinannotationbehavior[16,31].Asstatedabovetheinterpretationofcategoriesmightbeam-biguous.Secondly,theprobabilityofassigninganobjecttoacategorymaydifferamongcoders.Inaddition,annotatorsmightdevelopneworslightlyaberrantcodinghabitsovertime.Thelatterpointrefersmainlytothereliabilityofthean-notationprocess,butitcanbehypothesizedthatsuchaberra-tionsoccurmostlywhencategorydefinitionsandboundariesarediffuse.
Asaconsequence,theactionstakentoimprovethequalityofaschemamustbebasedonthespecificreasonsresponsible
forsystematicvarianceinmanualannotations.Problematicfeaturesintheschemaoritsapplicationmustthusbeana-lyzedthoroughlyinordertoensureorimprovethereliabilityofthecodinginstrument.
OUTLINEOFMETHODOLOGY
Theaboveconsiderationsledtothedesignofamethodologi-calframeworkforsystematicschemadevelopmentandeval-uation.Itcomprisesfivesuccessivesteps,whicharerepeatedaslongastheresultsarenotconsideredtobesufficient.Thesestepsare
Step1:lettwoormorecodersannotateasufficientamountofdatawithapreliminarycodingschema
Step2:repeattheannotationwiththesamecodersandma-terialafteracertainperiodoftime
Step3:checkbothannotationsforinter-andintra-coderagree-mentasameasureofreliability
Step4:identifysourcesofinconsistency
Step5:takeappropriatestepstoimprovethecodingschemaMethodologicalconsiderationsconcerningsinglestepswillbeconsideredfurtherinthefollowingsections.
PreliminaryConsiderations(Step1)
Beforestartingtheevaluationprocesssomebasicfactshavetobeconsideredsuchasthenumberofcodersandtheamountofmaterialtobeannotatedinordertogiveasufficientdatabasisforstatisticalanalyses.Concerningtheamountofdataneeded,toourknowledge,thereisnoevidenceofhowmanyobservationsdependentontheoverallnumberofcategoriesshouldbeobtainedtogetreliablestatistics.1Literatureonkappa(seebelow)usuallymentionsatleast100observa-tionstogetsufficientdataforcalculatingsignificance[13,12].[12]proposeamuchlargeramountofdataifconsid-erabledifferencesinthenumberofcasespercategoriesareexpected.
SecondAnnotation(Step2)
Thesecondannotationisdoneacertaintimeafterthefirstus-ingthesamematerial.Theamountoftimeelapsingbetweenthetwoannotationsisnoteasytochose,however.
Asknownfromsocialsciencesthetimebetweenthefirstandthesecondtesting,i.e.annotationmayinfluencetheresultorthedegreeofagreementbetweenthetwomeasurements[9,14].Sincelinguisticannotationsdonotdealwithper-sonaltraitsorattitudesthatmaychangeovertime,theimpactmightnotbeassevereasinthecontextofthesocialsciences.Still,certaineffectsoftimeshouldbetakenintoaccount.Forinstance,ifthetwoannotationslietooclosetogether,thecodersmayrememberlargepartsoftheirformerannotationwhichleadstooverestimationsofretest-reliability.Toolong
1
Someconsiderationsaboutsuitablesamplesizecouldbefoundincaseofequalmarginalfrequenciesorwithrestrictiontoonlytwocategories[6].Bothcasesarenotapplicabletoourproblem,however.
aperiod,ontheotherhand,mightnotonlycauseundesirabledelaysinannotations.Iftheannotationprocessisstoppedthroughoutthisperiod,unrealisticallyhighnegativeeffectscouldappearasdefinitionsorwholecategoriesofacodingschematendtobeforgotten(especiallyinthecaseofhighlycomplexschemas).Obviously,thecalculationoftheappro-priateperiodoftimethatshouldelapsebeforestartingthesecondannotationisnotastrivialasitmayfirstseem.We,unfortunately,donothaveadefiniteanswertowhatthe’best’spanoftimeis.Thispartisstillopentoresearch.
MeasuringInter-andIntra-CoderReliability(Step3)
TypesofReliabilityTwowaystomeasurethereliabilityof
acodingschemaseemfeasible,knownalsofromsocialsci-encesforthedevelopmentofratinginstruments.Thefirstpossibilityistheapplicationofaschemabydifferentcodersannotatingthesamematerialwhichleadstothemeasurementofinter-coderagreement(ICA).Thesecondisitsapplica-tiontotwoormoretimesbyonecoderforthesamematerialwhichindicatesintra-coderagreementortest-retestreliabil-ity(TRR)intermsofmeasurementtheory[14].Thesetwotypesarecomparabletowhat[18]namedstabilityandre-producibility.Thefirstapproachthusmeasuresconsistencyamongdifferentpersons,whereasthesecondapproachmea-suresconsistencyofonepersonovertime.Thesemeasurescanbeaffectedbyvariationsinthecodingbehavior,leadingtoinconsistencies,whichmaybebothobservableinman-ualannotations,withoutbeingnecessarilyattributabletothesamesources.Sincefeaturesofhighlycomplexschemasmayinducebothkindsofinconsistenciesindependently,theyshouldhencebeanalyzedseparately.
CalculatingReliability
Inter-coderagreementandintra-coder
agreement(test-retestreliability)canbothbecalculatedbythesamemeans.Asannotationsoflinguisticdataprimar-ilyconsistofannotationswithmutuallyexclusivecategorieswithoutanyordering(i.e.nominaldata)calculationscanbedonewiththeκ-(kappa)-coefficientdevelopedby[10].κmeasurestheagreementbetweentwocoderswhilecorrectingforchanceagreement,whichisthereasonwhyitshouldbepreferredtothemerecalculationofthepercentageofagree-ment[7].Thevalueoftheresultingkappa-coefficientindi-catesthedegreeofagreement.Fortheinterpretationoftheresultingkappaonemayrefertorulesofthumblikethosegivenby[19],where0≤κ<0.2meanslightagreement,0.2≤κ<0.4fairagreement,0.4≤κ<0.6moder-ateagreement,0.6≤κ<0.8substantialagreement,and0.8≤κ<1.0almostperfectagreement.Inspiteofseveralproblemswiththismeasureofagreement[1],kappahastheadvantageofbeingwidelyacceptedandeasytocalculate.
IdentifyingSourcesofUnreliability(Step4)
Todetectpossiblereasonsforlackofagreement,wedecidedtotestthehomogeneityofmarginaldistributions,whichcanbeseenasanindicatorofwhethercodershavedifferentin-terpretationsofthemeaningofcategoriesorwhethercoders
Categories–Coder2ABCDEFGΣA00000000B01000102CategoriesC0426010334Coder1
D00000000E1111460353F00010214G0520321325Σ11129250520118Table1:Assignmentdecisionsoftwocoders(imaginarydata)
justusesinglecategorieswithdifferentfrequencies[31].Thecomparisonisbasedondifferencesincoders’assignmentstosinglecategories,e.g.thenumberoftimescoder1assignedanobjecttocategoryA(0times)withthenumberoftimescoder2usedcategoryA(1times)(cp.table1).Homogene-ityisassumedwhenthedistributions,i.e.marginalsdonotdiffersignificantly.
Fortwocodersthischeckcanbedonewiththenon-parame-trictestbyStuartandMaxwell[28,22],whichcalculatestheoverallhomogeneityoverallcategories.Significanceofthetestindicatesthatmarginalhomogeneityisnotgiven,andthusadifferentinterpretationofcategoriesmustbeassumed.Furthermoreitisimportanttoidentifytheproblematiccat-egories,i.e.thosewhichareinterpreteddifferentlybythecoders.ThiscanbedonewiththeaidoftheMcNemar-test[23].Thistestconsidersthemarginaldistributionsofschematawithonlytwocategories,i.e.thecategoryunderconsiderationandacompoundinwhichtheremainingcate-goriesarejoinedtogether.Significanceofthetestindicatesdifferentinterpretationsofthecategoryunderconsideration.ForthecalculationofbothstatisticsweresortedtotheMH-programdevelopedbyUebersax.Thetoolcanbeobtainedasfreewarefromhttp://ourworld.compuserve.com//homepages/jsuebersax/mh.htm.
CASE1:CODINGSCHEMAFORSEMANTICTEXTANNOTATIONSSetting
ThemethodologicalframeworkforschemadevelopmentwasdevelopedwithinanannotationprojectattheUniversityofGiessen.Theaimofthisprojectistheanalysisoftheseman-ticsofdocumentstructures.2Forthispurpose,EnglishandGermanscientificarticlesaremanuallyannotatedonmul-tiplelevels,namelythestructuralandtwosemanticlevels,calledrhetoricalandthematic.Thethematicstructureofthearticledescribesthe’textworld’thatisreferredtobythear-ticle,thearticle’srhetoricalstructuredescribestherhetoricalrelationsthatholdbetweenthediscourseunitsofthearticle.
2
ProjectC1/SemDoc,DFG-Forschergruppe437/TexttechnologischeIn-formationsmodellierung.Formoredetailedinformationabouttheprojectseehttp://www.text-technology.de/
CategoryDefinitionassumptiontheoreticalassumptionorsuppositionbytheauthor
theoreticalBasiswellestablishedtheoreticalknowledgeintheresearcharea
hypothesis
concreteformulationofastatisticallytestableassumption,whichistobeeithercorroboratedorrefutedbytheresultsofthestudy
Table3:Examplesofcategorydefinitions
Whilewecouldresorttoexistingcodingschemasforthestructuralandrhetoricallevel,whichonlyneededtobead-justedtoourpurposes,thethematicschemahadtobedevel-opednearlyfromscratch.Usingexistingschemas[17,30]aswellasanalysesofsamplescientificarticlesasastart-ingpointwecompiledacodingschemaoforiginally71top-icssuchasmethod,history,andinducements.Byapplyingtheschematoawiderangeofdocuments,itwasextendedtopresently121differenttopics.Someofthesecategoriesrepresentverysubtlesemanticdifferences(seeta-ble3),whichmadeitnecessarytoaccomplishtheannotationtaskmanually.TheannotationitselfisdonebyhandinanXML-formatinthestyleof[25].Asmallpartofananno-tateddocumentisshowninfigure2.
Guidelinesdefiningthetopicsandclarifyingproblematiccaseswerewritten.Atthebeginningoftheannotationprocessthequality,measuredintermsofinter-coderagreement,wasverylow.Weobtainedagreementratesbetweenκ=.09andκ=.50(m=.22)fortwocoderseachannotatingthesamesixdocuments.Sincetheannotationqualitydidnotimprovemuchduringthefollowingannotationsessionsweattributedtheproblemtotheannotationschemaitself.Sincewedidnotwanttoreduceourschemainordertoretainasmuchinfor-mationaspossiblewedecidedtodevelopamethodologytoimprovetheusabilityofthecodingschemainstead.
FirstEvaluationCycleSteps1and2:Annotations
Asintheprojectthreeseparate
annotationlevelsareused,thenumberofannotatorsforeachlevelwaskepttotheminimumoftwoannotatorseach.Inordertomeettherequirementsforkappa(seeabove)andtoensureamoreorlessevendistributionintheprobabilityofoccurrencesoftopicswedecidedtoannotatetwocompletescientificarticlesforeachevaluationcycle.Thechosenarti-clesforthefirstannotationcyclecontainedbetween102and192segmentstobeannotatedleadingtoanaveragenumberof293annotatedtopicsforeachcoder.Thetwocodersanno-tatedindependentlyfromeachother.Thesecondannotationwasdoneapproximatelytwoweeksafterthefirst.
Step3:CalculatingReliability
Forthetwodocumentsinour
firstevaluationcycleweobtainedkappasattheslighttomod-eratelevelofagreement(seetable4),whichclearlycouldnotbeconsideredassatisfying.Inter-coderagreementwascal-
TRRICAcoder1coder2text1.18.45.64text2.26.55.62mean.22.50.63ICA:inter-coderagreement;TRR:test-retestreliability
Table4:Degreesofagreementatthefirstevaluationcycle
[kappa-values]
Number
Significance
Category
Coder1Coder2Level24100.0313900.004545790.00061900.00071900.00011900.004126180.01120
920
0.001
Table5:Differentlyinterpretedcategories
culatedwithdatafromthefirstannotationsofeachcoder.Theratherlowkappacoefficientsledtothequestionofwhysuchalowagreementwasobtainedand,inturn,wherethecausesforthelackinagreementcouldbefound.
Step4:IdentifyingSourcesofUnreliability
Asourresults
fromthefirstevaluationcycleshowinterpretationsturnedouttodifferentiateconsiderably.TheStuart-Maxwelltestwashighlysignificant(χ2=90.42;p<0.001).TheMcNemar-Textforsinglecategoriesshowedthatinthefirstevaluationcycleeightcategorieswereinterpreteddifferentlybythetwocoders(table5).Twodifferences,however,occurredbe-causecoder1introducednewtopicswhichwasthereforenotknowntocoder2(categories6and7).
Byfurthercheckingthetypesandnumberofcategoriesan-notatedbybothcoderswefoundtheeffectthatthefirstcoderannotatedmanymoredifferentcategoriesthanthesecondcoder.Intext1andtext2thefirstcoderannotated71and51categories,respectively,whereascoder2chosebetween39categoriesintext1and34categoriesintext2.InthislightthehigherTRR-valuesofcoder2donotseemsosurprisinganymore.
Step5:AdjustmentoftheSchema
Startingfromthestatis-ticalevidence,wenowbegantoadjustourcodingschema.First,wediscussedtheproblematiccategoriesfromtable5withtheannotatorstoclarifytheirunderstanding.Defini-tionswereadjustedandfixedintheannotationguidelineslikeincaseofcategory23(table6).Thetwonewlyinventedcategories6and7weredroppedbecausediscussionshowedthattheycouldbesubsumedintwoexistingcategories.Thedifferencesinannotationbehaviorofthetwocodersconcern-ingtheuseofadifferentamountofcategorieswerealsodis-cussedandmorerigorousguidelinesestablished.
Table2:PartofanannotationatthethematiclevelCategoryDefinitiontextual(old)statementsoftheauthor’sintentionsorabouttheorganizationoftextortextpartstextual(new) statementsoftheauthor’sintentionsorabouttheorganizationoftextortextparts,alsoinformationforfurtherreading;tablecaptionsareexcluded Table6:Adaptationofdefinitionforcategory23 TRRICAcoder1coder2text1.44.80.55text2.40.74.64mean.42.77.60ICA:inter-coderagreement;TRR:test-retestreliability Table7:Degreesofagreementatthesecondevaluationcycle[kappa-values] SecondEvaluationCycle Afterthemodificationsofthecodingschemaanewevalua-tioncyclestarted,whichincludedthesamestepsasdescribedabove.Inthesecondevaluationcycleweobtainedtheresultsstatedintable7.ICAvalueswerenearlytwiceashighthanincycle1.AlsoTRRvaluesforcoder1increasedconsiderably.(Dataforthesecondannotationofcoder2wasnotavailableintime,butwillbereadyinshort.)Accordingto[19]thetest-retestreliabilityforcoder1couldnowbeconsideredassubstantialtoalmostperfect,indicatingthattheschemamaybeusedconsistentlyovertimebyasinglecoder.Inter-coderagreementturnedfromfairtomoderate. Thetestformarginalhomogeneitystillwashighlysignifi-cant(χ2=153.02;p<0.001).Thecomparisonofsin-glecategories,however,showedthatthreeinsteadofthefor-mereightcategorieswerenotusedinaccordance(table8).Hence,otherevaluationcycleswillfollowinthenearfuturetofurtherimprovethecodingschema. CASE2:EVALUATIONOFANNOTATIONSOFSPEECHDATA Wealsotestedourmethodologicalapproachforcodingschemaevaluationwithdatafromanotherproject.TheLeaP(http://leap.lili.uni-bielefeld.de)projectisconcernedwiththeacquisitionofprosodybyforeignlanguagelearn- NumberSignificanceCategoryCoder1Coder2Level120150.00023190.01125 102 0.021 Table8:Differentlyinterpretedcategories ersandhassetupalargecorpusofannotatedspeechfiles.Thesewereannotatedbysixcodersusingasix-tiercodingschema.Onthefirsttier,typeofphrases(e.g.complete,in-terrupted)andinterveningnon-speecheventssuchaslaugh-terandnoisearecoded.Thesecondtierconsistsofanor-thographicannotationofwords.Onthethirdtier,syllablesareannotatedinSAMPA[34],andonthefourthtiervowelandconsonantboundariesareannotated.Ontier5,tonesareannotatedusingtheToBI[26]system,andonthesixthtier,initialhighs,finallowsandintermediatehighsandlowsofpitcharemarked.Foronespeechfile,anaverageof1000annotationsarecarriedout.Allannotatorsweretrainedfortwomonthsatthebeginningoftheproject. Forthecalculationofinter-coderagreementonespeechfileconsistingof368wordswasannotatedseparatelybythreetofourannotators.Forameasureofoverallagreementthemedianofallpairwisecomparisonspertier(kappa-values)wascalculated.Sinceorthographicenvironment,i.e.wordsandsyllablescannotbeconsideredascategories,noagree-mentwascalculatedforthesecondandthirdtier.Theresultsofpairwiseandoverallagreementforeachtierareshownintable9.Kappa-valuesclearlyindicatethatcertaintiersaremoredifficulttoannotateinagreementthanothers,e.g.tonesandphrases.Thesedifferencesseemattributablemainlytothecomplexityoftheunderlyingschemasasthenumberofcategoriesfromtier1totier6areseven(phrases),three(vowels),34(tones),four(pitch).Forthecalculationofretest-reliabilitythefirstfileannotatedwasannotatedagaintwoyearslaterbyeachcoder.Resultsfortier1andtier4showthatkappa-valuesareonamoderatelevelofagreement(ta-ble10).Inthelightofthelongperiodoftimethatelapsedbetweenthefirstandthesecondannotationthismuststillbeseenasarathergoodresult. Anevaluationofthereasonsfordisagreementwillbepre-sentedhereonlyforthepaircoder1–coder3inthefirsttier CoderPairTier 1-21-31-42-32-43-4Median1-phrases.40.39.43.57.63.60.504-vowels.46.46.52.46.46.49.465-tones.21.20.29.30.35.25.276-pitch.58 .68–.62– –.62Table9:Inter-coderagreementatdifferentannotationtiers[kappa-values] CoderTier 12341-phrases.53.24.51.654-vowels.58.35.46 .53 Table10:Retest-reliabilityatdifferentannotationtiers[kappa-values] (inter-coderagreement),sincethisisthepairwiththelow-estagreementonthislevel.Procedureandinterpretationareidenticaltothosedescribedincase1.Astheonlytenden-tiallysignificantStuart-MaxwellTest(χ2=12.404;p<0.05)proposes,theoverallinterpretationofcategoriescanbeconsideredasnearlyidentical.Thisleadstotheconclu-sionthatthedifferencesareattributableprimarilytoasys-tematicvarianceinassigningobjectstodifferentcategories.Additionally,however,theMcNemar-Testrevealsthatthereisonecategoryintheschema(category2)thatisinterpreteddifferently(χ2=7.36;p<0.05).Theimplicationinthiscasewouldbetofirstclarifythedefinitionoftheproblematiccategorywithbothcoders,andthentoresumetrainingwiththeaimofimprovingthedifferentiationbetweenobjects. PRACTICALPROBLEMS Inapplyingthemethodologytothetwoprojectsdescribedaboveweencounteredsomepracticalproblems,whichmightbeworthnoting,sincetheyarelikelytooccurinotherappli-cationsaswellandinquiteasimilarway. CoderCharacteristics Inourcasestudiesweassumedthatcodercharacteristicswerestableorhadnodirectinfluenceonannotationqual-ity.Thisofcourseisanoverlyoptimisticview.Individualcharacteristicsofcoderssuchasfamiliaritywiththemate-rial,amountofformertraining,butalsomotivationandin-terestmayclearlyhaveavaryingimpactontheirwork.Inbothstudieswetriedtokeepthesevariablesasstableaspos-siblebyprovidingequaltrainingforeverycoder,choosingannotatorsfamiliarwiththesubjectormaterialandgivingguidelinesfortheannotationprocessaimedatreducingef-fectsoffatigue(e.g.restrictingtheannotationtimetomax-imallythreehourspersession).Nonetheless,asinteractioneffectsofcodercharacteristicsandcodingtaskcannotbeex-cluded,thechoiceofagroupofsimilarcodersshouldbeaspired. CoderPairsTier 1-21-31-42-32-43-4Median1-phrases.86.92.88.89.93.90.904-vowels.991.00.99.991.00.99.995-tones.44.44.58.56.58.51.546-pitch.96 .94–1.00– –.96 Table11:Inter-coderagreementincase2[correctedkappa-values] KappaasMeasureofReliability Oneofthemajorproblemswhenemployingkappaisthatthecoefficientdependsontheactualmarginaldistribution[11,5].Incaseswithheterogeneousmarginaldistributionskappamaynothavetheoriginallyintendedrangeof−1to+1,butamorerestrictedone.Thiswillnotonlyreducethekappa-valuesobtained,butalsotheinterpretabilityoftheco-efficient,sincerulesofthumbforinterpretingthegoodnessofthecoefficient[19]donotapplyanymore. Inthiscasesomeauthorssuggestthecalculationofthepos-siblemaximumthatkappacanreach(κmax)withthegivenmarginaldistribution[10,1].Theexpressionκ/κmaxwillthenleadtoacorrectedκwiththeoriginalrangeof−1to+1[10,1].Eventhoughthisprocedurewouldhavethebigadvantageofnotonlytremendouslyimprovingthekappa-values(seetable11foranexample),butalsoofrestoringtheoriginalinterpretationofkappa,werefrainedfromusingitinthecontextofourframework.Severeaberrationsfromthehomogeneityofmarginaldistributionsoftenindicateunder-lyingproblemswiththeuseofthecategories.Bycorrectingkappa,valuableinformationwouldbediscarded. CONCLUSIONS Theaimoftheworkpresentedherewastopresenthands-onexperiencewiththedevelopmentofhighlycomplexcod-ingschemasformanualannotationsoflinguisticdata.Themethodologicalframeworkwecreatedinordertosolveourproblemswithpoorannotationqualitybecauseofthehighcomplexityoftheannotationtaskprovedfruitfulnotonlyinthecontextofouroriginalprojectaimingatthesemantican-notationoftextdocuments,butalsointranslatingittotheannotationofspeechdata.Wethereforefeelconfidentthatthesystematicanditerativeprocesspresentedherecanprof-itablybeappliedinotherannotationprojects,wherecomplexcodingschemashavetobedevelopedandevaluated. REFERENCES 1.Brennan,R.L.andPrediger,DaleJ.(1981).Coefficientkappa:Someuses,misuses,andAlternatives.EducationalandPsychologicalMeasurement,41,687-699.2.Bruce,R.andWiebe,J.(1999).Recognizingsubjectivity:Acasestudyinmanualtagging.NaturalLanguageEngineer-ing,5(2),187-205.3.Bulyko,I.andOstendorf,M.(2002).Abootstrappingap-proachtoautomatingprosodicannotationforlimited-domain synthesis.ProceedingsoftheIEEEWorkshoponSpeechSyn-thesis,11-13September,SantaMonica,CaliforniaUSA.4.Butler,T.,Fisher,S.,Coulombe,G.,Clements,P.,Brown,S.,Grundy,I.,Carter,K.,Harvey,K.andWood,J.(2000).Canateamtagconsistently?ExperiencesontheOrlandoproject.MarkupLanguages,2(2),111-125.5.Byrt,T.,Bishop,J.andCarlin,J.B.(1993).Bias,prevalenceandkappa.JournalofClinicalEpidemiology,46(5),423-429.6.Cantor,A.B.(1996).Sample-sizecalculationsforCohen’skappa.PsychologicalMethods,1(2),150-153.7.Carletta,J.(1996).Assessingagreementonclassificationtasks:Thekappastatistic.ComputationalLinguistics,22(2),249-254.8.Carletta,J.,Isard,A.,Isard,S.,Kwotko,J.C.,Doherty-Sneddon,G.andAnderson,A.H.(1997).Thereliabilityofadialoguestructurecodingscheme,23(1),13-31.9.Carmines,E.G.andZeller,R.A.(1979).Reliabilityandva-lidityassessment.SagePublications:BeverlyHillsandLon-don.PaperseriesonQuantitativeApplicationsintheSocialSciences,07-017.10.Cohen,J.(1960).Acoefficientofagreementfornominal scales.EducationalandPsychologicalMeasurement,20(1),37-46.11.Feinstein,A.R.andCicchetti,D.V.(1990).Highagreement butlowkappa:I.Theproblemoftwoparadoxes.JournalofClinicalEpidemiology,43(6),543-549.12.Flack,V.F.,Afifi,A.A.,Lachenbruch,P.A.andSchouten, H.J.A.(1988).Samplesizedeterminationsforthetworaterkappastatistics.Psychometrika,53(3),321-325.13.Hanley,J.A.(1987).Standarderrorofthekappastatistic.Psy-chologicalBulletin,102(2),315-321.14.Helmstadter,G.C.(1964).Principlesofpsychologicalmea-surement.MeredithPublishing:NewYork.15.Hollnagel,E.(1993).HumanReliabilityAnalysisContext andControl.AcademicPress:London.16.Hoyt,W.andKerns,M.-D.(1999).Magnitudeandmodera-torsofbiasinobserverratings:Ameta-analysis.Psychologi-calMethods,4(4),403-424.17.Kando,N.(1997).Text-levelstructureofresearchpapers: Implicationsfortext-basedinformationprocessingsystems.ProceedingsoftheBritishComputerSocietyAnnualCollo-quiumofInformationRetrievalResearch,Aberdeen,Scot-land,8-9April1997,68-81.18.Krippendorff,K.(1980).Contentanalysis:Anintroduction. SagePublications:BeverlyHillsandLondon.19.Landis,J.R.andKoch,G.G.(1977).Themeasurementofob-serveragreementforcategoricaldata.Biometrics,33(1),159-174.20.Maier,E.(1997).EvaluatingaSchemeforDialogueAnnota-tion.VERBMOBILReport193.DFKIGmbH,Saarbr¨ucken.21.Marcu,D.,Romera,M.andAmorrortu,E.(1999).Experi-mentsinConstructingaCorpusofDiscourseTrees:Prob-lems,AnnotationChoices,Issues.TheWorkshoponLevelsofRepresentationinDiscourse,Edinburgh,Scotland,71-87. 22.Maxwell,A.Comparingtheclassificationofsubjectsbytwo independentjudges.BritishJournalofPsychiatry,116,651-655.23.McNemar,Q.Noteonthesamplingerrorofthedifferencebe-tweencorrelatedproportionsorpercentages.Psychometrika,12,153-157.24.Ng,H.T.,Lim,C.Y.andFoo,S.K.(1999).ACaseStudyon Inter-AnnotatorAgreementforWordSenseDisambiguation.ProceedingsoftheACLSIGLEXWorkshop:StandardizingLexicalResources,CollegePark,Maryland,USA,21-22June1999,9-13.25.O’Donnell,M.RST-Tool2.4-AMarkupToolforRhetorical StructureTheory.ProceedingsoftheInternationalNaturalLanguageGenerationConference(INLG’2000),MitzpeRa-mon,Israel,12-16June2000,253-256.26.Silverman,K.,Beckman,M.,Pitrelli,J.Ostendorf,M., Wightman,C.,Price,P.,Pierrehumbert,J.andHirschbergJ.(1992).ToBI:AstandardforlabelingEnglishprosody.Proceedingsofthe1992InternationalConferenceonSpo-kenLanguageProcessing,DenverColorada,USA,16-20September1992,867–870.27.Staab,S.,Maedche,A.andHandschuh,S.(2001).Creating metadataforthesemanticweb:Anannotationframeworkandthehumanfactor.TechnicalReport412.InstituteAIFB,UniversityofKarlsruhe.28.Stuart,A.(1955).Atestforhomogeneityofthemarginaldis-tributionsinatwo-wayclassification.Biometrika,42,412-416.29.Teufel,S.(1999).Argumentativezoning:Informationextrac-tionfromscientifictext.PhDThesis,UniversityofEdinburgh.30.Teufel,S.,Carletta,J.andMoens,M.(1999).Anannotation schemefordiscourse-levelargumentationinresearcharticles.ProceedingsoftheNinthConferenceoftheEuropeanChap-teroftheAssociationforComputationalLinguistics(EACL-99),Bergen,8-12June1999.31.Uebersax,J.(2001).StatisticalMethodsforRaterAgreement. onlineavailable:http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm.32.Veronis,J.(2000).Sensetagging:Don’tlookforthemean-ingbutfortheuse.WorkshoponComputationalLexicog-raphyandMultimediaDictionaries(COMLEX’2000),22-23September2000,Patras,Greece,1-9.33.Vorsterman,A.,Martens,J.-P.andCoile,B.van.(1996).Au-tomaticsegmentationandlabelingofmulti-lingualspeechdata.ComputationalLinguistics,19(4),271-293.34.Wells,J.C.,Barry,W.,Grice,M.,Fourcin,A.andD.Gibbon. (1992).StandardComputer-CompatibleTranscription.SAMStageReportSen.3SAMUCL-037,UniversityCollegeLon-don.35.Wiebe,J.M.,Bruce,R.F.andO’Hara,T.P.(1999).Develop-mentanduseofagoldstandarddatasetforsubjectivityclas-sifications.Proceedingsofthe37thAnnualMeetingoftheAssociationforComputationalLinguistics(ACL-99),20-26June1999,UniversityofMaryland,246-253. 因篇幅问题不能全部显示,请点此查看更多更全内容