搜索
您的当前位置:首页正文

An Analysis of Model-based Clustering, Competitive Learning, and Information Bottleneck

来源:爱go旅游网
AnAnalysisofModel-basedClustering,CompetitiveLearning,and

InformationBottleneck

ShiZhong

DepartmentofComputerScienceandEngineering

FloridaAtlanticUniversity

777GladesRoad,BocaRaton,FL33431

zhong@cse.fau.edu

Abstract

Thispaperprovidesageneralformulationofprobabilisticmodel-basedclusteringwithde-terministicannealing(DA),whichleadstoaunifyinganalysisofk-means,EMclus-tering,softcompetitivelearningalgorithms(e.g.,self-organizingmap),andinformationbottleneck.Theanalysispointsoutaninter-estingyetnotwell-recognizedconnectionbe-tweenthek-meansandEMclustering—theyarejusttwodifferentstagesofaDAclus-teringprocess,withdifferenttemperatures.Demonstratedrelationshipsbetweenmodel-basedclustering,competitivelearning,andinformationbottleneck,canpotentiallygen-erateaseriesofnewalgorithms.

Butitsapplicationstoclusteringhavebeenlargelyre-strictedtovectordata(Roseetal.,1993;Hofmann&Buhmann,1997).Thispaperprovidesaprobabilis-ticmodel-basedDAformulation,whichhelpsusmakeinsightfulconnectionsbetweenmixturemodels,com-petitivelearning,andinformationbottleneck.Mixturemodels(McLachlan&Basford,1988)havebeenwidelyusedfordensityestimationandcluster-ingapplications.TheEMalgorithm(Dempsteretal.,1977)isusuallyusedtoestimatemaximumlikelihood(ormaximumaposteriori)parametersformixturemodels.Whenusedforclustering,itiscalledEMclus-tering.Thestandardk-meansalgorithm(Forgy,1965;MacQueen,1967)isanotherwell-knowclusteringtech-niquethathasitsrootinstatisticalanalysisandthatiscloselyrelatedtovectorquantizationinpatternrecog-nitionandsignalprocessing.ThestandardK-meansalgorithmisoftenviewedasaspecialcaseofspheri-calGaussianmodel-basedEMclusteringwithvarianceσ2→0(Mitchell,1997),yetmoresophisticatedanaly-sesexist.Forexample,aninformation-theoreticanal-ysiswasprovidedbyKearnsetal.(1997),whousetheterm“k-means”tomeanagenericalgorithmthatisnotlimitedtovectorspacewithGaussianmodels.Wefollowthisconceptualnotationinourdiscussion.Inthispaper,weshallintroduceamoreintuitiveexpla-nationfortherelationshipbetweenk-meansandEMclustering,underamodel-baseddeterministicanneal-ingframework.

Theonlineversionofk-meansalsohascloseconnectiontocompetitivelearningtechniques(Ahaltetal.,1990),inwhichthek-meansalgorithmisviewedtousetheWinner-Take-AlllearningrulewhileEMclusteringaSoftMaxlearningrule.Self-organizingmaps(SOM)(Kohonen,1997)andNeural-Gas(Martinetzetal.,1993)aretwosophisticatedcompetitivelearningmeth-odsthatuseaannealingmechanismtoavoidlocallyoptimalsolutions.ConnectionsbetweenSOM,Neural-Gas,andDAhavebeenmadeforvectorquantizationapplicationsbyHofmannandBuhmann(1998),who

1Introduction

Simulatedannealing(Kirkpatricketal.,1983)isastochasticoptimizationtechniquemotivatedbythean-nealingprocessesinphysicalchemistry.Certainchem-icalsystemscanbedriventotheirlow-energystatesbyannealing,agradualtemperature-decreasingprocess.Theannealingschedule,i.e.,therateatwhichthetem-peratureislowered,iscriticaltoreachingtheglobaloptimum.GemanandGeman(1984)havetheoreti-callyshownthattheglobaloptimumcanbeachieved

1

byaschedulefollowingT∝logm,wheremisthecur-rentiterationnumber.Suchschedules,however,areveryslowandunrealisticinmanyapplications.Deterministicannealing(DA),asthenamesuggests,isadeterministicversionofsimulatedannealing(Rose,1998).Derivedfromaninformationtheoreticviewofoptimizationproblems,DAisnotguaranteedtoreachglobaloptimum.Rather,itisaheuristicstrategythatavoidsmanylocallyoptimalsolutionsandenjoysafastertemperatureschedule.Ithasbeensuccess-fullyusedinawiderangeofapplications(Rose,1998).

showedthatthethreetypesofalgorithmsarethreedif-ferentimplementationsofacontinuationmethod(All-gower&Georg,1990)forvectorquantization,withdifferentcompetitivelearningrules.WeshallshowthatSOMandNeural-Gascanbeinterpretedasvari-ationsofDAclusteringandcanbeextendedtoemploydifferentprobabilisticmodels.

Recently,informationbottleneck(IB)(Tishbyetal.,1999)hasattractedalotofinterestsandgainedpop-ularityintextclusteringapplications.AnexplanationofitsconnectiontomaximumlikelihoodalgorithmshasbeenattemptedbySlonimandWeiss(2003).ByderivingIBformulationfrommultinomialmodel-baseddeterministicannealing,weshowthatIBtextcluster-ingexploredrecentlybymanyauthorsimplicitlyas-sumesamultinomialdistributionforclusters.Inshort,thispaperextendsourpreviouswork(Zhong&Ghosh,2003)withamoregeneralformulationofmodel-baseddeterministicannealingandathoroughdiscussion.Throughtheformulation,wedemonstratesomeusefulconnectionsbetweenmaximumlikelihoodestimationofmixturemodels,competitivelearning,andinformationbottleneck.Webelievetheseconnec-tionsmayleadtoaseriesofnewalgorithms.Section2presentsmodel-baseddeterministicanneal-ing.Section3demonstratesk-meansandEMclus-teringasspecialstagesofamodel-basedDAclus-teringprocess.Section4and5discusstherelation-shipsbetweenmodel-basedDAclusteringandcom-petitivelearningaswellasinformationbottleneck.Section6presentspossiblenewapplicationsandal-gorithmsbasedonourgeneralformulationofmodel-basedDAclustering.Finally,Section7summarizesthispaperwithremarksonfuturework.

2

Model-basedDeterministic

Annealing

Inthissection,wepresentageneralformulationofmodel-baseddeterministicannealinganddemonstratetheconnectiontoNealandHinton’sEMinterpretation(1998).AlthoughtheEMandDAalgorithmsapplytogeneraloptimizationproblems,forsimplicity,herewerestrainourdiscussiontoprobabilisticmodel-basedclustering.

Consideracentralclusteringproblem,inwhichX={x1,...,xN}isasetofNdataobjects,Z={z1,...,zN}asetofN(hiddenorunobserved)clusterindices,andΘasetofparametersassociatedwithaprobabilisticmodelofXandZ,tworandomvariablesgoverningthedistributionofxandz.Notez∈{1,...,K}isadiscretevariableandKisthenumberofclusters.Foreachclusterz,thecorrespondingmodelforXis

P(x|Θ,z)=P(x|θz),whichmeasuresthelikelihoodofobjectxcomingfromclusterz.Theparameterθzisthesetofparametersforclusterz.Ourobjectiveistomaximizethefollowingaveragelog-likelihood:

E=󰀇󰀇

P(x,z|Θ)logP(x|θz)

x

z

=󰀇P(x|Θ)

󰀇

P(z|x,Θ)logP(x|θz)

xz

󰀇

P(z|x,Θ)logP(x|θz),

(1)

x,zwherethe(parameterized)datapriorP(x|Θ)isun-avoidablysettobeconstant1/Ninpracticeaswealwaysdealwithafinitesample(butasN→∞,us-ing1/Nisasymptoticallycorrect).Forsimplicity,weignorethisconstantandusethesimplifiednotationin(1)inlateranalysis.

MaximizingEdirectlyoverP(z|x,Θ)leadstoak-meanstypepartitioningofdataobjects,i.e.,harddataassignment.Thatis,foreachdataobjectx,theposteriorP(z|x,Θ)hasavalueof1forz=argmaxz󰀁P(x|θz󰀁)and0otherwise.Toconsidersoftpartitioning,onecanintroduceentropyconstraintsorpenaltiesinto(scaled)objectiveE:

F=󰀇

P(z|x,Θ)logP(x|Θz)

x,z

−γ·H(Z|Θ)+T·H(Z|X,Θ),(2)

whereH(Z|Θ)=−󰀆

zP(z|Θ)logP(z|Θ)istheclus-terpriorentropy,

H(Z|X,Θ)=−󰀇

P(z|x,Θ)logP(z|x,Θ)

x,z

istheaverageclusterposteriorentropy,andγandTarecoefficientsusedtotradeoffbetweenmaximizingE,minimizingH(Z|Θ),andmaximizingH(Z|X,Θ).Intuitively,minimizingH(Z|Θ)hasaneffectoffa-voringsmallernumberofclusterswhilemaximizingH(Z|X,Θ)favorssoftposteriorassignmentofdataob-jectstodifferentclusters(toavoidharddecisions,lead-ingtoso-calledmaximumentropyclustering).TheH(Z|Θ)termisnotintheoriginalDAformulation(Rose,1998)butwasmentionedasapossibleentropy-constrainedextensionthathasbeenusedinvectorquantization.Weshallseehowitisusefulinmak-ingconnectionstoEMclustering.TheparameterThasatemperatureinterpretation,asexplainedlaterinthissection.

ThefunctionFcanberewrittenas

F=

󰀇

P(z|x,Θ)[logP(x|θz)x,z

+γlogP(z|Θ)−TlogP(z|x,Θ)].(3)

UsingclassicLagrangeoptimizationmethod,togetherwithconstraint󰀆

zP(z|x,Θ)=1,onecanderivethemaximumentropysolutionofposteriorP(z|x,Θ)(formaximizingF)tobethewell-knowGibbsdistribution

󰀂explogP(x|θ󰀃z)+γlogP(z|Θ)

P(z|x,Θ)=󰀆󰀂TlogP(x|θ󰀃z󰀁)+γlogP(z󰀁|Θ)

z󰀁expT=

󰀆P(x|θ1

z)TP(z|Θ)γ

T(x|θ1Θ)γ.(4)

z󰀁

Pz󰀁)TP(z󰀁|TSeveralinterestingchoicesofγarediscussedinthenextsection.Ingeneral,whentemperatureTishigh,theposteriorentropyishigh,meaningtheprobabilityofeachdataobjectbeingassignedtodifferentclustersisbalanced(oruniform).Conversely,asTdecreasesto0,theposteriorprobabilityP(z|x,Θ)willbeclosetoeither0or1,leadingtoaharddataassignment.ADAalgorithmnaturallystartsatahightemperatureandthengraduallydecreasesthetemperaturetowards0.Ateachtemperature,theEMalgorithmisusedtomaximizetheobjective(2),withaE-stepin(4)andaM-step

θ(󰀇

znew)=argmaxθ

P(z|x,Θ)P(x|θ).(5)

x

Agenericmodel-basedDAclusteringalgorithmwasconstructedinZhongandGhosh(2003).Notethatinthisformulation,thetemperatureTonlyaffectsthecalculationofposteriorP(z|x,Θ),i.e.,theE-stepoftheEMalgorithm.AnnealingM-stepispossiblebutcomputationallymuchmoreexpensivethantheDAformulation.

TherelationshiptomaximumlikelihoodestimationofmixturemodelswiththeEMalgorithmisdemon-stratedasfollows.AccordingtotheviewofNealandHinton(1998),theobjectivefunctionthattheEMal-gorithmmaximizesisF1

=E[logP(X,Z|Θ)|X,Θ]+H(Z|X,Θ)(6)

=󰀇P(z|x,Θ)logP(x,z|Θ)+H(Z|X,Θ)(7)x,z

=

󰀇

P(z|x,Θ)logP(x|θz)−H(Z|Θ)

x,z+H(Z|X,Θ),

(8)

whereZisthe(hidden)clusterindicesofalldataob-jects,thefirsttermin(6)isanexpectationofcom-pletedatalog-likelihoodoverP(z|x,Θ),andthesec-ondtermin(6)istheposteriorentropy.Derivationof(7)from(6)canbefoundinBlimes(1998),thusisomittedhere.From(7)to(8)isstraightforward.Therefore,theobjectivefunctionin(2)isthesameasthatusedinNealandHinton(1998)ifweset

γ=T=1.NealandHinton(1998)showedthat

theoptimalvalueofF1overP(z|x,Θ)isexactlytheincompletedatalog-likelihoodlogP(X|Θ).

3K-meansvs.EMClustering

Letusnowdiscussseveralchoicesoftheγparameter,whichelucidatesanintuitiverelationshipbetweenk-meansandEMclustering.

•γ=0:

TheobjectiveFbecomes

󰀇

P(z|x,Θ)logP(x|θz)+T·H(Z|X,Θ),

x,z

whichisthesameastheoriginalDAformu-lationwiththedistortionfunctionsettobe

−logP(x|θz).AsT→∞,P(z|x,Θ)becomesuniform,whichmakessensewhenthereisnopriorknowledgeaboutthedistributionofz.AsTtendsto0,theclusteringalgorithmbecomesk-means,whichmaximizesthefirsttermin(2).•γ=1:

TheobjectiveFbecomes󰀇

P(z|x,Θ)logP(x,z|Θ)+T·H(Z|X,Θ),

x,z

whichfurtherreducestostandardEMcluster-ing(7)ifT=1.AsT→∞,P(z|x,Θ)be-comesuniform,sameastheγ=0case.AsT

tendsto0,theclusteringalgorithmbecomesahardclusteringalgorithmthatisslightlydiffer-entfromk-means.Specifically,P(z|x,Θ)is1forz=argmaxz󰀁P(z󰀁|Θ)P(x|θz󰀁)and0otherwise.ThishardclusteringversionwasobservedandanalyzedinBanerjeeetal.(2003).Itseemstomaximizealowerboundofincompletedata(log-)likelihood.

•γ=T:

Thisisthescenariowearemainlyinterestedin.TheobjectiveFbecomes󰀇

P(z|x,Θ)logP(x|θz)−T·I(X;Z|Θ),(9)

x,z

whereonecaneasilyseethataninterpretationonmaximizingFistominimizethemutualinforma-tionbetweenXandZ,i.e.,tocompressXintoZasmuchaspossible,andmeanwhiletomaximizetheaveragelog-likelihood.ThetemperatureTisusedtoadjustsuchatradeoff.TheE-stepnowbecomes

P(z|x,Θ)=󰀆P(z|Θ)P(x|θz)1

T(z󰀁|Θ)P(x|θ1.

z󰀁

Pz󰀁)TAsT→∞,theclusterposteriorsbecomeequaltopriors,i.e.,P(z|x,Θ)=P(z|Θ).Thisisusefulwhenoneintendstoenforcesomepriorknowledgeonthedistributionofclustersizes(whichcannotbeachievedfortheprevioustwoscenarios).AsTlowersto1,theclusteringalgorithmisequiva-lenttoEMclustering;asTfurthergoesto0,thealgorithmbecomesk-means.Therefore,wecanviewEMclusteringandk-meansastwodifferentstagesofamodel-basedDAclusteringprocess,withT=1andT=0,respectively.ThisisabetterinterpretationthanthetraditionaloneinMitchell(1997),whereonehastoartificiallydi-minishthevarianceofGaussianmodels,destroy-ingreasonableGaussianmodeldescriptionsfortheclustersgeneratedbythestandardk-meansalgorithm.

4RelationtoCompetitiveLearning

Inthissection,wediscusstworelatedcompetitivelearningalgorithmsthathaveanannealingflavor—self-organizingmap(Kohonen,1997)andNeural-Gas(Martinetzetal.,1993),bothhavebeenusedforclus-tering.Toseetheconnectionclearly,wefirstdiscussbatchversionsofthetwoalgorithms.

FortheoriginalSOMalgorithm,themodelparam-etersarejustthemeansofeachcluster,i.e.,Θ={µz}z=1,...,K.AdistinctfeatureofSOMistheuseofatopologicalmap,inwhicheachclusterhasafixedcoordinate.Letthemap󰀂locationofφz

andK,φ2󰀃clusterzbeα(φ12)=exp−󰀃φ1−φ2󰀃

2α2aneighborhoodfunction.Letz(x)=argmaxz󰀌x−µz󰀌,where󰀌·󰀌isL2norm.ThebatchSOMalgorithmamountstoiteratingbetweenthefollowingtwosteps:

P(z|x,Θ)=󰀆Kα(φz,φz(x))

(10)

z󰀁Kα(φz󰀁

,φz(x))

and

µ(z

new)=1󰀇NP(z|x,Θ)x,(11)

x

whereαisaparametercontrollingthewidthoftheneighborhoodfunctionanddecreasesgraduallydur-ingtheclusteringprocess.Asαgoesto0,P(z|x,Θ)becomeseither0or1andthealgorithmreducestok-means.

Itisimmediatelynotedthattheαhasthesamefunc-tionalityofatemperatureparameterindeterministicannealing.Thedifferencefromstandardmodel-baseddeterministicannealingisthatherethecalculationofP(z|x,Θ)isconstrainedbyatopologicalmapstruc-ture,whichgivesSOMtheadvantagethatallresult-ingclustersarestructurallyrelatedaccordingtothe

pre-specifiedtopologicalmap.Obviously,thismod-ifiedE-stepmakesSOMnotastrictEMalgorithm;maximumlikelihoodstatement,aswellastheconver-gencetheoremoftheEMalgorithm(Dempsteretal.,1977),donotapply.Looselyspeaking,however,theconvergenceofSOMisguaranteedbytheconvergenceofk-meansduringthefinalstageofannealingprocesswhereαbecomescloseto0(thusP(z|x,Θ)becomeshardandthealgorithmbecomesk-means).

ComparingthebatchSOMalgorithmwithmodel-basedDAclustering,wenoticethatanaturalexten-sionofSOMistouseprobabilisticmodelsintheM-step(11).Heskes(2001)discussedthisdirectionandexploredmultinomialmodel-basedSOMforanalyzingmarketbasketdata.

TheNeural-GasalgorithmdiffersfromtheSOMclus-tering,onlyintheE-step,i.e.,howP(z|x,Θ)iscalcu-lated.Itusesarank-basedscheme

−r(x,z)/α

P(z|x,Θ)=󰀆e−r(x,z󰀁),(12)

z

󰀁e/αwhereαisagainanequivalenttemperatureparameterandr(x,z)arankfunctionthattakesvaluek−1ifµzisthek-thclosestclustercentroidtodatavectorx(ac-cordingtoL2distance).Convergenceofthisalgorithmhasbeenanalyzedandinterpretedusingdiffusiondy-namics(Martinetzetal.,1993).SimilartoSOM,how-ever,theconvergenceisobviousatthefinalstageofannealingasαlowersto0.Thoughitseemsnatu-raltoextendNeural-Gastouseprobabilisticmodelsandrankclustersbasedonthelog-likelihoodmeasureP(x|θz),wehavenotseenanymodel-basedNeural-Gasalgorithmssofar.Somodel-basedNeural-Gasal-gorithmscanbeaninterestingfutureworkandusefulinanalyzingnon-Gaussiandata.

Competitivelearningtechniquesareusuallypresentedinonlineform—aseachdataitemispresented,mul-tipleclusters(intheclusteringcontext)competeforitandtheclustercentroidsgetupdatedaccordingtothecompetingresults.Theupdatingpartbasicallyfollowsagradientdescentapproachandthelearningrateneedstobecarefullyselected(basedonstochas-ticapproximationprinciples,seeCherkassky&Mulier,1998).Onlinealgorithmsarebetterequippedthanbatchversionstohandleverylargedatasetsthatcannotfitintomemory.Theconnectiondemonstratedaboveindicatesthatanonlineversionofanymodel-basedDAclusteringalgorithmcanbereadilycon-structedfollowingthepracticeusedincompetitivelearningmethods.

5RelationtoInformationBottleneck

Inthissection,weshowanequivalencebetweenIBclusteringandmultinomialmodel-basedDAcluster-ing.Thisresultindicatesthat,whenappliedtoclustering,theIBframeworkimplicitlyassumeadi-cretemultinomialdistributionfortherepresentationofdata.EventhoughtheoriginalIBframeworkdoesnotspecifydiscretedistributions,continuousapplicationshavenotbeenseenyetandmayinvolveconsiderablecomputationaldifficulty.

LetP(x|θz)beamultinomialdistributionparameter-izedbyθz={P(y|z)},whereYisarandomvariableassociatedwiththesetofpossiblesymbolsfromwhichxisdrawn.P(y|z)istheprobabilityofdrawingsym-bolyinclusterz.Assumethateachxisformedbydrawingntimesfromthesetofsymbolsandnx(y)isthenumberoftimessymbolyisdrawn.Wethenhave

logP(x|θ󰀈

z)=log

P(y|z)nx(y)y

=󰀇

nx(y)logP(y|z)

y

󰀄󰀅=n

󰀇P(y|x)log

P(y|z)

P(y|x)

−Hy|x

=

−n󰀉

y

Dy|x)󰀌P(y|z))−H󰀁

KL(P(y|x,(13)

whereDKL(P󰀌Q)istheKullback-Leibler(KL)diver-gencebetween󰀆probabilitydistributionsPandQ,andHy|x=−yP(y|x)logP(y|x)canbeseenasacon-stantw.r.t.ZandΘ.Ontheotherhand,assumethatYisindependentofZgivenX(whichistrueforclus-teringsinceZisjustacompressedversionofX),itiseasytoshowthat(Tishbyetal.,1999;Dhillonetal.,2002)

I(X;Y)−I(Z;Y)∝󰀇

P(z|x)DKL(P(y|x)󰀌P(y|z)).

x,z

Therefore,whenmultinomialmodelsareused,plug-ging(13)into(9),wecanshowthat

F=

󰀇

P(z|x,Θ)logP(x|θz)−T·I(Z;X|Θ)x,z

=−n

󰀇

P(z|x,Θ)DKL(P(y|x)󰀌P(y|z))

x,z

−n·Hy|x−T·I(Z;X|Θ)=−n󰀁(I(X;Y|Θ)−I(Z;Y|Θ))−n·Hy|x−T·I(Z;X|Θ)

=

−T(I(Z;X|Θ)−βI(Z;Y|Θ))+C,

(14)

whereC=−n󰀁·I(X;Y|Θ)−n·Hy|xisaconstantw.r.t.ZandΘ.NotethatI(Z;X|Θ)−βI(Z;Y|Θ)isapa-rameterizedversionofIBobjective(tobeminimized),

andβ=n󰀁

Tisthetradeoffcoefficientandanalogous

toaninversetemperatureparameter.ThisisamoreaccuraterelationshipthantheonedefinedinSlonimandWeiss(2003)whoattemptedamappingbetweeninformationbottleneckandmaximumlikelihoodesti-mationofmixturemodels.Asweshowedthatthelatterissimplyonespecificstageofamodel-basedDAprocess,themappingdefinedinSlonimandWeiss(2003)wouldbeexactonlywhenβ=1,withmultino-mialmodels.Itisnotsurprisingtoseetheequivalencebetweenmultinomialmodel-basedDAclusteringandIBclusteringsincedeterministicannealingwasexplic-itlymentionedtoberelatedintheoriginalIBpaper(Tishbyetal.,1999).

ItisworthmentioningthatsomerecentworksonKLclustering(Dhillonetal.,2002;Dhillon&Guan,2003)arejust(hard)k-meansversionsofIBcluster-ingwithβ→∞ormultinomialmodel-basedDAclusteringwithT→0.Asβgoestoinfinity,mini-mizingI(Z;X)−βI(Z;Y)isequivalenttomaximiz-ingI(Z;Y),orsameasminimizinginformationlossI(X;Y)−I(Z;Y)(sinceI(X;Y)isaconstant).

6ClusteringwithAnAuxiliarySpace

Inthissection,wediscusshowtheframeworkof

model-basedDAclusteringcanbeusefulinaninter-estingclusteringscenario,inwhicheachdataitemhastworepresentations,oneofthemcharacterizessomeauxiliaryinformationofthedataitem.Forexample,ingeneexpressiondataanalysis,auxiliaryfunctionalcategoryinformationmaybeavailableforgenes.Onemaydesiretoclustergenesaccordingtotheauxiliaryrepresentationwhilestillusingtheexpressioninfor-mationduringtheclusteringprocess(Sinkkonen&Kaski,2001).Similarsituationexistsfortextcluster-ing,wherelargeamountofauxiliaryontologyinforma-tionoftextdocumentsexistandcanbeexploited.SinkkonenandKaski(2001)proposedtominimizeKL-divergence-basedlossfunctionintheauxiliaryspaceandwhileusingprimaryspacemodelstocalculatepos-teriorprobabilitiesP(z|x,Θ).Intheirwork,multino-mialmodelswereusedinthediscreteauxiliaryspaceandvonMises-Fishermodelsusedforthecontinuousprimaryspace.Rewritteninournotation,theobjec-tivetheyaremaximizingis

󰀇

P(z|x,Θ)logP(x|λz)

x,z

whereΘisasetofvonMises-Fisher(vMF)modelpa-rametersandλ’sareparametersforthemultinomialmodelsintheauxiliaryspace.Thisinterpretationim-mediatelysuggestmanyvariationsofclusteringdatawithtworepresentations.Ageneralobjectivefunction

couldbe

󰀇

P(z|x,Θ,Λ)logP(x|θz,λz),

x,z

whichextendstheparameterΘin(2)toincludemod-elsinbothspaces.SinkkonenandKaski(2001)choseP(x|θz,λz)=P(x|λz),i.e.,toclusterintheauxiliaryspace.Adesignthatconsidersbothspacesis

logP(x|θz,λz)=logP(x|θz)+ηlogP(x|λz),whereηadjustthetradeoffbetweenmodelingdataintwospaces.Similarly,manychoicesexistfortheformofposteriorprobabilityP(z|x,Θ,Λ).ItcanbebasedonΘalone,onΛalone,onatopologicalmap,oronmodelranksineitherspace.Whichchoicetopickwilldependonone’sintentionandpracticalapplications.

7Conclusion

Wehavepresentedageneralformulationformodel-baseddeterministicannealing,whichtakesk-meansandEMclusteringasspecialcases.Theformulationshowssomeinterestingconnectionstolearningmix-turemodelswiththeEMalgorithm,tocompetitivelearningtechniquessuchasself-organizingmapandNeural-Gas,andtoinformationbottleneckclustering.Theseconnectionsmaysuggestaseriesofpossiblenewalgorithmsthatareusefulinhandlinglargevolumesofdataanddatawithmultiplerepresentations.Inthefuture,weshallmainlyinvestigatetheefficacyofthesuggestednewalgorithmsintextclusteringandgeneexpressiondataanalysisapplications.

Theinformationbottleneckasageneralprinciple,hasbeenextendedbyFriedmanetal.(2001)toaddressin-terestingbi-clusteringorco-clustering(Dhillon,2001;Dhillonetal.,2003)andmulti-clusteringproblems.Howthemodel-basedDAclusteringcanbeadapted(orcombinedwithIBprinciple)fortheseproblemsseemstobeanotherpromisingfuturedirection.

References

Ahalt,S.C.,Krishnamurthy,A.K.,Chen,P.,&Melton,D.E.(1990).Competitivelearningalgo-rithmsforvectorquantization.NeuralNetworks,3,277–290.Allgower,E.L.,&Georg,K.(1990).Numericalcontin-uationmethods:Anintroduction.BerlinHeidelberg:Springer-Verlag.Banerjee,A.,Dhillon,I.,Sra,S.,&Ghosh,J.(2003).Generativemodel-basedclusteringofdirectionaldata.Proc.9thACMSIGKDDInt.Conf.Knowl-edgeDiscoveryandDataMining(pp.19–28).

Blimes,J.A.(1998).AgentletutorialoftheEMalgorithmanditsapplicationtoparameterestima-tionforGaussianmixtureandhiddenMarkovmod-els(TechnicalReport).UniversityofCaliforniaatBerkeley.

Cherkassky,V.,&Mulier,F.(1998).Learningfromdata:Concepts,theory,andmethods.NewYork:JohnWiley&Sons.

Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximum-likelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSo-cietyB,39,1–38.

Dhillon,I.,&Guan,Y.(2003).Informationtheoreticclusteringofsparseco-occurrencedata.Proc.IEEEInt.Conf.DataMining(pp.517–520).Melbourne,FL.

Dhillon,I.S.(2001).Co-clusteringdocumentsandwordsusingbipartitespectralgraphpartitioning.Proc.7thACMSIGKDDInt.Conf.KnowledgeDis-coveryandDataMining(pp.269–274).

Dhillon,I.S.,Mallela,S.,&Kumar,R.(2002).En-hancedwordclusteringforhierarchicaltextclassifi-cation.Proc.8thACMSIGKDDInt.Conf.Knowl-edgeDiscoveryandDataMining(pp.446–455).Dhillon,I.S.,Mallela,S.,&Modha,D.S.(2003).Information-theoreticco-clustering.Proc.9thACMSIGKDDInt.Conf.KnowledgeDiscoveryandDataMining(pp.89–98).

Forgy,E.(1965).Clusteranalysisofmultivariatedata:Efficiencyvs.interpretabilityofclassifications.Bio-metrics,21,768.

Friedman,N.,Mosenzon,O.,Slonim,N.,&Tishby,N.(2001).Multivariateinformationbottleneck.Proc.17thConf.UncertaintyinArtificialIntelligence.Geman,S.,&Geman,D.(1984).Stochasticrelax-ation,GibbsdistributionandtheBayesianrestora-tionofimages.IEEETrans.PatternAnal.MachineIntell.,6,721–741.

Heskes,T.(2001).Self-organizingmaps,vectorquanti-zation,andmixturemodeling.IEEETrans.NeuralNetworks,12,1299–1305.

Hofmann,T.,&Buhmann,J.(1997).Pairwisedataclusteringbydeterministicannealing.IEEETrans.PatternAnal.MachineIntell.,19,1–14.

Hofmann,T.,&Buhmann,J.(1998).Competitivelearningalgorithmsforrobustvectorquantization.IEEETrans.SignalProcessing,46,1665–1675.

Kearns,M.,Mansour,Y.,&Ng,A.Y.(1997).Aninformation-theoreticanalysisofhardandsoftas-signmentmethodsforclustering.Proc.13thConf.UncertaintyinArtificialIntelligence(pp.282–293).Kirkpatrick,S.,Gelatt,C.D.,&Vecchi,M.P.(1983).Optimizationbysimulatedannealing.Science,220,671–680.Kohonen,T.(1997).Self-OrganizingMap.NewYork:Springer-Verlag.MacQueen,J.(1967).Somemethodsforclassificationandanalysisofmultivariateobservations.Proc.5thBerkeleySymp.Math.StatisticsandProbability(pp.281–297).

Martinetz,T.M.,Berkovich,S.G.,&Schulten,K.J.(1993).“Neural-Gas”networkforvectorquantiza-tionanditsapplicationtotime-seriesprediction.IEEETrans.NeuralNetworks,4,558–569.McLachlan,G.,&Basford,K.(1988).Mixturemodels:Inferenceandapplicationstoclustering.NewYork:MarcelDekker.

Mitchell,T.(1997).Machinelearning.McGrawHill.Neal,R.,&Hinton,G.(1998).AviewoftheEMal-gorithmthatjustifiesincremental,sparseandothervariants.InM.I.Jordan(Ed.),Learningingraphi-calmodels,355–368.KluwerAcademicPublishers.Rose,K.(1998).Deterministicannealingforclus-tering,compression,classification,regression,andrelatedoptimizationproblems.ProceedingsofTheIEEE,86,2210–2239.

Rose,K.,Gurewitz,E.,&Fox,G.C.(1993).Con-strainedclusteringasanoptimizationmethod.IEEETrans.PatternAnal.MachineIntell.,15,785–794.Sinkkonen,J.,&Kaski,S.(2001).Clusteringbasedonconditionaldistributionsinanauxiliaryspace.NeuralComputation,14,217–239.

Slonim,N.,&Weiss,Y.(2003).Maximumlikelihoodandtheinformationbottleneck.AdvancesinNeuralInformationProcessingSystems15(pp.335–342).Cambridge,MA:MITPress.

Tishby,N.,Pereira,F.C.,&Bialek,W.(1999).Theinformationbottleneckmethod.Proc.37thAnnualAllertonConf.Communication,ControlandCom-puting(pp.368–377).

Zhong,S.,&Ghosh,J.(2003).Aunifiedframeworkformodel-basedclustering.JournalofMachineLearn-ingResearch,4,1001–1037.

因篇幅问题不能全部显示,请点此查看更多更全内容

Top