InformationBottleneck
ShiZhong
DepartmentofComputerScienceandEngineering
FloridaAtlanticUniversity
777GladesRoad,BocaRaton,FL33431
zhong@cse.fau.edu
Abstract
Thispaperprovidesageneralformulationofprobabilisticmodel-basedclusteringwithde-terministicannealing(DA),whichleadstoaunifyinganalysisofk-means,EMclus-tering,softcompetitivelearningalgorithms(e.g.,self-organizingmap),andinformationbottleneck.Theanalysispointsoutaninter-estingyetnotwell-recognizedconnectionbe-tweenthek-meansandEMclustering—theyarejusttwodifferentstagesofaDAclus-teringprocess,withdifferenttemperatures.Demonstratedrelationshipsbetweenmodel-basedclustering,competitivelearning,andinformationbottleneck,canpotentiallygen-erateaseriesofnewalgorithms.
Butitsapplicationstoclusteringhavebeenlargelyre-strictedtovectordata(Roseetal.,1993;Hofmann&Buhmann,1997).Thispaperprovidesaprobabilis-ticmodel-basedDAformulation,whichhelpsusmakeinsightfulconnectionsbetweenmixturemodels,com-petitivelearning,andinformationbottleneck.Mixturemodels(McLachlan&Basford,1988)havebeenwidelyusedfordensityestimationandcluster-ingapplications.TheEMalgorithm(Dempsteretal.,1977)isusuallyusedtoestimatemaximumlikelihood(ormaximumaposteriori)parametersformixturemodels.Whenusedforclustering,itiscalledEMclus-tering.Thestandardk-meansalgorithm(Forgy,1965;MacQueen,1967)isanotherwell-knowclusteringtech-niquethathasitsrootinstatisticalanalysisandthatiscloselyrelatedtovectorquantizationinpatternrecog-nitionandsignalprocessing.ThestandardK-meansalgorithmisoftenviewedasaspecialcaseofspheri-calGaussianmodel-basedEMclusteringwithvarianceσ2→0(Mitchell,1997),yetmoresophisticatedanaly-sesexist.Forexample,aninformation-theoreticanal-ysiswasprovidedbyKearnsetal.(1997),whousetheterm“k-means”tomeanagenericalgorithmthatisnotlimitedtovectorspacewithGaussianmodels.Wefollowthisconceptualnotationinourdiscussion.Inthispaper,weshallintroduceamoreintuitiveexpla-nationfortherelationshipbetweenk-meansandEMclustering,underamodel-baseddeterministicanneal-ingframework.
Theonlineversionofk-meansalsohascloseconnectiontocompetitivelearningtechniques(Ahaltetal.,1990),inwhichthek-meansalgorithmisviewedtousetheWinner-Take-AlllearningrulewhileEMclusteringaSoftMaxlearningrule.Self-organizingmaps(SOM)(Kohonen,1997)andNeural-Gas(Martinetzetal.,1993)aretwosophisticatedcompetitivelearningmeth-odsthatuseaannealingmechanismtoavoidlocallyoptimalsolutions.ConnectionsbetweenSOM,Neural-Gas,andDAhavebeenmadeforvectorquantizationapplicationsbyHofmannandBuhmann(1998),who
1Introduction
Simulatedannealing(Kirkpatricketal.,1983)isastochasticoptimizationtechniquemotivatedbythean-nealingprocessesinphysicalchemistry.Certainchem-icalsystemscanbedriventotheirlow-energystatesbyannealing,agradualtemperature-decreasingprocess.Theannealingschedule,i.e.,therateatwhichthetem-peratureislowered,iscriticaltoreachingtheglobaloptimum.GemanandGeman(1984)havetheoreti-callyshownthattheglobaloptimumcanbeachieved
1
byaschedulefollowingT∝logm,wheremisthecur-rentiterationnumber.Suchschedules,however,areveryslowandunrealisticinmanyapplications.Deterministicannealing(DA),asthenamesuggests,isadeterministicversionofsimulatedannealing(Rose,1998).Derivedfromaninformationtheoreticviewofoptimizationproblems,DAisnotguaranteedtoreachglobaloptimum.Rather,itisaheuristicstrategythatavoidsmanylocallyoptimalsolutionsandenjoysafastertemperatureschedule.Ithasbeensuccess-fullyusedinawiderangeofapplications(Rose,1998).
showedthatthethreetypesofalgorithmsarethreedif-ferentimplementationsofacontinuationmethod(All-gower&Georg,1990)forvectorquantization,withdifferentcompetitivelearningrules.WeshallshowthatSOMandNeural-Gascanbeinterpretedasvari-ationsofDAclusteringandcanbeextendedtoemploydifferentprobabilisticmodels.
Recently,informationbottleneck(IB)(Tishbyetal.,1999)hasattractedalotofinterestsandgainedpop-ularityintextclusteringapplications.AnexplanationofitsconnectiontomaximumlikelihoodalgorithmshasbeenattemptedbySlonimandWeiss(2003).ByderivingIBformulationfrommultinomialmodel-baseddeterministicannealing,weshowthatIBtextcluster-ingexploredrecentlybymanyauthorsimplicitlyas-sumesamultinomialdistributionforclusters.Inshort,thispaperextendsourpreviouswork(Zhong&Ghosh,2003)withamoregeneralformulationofmodel-baseddeterministicannealingandathoroughdiscussion.Throughtheformulation,wedemonstratesomeusefulconnectionsbetweenmaximumlikelihoodestimationofmixturemodels,competitivelearning,andinformationbottleneck.Webelievetheseconnec-tionsmayleadtoaseriesofnewalgorithms.Section2presentsmodel-baseddeterministicanneal-ing.Section3demonstratesk-meansandEMclus-teringasspecialstagesofamodel-basedDAclus-teringprocess.Section4and5discusstherelation-shipsbetweenmodel-basedDAclusteringandcom-petitivelearningaswellasinformationbottleneck.Section6presentspossiblenewapplicationsandal-gorithmsbasedonourgeneralformulationofmodel-basedDAclustering.Finally,Section7summarizesthispaperwithremarksonfuturework.
2
Model-basedDeterministic
Annealing
Inthissection,wepresentageneralformulationofmodel-baseddeterministicannealinganddemonstratetheconnectiontoNealandHinton’sEMinterpretation(1998).AlthoughtheEMandDAalgorithmsapplytogeneraloptimizationproblems,forsimplicity,herewerestrainourdiscussiontoprobabilisticmodel-basedclustering.
Consideracentralclusteringproblem,inwhichX={x1,...,xN}isasetofNdataobjects,Z={z1,...,zN}asetofN(hiddenorunobserved)clusterindices,andΘasetofparametersassociatedwithaprobabilisticmodelofXandZ,tworandomvariablesgoverningthedistributionofxandz.Notez∈{1,...,K}isadiscretevariableandKisthenumberofclusters.Foreachclusterz,thecorrespondingmodelforXis
P(x|Θ,z)=P(x|θz),whichmeasuresthelikelihoodofobjectxcomingfromclusterz.Theparameterθzisthesetofparametersforclusterz.Ourobjectiveistomaximizethefollowingaveragelog-likelihood:
E=
P(x,z|Θ)logP(x|θz)
x
z
=P(x|Θ)
P(z|x,Θ)logP(x|θz)
xz
∝
P(z|x,Θ)logP(x|θz),
(1)
x,zwherethe(parameterized)datapriorP(x|Θ)isun-avoidablysettobeconstant1/Ninpracticeaswealwaysdealwithafinitesample(butasN→∞,us-ing1/Nisasymptoticallycorrect).Forsimplicity,weignorethisconstantandusethesimplifiednotationin(1)inlateranalysis.
MaximizingEdirectlyoverP(z|x,Θ)leadstoak-meanstypepartitioningofdataobjects,i.e.,harddataassignment.Thatis,foreachdataobjectx,theposteriorP(z|x,Θ)hasavalueof1forz=argmaxzP(x|θz)and0otherwise.Toconsidersoftpartitioning,onecanintroduceentropyconstraintsorpenaltiesinto(scaled)objectiveE:
F=
P(z|x,Θ)logP(x|Θz)
x,z
−γ·H(Z|Θ)+T·H(Z|X,Θ),(2)
whereH(Z|Θ)=−
zP(z|Θ)logP(z|Θ)istheclus-terpriorentropy,
H(Z|X,Θ)=−
P(z|x,Θ)logP(z|x,Θ)
x,z
istheaverageclusterposteriorentropy,andγandTarecoefficientsusedtotradeoffbetweenmaximizingE,minimizingH(Z|Θ),andmaximizingH(Z|X,Θ).Intuitively,minimizingH(Z|Θ)hasaneffectoffa-voringsmallernumberofclusterswhilemaximizingH(Z|X,Θ)favorssoftposteriorassignmentofdataob-jectstodifferentclusters(toavoidharddecisions,lead-ingtoso-calledmaximumentropyclustering).TheH(Z|Θ)termisnotintheoriginalDAformulation(Rose,1998)butwasmentionedasapossibleentropy-constrainedextensionthathasbeenusedinvectorquantization.Weshallseehowitisusefulinmak-ingconnectionstoEMclustering.TheparameterThasatemperatureinterpretation,asexplainedlaterinthissection.
ThefunctionFcanberewrittenas
F=
P(z|x,Θ)[logP(x|θz)x,z
+γlogP(z|Θ)−TlogP(z|x,Θ)].(3)
UsingclassicLagrangeoptimizationmethod,togetherwithconstraint
zP(z|x,Θ)=1,onecanderivethemaximumentropysolutionofposteriorP(z|x,Θ)(formaximizingF)tobethewell-knowGibbsdistribution
explogP(x|θz)+γlogP(z|Θ)
P(z|x,Θ)=TlogP(x|θz)+γlogP(z|Θ)
zexpT=
P(x|θ1
z)TP(z|Θ)γ
T(x|θ1Θ)γ.(4)
z
Pz)TP(z|TSeveralinterestingchoicesofγarediscussedinthenextsection.Ingeneral,whentemperatureTishigh,theposteriorentropyishigh,meaningtheprobabilityofeachdataobjectbeingassignedtodifferentclustersisbalanced(oruniform).Conversely,asTdecreasesto0,theposteriorprobabilityP(z|x,Θ)willbeclosetoeither0or1,leadingtoaharddataassignment.ADAalgorithmnaturallystartsatahightemperatureandthengraduallydecreasesthetemperaturetowards0.Ateachtemperature,theEMalgorithmisusedtomaximizetheobjective(2),withaE-stepin(4)andaM-step
θ(
znew)=argmaxθ
P(z|x,Θ)P(x|θ).(5)
x
Agenericmodel-basedDAclusteringalgorithmwasconstructedinZhongandGhosh(2003).Notethatinthisformulation,thetemperatureTonlyaffectsthecalculationofposteriorP(z|x,Θ),i.e.,theE-stepoftheEMalgorithm.AnnealingM-stepispossiblebutcomputationallymuchmoreexpensivethantheDAformulation.
TherelationshiptomaximumlikelihoodestimationofmixturemodelswiththeEMalgorithmisdemon-stratedasfollows.AccordingtotheviewofNealandHinton(1998),theobjectivefunctionthattheEMal-gorithmmaximizesisF1
=E[logP(X,Z|Θ)|X,Θ]+H(Z|X,Θ)(6)
=P(z|x,Θ)logP(x,z|Θ)+H(Z|X,Θ)(7)x,z
=
P(z|x,Θ)logP(x|θz)−H(Z|Θ)
x,z+H(Z|X,Θ),
(8)
whereZisthe(hidden)clusterindicesofalldataob-jects,thefirsttermin(6)isanexpectationofcom-pletedatalog-likelihoodoverP(z|x,Θ),andthesec-ondtermin(6)istheposteriorentropy.Derivationof(7)from(6)canbefoundinBlimes(1998),thusisomittedhere.From(7)to(8)isstraightforward.Therefore,theobjectivefunctionin(2)isthesameasthatusedinNealandHinton(1998)ifweset
γ=T=1.NealandHinton(1998)showedthat
theoptimalvalueofF1overP(z|x,Θ)isexactlytheincompletedatalog-likelihoodlogP(X|Θ).
3K-meansvs.EMClustering
Letusnowdiscussseveralchoicesoftheγparameter,whichelucidatesanintuitiverelationshipbetweenk-meansandEMclustering.
•γ=0:
TheobjectiveFbecomes
P(z|x,Θ)logP(x|θz)+T·H(Z|X,Θ),
x,z
whichisthesameastheoriginalDAformu-lationwiththedistortionfunctionsettobe
−logP(x|θz).AsT→∞,P(z|x,Θ)becomesuniform,whichmakessensewhenthereisnopriorknowledgeaboutthedistributionofz.AsTtendsto0,theclusteringalgorithmbecomesk-means,whichmaximizesthefirsttermin(2).•γ=1:
TheobjectiveFbecomes
P(z|x,Θ)logP(x,z|Θ)+T·H(Z|X,Θ),
x,z
whichfurtherreducestostandardEMcluster-ing(7)ifT=1.AsT→∞,P(z|x,Θ)be-comesuniform,sameastheγ=0case.AsT
tendsto0,theclusteringalgorithmbecomesahardclusteringalgorithmthatisslightlydiffer-entfromk-means.Specifically,P(z|x,Θ)is1forz=argmaxzP(z|Θ)P(x|θz)and0otherwise.ThishardclusteringversionwasobservedandanalyzedinBanerjeeetal.(2003).Itseemstomaximizealowerboundofincompletedata(log-)likelihood.
•γ=T:
Thisisthescenariowearemainlyinterestedin.TheobjectiveFbecomes
P(z|x,Θ)logP(x|θz)−T·I(X;Z|Θ),(9)
x,z
whereonecaneasilyseethataninterpretationonmaximizingFistominimizethemutualinforma-tionbetweenXandZ,i.e.,tocompressXintoZasmuchaspossible,andmeanwhiletomaximizetheaveragelog-likelihood.ThetemperatureTisusedtoadjustsuchatradeoff.TheE-stepnowbecomes
P(z|x,Θ)=P(z|Θ)P(x|θz)1
T(z|Θ)P(x|θ1.
z
Pz)TAsT→∞,theclusterposteriorsbecomeequaltopriors,i.e.,P(z|x,Θ)=P(z|Θ).Thisisusefulwhenoneintendstoenforcesomepriorknowledgeonthedistributionofclustersizes(whichcannotbeachievedfortheprevioustwoscenarios).AsTlowersto1,theclusteringalgorithmisequiva-lenttoEMclustering;asTfurthergoesto0,thealgorithmbecomesk-means.Therefore,wecanviewEMclusteringandk-meansastwodifferentstagesofamodel-basedDAclusteringprocess,withT=1andT=0,respectively.ThisisabetterinterpretationthanthetraditionaloneinMitchell(1997),whereonehastoartificiallydi-minishthevarianceofGaussianmodels,destroy-ingreasonableGaussianmodeldescriptionsfortheclustersgeneratedbythestandardk-meansalgorithm.
4RelationtoCompetitiveLearning
Inthissection,wediscusstworelatedcompetitivelearningalgorithmsthathaveanannealingflavor—self-organizingmap(Kohonen,1997)andNeural-Gas(Martinetzetal.,1993),bothhavebeenusedforclus-tering.Toseetheconnectionclearly,wefirstdiscussbatchversionsofthetwoalgorithms.
FortheoriginalSOMalgorithm,themodelparam-etersarejustthemeansofeachcluster,i.e.,Θ={µz}z=1,...,K.AdistinctfeatureofSOMistheuseofatopologicalmap,inwhicheachclusterhasafixedcoordinate.Letthemaplocationofφz
andK,φ2clusterzbeα(φ12)=exp−φ1−φ2
2α2aneighborhoodfunction.Letz(x)=argmaxzx−µz,where·isL2norm.ThebatchSOMalgorithmamountstoiteratingbetweenthefollowingtwosteps:
P(z|x,Θ)=Kα(φz,φz(x))
(10)
zKα(φz
,φz(x))
and
µ(z
new)=1NP(z|x,Θ)x,(11)
x
whereαisaparametercontrollingthewidthoftheneighborhoodfunctionanddecreasesgraduallydur-ingtheclusteringprocess.Asαgoesto0,P(z|x,Θ)becomeseither0or1andthealgorithmreducestok-means.
Itisimmediatelynotedthattheαhasthesamefunc-tionalityofatemperatureparameterindeterministicannealing.Thedifferencefromstandardmodel-baseddeterministicannealingisthatherethecalculationofP(z|x,Θ)isconstrainedbyatopologicalmapstruc-ture,whichgivesSOMtheadvantagethatallresult-ingclustersarestructurallyrelatedaccordingtothe
pre-specifiedtopologicalmap.Obviously,thismod-ifiedE-stepmakesSOMnotastrictEMalgorithm;maximumlikelihoodstatement,aswellastheconver-gencetheoremoftheEMalgorithm(Dempsteretal.,1977),donotapply.Looselyspeaking,however,theconvergenceofSOMisguaranteedbytheconvergenceofk-meansduringthefinalstageofannealingprocesswhereαbecomescloseto0(thusP(z|x,Θ)becomeshardandthealgorithmbecomesk-means).
ComparingthebatchSOMalgorithmwithmodel-basedDAclustering,wenoticethatanaturalexten-sionofSOMistouseprobabilisticmodelsintheM-step(11).Heskes(2001)discussedthisdirectionandexploredmultinomialmodel-basedSOMforanalyzingmarketbasketdata.
TheNeural-GasalgorithmdiffersfromtheSOMclus-tering,onlyintheE-step,i.e.,howP(z|x,Θ)iscalcu-lated.Itusesarank-basedscheme
−r(x,z)/α
P(z|x,Θ)=e−r(x,z),(12)
z
e/αwhereαisagainanequivalenttemperatureparameterandr(x,z)arankfunctionthattakesvaluek−1ifµzisthek-thclosestclustercentroidtodatavectorx(ac-cordingtoL2distance).Convergenceofthisalgorithmhasbeenanalyzedandinterpretedusingdiffusiondy-namics(Martinetzetal.,1993).SimilartoSOM,how-ever,theconvergenceisobviousatthefinalstageofannealingasαlowersto0.Thoughitseemsnatu-raltoextendNeural-Gastouseprobabilisticmodelsandrankclustersbasedonthelog-likelihoodmeasureP(x|θz),wehavenotseenanymodel-basedNeural-Gasalgorithmssofar.Somodel-basedNeural-Gasal-gorithmscanbeaninterestingfutureworkandusefulinanalyzingnon-Gaussiandata.
Competitivelearningtechniquesareusuallypresentedinonlineform—aseachdataitemispresented,mul-tipleclusters(intheclusteringcontext)competeforitandtheclustercentroidsgetupdatedaccordingtothecompetingresults.Theupdatingpartbasicallyfollowsagradientdescentapproachandthelearningrateneedstobecarefullyselected(basedonstochas-ticapproximationprinciples,seeCherkassky&Mulier,1998).Onlinealgorithmsarebetterequippedthanbatchversionstohandleverylargedatasetsthatcannotfitintomemory.Theconnectiondemonstratedaboveindicatesthatanonlineversionofanymodel-basedDAclusteringalgorithmcanbereadilycon-structedfollowingthepracticeusedincompetitivelearningmethods.
5RelationtoInformationBottleneck
Inthissection,weshowanequivalencebetweenIBclusteringandmultinomialmodel-basedDAcluster-ing.Thisresultindicatesthat,whenappliedtoclustering,theIBframeworkimplicitlyassumeadi-cretemultinomialdistributionfortherepresentationofdata.EventhoughtheoriginalIBframeworkdoesnotspecifydiscretedistributions,continuousapplicationshavenotbeenseenyetandmayinvolveconsiderablecomputationaldifficulty.
LetP(x|θz)beamultinomialdistributionparameter-izedbyθz={P(y|z)},whereYisarandomvariableassociatedwiththesetofpossiblesymbolsfromwhichxisdrawn.P(y|z)istheprobabilityofdrawingsym-bolyinclusterz.Assumethateachxisformedbydrawingntimesfromthesetofsymbolsandnx(y)isthenumberoftimessymbolyisdrawn.Wethenhave
logP(x|θ
z)=log
P(y|z)nx(y)y
=
nx(y)logP(y|z)
y
=n
P(y|x)log
P(y|z)
P(y|x)
−Hy|x
=
−n
y
Dy|x)P(y|z))−H
KL(P(y|x,(13)
whereDKL(PQ)istheKullback-Leibler(KL)diver-gencebetweenprobabilitydistributionsPandQ,andHy|x=−yP(y|x)logP(y|x)canbeseenasacon-stantw.r.t.ZandΘ.Ontheotherhand,assumethatYisindependentofZgivenX(whichistrueforclus-teringsinceZisjustacompressedversionofX),itiseasytoshowthat(Tishbyetal.,1999;Dhillonetal.,2002)
I(X;Y)−I(Z;Y)∝
P(z|x)DKL(P(y|x)P(y|z)).
x,z
Therefore,whenmultinomialmodelsareused,plug-ging(13)into(9),wecanshowthat
F=
P(z|x,Θ)logP(x|θz)−T·I(Z;X|Θ)x,z
=−n
P(z|x,Θ)DKL(P(y|x)P(y|z))
x,z
−n·Hy|x−T·I(Z;X|Θ)=−n(I(X;Y|Θ)−I(Z;Y|Θ))−n·Hy|x−T·I(Z;X|Θ)
=
−T(I(Z;X|Θ)−βI(Z;Y|Θ))+C,
(14)
whereC=−n·I(X;Y|Θ)−n·Hy|xisaconstantw.r.t.ZandΘ.NotethatI(Z;X|Θ)−βI(Z;Y|Θ)isapa-rameterizedversionofIBobjective(tobeminimized),
andβ=n
Tisthetradeoffcoefficientandanalogous
toaninversetemperatureparameter.ThisisamoreaccuraterelationshipthantheonedefinedinSlonimandWeiss(2003)whoattemptedamappingbetweeninformationbottleneckandmaximumlikelihoodesti-mationofmixturemodels.Asweshowedthatthelatterissimplyonespecificstageofamodel-basedDAprocess,themappingdefinedinSlonimandWeiss(2003)wouldbeexactonlywhenβ=1,withmultino-mialmodels.Itisnotsurprisingtoseetheequivalencebetweenmultinomialmodel-basedDAclusteringandIBclusteringsincedeterministicannealingwasexplic-itlymentionedtoberelatedintheoriginalIBpaper(Tishbyetal.,1999).
ItisworthmentioningthatsomerecentworksonKLclustering(Dhillonetal.,2002;Dhillon&Guan,2003)arejust(hard)k-meansversionsofIBcluster-ingwithβ→∞ormultinomialmodel-basedDAclusteringwithT→0.Asβgoestoinfinity,mini-mizingI(Z;X)−βI(Z;Y)isequivalenttomaximiz-ingI(Z;Y),orsameasminimizinginformationlossI(X;Y)−I(Z;Y)(sinceI(X;Y)isaconstant).
6ClusteringwithAnAuxiliarySpace
Inthissection,wediscusshowtheframeworkof
model-basedDAclusteringcanbeusefulinaninter-estingclusteringscenario,inwhicheachdataitemhastworepresentations,oneofthemcharacterizessomeauxiliaryinformationofthedataitem.Forexample,ingeneexpressiondataanalysis,auxiliaryfunctionalcategoryinformationmaybeavailableforgenes.Onemaydesiretoclustergenesaccordingtotheauxiliaryrepresentationwhilestillusingtheexpressioninfor-mationduringtheclusteringprocess(Sinkkonen&Kaski,2001).Similarsituationexistsfortextcluster-ing,wherelargeamountofauxiliaryontologyinforma-tionoftextdocumentsexistandcanbeexploited.SinkkonenandKaski(2001)proposedtominimizeKL-divergence-basedlossfunctionintheauxiliaryspaceandwhileusingprimaryspacemodelstocalculatepos-teriorprobabilitiesP(z|x,Θ).Intheirwork,multino-mialmodelswereusedinthediscreteauxiliaryspaceandvonMises-Fishermodelsusedforthecontinuousprimaryspace.Rewritteninournotation,theobjec-tivetheyaremaximizingis
P(z|x,Θ)logP(x|λz)
x,z
whereΘisasetofvonMises-Fisher(vMF)modelpa-rametersandλ’sareparametersforthemultinomialmodelsintheauxiliaryspace.Thisinterpretationim-mediatelysuggestmanyvariationsofclusteringdatawithtworepresentations.Ageneralobjectivefunction
couldbe
P(z|x,Θ,Λ)logP(x|θz,λz),
x,z
whichextendstheparameterΘin(2)toincludemod-elsinbothspaces.SinkkonenandKaski(2001)choseP(x|θz,λz)=P(x|λz),i.e.,toclusterintheauxiliaryspace.Adesignthatconsidersbothspacesis
logP(x|θz,λz)=logP(x|θz)+ηlogP(x|λz),whereηadjustthetradeoffbetweenmodelingdataintwospaces.Similarly,manychoicesexistfortheformofposteriorprobabilityP(z|x,Θ,Λ).ItcanbebasedonΘalone,onΛalone,onatopologicalmap,oronmodelranksineitherspace.Whichchoicetopickwilldependonone’sintentionandpracticalapplications.
7Conclusion
Wehavepresentedageneralformulationformodel-baseddeterministicannealing,whichtakesk-meansandEMclusteringasspecialcases.Theformulationshowssomeinterestingconnectionstolearningmix-turemodelswiththeEMalgorithm,tocompetitivelearningtechniquessuchasself-organizingmapandNeural-Gas,andtoinformationbottleneckclustering.Theseconnectionsmaysuggestaseriesofpossiblenewalgorithmsthatareusefulinhandlinglargevolumesofdataanddatawithmultiplerepresentations.Inthefuture,weshallmainlyinvestigatetheefficacyofthesuggestednewalgorithmsintextclusteringandgeneexpressiondataanalysisapplications.
Theinformationbottleneckasageneralprinciple,hasbeenextendedbyFriedmanetal.(2001)toaddressin-terestingbi-clusteringorco-clustering(Dhillon,2001;Dhillonetal.,2003)andmulti-clusteringproblems.Howthemodel-basedDAclusteringcanbeadapted(orcombinedwithIBprinciple)fortheseproblemsseemstobeanotherpromisingfuturedirection.
References
Ahalt,S.C.,Krishnamurthy,A.K.,Chen,P.,&Melton,D.E.(1990).Competitivelearningalgo-rithmsforvectorquantization.NeuralNetworks,3,277–290.Allgower,E.L.,&Georg,K.(1990).Numericalcontin-uationmethods:Anintroduction.BerlinHeidelberg:Springer-Verlag.Banerjee,A.,Dhillon,I.,Sra,S.,&Ghosh,J.(2003).Generativemodel-basedclusteringofdirectionaldata.Proc.9thACMSIGKDDInt.Conf.Knowl-edgeDiscoveryandDataMining(pp.19–28).
Blimes,J.A.(1998).AgentletutorialoftheEMalgorithmanditsapplicationtoparameterestima-tionforGaussianmixtureandhiddenMarkovmod-els(TechnicalReport).UniversityofCaliforniaatBerkeley.
Cherkassky,V.,&Mulier,F.(1998).Learningfromdata:Concepts,theory,andmethods.NewYork:JohnWiley&Sons.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximum-likelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSo-cietyB,39,1–38.
Dhillon,I.,&Guan,Y.(2003).Informationtheoreticclusteringofsparseco-occurrencedata.Proc.IEEEInt.Conf.DataMining(pp.517–520).Melbourne,FL.
Dhillon,I.S.(2001).Co-clusteringdocumentsandwordsusingbipartitespectralgraphpartitioning.Proc.7thACMSIGKDDInt.Conf.KnowledgeDis-coveryandDataMining(pp.269–274).
Dhillon,I.S.,Mallela,S.,&Kumar,R.(2002).En-hancedwordclusteringforhierarchicaltextclassifi-cation.Proc.8thACMSIGKDDInt.Conf.Knowl-edgeDiscoveryandDataMining(pp.446–455).Dhillon,I.S.,Mallela,S.,&Modha,D.S.(2003).Information-theoreticco-clustering.Proc.9thACMSIGKDDInt.Conf.KnowledgeDiscoveryandDataMining(pp.89–98).
Forgy,E.(1965).Clusteranalysisofmultivariatedata:Efficiencyvs.interpretabilityofclassifications.Bio-metrics,21,768.
Friedman,N.,Mosenzon,O.,Slonim,N.,&Tishby,N.(2001).Multivariateinformationbottleneck.Proc.17thConf.UncertaintyinArtificialIntelligence.Geman,S.,&Geman,D.(1984).Stochasticrelax-ation,GibbsdistributionandtheBayesianrestora-tionofimages.IEEETrans.PatternAnal.MachineIntell.,6,721–741.
Heskes,T.(2001).Self-organizingmaps,vectorquanti-zation,andmixturemodeling.IEEETrans.NeuralNetworks,12,1299–1305.
Hofmann,T.,&Buhmann,J.(1997).Pairwisedataclusteringbydeterministicannealing.IEEETrans.PatternAnal.MachineIntell.,19,1–14.
Hofmann,T.,&Buhmann,J.(1998).Competitivelearningalgorithmsforrobustvectorquantization.IEEETrans.SignalProcessing,46,1665–1675.
Kearns,M.,Mansour,Y.,&Ng,A.Y.(1997).Aninformation-theoreticanalysisofhardandsoftas-signmentmethodsforclustering.Proc.13thConf.UncertaintyinArtificialIntelligence(pp.282–293).Kirkpatrick,S.,Gelatt,C.D.,&Vecchi,M.P.(1983).Optimizationbysimulatedannealing.Science,220,671–680.Kohonen,T.(1997).Self-OrganizingMap.NewYork:Springer-Verlag.MacQueen,J.(1967).Somemethodsforclassificationandanalysisofmultivariateobservations.Proc.5thBerkeleySymp.Math.StatisticsandProbability(pp.281–297).
Martinetz,T.M.,Berkovich,S.G.,&Schulten,K.J.(1993).“Neural-Gas”networkforvectorquantiza-tionanditsapplicationtotime-seriesprediction.IEEETrans.NeuralNetworks,4,558–569.McLachlan,G.,&Basford,K.(1988).Mixturemodels:Inferenceandapplicationstoclustering.NewYork:MarcelDekker.
Mitchell,T.(1997).Machinelearning.McGrawHill.Neal,R.,&Hinton,G.(1998).AviewoftheEMal-gorithmthatjustifiesincremental,sparseandothervariants.InM.I.Jordan(Ed.),Learningingraphi-calmodels,355–368.KluwerAcademicPublishers.Rose,K.(1998).Deterministicannealingforclus-tering,compression,classification,regression,andrelatedoptimizationproblems.ProceedingsofTheIEEE,86,2210–2239.
Rose,K.,Gurewitz,E.,&Fox,G.C.(1993).Con-strainedclusteringasanoptimizationmethod.IEEETrans.PatternAnal.MachineIntell.,15,785–794.Sinkkonen,J.,&Kaski,S.(2001).Clusteringbasedonconditionaldistributionsinanauxiliaryspace.NeuralComputation,14,217–239.
Slonim,N.,&Weiss,Y.(2003).Maximumlikelihoodandtheinformationbottleneck.AdvancesinNeuralInformationProcessingSystems15(pp.335–342).Cambridge,MA:MITPress.
Tishby,N.,Pereira,F.C.,&Bialek,W.(1999).Theinformationbottleneckmethod.Proc.37thAnnualAllertonConf.Communication,ControlandCom-puting(pp.368–377).
Zhong,S.,&Ghosh,J.(2003).Aunifiedframeworkformodel-basedclustering.JournalofMachineLearn-ingResearch,4,1001–1037.
因篇幅问题不能全部显示,请点此查看更多更全内容