andimplementation
MaurizioMartina,MemberIEEE,GuidoMasera,MemberIEEE
Abstract
ThispaperdescribestheanalysisoftheMumfordandShahfunctionalfromtheimplementationpointofview.Ourgoalistoshowresultsintermsofcomplexityforrealtimeapplications,suchasmotionestimationbasedonsegmentationtechniques,oftheMumfordandShahfunctional.Moreoverthesensitivitytofiniteprecisionrepresentationisaddressed,afastVLSIarchitectureisdescribedandresultsobtainedforitscompleteimplementationona0.13µmstandardcellstechnologyarepresented.
IndexTerms
ImageSegmentation,MumfordandShah,PerformanceEvaluation,VLSIimplementation
I.INTRODUCTION
RecentlytheuseoftheMumfordandShahfunctional[1]hasbeeninvestigatedindifferentapplications[2],[3],[4].In[4]anovelinterpretationoftheopticflowhasledtoanextensionoftheMumfordandShahfunctionaltomotionsegmentation.ThisapproachnamedMotionCompetitionisintendedtojoinmotionestimationandsegmentationtoderiveavariationalapproachforthesegmentationoftheimagedomainintoregionsofhomogeneousmotion.Intermsofimplementationseveraliterativealgorithmscanbefoundintheliterature,e.g.[3],[5],[6],[7],[8],[9]and[10].HoweverthecomputationalcomplexitybehindtheMumfordandShahmodelhasnotbeenstressedyet.Themultigridapproachisaninterestingtechniquetospeed-upiterativemethodsforsolvingellipticproblems:thebasicideaistofindfirstasolutiontotheproblem,solvingitonacoarsemeshandthentorefinetheoriginalproblemonafine
TheauthorsarewithCERCOM(CenterforMultimediaRadioCommunications)-Dip.diElettronica-PolitecnicodiTorino,CorsoDucadegliAbruzzi24,I-10129Torino(Italy)-correspondingauthor:MaurizioMARTINA-E-mail:maurizio.martina(guido.masera)@polito.it,Tel:+390112276512/+390115644102,Fax:+390115644099
August4,2005
DRAFT
1
mesh.ThisapproachhasbeensuccessfullyemployedfortheMumfordandShahfunctionalin[7]provingnoteworthyspeed-upintermsofiterationsreduction.Itsmaindrawbackistheamountofmemoryrequiredtostorecoarsesolutionsthatwillbeemployedduringthesolutionoffinergrids.Moreovertomovefromacoarsergridtoafineroneandvice-versa,restrictionandexpansionoperatorsarerequired.
OthersolutionsarebasedontheSteepestDescent(e.g.[3]and[9])toreducethenumberofiterations.BesidesthePreconditionedConjugateGradient(PCG)methodisaveryeffectivetechniqueforsolvingsparselinearequationssystemsasAx=b.ItisbasedonpreconditioningthesocalledA-orthogonalitythatimprovestheconvergetothesolutionemployingA-orthogonalsearchdirectionsd(i)(dT(i)Ad(j)=0).Itsusetospeed-uptheMumfordandShahfunctionalimplementationhasbeenaddressed[10],showinginterestingresults.ThemaindrawbackofthePCGmethodisitssensitivitytofiniteprecisionrepresentation.Infactworkingwithhighprecisionfloatingpointtools(e.gMatlab)thisphenomenonisverylimited.OnthecontraryweareinterestedinfixedpointimplementationsthatcouldtacklePCGmethodeffectiveness.Asimplenumericaliterativesolutionbasedonnon-linearGauss-Seidelmethodcanbeem-ployed[5]toreducethenumberofiterations:thecomputationalcomplexityisstillproportionaltothenumberofiterationskandtotheimagesizer×c,whereristhenumberofrowsandcisthenumberofcolumns.RunningaCmodelona3.06GHzIntelXeonwith2GBofRAMitcanbeobservedthatfor255graylevelsimages,270-280msarerequiredonaQCIF(176×144pixels)frameand1120-1130msonaCIF(352×288pixels)frame.Thepowerconsumption[11],withasupplyvoltageof1.5Visestimatedin60-70Wthatmeansmorethan30µJpersampleperiteration:evenwithamodernprocessor,thecomputationalcomplexitytoperformrealtimesegmentationistooinadequateforageneralpurposeCPU.
Thecontributionofthisworkistwofold:firsttofocusonthecomputationalcomplexityofsolvingtheMumfordandShahfunctional,givingresultsonexecutiontimeandonfiniteprecisionrepresentationeffects.Secondtoproposeafasthardwareimplementationsuitablefornewmotionestimationtechniques,asMotionCompetition,withhighframeratevideosequences.Theproposedarchitecturekeypointsare:1)afullyparalleldata–pathwithtwohardwaredividers;2)animproveddiagonalscanningordertoeffectivelyfeedthedata–pathpipeline.ThecompleteVLSIflowperformedfortheproposedarchitectureincludesthegenerationofactualpostplaceandroutefiguresintermsofcomplexity,area,clockfrequencyandpowerconsumption.
August4,2005
DRAFT
2
II.THEORETICALFRAMEWORK
TheMumfordandShahapproachisbasedonamathematicalmodelthatconsidersthesegmentationproblemasapartitionoftheimagedomainΩ⊂R2inopensubsetsΩi.Givenanimage,whoseintensityfunctionisdefinedasg:Ω→R,theMumfordandShahfunctional[1]aimstofindasmoothapproximationuofgineachsub-domainΩi:
E(u,K)=(u−g)2dx+α|∇u|2dx+βK
Ω
Ω\\K
(1)
whereKisthesetofΩiboundariesandKdenotesKsetlength,αisaparameterrelatedtothescale[9],βisaparameterrelatedtothecontrast[9],x=(x1,x2),dxistheLebesgue
2∂u∂u2
measureintheplaneand|∇u|=∂x1+∂x2.In[12]AmbrosioandTortorellidemonstratedthatthefunctionaldescribedinequation1canbeapproximated,intheΓ-convergencesenseby
Eε(u,z)=(u−g)2+αz2|∇u|2+βΦ(ε,z)dx(2)
Ω
withΦ(ε,z)=ε|∇z|2+
wherethefunctionzyieldsanapproximatedescriptionofthe
2
∂z∂zsetofcurvesKand|∇z|2=∂x+.ThusthesolutionobtainedminimizingEε(u,z)tends∂x21tothesolutionobtainedminimizingE(u,K)asε→0.Asequation2isanellipticproblemitsminimizerssatisfytheEuler-Lagrangedifferentialequationbothforuandz.Sothatweneedto:1)developuandzEuler-Lagrangeequations;2)discretizeg,uandzonauniformgridofN×Nnodeswithmeshsizeh-thustheybecomearraysgi,j,ui,j,zi,j.Togethertheycorrespondtoanonlinearsystemofequationsforthe2N2unknownvaluesui,j,zi,j.EmployinganiterativealgorithmasthenonlinearGauss-Seidelmethod[5]weobtain
ui,j
4ˆzi,j+h2/ε2u˜i,j/h2+gi,j/α
,zi,j==
1/α+z˜i,j/h216+uˆi,j+h2/ε2(3)
(1−z)2,4ε222
whereu˜i,j=zi2+1,jui+1,j+zi2−1,jui−1,j+zi,j˜i,j=zi2+1,j+zi2−1,j+zi,j+1ui,j+1+zi,j−1ui,j−1,z+1+22zi,j,zˆ=z+z+z+zanduˆ=4α|∇u|i,ji+1,ji−1,ji,j+1i,j−1i,j−1i,j/βε.Fromequation3itcanbe
observedthatnoteworthycomputationalcomplexityisrequiredtocomputeui,jandzi,j.Foracompleteformalderivationofequation3referto[5].
III.PARAMETERS,ITERATIONSANDFINITEPRECISIONEFFECTS
A.Parametersrangechoice
Obtainingoptimalvaluesforα,βandεisnotstraightforward,howeverdifferentapproachescanbefoundintheliterature[3],[9],[13]andthisselectiontaskisnotagoalofthiswork.
August4,2005
DRAFT
3
Neverthelessselectingα,βandεaspowersof2wouldleadtostronglyreducethecomputationalcomplexity[14].Inthispaperwewilldiscusstherangessuggestedin[14],namelyε∈[1/32,1/2],β∈[1/256,1/2]andα∈[1,64]withtheimagevaluesintherange[0,1].Figure1showstheresultsobtainedvaryingαasapowerof2in[1,64]withh=1,ε=1/16andβ=1/256ontheimage“foreman”:awidespectrumofimagesfromsmoothertosharperarecovered.Figure2showsthemeanerrorobtainedvaryingεandβaspowersof2intheranges[1/32,1/2]and[1/256,1/2]respectively,withα=8forthesameimage:itcanbenoticedthattheerrorisratherlimited;moreoverasimilarbehaviorhasbeenobservedforseveralotherimagesandparameterschoices,thustheperformedanalysisshowsthatthechoiceofα,βandεaspowersof2resultsinacceptableapproximationsoftheMumfordandShahfunctional.Thesolutionofequations2withthenonlinearGauss-Seidelmethodimpliescertaininitialconditions(i.e.thevaluesatiteration0),thatcanbechosen,accordingto[5],asui,j=gi,jandzi,j=1.Moreoverapropernumberofiterationsisrequiredtoobtaingoodresults.Fromthedirectimplementationofequation3onsomeQCIFandCIFimages,withh=1,α=8,β=1/256andε=1/16,wecanobservethat:afterapproximately20iterationstheincreaseofaccuracyonuandz(fromthekthtothek+1thiteration)tendstosaturateintermsofMeanSquareError(MSE)andPeakSignaltoNoiseRatio(PSNR).Infigure3thisbehaviorisshownfordifferentimages(“Foreman”,“Mobile”,“Tempete”and“Paris”)togetherwiththemeanvalue(dashedcurves).Thesolidcurves,representingthePSNRobtainedfromthemeanMSE,bettershowthesaturationphenomenon.
InthefollowingwewillfixtheMumfordandShahfunctionalparametersash=1,α=8,β=1/256,ε=1/16andthemaximumnumberofiterationsto20:withthissetofvalueswewillconcentrateontheMumfordandShahfunctional(equation3)sensitivitytofiniteprecisioneffects.
B.Finiteprecisioneffects
Sincegi,j,ui,jandzi,jareintherange[0,1],andα≥1,thefollowinginequalitiesholdtrueforthenumerator(Nu)andthedenominator(Du)ofui,jinequation3:0≤Nu≤4+1/αand1/α≤Du≤4+1/α,sothat,3bitsareneededfortheintegerpartofNuandDu.Analogouslywehave:1/ε2≤Nz≤16+1/ε2and16+1/ε2≤Dz≤16+2α/βε,whereNzandDzarethenumeratorandthedenominatorofzi,jinequation3.Sinceα≥1,0≤β≤1and0≤ε≤1
August4,2005
DRAFT
(0)
(0)
4
(a)uwithα=1(b)uwithα=64(c)zwithα=1
(d)zwithα=2(e)zwithα=4(f)zwithα=8
(g)zwithα=16
Fig.1.
(h)uwithα=32(j)zwithα=64
MumfordandShahresultsontheQCIFimage“Foreman”withα∈[1,64]
x 1016−5x 1060.05−40.05140.10.150.20.250.360.350.40.450.54120.10.150.20.255410β8β30.30.350.410.450.5220.050.10.150.20.250.30.350.40.450.50.050.10.150.20.250.30.350.40.450.5εε(a)βandεerroranalysisonu(b)βandεerroranalysisonz
Fig.2.Erroranalysisonβandε
August4,2005DRAFT
5
Number of iterations0x 10−3Number of iterations8010055ForemanMobileTempeteParismean102030405060709000.3102030405060708090650Mean Square Error5454Mean Square Error0.25ForemanMobileTempeteParismean1004540350.2302520PSNR30.15352300.1151250.05100101234Number of iterations567892010234Number of iterations5678910(a)uMSEandmeanPSNRVsnumberofiterations(b)zMSEandmeanPSNRVsnumberofiterations
Fig.3.uandzaccuracyincreaseandthemeanPSNRVsnumberofiterationsfortheQCIFimage“Foreman”,theCIFimages“Mobile”,“Tempete”,“Paris”
thenumberofbitsforcorrectlyrepresentingtheintegerpartofNzandDzwouldberatherhigherthanthenumberobtainedfortheintegerpartofNuandDu.Inparticularwiththeworstcontributionsofα,βandε,around20bitsfortheintegerpartofDzareneeded.Toreducethisnumberwerewriteequation3(zi,j)withh=1as
zi,j=
1α1α+4εzˆαi,j
2
2
PSNR402ε
+16ε+4|∇u|i,jαβ(4)
Nowwehave:1/α≤Nz≤16ε2/α+1/αand1/α+16ε2/α≤Dz≤1/α+16ε2/α+2ε/β.Withtherangeconsideredforα,βandε,NzandDzneedatleast3and9bitsrespectivelyfortheintegerpart.Althoughdifferentchoicesofparameterscouldalsobeofinterest,intherestofthepaperanalysisandsynthesisresultswillbelimitedtothecaseh=1,α=8,β=1/256andε=1/16:theMatlabimplementationoftheMumfordandShahfunctionalinthiscasewillbeassumedasthereferencemodelandusedasacomparisontermwhileinvestigatingthefiniteprecisioneffectsofui,jandzi,jcalculation.
Infigure4(a,b)meanvaluesforMSEandPSNRversusthenumberofbitsdevotedtothefractionalpartrepresentation(m)overdifferentimagesareshown.After16bitsforthefractionalpartthedifferencebetweenthefloatingpointreferencemodelandthefiniteprecisionimplementationisnegligible.Moreoverafterm=18−20bits,thevaluesforthePSNRmightnotbecompletelyreliableduetothejointeffectoffiniteprecisionandalgorithmiterations.
August4,2005
DRAFT
6
7x 10−40.090.0818017016018016014060.07PSNR [dB] U0.06MSE U40.05MSE Z140130120110100Z12010080U30.040.0320.021U6040Z0.0190807081012141618202208101214number of bit161820220number of bit(a)uandzMSEVsm0.0314 bits16 bits18 bits0.0250.030.035(b)uandzPSNRVsm14 bits16 bits18 bits0.0250.02mean = 1.6169e−08var = 2.8872e−120.02mean = −2.9939e−08var = 2.0795e−011mean = −5.1001e−08var = 1.476e−10pdf0.015pdf0.0150.01mean = 1.3202e−07var = 3.3002e−100.01mean = −1.0302e−07var = 4.1e−10mean = −2.433e−07var = 6.5702e−090.0050.0050−5−4−3−2ufloating−point−ufixed−point−101234x 105−50−4−3−2zfloating−point−zfixed−point−10123x 10−4(c)uerrordistributionwithm=14,16,18(d)zerrordistributionwithm=14,16,18
Fig.4.meanuandzdifferencebetweenthefloatingpointreferencemodelandfiniteprecisionCmodelVsnumberofbits
forthefractionalpart(m)(a),(b).Errordistributiononuandzwithm=14,16,18(c),(d)fortheQCIFimage“foreman”
Furthermoreinfigure4(c,d)theerrordistributionsonuandzform=14,16,18areshown.Theseresultshavebeenobtainedcalculatingthedifferencebetweenthefloatingpointresultsandthefixedpointones,andrepresenttheerrordistribution(pdf).Theframeusedis“foreman”andtheparametersvaluesareh=1,α=8,β=1/256andε=1/16.Itisworthnoticingthatsimilarresultshavebeenobtainedfordifferentchoicesofparametervaluesintheaforementionedranges.Fromtheresultsshowninfigure4wecanconcludethatahardwarededicatedimplementation,employingm≥16wouldbeabletoachieveverygoodperformance.
August4,2005DRAFT
PSNR [dB] Z451507
zNuNzSuSzEuEzWuWstartaddressgenerationunitjig−dbus>>addressgenerationunitjiw−abus#colsgbufferubufferzbufferg−abusu−dbusz−dbusmemoryinterface4<<4<<εβ<<4ε2αu−abusuNzNuSzSuEzEuWzWgi,jz−abusε2ααβε1>>αdata−pathui,jzi,jgi,j>>αNuu−divDzDuz−divNzui,jzi,j(a)Proposedarchitectureblockscheme(b)Data–pathblockscheme
Fig.5.Proposedarchitectureblockscheme(a)anddetailofthethedata–path(b)
IV.IMPLEMENTATION
ISSUES
In[14]softwareimplementationsonmodernmicroprocessorsoftheMumfordandShahfunc-tionalareaddressed,showingthathighframeratesequencescannotbeprocessedinreal–time;infacttoperform20iterationsonaQCIFframe4sand12.5sonthehighperformanceTexasInstrumentsTMS320C6711andontheStrongARMSA-1100respectively.Thusinasystemwhererealtimeprocessingisneeded,asthevideoscenario,adedicatedVLSIimplementationwouldberequired.FromtheanalysisperformedinsectionIIIitcanbeobservedthatitisdifficulttofixthevaluesoftheMumfordandShahfunctionalparameters,thenumberofiterationsandthenumberofbitsmdevotedtotherepresentationofthefractionalpartofthedata.ToobtainahighlyreusablearchitectureaparametricVHDLmodelhasbeendeveloped,focusingonhighperformanceintermsofnumberofsustainedframe,whilegrantinghighqualityoftheresultinguandz.
IntheliteraturetheimplementationoftheGauss-Seidelmethodforthesolutionofequationssystemshasalreadybeenaddressed.HoweversolutionsproposedintheliteraturecannotbestraightforwardlyadaptedtothecaseoftheMumfordandShahfunctional.InfactmanyworksdescribetheimplementationoftheGauss-Seidelmethodwithparticularemphasisonserialandsharedmemoryparallelcomputing[15],[16]oronsystolicsolutions[17].Aspointedoutin[18]thereisastrongdatadependencybetweenthepixelsproducedwiththeGauss-Seidelalgorithm.Inthefollowingwewillrefertotheneighborsoftheconsideredpixelasnorth,south,eastand
August4,2005
DRAFT
8
13610259487kk=310k=24jk=118952367i(a)Diagonalscanningorder
Fig.6.
Pixelscanningorder
(b)Improveddiagonalscanningorder
west:xi−1,j=xN,xi+1,j=xS,xi,j+1=xEandxi,j−1=xWwherexcanbeeitheruorz.Theproposedarchitecture,depictedinfigure5(a),iscomposedby:a)threebufferstostorerespectivelytheoriginalimageg,theapproximationuandtheedgesz;b)twoaddressgenerationunitsdevotedtomanagethecorrectscanningordertoelaboratethesamples;c)acomplexdata–pathtoimplementequations3and4;d)asimplememoryinterfaceunittoloadallthedatarequiredtoperformtheoperationsdescribedbyequations3and4.A.Addressunit
Asitcanbeobservedfromequations3and4ui,j(andzi,j)dependsonnorth,south,eastandwestvalues.Thusscanningtheimagebyrowsandcolumns(orvice-versabycolumnsandrows)toevaluatethenewvalueofui,j+1thecomputationofui,jandzi,jmustbecompleted.Infactforui,j+1thepixelsui,jandzi,jactasuWandzWrespectively.Insteadofscanningtheimagerow-wiseandcolumn-wise(orvice-versa)a“diagonal”scanningordercanbeemployed(figure6(a)):ahighlatencyforgeneratingthefirstvaluesisrequired,whileforthefollowingpixelsthelatencydecreasesasthepipelineisfilled.Moreoverthealgorithmiterativenaturecanbeexploitedtopipelinealsoontheiterations[18].Thisimproveddiagonalscanningorderisshowninfigure6(b)whereiandjrepresentrowsandcolumnsdirectionsandkrepresentstheiterationstep;thenumberassociatedtoeachpixelrepresentstheorderusedtoprocessthe
August4,2005DRAFT
9
startenstartenAddress generation order
[1,1]
[1,2][2,1][1,1]
[1,3][2,2][3,1][1,2][2,1][1,1]
[1,4][2,3][3,2][4,1][1,3][2,2][3,1][1,2][2,1][1,1]...
tc#iterationiterationcounter(−)entcstartenstartenstart−cnttc(+)startstop−cnttc(+)startstart−cnttc(+)startstop−cnttc(+)startstartstart−vcnt(−)entcstop−vcnt(−)enFSMstart−vcnt(−)tcentcstop−vcnt(−)enstartvaluecnt(+)ienstartvaluecnt(−)jeni−cntj−cntFig.7.Addressgenerationunit
differentvalues.Thisscanningorderleadstothegenerationoftheaddressesdepictedinfigure7ontheleft;theycanbegeneratedresortingtosomecountersandsmallcontrollogic,asdepictedinfigure7ontheright,wherestart−cntandstop−cntareup-counterstogeneraterows(i)andcolumns(j)currentindexlimits.start−vcntandstop−vcntaredown-counterstogeneratenewindexlimitstakingintoaccountboththediagonalorderandthealgorithmiterations.value−cntareanup-counter(i)forrowindicesgenerationandadown-counter(j)forcolumnsindicesgeneration.Thefinitestatemachinereadscountersterminalcount(tc)signalsandgeneratesstartandenable(en)signalsalsocheckingrowsandcolumnsbounds.Finallyiterationcounterisadown-counterdevotedtoperformthenumberofiterationsprogrammed.ThisarchitecturehasbeensynthesizedwithSynopsysdesigncompiler,placedandroutedwithCadenceencounterona0.13µmstandardcelltechnologybothforQCIFandCIFframes.TheaddressgenerationunitforQCIFframescanrunat550MHzrequiringanareaof9170µm2,whereasforCIFframescanrunat510MHzrequiringanareaof10655µm2.B.Data–path
Theproposedarchitectureimplementsequations3and4respectivelytogeneratebothui,jandzi,j.Asitcanbeinferredfromequations3and4,thearchitecturebottleneckisthedivision.Inthecaseoftheproposedarchitecturethedivisionalgorithmneedstobeunrolledandpipelinedinordertoobtainaparalleldividerabletosustainhighthroughput.Moreover,sinceui,jandzi,jarein[0,1]thenumeratorwillneverbegreaterthanthedenominator,thuswedonotneedtointroduceoperandsnormalization,asinmostSRTdividersimplementations[19].
Asdepictedinfigure5(b),toperformtheseoperations,thedata–pathcanprocessatthesametimeuN,uS,uW,uE,zN,zS,zW,zEandgi,j.Asdescribedbyequations3and4zN,zS,zE
August4,2005
DRAFT
10
andzWcanbeemployedtoevaluatethesquarevaluesrequiredbyu˜i,jandui,jdenominator;moreovertheyareneededtocalculatezˆi,j.Employingfourmultipliersthefoursquarevaluescanbecalculatedatthesametime(top-leftinfigure5(b)),meanwhilethefourzvaluesareaddedtogethertoobtainzˆi,j(top-rightinfigure5(b)).DuringthesameclockcycleuN,uS,uEanduWareemployedtoevaluatethefirstpartofthediscretegradient|∇u|2i,jandarestoredintofourregisters.Sincethediscretegradientissquaredandmultipliedbyfouritcanbecomputedsimplyaddingtogetherthesquareddifferences,infactthefollowingexpression
22holdstrue4|∇u|2i,j=(uS−uN)+(uE−uW).Thediscretegradientcalculationhasbeensplit
intothreeparts:1)subtractionsuSN=uS−uNanduEW=uE−uW,2)squaresuy=u2SNandux=u2EW,3)additionuy+ux,whereeachpartisimplementedinaseparateclockcycle(top-centerinfigure5(b)).Whilethediscretegradientsecondpartisevaluated,thesquaredzvaluesareaddedtogethertoobtainui,jdenominatorandmultipliedbytheproperuinput(uN,uS,uEoruW)tocomputethepartialvaluesrequiredbyu˜i,j.Moreoverinthesameclockcyclezˆi,jisshiftedby4ε2/α(seefigure5(b)inthemiddle).
Duringthefollowingclockcycletheevaluationofu˜i,jiscompletedandgi,j(shiftedbyα)isaddedtogenerateui,jnumerator.Inthesameclockcyclethecomputationofui,jdenominatoriscompletedaddingtozsquaredvaluestheterm1/α.Moreoverzi,jnumeratoranddenominatorarecalculated:theformeraddingto4ε2/αzˆtheterm1/α,thelattershiftingthediscretegradientbyε/βandadding1/α+16ε2/β.Finallytwodividersareemployedtocomputeinparallelui,jandzi,jstartingfromtheirnumeratorsanddenominators(bottompartoffigure5(b)).Fromtheresultsshowninfigure4itisworthnoticingthatprobablymorethan12bitswillbeemployedfortherepresentationofdatafractionalpart.Sinceui,jandzi,jarein[0,1],thedata–pathhasbeenimplemented(synthesized,placedandrouted)varyingthedatawidthrepresentation(m)from14to24bits.BesidesconsideringtherangesdetailedinsectionIIIforα,βandε,internaldataneedupto9bitsfortheintegerpartrepresentation.Resultsdetailedinfigure8showhowthefrequencyachievablebythedata–pathchangesvaryingthenumberofbitstorepresenttheoutputresults.Fromtheresultsreportedinfigure8itcanbeobservedthattomakethedata–pathabletoprocessanewsampleateachclockcycle,theaddressgenerationunitneedstobeconnectedtoafastmemoryinterfaceblock.
August4,2005DRAFT
11
Frequency451 bit per clock2 bit per clock3.5x 106Area1 bit per clock2 bit per clock403352.5MHz30µm22251.520115141516171819m20212223240.5141516171819m2021222324(a)Frequency
Fig.8.
Proposeddata–path:areaVsnumberofbitperoutputresults(m)
(b)Area
C.Memoryinterfaceblock
Itisreasonabletosupposethatu,zandgvaluesarestored(atleastpartially)inthreebuffers.Thememoryinterfaceblockisasimplefinitestatemachinethatstartingfromthematrixrowaddress(i)andthematrixcolumnaddress(j)generatedbytheaddressgenerationunit,isabletoloaduN,uS,uE,uW,zN,zS,zE,zWandgi,jfromthebuffersandtofeedthedata–pathwiththesevalues.Exploitingthelowerfrequencyachievablebythedata–path,thememoryinterfaceblockbehavesasasortofserialtoparallelcomponent.Infigure9(a)ablockschemeofthememoryinterfaceblockisshown.Thefinitestatemachine(FSM)isdevotedtoreceivetherowandcolumnindices,iandj,calculatedbytheaddressgenerationunit.StartingfromthesevaluestheFSMselectstoadd1,0or-1toinordertoimplementi+1,iori−1(analogouslywithj).Theresultobtainedonthecolumnindexjoughttobemultipliedbythenumberofcolumns(#COLS)andaddedwiththeresultobtainedontherowindexi;whenanaddressiscorrectlygenerateditisvalidatedbytheFSM.Givenanaddressontheproperaddressbus(gabus,uabusandzabus),thedatabuffers(g,uandz)latchthedatarequiredbythememoryinterfaceblockonthedatabus(gdbus,udbusandzdbus);finallytheFSMloadseachvalueintoitsregister(bottom-rightinfigure9(a)).Thememoryinterfaceblockhasbeensynthesized,placedandroutedona0.13µmstandardcelltechnologyforQCIFandCIFframesvaryingm
August4,2005
DRAFT
12
x 104AreaQCIFCIF5.65.45.25j#cols−1j01−101g_abusu_abusz_abusiivalidFSMµm24.84.64.44.243.83.614uNzNuSzSuEzEuWzWgi,j
u_dbusz_dbusg_dbus1516171819m2021222324(a)Memoryinterfaceblockscheme(b)MemoryinterfaceblockareaVsdatawidth
Fig.9.MemoryinterfaceblockschemeandareaVsdatawidthforQCIFandCIFframes
Sincethecriticalpath(forthisimplementation)isintheaddressgeneration(j×#COLS+i),themaximumachievableclockfrequencydependsonlyontheframesizeandnotonthedatawidth.Infactthemaximumachievablefrequencyisabout215MHzand190MHzforQCIFandCIFframesrespectively.Experimentalresultsshowthattheamountofgatesrequiredincreasesasthedatawidthgrows(seefigure9(b)).Finally,sincethedata–pathproducestwodataperclockcycle,namelyui,jandzi,j,theycanbesenttotheproperbufferresortingtoafurtheraddressgeneratoramultiplierandanadder.
V.PROPOSEDARCHITECTUREPERFORMANCE
Inthefollowingwewillfocusonthecaseofm=16andonebitperclockcycle,andwewilldiscusstheglobalarchitectureperformance.SimulationsperformedontheproposedarchitectureVHDLdescriptionshowthatthedata–pathneeds20clockcyclestoevaluateanewcoupleofresults(uandz).Duetotheimproveddiagonalscanningthedata–pathlatencydecreasesasLat0
i=1i.Infactafterthe20thcolumnisprocessed,thesimulationshowsthatthearchitectureproducesacoupleofnewvalues(ui,j,zi,j)ateachclockcycle.Consideringpostplaceandroutefrequencyandthatthememoryinterfaceshouldrunfasterthanthedata–pathtofetchtherequireddata,theproposedarchitecturecansustainmorethan25QCIFframespersecondandabout15CIFframespersecond.Infigure10(c)theperformanceachievablewiththeproposed
August4,2005
DRAFT
13
g
low addr
g
Data−path
Data−path
z divider
Data−path
u divider
low addr
Data−path
g
high addr
QCIF (144 x 176)CIF (288 x 352)
Proposedarchitecture1 iterationProposedarchitecture20 iterations[15]20 iterations[7]
λin [10, 1000]
z
low addr
g
high addr
676µs3 ms
z
high addr
addrunit
wbunit
u
Data−path
u divider
high addr
wbunit
z
high addr
Data−path
z divider
13.5 ms61 ms
u
low addr
u
low addr
z
low addr
~530 ms~2.1 s
u
Memory Interfacehigh addr
Memory Interface
addrunit
~393 ms~1.74 s
(a)QCIF
Fig.10.
(b)CIF(c)Performancecomparison
PostplaceandroutearchitecturesforQCIFandCIFframes(a)and(b).Performancecomparisonwith[14]and[6]
architecturearesummarizedandcomparedwiththeruntimeofthefixedpointCmodel[20]describedin[14]andofacompetitivesoftwareimplementation[6]withintheMegawavepackage(Megawave2.31a)[21]runningonaLinuxMandrake10-PentiumIVat1.6GHzwith512MBofRAM.Postplaceandrouteofthewholearchitectureshowsthat4.72mm2and4.73mm2arerequiredforQCIFandCIFframesrespectively(seefigure10(a,b)).
ThankstotheparametricVHDLdescriptionthetwoimplementationshavebeenachievedsimplychangingsomeparametersintheVHDLsourceandre-performingthedesignflow.Infigure10(a,b)itcanbeobservedthat:sincetheQCIFdesignisslightlysmallerthantheCIFone,alsotheQCIFchipisslightlylessdensethantheCIFone.Asfarasthepowerconsumptionisconcernedpostplaceandrouteanalysisshowsthatabout337mWarerequiredfortheQCIFarchitectureand395mWfortheCIFarchitecture.
VI.CONCLUSIONS
InthispapertheMumfordandShahfunctionalhasbeeninvestigatedfromtheimplementationpointofview.Adetailedstudyoffiniteprecisionrepresentationeffecthasbeencarriedout,afastVLSIarchitecturehasbeendescribedandresultsobtainedthroughtheimplementationona0.13µmstandardcellstechnologyhavebeenpresented.Finallytheauthorswouldliketothankthereviewersfortheirsuggestionsthathaveactuallyimprovedthequalityofthiswork.
August4,2005DRAFT
14
REFERENCES
[1]D.MumfordandJ.Shah,“Optimalapproximationsbypiecewisesmoothfunctionsandassociatedvariationalproblems,”
Comm.PureAppl.Math.,vol.42,pp.577–685,1989.
[2]F.SroubekandJ.Flusser,“Multichannelblinditerativeimagerestoration,”IEEETrans.onImageProcessing,vol.12,
no.9,pp.1094–1106,sep.2003.
[3]A.Brook,R.Kimmel,andN.A.Sochen,“Variationalrestorationandedgedetectionforcolorimages,”Journalof
MathematicalImagingandVision,vol.18,no.3,pp.247–268,2003.
[4]D.CremersandS.Soatto,“Motioncompetition:Avariationalapproachtopiecewiseparametricmotionsegmentation,”
InternationalJournalofComputerVision,vol.62,no.3,pp.249–265,2005.
[5]R.March,“Visualreconstructionwithdiscontinuitiesusingvaraitionalmethods,”ImageandVisionComputing,vol.10,
pp.30–38,1992.
[6]G.Koepfler,C.Lopez,andJ.M.Morel,“Amultiscalealgorithmforimagesegmentationbyvariationalmethod,”SIAMJ.
Numer.Anal.,vol.31,no.1,pp.282–299,feb.1994.
[7]D.Cremers,F.Tishh¨auser,J.Weickert,andC.Shn¨orr,“Diffusionsnakes:introducingstatisticalshapeknowledgeintothe
MumfordandShahfunctional,”InternationalJournalofComputerVision,vol.50,no.3,pp.295–313,2002.
[8]T.P.Vogl,J.W.Mangis,A.K.Rigler,W.T.Zink,andD.L.Alkon,“Acceleratingtheconvergenceoftheback-propagation
method,”BiologicalCybrnetics,vol.59,pp.257–263,1988.
[9]W.Vanzella,F.A.Pellegrino,andV.Torre,“Self-adaptiveregularization,”IEEETrans.onPatternAnalysisandMachine
Intelligence,vol.26,no.6,pp.804–809,jun.2004.
[10]F.Gibou,D.Levy,C.Cardenas,P.Liu,andA.Boyer,“Partialdifferentialequationbasedsegmentationforradiotherapy
treatmentplanning,”MathematicalBiosciencesandEngineering,vol.2,pp.209–226,2005.[11]Intel,“http://www.intel.com/design/xeon/datashts/252135.htm.”
[12]L.AmbrosioandV.M.Tortorelli,“ApproximationoffunctionalsdependingonjumpsbyellipticfunctionalsviaΓ-convergence,”Comm.PureAppl.Math.,vol.43,pp.999–1036,1990.
[13]T.F.ChanandL.A.Vese,“Activecontourswithoutedges,”IEEETrans.onImageProcessing,vol.10,no.2,pp.266–277,
feb.2001.
[14]M.MartinaandG.Masera,“MumfordandShahfunctional:Finiteprecisionanalysisandsoftwareimplementation,”in
IEEEInternationalSymposiumonSignalProcessingandInformationTechnology,2004.
[15]M.F.Adams,“AdistributedmemoryunstructuredGauss-Seidelalgorithmformultigridsmoothers,”inACM/IEEE
ProceedingsofSC2001:HighPerformanceNetworkingandComputing,2001,pp.1–4.
[16]M.M.Strout,L.Carter,J.Ferrante,J.Freeman,andB.Kreaseck,“CombinigperformanceaspectsofirregularGauss-Seidel
viasparsetiling,”in15thWorkshoponLanguagesandCompilersforParallelComputing,2002,pp.1–4.
[17]F.Z.Hadjam,A.Rahmoun,andM.Benmohammed,“Ondesigningasystolicnetworkfortheresolutionoflinearsystems
usingtheGauss-Seidelmethod,”inIEEEInternationalConferenceonComputerSystemsandApplications,2001,pp.283–286.
[18]J.A.YangandY.Choo,“Formalderivationofanefficientparallel2-DGauss-Siedelmethod,”inProceedingsofthe6th
InternationalParallelProcessingSymposium,1992,pp.204–207.
[19]J.E.Robertson,“Anewclassofdigitaldivisionmethods,”Trans.onElectronicComputers,vol.7,pp.218–222,1958.[20]M.Martina,“MumfordandShahCmodel,”downloadableatwww.vlsilab.polito.it/∼martina.[21]MegaWave2,“availableat:http://www.cmla.ens-cachan.fr/cmla/megawave/index.html.”
August4,2005DRAFT
因篇幅问题不能全部显示,请点此查看更多更全内容