您的当前位置：首页正文

Mumford and Shah functional VLSI analysis

来源：爱go旅游网

MumfordandShahfunctional:VLSIanalysis

andimplementation

MaurizioMartina,MemberIEEE,GuidoMasera,MemberIEEE

Abstract

ThispaperdescribestheanalysisoftheMumfordandShahfunctionalfromtheimplementationpointofview.Ourgoalistoshowresultsintermsofcomplexityforrealtimeapplications,suchasmotionestimationbasedonsegmentationtechniques,oftheMumfordandShahfunctional.Moreoverthesensitivitytoﬁniteprecisionrepresentationisaddressed,afastVLSIarchitectureisdescribedandresultsobtainedforitscompleteimplementationona0.13µmstandardcellstechnologyarepresented.

IndexTerms

ImageSegmentation,MumfordandShah,PerformanceEvaluation,VLSIimplementation

I.INTRODUCTION

RecentlytheuseoftheMumfordandShahfunctional[1]hasbeeninvestigatedindifferentapplications[2],[3],[4].In[4]anovelinterpretationoftheopticﬂowhasledtoanextensionoftheMumfordandShahfunctionaltomotionsegmentation.ThisapproachnamedMotionCompetitionisintendedtojoinmotionestimationandsegmentationtoderiveavariationalapproachforthesegmentationoftheimagedomainintoregionsofhomogeneousmotion.Intermsofimplementationseveraliterativealgorithmscanbefoundintheliterature,e.g.[3],[5],[6],[7],[8],[9]and[10].HoweverthecomputationalcomplexitybehindtheMumfordandShahmodelhasnotbeenstressedyet.Themultigridapproachisaninterestingtechniquetospeed-upiterativemethodsforsolvingellipticproblems:thebasicideaistoﬁndﬁrstasolutiontotheproblem,solvingitonacoarsemeshandthentoreﬁnetheoriginalproblemonaﬁne

TheauthorsarewithCERCOM(CenterforMultimediaRadioCommunications)-Dip.diElettronica-PolitecnicodiTorino,CorsoDucadegliAbruzzi24,I-10129Torino(Italy)-correspondingauthor:MaurizioMARTINA-E-mail:maurizio.martina(guido.masera)@polito.it,Tel:+390112276512/+390115644102,Fax:+390115644099

August4,2005

DRAFT

mesh.ThisapproachhasbeensuccessfullyemployedfortheMumfordandShahfunctionalin[7]provingnoteworthyspeed-upintermsofiterationsreduction.Itsmaindrawbackistheamountofmemoryrequiredtostorecoarsesolutionsthatwillbeemployedduringthesolutionofﬁnergrids.Moreovertomovefromacoarsergridtoaﬁneroneandvice-versa,restrictionandexpansionoperatorsarerequired.

OthersolutionsarebasedontheSteepestDescent(e.g.[3]and[9])toreducethenumberofiterations.BesidesthePreconditionedConjugateGradient(PCG)methodisaveryeffectivetechniqueforsolvingsparselinearequationssystemsasAx=b.ItisbasedonpreconditioningthesocalledA-orthogonalitythatimprovestheconvergetothesolutionemployingA-orthogonalsearchdirectionsd(i)(dT(i)Ad(j)=0).Itsusetospeed-uptheMumfordandShahfunctionalimplementationhasbeenaddressed[10],showinginterestingresults.ThemaindrawbackofthePCGmethodisitssensitivitytoﬁniteprecisionrepresentation.Infactworkingwithhighprecisionﬂoatingpointtools(e.gMatlab)thisphenomenonisverylimited.OnthecontraryweareinterestedinﬁxedpointimplementationsthatcouldtacklePCGmethodeffectiveness.Asimplenumericaliterativesolutionbasedonnon-linearGauss-Seidelmethodcanbeem-ployed[5]toreducethenumberofiterations:thecomputationalcomplexityisstillproportionaltothenumberofiterationskandtotheimagesizer×c,whereristhenumberofrowsandcisthenumberofcolumns.RunningaCmodelona3.06GHzIntelXeonwith2GBofRAMitcanbeobservedthatfor255graylevelsimages,270-280msarerequiredonaQCIF(176×144pixels)frameand1120-1130msonaCIF(352×288pixels)frame.Thepowerconsumption[11],withasupplyvoltageof1.5Visestimatedin60-70Wthatmeansmorethan30µJpersampleperiteration:evenwithamodernprocessor,thecomputationalcomplexitytoperformrealtimesegmentationistooinadequateforageneralpurposeCPU.

Thecontributionofthisworkistwofold:ﬁrsttofocusonthecomputationalcomplexityofsolvingtheMumfordandShahfunctional,givingresultsonexecutiontimeandonﬁniteprecisionrepresentationeffects.Secondtoproposeafasthardwareimplementationsuitablefornewmotionestimationtechniques,asMotionCompetition,withhighframeratevideosequences.Theproposedarchitecturekeypointsare:1)afullyparalleldata–pathwithtwohardwaredividers;2)animproveddiagonalscanningordertoeffectivelyfeedthedata–pathpipeline.ThecompleteVLSIﬂowperformedfortheproposedarchitectureincludesthegenerationofactualpostplaceandrouteﬁguresintermsofcomplexity,area,clockfrequencyandpowerconsumption.

August4,2005

DRAFT

II.THEORETICALFRAMEWORK

TheMumfordandShahapproachisbasedonamathematicalmodelthatconsidersthesegmentationproblemasapartitionoftheimagedomainΩ⊂R2inopensubsetsΩi.Givenanimage,whoseintensityfunctionisdeﬁnedasg:Ω→R,theMumfordandShahfunctional[1]aimstoﬁndasmoothapproximationuofgineachsub-domainΩi:

󰀂󰀂

E(u,K)=(u−g)2dx+α|∇u|2dx+β󰀈K󰀈

Ω

Ω\\K

(1)

whereKisthesetofΩiboundariesand󰀈K󰀈denotesKsetlength,αisaparameterrelatedtothescale[9],βisaparameterrelatedtothecontrast[9],x=(x1,x2),dxistheLebesgue

󰀃󰀃2󰀃∂u∂u󰀃2

measureintheplaneand|∇u|=󰀃∂x1+∂x2󰀃.In[12]AmbrosioandTortorellidemonstratedthatthefunctionaldescribedinequation1canbeapproximated,intheΓ-convergencesenseby

󰀂

Eε(u,z)=(u−g)2+αz2|∇u|2+βΦ(ε,z)dx(2)

Ω

withΦ(ε,z)=ε|∇z|2+

wherethefunctionzyieldsanapproximatedescriptionofthe

󰀃2

󰀃∂z∂z󰀃setofcurvesKand|∇z|2=󰀃∂x+.ThusthesolutionobtainedminimizingEε(u,z)tends∂x2󰀃1tothesolutionobtainedminimizingE(u,K)asε→0.Asequation2isanellipticproblemitsminimizerssatisfytheEuler-Lagrangedifferentialequationbothforuandz.Sothatweneedto:1)developuandzEuler-Lagrangeequations;2)discretizeg,uandzonauniformgridofN×Nnodeswithmeshsizeh-thustheybecomearraysgi,j,ui,j,zi,j.Togethertheycorrespondtoanonlinearsystemofequationsforthe2N2unknownvaluesui,j,zi,j.EmployinganiterativealgorithmasthenonlinearGauss-Seidelmethod[5]weobtain

ui,j

4ˆzi,j+h2/ε2u˜i,j/h2+gi,j/α

,zi,j==

1/α+z˜i,j/h216+uˆi,j+h2/ε2(3)

(1−z)2,4ε󰀃222

whereu˜i,j=zi2+1,jui+1,j+zi2−1,jui−1,j+zi,j˜i,j=zi2+1,j+zi2−1,j+zi,j+1ui,j+1+zi,j−1ui,j−1,z+1+22zi,j,zˆ=z+z+z+zanduˆ=4α|∇u|i,ji+1,ji−1,ji,j+1i,j−1i,j−1i,j/βε.Fromequation3itcanbe

observedthatnoteworthycomputationalcomplexityisrequiredtocomputeui,jandzi,j.Foracompleteformalderivationofequation3referto[5].

III.PARAMETERS,ITERATIONSANDFINITEPRECISIONEFFECTS

A.Parametersrangechoice

Obtainingoptimalvaluesforα,βandεisnotstraightforward,howeverdifferentapproachescanbefoundintheliterature[3],[9],[13]andthisselectiontaskisnotagoalofthiswork.

August4,2005

DRAFT

Neverthelessselectingα,βandεaspowersof2wouldleadtostronglyreducethecomputationalcomplexity[14].Inthispaperwewilldiscusstherangessuggestedin[14],namelyε∈[1/32,1/2],β∈[1/256,1/2]andα∈[1,64]withtheimagevaluesintherange[0,1].Figure1showstheresultsobtainedvaryingαasapowerof2in[1,64]withh=1,ε=1/16andβ=1/256ontheimage“foreman”:awidespectrumofimagesfromsmoothertosharperarecovered.Figure2showsthemeanerrorobtainedvaryingεandβaspowersof2intheranges[1/32,1/2]and[1/256,1/2]respectively,withα=8forthesameimage:itcanbenoticedthattheerrorisratherlimited;moreoverasimilarbehaviorhasbeenobservedforseveralotherimagesandparameterschoices,thustheperformedanalysisshowsthatthechoiceofα,βandεaspowersof2resultsinacceptableapproximationsoftheMumfordandShahfunctional.Thesolutionofequations2withthenonlinearGauss-Seidelmethodimpliescertaininitialconditions(i.e.thevaluesatiteration0),thatcanbechosen,accordingto[5],asui,j=gi,jandzi,j=1.Moreoverapropernumberofiterationsisrequiredtoobtaingoodresults.Fromthedirectimplementationofequation3onsomeQCIFandCIFimages,withh=1,α=8,β=1/256andε=1/16,wecanobservethat:afterapproximately20iterationstheincreaseofaccuracyonuandz(fromthekthtothek+1thiteration)tendstosaturateintermsofMeanSquareError(MSE)andPeakSignaltoNoiseRatio(PSNR).Inﬁgure3thisbehaviorisshownfordifferentimages(“Foreman”,“Mobile”,“Tempete”and“Paris”)togetherwiththemeanvalue(dashedcurves).Thesolidcurves,representingthePSNRobtainedfromthemeanMSE,bettershowthesaturationphenomenon.

InthefollowingwewillﬁxtheMumfordandShahfunctionalparametersash=1,α=8,β=1/256,ε=1/16andthemaximumnumberofiterationsto20:withthissetofvalueswewillconcentrateontheMumfordandShahfunctional(equation3)sensitivitytoﬁniteprecisioneffects.

B.Finiteprecisioneffects

Sincegi,j,ui,jandzi,jareintherange[0,1],andα≥1,thefollowinginequalitiesholdtrueforthenumerator(Nu)andthedenominator(Du)ofui,jinequation3:0≤Nu≤4+1/αand1/α≤Du≤4+1/α,sothat,3bitsareneededfortheintegerpartofNuandDu.Analogouslywehave:1/ε2≤Nz≤16+1/ε2and16+1/ε2≤Dz≤16+2α/βε,whereNzandDzarethenumeratorandthedenominatorofzi,jinequation3.Sinceα≥1,0≤β≤1and0≤ε≤1

August4,2005

DRAFT

(0)

(a)uwithα=1(b)uwithα=64(c)zwithα=1

(d)zwithα=2(e)zwithα=4(f)zwithα=8

(g)zwithα=16

Fig.1.

(h)uwithα=32(j)zwithα=64

MumfordandShahresultsontheQCIFimage“Foreman”withα∈[1,64]

x 1016−5x 1060.05−40.05140.10.150.20.250.360.350.40.450.54120.10.150.20.255410β8β30.30.350.410.450.5220.050.10.150.20.250.30.350.40.450.50.050.10.150.20.250.30.350.40.450.5εε(a)βandεerroranalysisonu(b)βandεerroranalysisonz

Fig.2.Erroranalysisonβandε

August4,2005DRAFT

Number of iterations0x 10−3Number of iterations8010055ForemanMobileTempeteParismean102030405060709000.3102030405060708090650Mean Square Error5454Mean Square Error0.25ForemanMobileTempeteParismean1004540350.2302520PSNR30.15352300.1151250.05100101234Number of iterations567892010234Number of iterations5678910(a)uMSEandmeanPSNRVsnumberofiterations(b)zMSEandmeanPSNRVsnumberofiterations

Fig.3.uandzaccuracyincreaseandthemeanPSNRVsnumberofiterationsfortheQCIFimage“Foreman”,theCIFimages“Mobile”,“Tempete”,“Paris”

thenumberofbitsforcorrectlyrepresentingtheintegerpartofNzandDzwouldberatherhigherthanthenumberobtainedfortheintegerpartofNuandDu.Inparticularwiththeworstcontributionsofα,βandε,around20bitsfortheintegerpartofDzareneeded.Toreducethisnumberwerewriteequation3(zi,j)withh=1as

zi,j=

1α1α+4εzˆαi,j

PSNR402ε

+16ε+4|∇u|i,jαβ(4)

Nowwehave:1/α≤Nz≤16ε2/α+1/αand1/α+16ε2/α≤Dz≤1/α+16ε2/α+2ε/β.Withtherangeconsideredforα,βandε,NzandDzneedatleast3and9bitsrespectivelyfortheintegerpart.Althoughdifferentchoicesofparameterscouldalsobeofinterest,intherestofthepaperanalysisandsynthesisresultswillbelimitedtothecaseh=1,α=8,β=1/256andε=1/16:theMatlabimplementationoftheMumfordandShahfunctionalinthiscasewillbeassumedasthereferencemodelandusedasacomparisontermwhileinvestigatingtheﬁniteprecisioneffectsofui,jandzi,jcalculation.

Inﬁgure4(a,b)meanvaluesforMSEandPSNRversusthenumberofbitsdevotedtothefractionalpartrepresentation(m)overdifferentimagesareshown.After16bitsforthefractionalpartthedifferencebetweentheﬂoatingpointreferencemodelandtheﬁniteprecisionimplementationisnegligible.Moreoverafterm=18−20bits,thevaluesforthePSNRmightnotbecompletelyreliableduetothejointeffectofﬁniteprecisionandalgorithmiterations.

August4,2005

DRAFT

7x 10−40.090.0818017016018016014060.07PSNR [dB] U0.06MSE U40.05MSE Z140130120110100Z12010080U30.040.0320.021U6040Z0.0190807081012141618202208101214number of bit161820220number of bit(a)uandzMSEVsm0.0314 bits16 bits18 bits0.0250.030.035(b)uandzPSNRVsm14 bits16 bits18 bits0.0250.02mean = 1.6169e−08var = 2.8872e−120.02mean = −2.9939e−08var = 2.0795e−011mean = −5.1001e−08var = 1.476e−10pdf0.015pdf0.0150.01mean = 1.3202e−07var = 3.3002e−100.01mean = −1.0302e−07var = 4.1e−10mean = −2.433e−07var = 6.5702e−090.0050.0050−5−4−3−2ufloating−point−ufixed−point−101234x 105−50−4−3−2zfloating−point−zfixed−point−10123x 10−4(c)uerrordistributionwithm=14,16,18(d)zerrordistributionwithm=14,16,18

Fig.4.meanuandzdifferencebetweentheﬂoatingpointreferencemodelandﬁniteprecisionCmodelVsnumberofbits

forthefractionalpart(m)(a),(b).Errordistributiononuandzwithm=14,16,18(c),(d)fortheQCIFimage“foreman”

Furthermoreinﬁgure4(c,d)theerrordistributionsonuandzform=14,16,18areshown.Theseresultshavebeenobtainedcalculatingthedifferencebetweentheﬂoatingpointresultsandtheﬁxedpointones,andrepresenttheerrordistribution(pdf).Theframeusedis“foreman”andtheparametersvaluesareh=1,α=8,β=1/256andε=1/16.Itisworthnoticingthatsimilarresultshavebeenobtainedfordifferentchoicesofparametervaluesintheaforementionedranges.Fromtheresultsshowninﬁgure4wecanconcludethatahardwarededicatedimplementation,employingm≥16wouldbeabletoachieveverygoodperformance.

August4,2005DRAFT

PSNR [dB] Z451507

zNuNzSuSzEuEzWuWstartaddressgenerationunitjig−dbus>>addressgenerationunitjiw−abus#colsgbufferubufferzbufferg−abusu−dbusz−dbusmemoryinterface4<<4<<εβ<<4ε2αu−abusuNzNuSzSuEzEuWzWgi,jz−abusε2ααβε1>>αdata−pathui,jzi,jgi,j>>αNuu−divDzDuz−divNzui,jzi,j(a)Proposedarchitectureblockscheme(b)Data–pathblockscheme

Fig.5.Proposedarchitectureblockscheme(a)anddetailofthethedata–path(b)

IV.IMPLEMENTATION

ISSUES

In[14]softwareimplementationsonmodernmicroprocessorsoftheMumfordandShahfunc-tionalareaddressed,showingthathighframeratesequencescannotbeprocessedinreal–time;infacttoperform20iterationsonaQCIFframe4sand12.5sonthehighperformanceTexasInstrumentsTMS320C6711andontheStrongARMSA-1100respectively.Thusinasystemwhererealtimeprocessingisneeded,asthevideoscenario,adedicatedVLSIimplementationwouldberequired.FromtheanalysisperformedinsectionIIIitcanbeobservedthatitisdifﬁculttoﬁxthevaluesoftheMumfordandShahfunctionalparameters,thenumberofiterationsandthenumberofbitsmdevotedtotherepresentationofthefractionalpartofthedata.ToobtainahighlyreusablearchitectureaparametricVHDLmodelhasbeendeveloped,focusingonhighperformanceintermsofnumberofsustainedframe,whilegrantinghighqualityoftheresultinguandz.

IntheliteraturetheimplementationoftheGauss-Seidelmethodforthesolutionofequationssystemshasalreadybeenaddressed.HoweversolutionsproposedintheliteraturecannotbestraightforwardlyadaptedtothecaseoftheMumfordandShahfunctional.InfactmanyworksdescribetheimplementationoftheGauss-Seidelmethodwithparticularemphasisonserialandsharedmemoryparallelcomputing[15],[16]oronsystolicsolutions[17].Aspointedoutin[18]thereisastrongdatadependencybetweenthepixelsproducedwiththeGauss-Seidelalgorithm.Inthefollowingwewillrefertotheneighborsoftheconsideredpixelasnorth,south,eastand

August4,2005

DRAFT

13610259487kk=310k=24jk=118952367i(a)Diagonalscanningorder

Fig.6.

Pixelscanningorder

(b)Improveddiagonalscanningorder

west:xi−1,j=xN,xi+1,j=xS,xi,j+1=xEandxi,j−1=xWwherexcanbeeitheruorz.Theproposedarchitecture,depictedinﬁgure5(a),iscomposedby:a)threebufferstostorerespectivelytheoriginalimageg,theapproximationuandtheedgesz;b)twoaddressgenerationunitsdevotedtomanagethecorrectscanningordertoelaboratethesamples;c)acomplexdata–pathtoimplementequations3and4;d)asimplememoryinterfaceunittoloadallthedatarequiredtoperformtheoperationsdescribedbyequations3and4.A.Addressunit

Asitcanbeobservedfromequations3and4ui,j(andzi,j)dependsonnorth,south,eastandwestvalues.Thusscanningtheimagebyrowsandcolumns(orvice-versabycolumnsandrows)toevaluatethenewvalueofui,j+1thecomputationofui,jandzi,jmustbecompleted.Infactforui,j+1thepixelsui,jandzi,jactasuWandzWrespectively.Insteadofscanningtheimagerow-wiseandcolumn-wise(orvice-versa)a“diagonal”scanningordercanbeemployed(ﬁgure6(a)):ahighlatencyforgeneratingtheﬁrstvaluesisrequired,whileforthefollowingpixelsthelatencydecreasesasthepipelineisﬁlled.Moreoverthealgorithmiterativenaturecanbeexploitedtopipelinealsoontheiterations[18].Thisimproveddiagonalscanningorderisshowninﬁgure6(b)whereiandjrepresentrowsandcolumnsdirectionsandkrepresentstheiterationstep;thenumberassociatedtoeachpixelrepresentstheorderusedtoprocessthe

August4,2005DRAFT

startenstartenAddress generation order

[1,1]

[1,2][2,1][1,1]

[1,3][2,2][3,1][1,2][2,1][1,1]

[1,4][2,3][3,2][4,1][1,3][2,2][3,1][1,2][2,1][1,1]...

tc#iterationiterationcounter(−)entcstartenstartenstart−cnttc(+)startstop−cnttc(+)startstart−cnttc(+)startstop−cnttc(+)startstartstart−vcnt(−)entcstop−vcnt(−)enFSMstart−vcnt(−)tcentcstop−vcnt(−)enstartvaluecnt(+)ienstartvaluecnt(−)jeni−cntj−cntFig.7.Addressgenerationunit

differentvalues.Thisscanningorderleadstothegenerationoftheaddressesdepictedinﬁgure7ontheleft;theycanbegeneratedresortingtosomecountersandsmallcontrollogic,asdepictedinﬁgure7ontheright,wherestart−cntandstop−cntareup-counterstogeneraterows(i)andcolumns(j)currentindexlimits.start−vcntandstop−vcntaredown-counterstogeneratenewindexlimitstakingintoaccountboththediagonalorderandthealgorithmiterations.value−cntareanup-counter(i)forrowindicesgenerationandadown-counter(j)forcolumnsindicesgeneration.Theﬁnitestatemachinereadscountersterminalcount(tc)signalsandgeneratesstartandenable(en)signalsalsocheckingrowsandcolumnsbounds.Finallyiterationcounterisadown-counterdevotedtoperformthenumberofiterationsprogrammed.ThisarchitecturehasbeensynthesizedwithSynopsysdesigncompiler,placedandroutedwithCadenceencounterona0.13µmstandardcelltechnologybothforQCIFandCIFframes.TheaddressgenerationunitforQCIFframescanrunat550MHzrequiringanareaof9170µm2,whereasforCIFframescanrunat510MHzrequiringanareaof10655µm2.B.Data–path

Theproposedarchitectureimplementsequations3and4respectivelytogeneratebothui,jandzi,j.Asitcanbeinferredfromequations3and4,thearchitecturebottleneckisthedivision.Inthecaseoftheproposedarchitecturethedivisionalgorithmneedstobeunrolledandpipelinedinordertoobtainaparalleldividerabletosustainhighthroughput.Moreover,sinceui,jandzi,jarein[0,1]thenumeratorwillneverbegreaterthanthedenominator,thuswedonotneedtointroduceoperandsnormalization,asinmostSRTdividersimplementations[19].

Asdepictedinﬁgure5(b),toperformtheseoperations,thedata–pathcanprocessatthesametimeuN,uS,uW,uE,zN,zS,zW,zEandgi,j.Asdescribedbyequations3and4zN,zS,zE

August4,2005

DRAFT

andzWcanbeemployedtoevaluatethesquarevaluesrequiredbyu˜i,jandui,jdenominator;moreovertheyareneededtocalculatezˆi,j.Employingfourmultipliersthefoursquarevaluescanbecalculatedatthesametime(top-leftinﬁgure5(b)),meanwhilethefourzvaluesareaddedtogethertoobtainzˆi,j(top-rightinﬁgure5(b)).DuringthesameclockcycleuN,uS,uEanduWareemployedtoevaluatetheﬁrstpartofthediscretegradient|∇u|2i,jandarestoredintofourregisters.Sincethediscretegradientissquaredandmultipliedbyfouritcanbecomputedsimplyaddingtogetherthesquareddifferences,infactthefollowingexpression

22holdstrue4|∇u|2i,j=(uS−uN)+(uE−uW).Thediscretegradientcalculationhasbeensplit

intothreeparts:1)subtractionsuSN=uS−uNanduEW=uE−uW,2)squaresuy=u2SNandux=u2EW,3)additionuy+ux,whereeachpartisimplementedinaseparateclockcycle(top-centerinﬁgure5(b)).Whilethediscretegradientsecondpartisevaluated,thesquaredzvaluesareaddedtogethertoobtainui,jdenominatorandmultipliedbytheproperuinput(uN,uS,uEoruW)tocomputethepartialvaluesrequiredbyu˜i,j.Moreoverinthesameclockcyclezˆi,jisshiftedby4ε2/α(seeﬁgure5(b)inthemiddle).

Duringthefollowingclockcycletheevaluationofu˜i,jiscompletedandgi,j(shiftedbyα)isaddedtogenerateui,jnumerator.Inthesameclockcyclethecomputationofui,jdenominatoriscompletedaddingtozsquaredvaluestheterm1/α.Moreoverzi,jnumeratoranddenominatorarecalculated:theformeraddingto4ε2/αzˆtheterm1/α,thelattershiftingthediscretegradientbyε/βandadding1/α+16ε2/β.Finallytwodividersareemployedtocomputeinparallelui,jandzi,jstartingfromtheirnumeratorsanddenominators(bottompartofﬁgure5(b)).Fromtheresultsshowninﬁgure4itisworthnoticingthatprobablymorethan12bitswillbeemployedfortherepresentationofdatafractionalpart.Sinceui,jandzi,jarein[0,1],thedata–pathhasbeenimplemented(synthesized,placedandrouted)varyingthedatawidthrepresentation(m)from14to24bits.BesidesconsideringtherangesdetailedinsectionIIIforα,βandε,internaldataneedupto9bitsfortheintegerpartrepresentation.Resultsdetailedinﬁgure8showhowthefrequencyachievablebythedata–pathchangesvaryingthenumberofbitstorepresenttheoutputresults.Fromtheresultsreportedinﬁgure8itcanbeobservedthattomakethedata–pathabletoprocessanewsampleateachclockcycle,theaddressgenerationunitneedstobeconnectedtoafastmemoryinterfaceblock.

August4,2005DRAFT

Frequency451 bit per clock2 bit per clock3.5x 106Area1 bit per clock2 bit per clock403352.5MHz30µm22251.520115141516171819m20212223240.5141516171819m2021222324(a)Frequency

Fig.8.

Proposeddata–path:areaVsnumberofbitperoutputresults(m)

(b)Area

C.Memoryinterfaceblock

Itisreasonabletosupposethatu,zandgvaluesarestored(atleastpartially)inthreebuffers.Thememoryinterfaceblockisasimpleﬁnitestatemachinethatstartingfromthematrixrowaddress(i)andthematrixcolumnaddress(j)generatedbytheaddressgenerationunit,isabletoloaduN,uS,uE,uW,zN,zS,zE,zWandgi,jfromthebuffersandtofeedthedata–pathwiththesevalues.Exploitingthelowerfrequencyachievablebythedata–path,thememoryinterfaceblockbehavesasasortofserialtoparallelcomponent.Inﬁgure9(a)ablockschemeofthememoryinterfaceblockisshown.Theﬁnitestatemachine(FSM)isdevotedtoreceivetherowandcolumnindices,iandj,calculatedbytheaddressgenerationunit.StartingfromthesevaluestheFSMselectstoadd1,0or-1toinordertoimplementi+1,iori−1(analogouslywithj).Theresultobtainedonthecolumnindexjoughttobemultipliedbythenumberofcolumns(#COLS)andaddedwiththeresultobtainedontherowindexi;whenanaddressiscorrectlygenerateditisvalidatedbytheFSM.Givenanaddressontheproperaddressbus(gabus,uabusandzabus),thedatabuffers(g,uandz)latchthedatarequiredbythememoryinterfaceblockonthedatabus(gdbus,udbusandzdbus);ﬁnallytheFSMloadseachvalueintoitsregister(bottom-rightinﬁgure9(a)).Thememoryinterfaceblockhasbeensynthesized,placedandroutedona0.13µmstandardcelltechnologyforQCIFandCIFframesvaryingm

August4,2005

DRAFT

x 104AreaQCIFCIF5.65.45.25j#cols−1j01−101g_abusu_abusz_abusiivalidFSMµm24.84.64.44.243.83.614uNzNuSzSuEzEuWzWgi,j

u_dbusz_dbusg_dbus1516171819m2021222324(a)Memoryinterfaceblockscheme(b)MemoryinterfaceblockareaVsdatawidth

Fig.9.MemoryinterfaceblockschemeandareaVsdatawidthforQCIFandCIFframes

Sincethecriticalpath(forthisimplementation)isintheaddressgeneration(j×#COLS+i),themaximumachievableclockfrequencydependsonlyontheframesizeandnotonthedatawidth.Infactthemaximumachievablefrequencyisabout215MHzand190MHzforQCIFandCIFframesrespectively.Experimentalresultsshowthattheamountofgatesrequiredincreasesasthedatawidthgrows(seeﬁgure9(b)).Finally,sincethedata–pathproducestwodataperclockcycle,namelyui,jandzi,j,theycanbesenttotheproperbufferresortingtoafurtheraddressgeneratoramultiplierandanadder.

V.PROPOSEDARCHITECTUREPERFORMANCE

Inthefollowingwewillfocusonthecaseofm=16andonebitperclockcycle,andwewilldiscusstheglobalarchitectureperformance.SimulationsperformedontheproposedarchitectureVHDLdescriptionshowthatthedata–pathneeds20clockcyclestoevaluateanewcoupleofresults(uandz).Duetotheimproveddiagonalscanningthedata–pathlatencydecreasesas󰀁Lat0

i=1i.Infactafterthe20thcolumnisprocessed,thesimulationshowsthatthearchitectureproducesacoupleofnewvalues(ui,j,zi,j)ateachclockcycle.Consideringpostplaceandroutefrequencyandthatthememoryinterfaceshouldrunfasterthanthedata–pathtofetchtherequireddata,theproposedarchitecturecansustainmorethan25QCIFframespersecondandabout15CIFframespersecond.Inﬁgure10(c)theperformanceachievablewiththeproposed

August4,2005

DRAFT

low addr

Data−path

z divider

Data−path

u divider

low addr

Data−path

high addr

QCIF (144 x 176)CIF (288 x 352)

Proposedarchitecture1 iterationProposedarchitecture20 iterations[15]20 iterations[7]

λin [10, 1000]

low addr

high addr

676µs3 ms

high addr

addrunit

wbunit

Data−path

u divider

high addr

wbunit

high addr

Data−path

z divider

13.5 ms61 ms

low addr

~530 ms~2.1 s

Memory Interfacehigh addr

Memory Interface

addrunit

~393 ms~1.74 s

(a)QCIF

Fig.10.

(b)CIF(c)Performancecomparison

PostplaceandroutearchitecturesforQCIFandCIFframes(a)and(b).Performancecomparisonwith[14]and[6]

architecturearesummarizedandcomparedwiththeruntimeoftheﬁxedpointCmodel[20]describedin[14]andofacompetitivesoftwareimplementation[6]withintheMegawavepackage(Megawave2.31a)[21]runningonaLinuxMandrake10-PentiumIVat1.6GHzwith512MBofRAM.Postplaceandrouteofthewholearchitectureshowsthat4.72mm2and4.73mm2arerequiredforQCIFandCIFframesrespectively(seeﬁgure10(a,b)).

ThankstotheparametricVHDLdescriptionthetwoimplementationshavebeenachievedsimplychangingsomeparametersintheVHDLsourceandre-performingthedesignﬂow.Inﬁgure10(a,b)itcanbeobservedthat:sincetheQCIFdesignisslightlysmallerthantheCIFone,alsotheQCIFchipisslightlylessdensethantheCIFone.Asfarasthepowerconsumptionisconcernedpostplaceandrouteanalysisshowsthatabout337mWarerequiredfortheQCIFarchitectureand395mWfortheCIFarchitecture.

VI.CONCLUSIONS

InthispapertheMumfordandShahfunctionalhasbeeninvestigatedfromtheimplementationpointofview.Adetailedstudyofﬁniteprecisionrepresentationeffecthasbeencarriedout,afastVLSIarchitecturehasbeendescribedandresultsobtainedthroughtheimplementationona0.13µmstandardcellstechnologyhavebeenpresented.Finallytheauthorswouldliketothankthereviewersfortheirsuggestionsthathaveactuallyimprovedthequalityofthiswork.

August4,2005DRAFT

REFERENCES

[1]D.MumfordandJ.Shah,“Optimalapproximationsbypiecewisesmoothfunctionsandassociatedvariationalproblems,”

Comm.PureAppl.Math.,vol.42,pp.577–685,1989.

[2]F.SroubekandJ.Flusser,“Multichannelblinditerativeimagerestoration,”IEEETrans.onImageProcessing,vol.12,

no.9,pp.1094–1106,sep.2003.

[3]A.Brook,R.Kimmel,andN.A.Sochen,“Variationalrestorationandedgedetectionforcolorimages,”Journalof

MathematicalImagingandVision,vol.18,no.3,pp.247–268,2003.

[4]D.CremersandS.Soatto,“Motioncompetition:Avariationalapproachtopiecewiseparametricmotionsegmentation,”

InternationalJournalofComputerVision,vol.62,no.3,pp.249–265,2005.

[5]R.March,“Visualreconstructionwithdiscontinuitiesusingvaraitionalmethods,”ImageandVisionComputing,vol.10,

pp.30–38,1992.

[6]G.Koepﬂer,C.Lopez,andJ.M.Morel,“Amultiscalealgorithmforimagesegmentationbyvariationalmethod,”SIAMJ.

Numer.Anal.,vol.31,no.1,pp.282–299,feb.1994.

[7]D.Cremers,F.Tishh¨auser,J.Weickert,andC.Shn¨orr,“Diffusionsnakes:introducingstatisticalshapeknowledgeintothe

MumfordandShahfunctional,”InternationalJournalofComputerVision,vol.50,no.3,pp.295–313,2002.

[8]T.P.Vogl,J.W.Mangis,A.K.Rigler,W.T.Zink,andD.L.Alkon,“Acceleratingtheconvergenceoftheback-propagation

method,”BiologicalCybrnetics,vol.59,pp.257–263,1988.

[9]W.Vanzella,F.A.Pellegrino,andV.Torre,“Self-adaptiveregularization,”IEEETrans.onPatternAnalysisandMachine

Intelligence,vol.26,no.6,pp.804–809,jun.2004.

[10]F.Gibou,D.Levy,C.Cardenas,P.Liu,andA.Boyer,“Partialdifferentialequationbasedsegmentationforradiotherapy

treatmentplanning,”MathematicalBiosciencesandEngineering,vol.2,pp.209–226,2005.[11]Intel,“http://www.intel.com/design/xeon/datashts/252135.htm.”

[12]L.AmbrosioandV.M.Tortorelli,“ApproximationoffunctionalsdependingonjumpsbyellipticfunctionalsviaΓ-convergence,”Comm.PureAppl.Math.,vol.43,pp.999–1036,1990.

[13]T.F.ChanandL.A.Vese,“Activecontourswithoutedges,”IEEETrans.onImageProcessing,vol.10,no.2,pp.266–277,

feb.2001.

[14]M.MartinaandG.Masera,“MumfordandShahfunctional:Finiteprecisionanalysisandsoftwareimplementation,”in

IEEEInternationalSymposiumonSignalProcessingandInformationTechnology,2004.

[15]M.F.Adams,“AdistributedmemoryunstructuredGauss-Seidelalgorithmformultigridsmoothers,”inACM/IEEE

ProceedingsofSC2001:HighPerformanceNetworkingandComputing,2001,pp.1–4.

[16]M.M.Strout,L.Carter,J.Ferrante,J.Freeman,andB.Kreaseck,“CombinigperformanceaspectsofirregularGauss-Seidel

viasparsetiling,”in15thWorkshoponLanguagesandCompilersforParallelComputing,2002,pp.1–4.

[17]F.Z.Hadjam,A.Rahmoun,andM.Benmohammed,“Ondesigningasystolicnetworkfortheresolutionoflinearsystems

usingtheGauss-Seidelmethod,”inIEEEInternationalConferenceonComputerSystemsandApplications,2001,pp.283–286.

[18]J.A.YangandY.Choo,“Formalderivationofanefﬁcientparallel2-DGauss-Siedelmethod,”inProceedingsofthe6th

InternationalParallelProcessingSymposium,1992,pp.204–207.

[19]J.E.Robertson,“Anewclassofdigitaldivisionmethods,”Trans.onElectronicComputers,vol.7,pp.218–222,1958.[20]M.Martina,“MumfordandShahCmodel,”downloadableatwww.vlsilab.polito.it/∼martina.[21]MegaWave2,“availableat:http://www.cmla.ens-cachan.fr/cmla/megawave/index.html.”

August4,2005DRAFT

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文