APhDDissertationProposal
submittedby
DepartmentofEE-Systems
UniversityofSouthernCalifornia,LosAngeles,CAUSA
zbaker@halcyon.usc.edu
ZacharyBaker
ViktorK.Prasanna(Chairman)
PeterA.BeerelKaiHwang
ShrikanthNarayanan
MichaelS.Waterman(OutsideMember)
GuidanceCommittee
Monday,Dec6th,2004
1-3pmEEB539
Abstract
Weproposetodevelopalgorithmsandarchitecturesforenergy,time,andareaefficienthardwareimplementationsofvariouscomputationanddataintensivekernels,namely,stringmatchingandcorrelationmining.Wewilldemonstrateourarchitecturesinthecontextofnetworksecurity,butthekernelshavewideapplicationsinnotonlynetworksecurity,butcomputationalbiology,wiretaptextmining,frauddetection,etc.Throughtheuseofreconfigurablehardwarewecanachievesignificantperformanceimprovementsoversoftwareapproaches,and,throughintelligentalgorithmdesignwedemonstrateimprovementsintime,area,andenergy-efficiencyoverna¨ıvehardwareapproaches.Inthisproposalwewilldescribethegeneralapplicationareas,giveanoverviewofourcompletedresearch,andintroducethreepotentialresearchtracks.
Contents
1Introduction
2Background
2.1FPGA............................................2.2TextMatching.......
................................2.2.1Unalignedmatching................................2.2.2AlignedMatching.................................2.3DataMining........................................2.4NetworkSecurity......................................3CompletedResearch
3.1StringMatching.....
.................................3.1.1KMP........................................3.1.2Shift-and-Compare.................................3.2DataMining.......
.................................3.2.1IntroductiontotheAprioriAlgorithm......................3.2.2OurContributions
...........
......................
4ProposedResearch
4.1Low-risk,MediumReward:AcceleratingDataMiningKernels.............4.2Medium-riskEngineering:AlignedTextMining.....................4.3High-risk,HighReward:AutomaticAttackDetection.................5ResearchPlan6OverallThesisGoals
1
233334456667101011131314151617
1Introduction
Computersystemsecurityisprogressivelybecomingalargerandmorepressingconcerntotheworld.AsmoresectorsofoureconomybecomedependentontheInternet,theybecomeincreasinglyexposedtoattacksfromoutside.Thevarietyofattacksisonlymatchedbytheirpotentialtargetswithinthesystem.OneofthemoresimpleattacksisaDenialofServiceattack,whichfloodsanetworkwithsomanyrequeststhatitisimpossibletoprovideservicetothelegitimaterequests.Morecomplicatedattacksactuallyaddresstheweaknessesofparticularapplicationsrunningonaserver.Theseattacks,mostcommonlyimplementedthroughbufferoverflowandotherwiseunacceptableinputs,exploitpoorlydesignedsoftwaretotakecontrolofacomputer.Aftercontrolisgained,thecompromisedcomputercanbeusedinadistributedattackagainstothercomputers,searchacomputerforpasswords,personalinformation,creditcardnumbers,etc.Often,theseapplication-levelattacksaredistinguishedbyaparticularfingerprintstringthatiscommontoallvariationsoftheattack.
Detectingtheseattacksreducestoastring-matchingproblem,ahighlycomparison-intensiveproblemtoattemptinreal-time.Whilethekernelstringmatchingoperationisasimpleandwell-understoodproblem,theparticularrequirementsofnetworksecuritymakeitaninterestingresearchproblem.Thekeycharacteristicisconcurrentmatchingagainstthousandsofpatternsatlineratesofupto10Gb/s.Unfortunately,thisisahighlycomputationallyintensiveproblem,andreal-timeperformanceisnecessarytobothpreventbufferoverflowsandprovidetimelyattackinformation.Itisimpossibletomeettheperformancerequirementswithoutspecialhardware.Forinstance,aDual1GHzPentiumIIIsystem,using845patterns,runsatonly50Mbps.Thehighlevelofparallelismandthroughputrequiredtransformasimplesequentialproblemintoaninterestingstudyofarchitecturaldetail,graph-basedoptimizationtechniques,andhardwareutilization.
Forthisreason,itisnecessarytomovetohardwareimplementationstoincreasethelevelofparallelismpossible.FPGAdevicesprovideasurprisinglypowerfulandconvenientplatformforimplementingouralgorithms.Forthesekerneldomains,FPGAsprovidemanyoftheadvantagesofASICdesign,includingpotentialforparallelism,veryefficientbit-leveloperations,andhighbandwidth.UnlikeASICs,though,reconfigurablehardwarecanbereprogrammedtosuitaspecificsetofinputdataoroperatingsituations,allowingtheflexibilityofsoftwarewiththepoweroffixedhardware.
Theotheralgorithmareaweaddress,correlationmining,hasapplicationsinbusinessdatapro-cessing,biosequencemining,andwehaverecentlywrittenawhitepaperdealingwiththeprocessingofspeech-to-textdataforwiretapmining.Miningisadataintensiveproblem,withlargedatabasesspanningGigabytes.Ourobjectiveistodeveloparchitecturesallowingforreal-timeanalysisofdatawindows,requiringsignificantlyhigherperformancethaniscurrentlyavailable.
AproblemofthestringmatchingapproachtoIntrusionDetectionhasalwaysbeentheadditionofnewpatterns;theyaredeterminedmanuallybyinspectingnewattacksforidentifying“fingerprints.”Ouroverallgoalistoleverageourhigh-performancekernelsinasystemthatcandetermine,basedonitscurrentdatabaseandthetemporallocalityofattackingpackets,thatanewattackiscontainedwithinanotherwiseinnocent-appearingpacket.
WefirstgiveanintroductiontocurrentFPGAtechnologiesandtheresourcesandperformanceavailablewiththem,andthengiveanoverviewofthetextmatchinganddataminingalgorithmsweutilize.InSection3,wedescribetheworkcompletedthusfar.InSection4,ourproposedworkisdiscussed,alongwithsomeoftheproblemswefaceinachievingourgoals.
2
2
2.1
Background
FPGA
FieldProgrammableGateArrays(FPGA)provideafabricuponwhichapplicationscanbebuilt.FPGAs,inparticular,SRAMbasedFPGAsfromXilinx[40]orAltera[4]arebasedonalook-uptables,flip-flops,andmultiplexers.Inthesedevices,aSRAMbankservesasaconfigurationmemorythatcontrolsallofthefunctionalityofthedevice,fromthelogicimplementedtothesignalingstandardsoftheIOpins.Thevaluesinthelook-uptablescanproduceanycombinationallogicfunctionalitynecessary,theflip-flopsprovideintegratedstateelements,andtheSRAM-controlledroutingdirectlogicvaluesintotheappropriatepathstoproducethedesiredarchitecture.Thedeviceiscomposedofmanythousandsofbasiclogiccellsthatincludethebasiclogicelements,andbasedonthedevicevariety,includesfastASICmultipliers,ethernetMACs,localRAMs,andclockmanagers.FPGAsstartedoutasprototypingdevices,allowingforconvenientdevelopmentofglue-logic-typeapplicationsforconnectingASICcomponentswithouthighVLSIdesigncostsorlargenumbersofdiscretestandardlogicgates.AsthegatedensityofFGPAdevicesincreasedandapplicationspecificASICblockswereadded,theapplicationsshiftedfromgluelogictoawidevarietyofsolutionsforsignalprocessingandnetworkproblems,andthedeviceshavebeendeployedinthefieldasthefinal,butstillflexible,solution.BecausethedeviceiscontrolledbythestateoftheSRAMbits,thefunctionalitycanbechangedbychangingthememorystate.Thiscanbeuseful,aslogiccanbecustomizedforaparticularsetofinputdata.
FPGAscanprovideanefficientsolutionformanyproblems.Reconfigurablelogichasbecomeaverypopularapproachfornetworkapplications.Cryptography[42],error-tolerantcodeprocess-ing[25],routerapplicationssuchaspacketscheduling[22]andflowmonitoring[27],andvariousstrategiesforintrusiondetectionallhaveFPGAimplementationsforacceleratingcomputationalintensiveportionsoftheircode.However,byteorientedoperationssuchasstringmatchingisoneparticularstrengthofFPGAfabrics.Intheearliestworkinregularexpressionmatching[30,31],amethodwaspresentedformatchingregularexpressionsusingaNon-deterministicFiniteAutomaton,implementedonanFPGA.
Webelievethatintelligentarchitecturaldesignisworthmorethanasimpleimplementation.Therearemanytoolsforautomaticallyconvertingahigh-levellanguageprogramintoalow-leveldesign.Withrecentadvancesincompilerandsynthesistechnology,itisnowpossibletomapthecomputationallyintensivemodulesofaprograminFortran,C(SRCCarte[37]andCeloxica[12]),orJava(XilinxForge[41])tohardware.Duetothedevelopmentofthesehigh-leveldesigntools[4,40],thescientificcomputingcommunitycanutilizereconfigurabledeviceswithoutsteeplearningcurves.
However,whiletheeaseandpopularityofusingFPGAsforapplicationdesignhasincreased,theautomaticallygeneratedarchitecturestendtobeinefficientcomparedtowell-researchedandthought-outarchitecturesforcomplexapplications.Inthesesituations,thecreativityanddomainknowledgerelevanttoadesignprovidedbyahumandesignerisavaluableasset.
2.2TextMatching
Textmatchingisafundamentalkernelofmanyapplications.Webreakthetextmatchingdisciplineintotwomainsections;unalignedandalignedmatching.Thesecategoriescorrespondroughlytoourcurrentandproposedworkinstringmatchingfornetworksecurityandtextmining,respectively.2.2.1
Unalignedmatching
Unalignedmatchingisachallengingproblembecauseitisimpossibletopredeterminewhereapatternwillstartinaninputstream.Thus,inthena¨ıveapproach,thetimecomplexityisO(nk)foraklengthpatterninannlengthinputstream.Thena¨ıveapproachassumesaworstcaseinwhichthe
3
Figure1:Devicecapabilitiesvs.time
inputpatternmatchesforthefirstk−1charactersandthenfailsinthelastcharacter;thesystemthenadvancesthebeginningcomparisonpointertothenextcharacterandproceedestomatchtothek−1patterncharacteragain.Anexamplesituationofthissortisaaaaaaabwheretheinputstreamsconsistsonlyofa.
Thisisvitaltoourarchitectureaswewillseelater,butitisimportanttonetworksecurityingeneralbecauseanattackermightattempttocausetheIDStodroppacketsbyfloodingitwithspeciallydesignedpackets.IfaIDSisoverwhelmedbytraffic,mostconfigurationswillallowsomepacketsthroughwithoutscreening,orsubjectthemonlytoacursoryexamination.Thisprovidesanopportunityforanattackertosubvertthesystem.Thistypeofattackisshowntobeuselessagainstourdesigninourpaper([8]).2.2.2
AlignedMatching
Alignedmatchingisadifferentsituationaltogether.Becauseeachwordisdelimitedbythespacebetweenwords,itispossibletotakeamuchmoreparallelapproachtotheproblem.Insteadoftestingonealignmentofthestreamineachcycle,itispossibletoincreasethroughputbyacceptingmultiplewordsfromthestreamaswellasallowingconcurrentevaluationofpatterns.Onestrategyistoaserialshift-inbufferwithaparallelreadout;thisallowsawholewordtobeshiftedinandthenreadoutandcomparedinasinglestep.Becauseweknowthatthematchisbasedonawholewordinsteadofasomearbitrarysectionoftext,wecanacceptcharactersandfasterandreduceunnecessarywork.
Thisisrelevantbecausetherearemanyalignedmatchingproblems.Whiletheunalignedproblemisrelevanttonetworksecuritymatching,alignedmatchinghasapplicationstotextmining,websearching,indexing,speech-to-textapplications,etc.Asweareinterestedindevelopingourworkinthetextminingfield,weareinterestedinexploringthealignedmatchingproblem.
2.3DataMining
Recentadvancesinstorageanddatasensinghasrevolutionizedourtechnologicalcapabilityforcollectingandstoringdata.Serverlogsforpopularwebsites,customertransactiondatafromnetworkrouters,creditcardpurchases,customerloyaltycards,etc.produceterabytesofdatainthespanofadaythatisusefulashistoricalrecordbutnotasusefulasitcouldbewereiteffectivelyprocessedforpatternsandtrends.Correlation-baseddataminingisthefieldofalgorithmstoprocessthisdataintomoreusefulforms,inparticular,connectionsbetweensetsofitems.Wehaveinvestigated
4
theApriorialgorithm[2],apopularstrategydesignedforprogressivelygroupingtogetherfrequentitemsetsinlargedatabasesgivenaparticularfrequencycutoff.
WhilethecomputationanddatacomplexityoftheApriorialgorithmisveryhigh,littleresearchhasbeendoneinefficientimplementationsforhardwareacceleration.Wefeelthatmuchofthedis-interestindoingthisworkliesinthechallengeofimplementingsetmembershipfunctionsefficiently,aswellasinthecomplexityofcontrol.
Thedevicechoiceisimportant,asthesystolicarrayisdependentonitsdatasourceandsink,andforthe“softwaredeceleration”[19],providedbyamicroprocessorforcontrol-intensiveoperationsthattakeupasmallfractionofthetotalrunningtimeofthealgorithm.
K-means
K-meansclusteringisadataminingstrategythatgroupstogetherelementsbasedonadistancemeasure.ThedistancecanbeanactualmeasureofEuclideandistanceorcanbemappedfromanymannerofotherdatatypes.Eachiteminasetisrandomlyassignedtoacluster,thecentersoftheclustersarecomputed,andthenelementsareaddedandremovedfromclusterstomoreefficientlymovethemclosertothecentersoftheclusters.In[16,39]thek-meansclusteringalgorithminimplementedasanexampleofaspecialreconfigurablefabricintheformofacellulararrayconnectedtoahostprocessor.
K-meansisrelatedtotheApriorialgorithmasbotharedependentonefficientsetadditionsandcomputationsperformedonallelementsofthosesets,butaddsthedistancecomputationandsignificantlychangeshowthesetsarebuiltup.Besidesdifferingintheoverallalgorithm,thestructureofthecomputationisalsosignificantlydifferent,asthesystemrequirestheuseofglobalmemory,inwhicheachunit’spersonalmemoryisaccessiblebythehostcontroller.Byavoidingglobalconnectionsthatviolatetheprinciplesofsystolicdesign,wecanincreaseoverallsystemclockfrequencyandeaseroutingproblems.
2.4NetworkSecurity
Methodscommonlyusedtoprotectagainstsecuritybreachesincludefirewallswithfilteringmecha-nismstoscreenoutobviouslydangerouspackets,andintrusiondetectionsystemswhichusemuchmoresophisticatedrulesandpatternmatchingtosensepotentialmaliciouspackets.Thesetechniquesrequiresignificantcomputationalresources,and,usinghighly-paralleladaptivesoftprocessors,pro-videopportunitiesfordramaticimprovementsinperformance.
Afirewall’sfunctionistofilterattheheaderlevel;ifaconnectionisattemptedtoadisallowedport,suchasFTP,theconnectionisrefused.Thiscatchesmanyobviousattacks,butinordertodetectmoresubtleattacks,anIntrusionDetectionSystem(IDS)isutilized.TheIDSdiffersfromafirewallinthatitgoesbeyondtheheader,actuallysearchingtheapplicationlayerstreamcontentsforvariouspatternsthatimplyanattackistakingplace,orthatsomedisallowedcontentisbeingtransferredacrossthenetwork.CurrentIDSpatterndatabasesreachintothethousandsofpatterns,providingforadifficultcomputationaltask.
BecausetheIDSmustinspectatthelinerateofitsdataconnection,IDSpatternmatchingdemandsexceptionallyhighperformance.Thisperformanceisdependentontheabilitytomatchagainstalargesetofpatterns,andthustheabilitytoautomaticallyoptimizeandsynthesizelargedesignsisvitaltoafunctionalnetworksecuritysolution.Muchworkhasbeendoneinthefieldofstringmatchingfornetworksecurity[14,17,18,26,35].However,thestudyoftheautomaticdesignofefficient,flexible,andpowerfulsystemarchitecturesisstillinitsinfancy.
AcompleteIntrusionDetectionSystems(IDS)basedontheSnortrules[34]requiresasystemoptimizedforthousandsofrules,manyofwhichrequirestringmatchingagainsttheentiredatasegmentofapacket.Highlevelsofperformancearenecessarytoprovidereal-timestringmatchingattrunklinespeeds.Snort,theopen-sourceIDS[34]hasthousandsofcontent-basedrules.Eachoftheserulesrequirethatapacketbesearchedinitsentiretyfortheoccurrenceofsome“fingerprint”string.Usingna¨ıvemethods,thisisunworkable.
5
3
3.1
3.1.1
CompletedResearch
StringMatching
KMP
IntrusionDetectionSystemsrelyonthepatternmatchingofthousandsoffingerprintstringstodetectpotentialmaliciouspackets.Thisfilteringrequiressignificantcomputationalresourcesandisdifficultusingna¨ıvemethods.
Ournovelapproachtoruntimeadaptabilityusesapipelined,two-comparator,bufferedimple-mentationoftheKnuth-Morris-Prattalgorithm(KMP)[21]toimplementhigh-performancepatternmatching,whileyieldingaunitdesignthatishighlyareaefficient.
OverviewofKMP
KMP,developedbyKnuth,Morris,andPratt[21]utilizesapre-computedtabletopreventredundantcomparisons,reducingtheworst-caserunningtimefromO(kn)toO(n+k).Thepre-computedtable,orπ-table1,containsthelengthofthelargestnumberofcharactersintheprefixofthepatternPthatmatchthesuffix:
π[q]=max{j:j (1) Thatis,theπ-tabletellshowmuchofthebeginningofthepatternhasalreadybeenmatchedatanypositioninthepattern.Ofcourse,thisonlyaffectsthesystemwhenthepatternhasrepeatedstrings;apatternsuchas“abcdefg”wouldhavenousefulinformationintheπ-table,while“abaab”wouldhaveusefulinformationbecausethebeginningofthepatternshowsuplaterinthepattern.Aftertheπ[q]definitionisapplied,thereissomepost-processingontheπ-table,discussedlaterinthissection. Theπ-tablebelowisfortheworst-casepattern,aFibonaccistring[21].π[1]=0,meaningthereisnopossiblematchandtheinputpointershouldbeadvanced.π[4]=2,meaningthenextcharactercomparisonisagainstP[2]=‘b’.Ifthisfails,thenextcomparisonisagainst‘a’.Thesecondcomparisonof‘a’isunavoidablebecauseKMPishistoryless,meaningthatthesystemcannotrememberwhatithascomparedearlier.Animportantpointisthathard-codedFiniteAutomata-basedimplementationscanworkmoreefficientlythanamemory-basedapproachbecausetheFAcancompareallofthepossibletransitionsinparallel,whiletheRAM-basedapproachmustsequentiallycomparetheincomingcharacteragainst(potentially)severalpatterncharacters. qP[q]π[q]1a02b13a04a25b16a07b48a09a2OneimportantnoteisthatwhenP(i)=P(π(i)),theoptimizationπ(i)<=π(π(i))canbemade.ThisviolatesEquation1butprovidesmuchhigherperformancebyremovingcomparisonsthatcanneverbetrue.Thisoptimizationwasintroducedintheoriginalpaper[21]butwasomittedin[38]. OurapproachtointrusiondetectionusesamodifiedversionoftheKMPalgorithmandmatchingarchitectureoptimizedforrunningonalineararray.TheKnuth-Morris-Prattalgorithm(KMP)[21]isasophisticatedapproachtostringmatching,providingO(n+k)intheworstcase.Theexactrunningworstcaserunningtimeis2n−k. OurContributions Ourarchitectureenableshigh-throughput,easilyconfigurableintrusiondetectionforavarietyofhardwareplatforms(FPGAorASIC).Unlikeotherstate-of-the-artarchitectures,ourapproach usetheπ-tableasequivalenttotheusageofthenextfunctionin[21];[38]alwayscomparesagainstπ+1 whereastheoriginalKMPcomparesagainstπ 1We 6 doesnotusehard-wiredcircuitrytoimplementpatternmatchingonanFPGAorotherwise,butusesconfigurablememoriesstoringpatternsandprecompiledjumptablestoprovideexceptionalflexibility. Byminimizingthenumberofcomparatorsrequiredtomatchanewinputcharactereachcycle,weuselesshardwareresourcesthanotherapproaches.Afterfindingthemaximumbuffersizerequirementsthroughcarefulanalysisofthealgorithm,wecanproduceapatternmatchingunitthatusesfewFPGAresources,allowingmoreunitstobeintegratedontoasinglechip.Byallowingonlyone-waycommunicationbetweenneighboringunits,thearchitectureissuited,bydesign,foruseinalineararrayandingridsofunits. Asignificantcontributionofthispaperisademonstrationofthemaximumsizebufferrequiredtoimplementastringmatchercapableofacceptingacharacterfromtheinputineachcyclewithoutresortingtokparallelcomparators,wherekisthepatternsize,orannelementbuffer,wherenistheinputstreamsize.ThearchitecturethatenablesthisisabufferedstringmatchingsystemimplementingaKMP-likepre-computationalgorithmutilizingtwocomparators.Thisallowsa√ matchingunittoacceptonecharactereachcycleintoabufferofsizelogφk,whereφ=1+25.Bybufferingtheinputweallowalineararrayofpatternmatchingunitstobedaisy-chainedtogether.Theunit-levelbufferingisakeyidea,becausewithoutbufferingwheneveranindividualunitstallstorecoverfromafailingcomparison,theentiresystemmuststallaswell.Thisisunacceptable.ThelineararrayweenablereducesthefanoutandtheinterconnectdistancebyallowingunitstoberegularlyarrangedonanFPGAwithoutlonginterconnects.Thedetailsofourarchitecturearecoveredin[8]. Whilethenetworkintrusiondetectionproblemiswelldefined,theperformancemeasuresarenot.Therearemanyapproachestotheintrusiondetectionproblem,andeachhasitsownstrengthsandweaknesses.Weassumethatthemostimportantmeasuresarethroughput,areaefficiency,andtotalworkperformed.Throughput,becausethefilteringneedstokeeppacewiththelinerate.Areaefficienyisimportantbecausethereareafewthousandpatternsthatneedtobematched.Weproposethatreconfigurabilityisthethirdvitalmetric:thepatterndatabasedoeschangefromtimetotime,andthedowntimeandprocessingtimeforreconfigurationshouldbelow.Hardwiredapproaches[5,6,7,18,36]requireplaceandrouteoftheentiredevice,andthenadelayforbitstreamreconfiguration.Theidealapproachoffersonlinereconfiguration,withlittleornodelaytothedatastream. Wehavedefinedperformanceintermsofoverallworkperformed,andpatterncharactersperlogiccells.Shift-and-compareapproaches,asdiscussedinSection3.1.2below,requirethattheentirepatternbecomparedagainsttheinputstringineverycycle.Thisequalstheworst-casesituationinthena¨ıveapproach,nkcomparisons.ThroughamortizedanalysisweshowthattheKMParchitecturerequiresonly2nbytecomparisons,amuchmoreefficientstrategythantheshift-and-compareapproach.Intermsofmeasuredbypatterncharactersperlogiccell,theKMPoutperformsallbutthehardwiredapproaches;inordertoincreasethelengthofthepatternstring,allthatisrequiredisenoughmemorytostorethepatternandRAMsufficienttostore1/2lengthofthepatternforthebuffer.Becausethesizeofthecontrollogicandthenumberofcomparatorsdoesnotincreasewiththesizeofthepattern(exceptingcountersizes,increasingaslg(n)ofthepatternsize),theKMPapproachisveryefficient.3.1.2 Shift-and-Compare In[5,6,7],wedevelopedanautomaticsynthesismethodologyandtoolsetforhighlyareaandtime-efficientintrusiondetectionsystemsusingahigh-level,graph-basedpartitioningmethodology.Automatedsystemleveldesignallowsmoreefficientcommunicationandextensivereuseofhardwarecomponentsfordramaticincreasesinarea-timeperformance.Throughpre-processing,thistool-basedmethodologyyieldsdesignswithcompetitiveclockfrequenciesthataremoreareaefficientthananyothershift-and-comparearchitectures. 7 Previousapproachestostringmatching,excepting[15]havecenteredaroundabyte-levelviewofcharacters.Highperformancedesignsevenincreasedthebasecomparisonsizeto32bits,providinghighthroughputbyprocessingfourcharacterspercycle.Increasingthenumberofbitsprocessedatasinglecomparatorunitincreasesthefan-intosinglegates.Ourapproachmovesintheoppositedirection,tosingle-bit,orunary,comparisons.Wedecodeanincomingcharacterintoa“one-hot”bitvector,inwhichacharactermapstosinglebit.Thisearlydecodingisreferredtoas“shareddecoding”in[15]. Unfortunately,withoutsomereductioninthecharacterset,unaryrepresentationsarealmostentirelyuselessduetothelevelofinefficiencycausedbythehugenumberofbitlinesrequiredforthe256characterASCIIset.However,ifthecharactersetcanbereduced,thenumberofbitlinescanbedramaticallyreduced.ThemosttrivialexampleofreducedsetsisDNAmatching,wheretheonlycharactersrelevantareA,T,C,G,representedasfourone-hotbits.Amoreinterestingexampleisstringmatchingfornetworksecurity,wherethousandsofpatternsneedtobematchedsimultaneouslyathighthroughputrates. Becauseintrusiondetectionrequiresamixofcasesensitiveandinsensitivealphabeticcharacters,numbers,punctuation,andhexadecimal-specifiedbytes,thereisaninterestinglevelofcomplexity.However,eachstringonlycontainsafewdozencharacters,andthosecharacterstendtorepeatacrossstrings.Usingtechniquesfromgraphtheory,thepatternsarepartitionedn-wayssuchthatthenumberofrepeatedcharacterswithinapartitionismaximized,whilethenumberofcharactersrepeatedbetweenpartitionsisminimized,thesystemcanbecomposedofnpipelines,eachwithaminimumofbitlines. Thefirstandobviousproblemisthenumberofcharactersthatmightneedtobematchedforanyinputpattern.Dataisencodedforareason,afterall,andhavingabitforeverycharacterisnotefficient.Theroutingofalargeamountofbitlinesisfairlyexpensive,butbecauseofthepipeliningbetweenunits,thereislittletimeperformancepenalty.Ifthedesignautomationtoolscanpartitionarulesetto30bits,orroughlythesameamountofdataroutingasin[35],andthenmaketheactualmatchingunitsatleast8xsmaller(32downto4bitsfor[14]),andremovethebuffersandcontrollogicofKMP-styledesignsuchas[18],wecanexpecttoseeperformance(area-time)increases. InthewholeoftheSnortdatabase,thereareonlyabout100differentcharactersevermatchedagainst.Someofthosearecaseinsensitive,orcanbemadecaseinsensitivewithoutlossofgenerality,reducingthenumberofuniquecharacterstoroughly75.Hexadecimal-specifiednon-characterbytesareexpandedtoequivalentcharacterrepresentations.Thenextstepistopartitionthepatternsintoseveralgroupssuchthattheminimumnumberofcharactershavetobepipedthroughthecircuit;thatis,wegiveeachgroupofpatternsapipeline,andgothroughvariousheuristictrickstoattempttoreducethepipelineregisterwidth. Thegraphcreationstrategyisasfollows.Westartwithacollectionofpatterns,representedasnodesofagraph.Eachpatterniscomposedofcharacters.Everynodewithagivenletterisconnectedbyanedgetoeveryothernodewiththatletter. Thisproducesadenselyconnectedgraph,almost40,000edgesina361vertexgraph.Ourobjectiveistopartitionthebiggraphintotwoormoresmallergroupssuchthatthenumberofedgesbetweennodeswithinthegroupismaximized,andthenumberofedgesbetweennodesindifferentgroupsisminimized.Eachpipelinesuppliesdataforasinglegroup.Bymaximizingtheedgesinternaltoagroupandminimizingedgesoutsidethegroup(thatmustbeduplicatedbetweenpartitions),wereducethewidthofthepipelineregisters,butimprovetheusageofanygivencharacterwithinthepipeline. Thisproblemisastandardgraphtheoryprobleminphysicaldesignautomationknownas“min-cut”problem.Asinglecircuitneedstobesplitintotwochips,butthetwosectionsareinterdepen-dent.Themincutsolutionminimizesthenumberofwiresconnectingthetwochips,thusdecreasingthenumberofpinsandreducingenergyconsumptionandclockperiod.Weuseaniterativeim-provementheuristicbyKernighanandLin[20](1970). Akeycontributionofthispaperisthedramaticallyreducedcomplexityofthemincutoperation 8 Design USC(nopipelining)USC(pipelined)USCUnaryUSCUnary(1byte)USCUnary(4byte)USCUnary(8byte)USCUnary(Prefilter)USCUnary(Tree)LosAlamos(FPL’03)[17]UCLA(FPL’02)[14] UCLAw/Reuse(FCCM’04)[13] U/Crete(FPL’03)[35]U/Crete(FCCM’04)[36]GATech(FCCM’04)[15]Throughput1.8Gb/s2.4Gb/s2.07Gb/s1.79Gb/s6.1Gb/s10.3Gb/s6.4Gb/s2.00Gb/s2.2Gb/s2.88Gb/s3.2Gb/s10.8Gb/s9.7Gb/s7.0Gb/sUnitSize911207.35.722.332.09.46.624316011.42695750Performance 19.8202833152713226823039.118.028040.1170140 Table1:Unitsizeistheaverageunitsizefora16characterpattern(inlogiccells;onesliceistwologiccells),andperformance(inMb/s/cell).Throughputisassumedtobeconstantovervariationsinpatternsize. bypartitioningnotnetlists,butthesetofpatterns.Becausetheoperationisatamuchhigherlevel,atthelayerofthepatternsandcharactersimilaritiesratherthangatesandwires,thesystemnotonlyrequiresmuchlesspruningbythesynthesistool,butallowsformucheasierpre-synthesisperformanceestimation. Partitioningreducesthenumberofcharacterswithinanysinglepartition,whichreducestheinterconnectdensitywithinthatpartition.Wehavedemonstratedanaverageimprovementintimeperformanceof20%foralldatabasesizes.Unfortunately,becauseasthenumberofpartitionsincreasesthenumberofreplicatedcharactersalsoincreases,multiplepartitionscancauseanincreaseinoverallareausage.However,thesystemasawholewilloperatemorequicklybecauseeachpipelineisindependent. Wehavealsoexploredtheuseofatreestructureforreducestheamountofcharactersredundantlycomparedbetweenmultiplepatterns.Eachpattern(oflengthgreaterthaneightcharacters,asotherwisethewholepatternwouldfitinthetwoprefixes)iscomposedofafourbytefirst-levelprefix,afourbytesecond-levelprefix,andtheremainderofthepattern.Eachprefixismatchedindependentlyoftheremainderofthepattern.Thisiseffectiveinreducingtheareaofthedesignbecauselargesectionsoftherulesetsshareprefixes.Themostcommonprefixis/scripts,wherethefirstandsecond-levelprefixesareusedtogether.The4-characterprefixwasdeterminedtofiteasilyintotheVirtex-style4-bitlookuptable,butitturnsoutthatfour-charactergroupsarehighlyrelevanttointrusiondetectionaswell.Patternswithdirectorynamessuchas/cgi-binand/cgi-wincansharethesamefirst-levelprefix,andtheneachhaveafewdozenpatternsthatsharethe-binor-winsecond-levelprefix.Redundantprefixcomparisonscanbereducedby2to3xthroughtheuseofthetreearchitecture. Overall,thehardwiredsystemoperatesatmuchareaefficienciesthattheKMPsystem,packingupto5patterncharactersinasingleslice.Thehardwiringoflogicallowsforveryhighpatterndensity,fortworeasons.First,thelookup-tablearchitectureallowsforarbitrarylogicspecification,allowingveryefficientcomparators,whereasamoreflexiblesystemwouldrequireXORgatesandex-trastorageforthepattern,reducingefficiencybymorethan2xevenwithouttheotheroptimizations.Second,becausethesystemiscomposedwithknowledgeofthepatternstobematched,inter-patternredundancycanbeexploited,whereastheKMPcanonlyexploitintra-patternredundancy. 9 3.2DataMining Dataminingisarapidlyadvancingfieldofstrategiesforfindingconnectionsbetweenelementsinlargedatabases.TheApriorialgorithm[23]isapopularandfoundationalmemberofthecorrelation-baseddataminingkernelsusedtoday.However,itisacomputationallyexpensivealgorithmandrunningtimescanstretchtodaysforlargedatabases,asdatabasesizescanreachfromGigabytesandcomputationrequiresmultiplepasses.3.2.1 IntroductiontotheAprioriAlgorithm WebreaktheApriori[2]algorithmintothreesections,asillustratedinFigure2.Initialfrequentitemsetsarefedintothesystem,andthecandidategeneration,candidatepruning,andcandidatesupportstepsareexecutedinturn.Thesupportinformationisfedbackintothecandidategeneratorandthecyclecontinuesuntilthefinalcandidatesetisdetermined.Wewillfirstintroducesomeofthedatamininglexiconandthendescribetheoperationalphasesinmoredetail. Intheliterature,ananalogytoashoppingcartisused:thesetofitemspurchasedatonetime,checkedoutfromthelibrary,orotherwisegroupedtogetherbasedonsomecriterasuchastime,customer,etc.isreferredtoasabasket.Theitemswithinthebasketcanbetheentiretransaction,ortheremaybemultipletransactionswithinthebasket.Afrequentitemsetistheasetofoneormoreitemsthatoftenoccurinadatabaseoneitem,andoftenoccurstogetherinthesamebasketwithinthedatabaseifitconsistsofmorethanoneitem.Thecutoffofhowoftenasetmustoccurbeforeitisincludedinthecandidatesetisthesupport. Inthisway,aresearchercanrequestaparticularsupportvalueandfindtheitemswhichoccurtogetherinabasketaminimumnumberoftimeswithinthedatabase,guaranteeingaminimumconfidenceintheresults.Apopularexampleintheliterature(possiblyapocryphal)isprocessingthesupermarkettransactionsofworkingmenwithyoungchildren:whentheygotothestoreafterworktopickupdiapers,theytendtopurchasebeeratthesametime.Thus,itmakessensestatistically,ifnotsociallyresponsibly,toputabeerrefrigeratorinthediaperaisle. Figure2:Processflowofthedataminingsystem Candidategenerationistheprocessinwhichonegenerationofcandidatesarebuiltintothenextgeneration.ThisbuildingprocessisfromwheretheApriorinamederives.Becauseeachnewcandidateisbuiltfromcandidatesthathavebeendeterminedapriori(inthepreviousgeneration)tohaveahighlevelofsupport,theycanbeconfidentlyexpandedintonewpotentialfrequentitemsets.Thisisexpressedformallyasfollows:∀f1,f2∈Fkdo withf1=(i1,...,ik−1,ik)andf2=(i1,...,ik−1,i∗k) ∗ andik Itshouldbenotedthatonlyorderedsetsareutilized.Thus,whenfisgeneratedfromf1andf2,thesetsremainordered.Candidategenerationpairsupanycandidatesthatdifferonlyintheirfinalelementtogeneratethenextcandidategeneration. 10 Thenextstepofcandidategenerationguaranteesthateachnewcandidateisnotonlyformedfromtwocandidatesfromthepreviousgeneration,butthateverysubsetofitisalsopresentinthepreviousgeneration,asfollows:∀i∈f:f−{i}∈Fk Thus,ourinitialcandidategenerationprovesbydesignthatifweremoveeitherofthelasttwoitems(ik,i∗k)fromthenewcandidate,wewillgetcandidatesfromthepreviousgeneration,namely,f1andf2.Thesecondstepprovesthatifweremoveanyoftheotheritemsfromthenewcandidate,wemustfindacandidatefromthepreviousgeneration.Thisprogressivebuild-upofcandidatesistheheartoftheApriorialgorithm. Thethirdphaseofthealgorithmisthesupportcalculation.Itisbyfarthemosttimeconsuminganddataintensivepartoftheapplication,asitisduringthisphasethedatabaseisstreamedintothesystem.Eachpotentialcandidate’ssupport,ornumberofoccurrencesoverthedatabaseset,isdeterminedbycomparingeachcandidatewitheachtransactioninthedatabase.Ifthesetofitemsthatmakeupthecandidateappearinthetransaction,thesupportcountforthatcandidateisincremented,asfollows:∀t∈Tdo ∀c∈Cdo ifc∈t support(c)++ ThemainproblemwiththeApriorialgorithmisthisdatacomplexity.Eachcandidatemustbecomparedagainsteverytransactiondata,andcandidategenerationmustseetheentiredatabasetransactionset.Thisgivesalargerunningtimeforasinglegeneration,O(|T||C||t|),assumingthesubsetfunctioncanbeimplementedinconstanttime|t|.However,theparallelismcontainedintheloopsallowsforsomeinterestingaccelerationinhardware,particularlywhenimplementedasasystolicarray.3.2.2 OurContributions Throughtheuseofanewtypeofsystolicarrayarchitecture,thetimerequiredforprocessingcanbedramaticallyreduced.OurarrayarchitectureimplementationonaXilinxVirtex-IIPro100providesa4xperformanceimprovementoverthefastestsoftwareimplementationavailablerunningonadual2.8GHzXeonsystemwith3GBRAM. Duetoourstreamingimplementation,thecandidategenerationphaseofthealgorithmrequireordersofmagnitudelesstimethanthesupportcalculation,andoverallrequiresroughly25%ofthetimerequiredbythefastestnon-supercomputerimplementation[11].Theoff-chipmemoryrequiredisnegligiblebeyondthesizeofthedatabaseandthebandwidthbetweenmemoryandthesystolicarrayisonly250MB/s. WhilethearchitecturecouldeasilybeimplementedinacustomASIC–infact,thesimpleunitsthatmakeupthesystolicarrayaredesignedexplicitlyforeaseofASICimplementation–theuseofFPGAallowstheusertoutilizeparameterizeddesignswhichallowforvariablesizeitemdescriptorsaswellasoptimizedmemorysizesforaparticularproblem.Aswell,FPGAsallowthedesigntobescaledupwardeasilyasprocesstechnologyallowsforever-largergatecounts. Weaddresstheseissuesthroughhybridsystolicarray-microcontrolleddatapathsandefficientdesignprinciples.ThispaperpresentsseveralstrategieswehavedevelopedforadaptingtheApriorialgorithmtouseinasystolicarray[23].Wealsopresentastrategycalled“systolicinjection,”asignificanttechnicalcontributionstothegeneraluseofsystolicarraysthatallowsawiderangeofpreviouslychallengingapplicationstobeimplementedefficientlythroughanefficientstalling 11 procedure(Figure3).Themethodalsoforreportingunpredictablygeneratedmid-arrayresultstoacontrollerwithoutchanceofcollisionorexcessivestalling. Figure3illustratesthedatamovementthroughthepipelineandthedatastoragebehaviorwithinaunit. Figure3:SystolicinjectionoccursinStage3(cycle1),andStage2(cycle2)butdoesnotcauseunrecoverablecollisions Throughtheuseofthesystolicarrayweallowforincreasedfrequencyperformance,decreasedinterconnect,andsimple,easilyscalablepackagedunits.Weimplementalldataandcomputationintensiveoperationswithinthesystolicarrayandimplementallserialandcontrolintensiveopera-tionswithinamicroprocessor,namelythePowerPC405integratedwiththeVirtex-IIProdevice.Theoff-chipbandwidthrequiredis250MB/s,allowingforpracticalimplementationwithoff-the-shelfmemorycontrollers.Theoff-chipmemoryrequirements,beyondthedatabase,arenegligible. Becausethesystolicarrayimplementationdoesnotrequirereconfiguration,modifyingthesetsismuchfaster,asintheKMPstringmatchingsystolicarray,thereconfigurationtimeismk,thatis,theproductofthenumberofsetstobemodifiedandthesizeoftheset.Thisisveryfast,butincreasesthesizeofthesubsetunitoverwhatwouldberequiredinahardwiredimplemenation.Becauseperformanceofthesystolicarrayislargelydeterminedbyproductofthenumberofunitsandtheelementsthatcanbeacceptedineachcycle,aninterestingoptimizationproblemisencountered: 12 becausethereislittleareaconsumptionbeyondthesubsetarray,increasingthenumberofelementsacceptedineachcyclethoughelement-levelparallelismleadstoanequaldecreaseinthenumberofunitsonadevice.Weaddresstheseproblemsanddefinenewapplicationsforthesealgorithmsinthenextsection. 4ProposedResearch AllworkinSection4.1istowardthedevelopmentofgeneralarchitecturesuitablefortheuseintheapplicationsdescribedinSections4.2and4.3.LikethebasicIntrusionDetectionSystem,eachapplicationprovidesaparticularsetofproblemdefinitions,parameters,andperformancemetricstooptimize. 4.1Low-risk,MediumReward:AcceleratingDataMiningKernels Whilewehavemadesomeinterestingadvancesinthefieldofdatamining,inparticulartheimple-mentation[9]ofa4xperformanceimprovementoftheApriorialgorithm[2],webelievethereisstillmuchroominthefieldforresearchcontributions.First,ourpreviouswork,whilecontributingsomeinterestingarchitecturalenhancementstothefieldofsystolicarrays,didnotachievetheperformanceimprovementsthatmakehardwareimplementationssignificantlyworthwhile.Thisismostlyduetothenatureofthedataminingprobleminthestandardapproach;itrequiresseveralpassesofalargedataset,wherethecomputationalneedsarenotdense,butdifficulttoeasilyparallelize. Oneofthekeyweaknessesofthecurrentimplementationistheoneitem-per-clockbandwidthlimitation.Thatis,ineachcycle,onlyoneelementofthedatabasecanbeprocessedduetothemergesort-likesubsetoperationcurrentlyimplemented.Wearecurrentlyinvestigatingalternateimplementationsofthefunctionalitythatallowformultipleelementstobeacceptedinacyclethroughhashingofinputdataandsmallbuffers.Thesestrategiesdonotaddresstherealproblemoflowcomputedensity,though,andwearesearchingforstrategiestoincreasethedensityoftheproblem.Oneapproach,introducedin[3]istore-encodecommonitemsetsassingleitemstoallowthereductionofthetotalnumberofelements.Ifitispossibletoextendthistoahardware-levelre-encodingandhardwiring,similartotheapproachwedevelopedforstringmatchingin[5],largeperformancegainscouldberealized. Thereareseveralproblemswithextractingalargeamountofparallelismandredundancyre-ductionthatmustbeaddressed.ThefirstisaninabilitytotakeadvantageofthereconfigurableaspectoftheFPGA.TheIntrustionDetectionproblemisanidealoneforFPGAbecausethepatternsetsareassumedtonotchangemorethanafewtimespermonth.Thisallowsthepatternstobehard-codedintothedevicearchtecture.Thisislessusefulfordatamining.AmainfunctionalityoftheApriorialgorithmisindeterminingifasmallsetofelementsareasubsetoflargersetsofelements(transactions)overthousandstomillionsofdatabaserecords.Weassumetheelementstohave16bitidentifiers;thispreventsusfromefficientlyusingreencodingtechniquesasusedinthestringmatchingproblem.Itisinefficienttogenerateandloadnewbitsreamstomatchforasingleinstanceofaset.TheVirtex-IIPro100systolicarrayimplementationrequiresroughly24hourstoplace-and-route,removingtheoptionofchangingconfigurationevery15secondsforthe100,000transactionbenchmarkingset. Wearecurrentlyinvestigatingtwooptions:increasedthroughhash-basedtechniques,andmemory-basedre-encoding.Hash-based Increasingelement-levelparallelismpasttheone-elementpercyclebarrierofthecurrentsystolicarrayarchitectureisnecessarytoincreaseoverallthroughputofthesystem.Becausethecontentsofatransactionareunpredictable,itisimpossibletohaveacounter-controlledmultiplexerinflexiblyassignelementstoseparatecomparatorsub-systems.However,withahash-basedsystem,onecandemultiplextheincomingsetelementsbasedontheelementID,allowingforapredictable 13 parallelization(predictableinthissensemeansthatweknowbeforehandwhichelementswillbehandledbyagivencomparatorsubsystem).Whiletheelement-wiseresponsibilityispredictable,theloadingisnotasnothingisguaranteedaboutthecontentsofagiventransaction;thatis,thehashingofelement-IDscan,intheworst-case,directanentiretransactionintoasinglecomparator.Ofcourse,arandomizedhashingalgorithmguaranteesthattheaveragecasewillbeperfectlydistributedacrossthefieldofcomparatorsub-systems.However,inpractice,buffersandsynchronizationwillberequiredtohandleworst-casesituations.Assumingthattheparallelizationsystemexistswithinthelargersystolicarrayarchitecture,ifabufferofsomearbitrarysizedoesbecomefull,thesystemasawholecanbestalledusingthepre-existingsystolicinjectionhardware,althoughthisreducesaveragethroughput. Memory-based Inthisarchitecture,theperformanceistiedtothedetailsofthedevice.WeconsidertheXilinxVirtex-IIPro100device,apopulardeviceusedforresearchsimulations.Thisdevicehas55018kbtwo-portblockRAMmemories.Inordertofittheentireitemsetintoasinglememory,weconstraintheelementIDsizeto14bits,using16kbofablock.TheRAMisconfiguredasa16,384elementby1bitarray.Eachcandidateitemsetisconfiguredastheelementsofamemory,wheretheaddressofanelementistheelementIDandthepresenceofanitemisrepresentedbyasetbit.Supportisdeterminedbycountingthenumberofbitsmatched.Afterprogramming,writeenableisdeassertedandthedatabaseisstreamedpastthememoryblocks.Now,theelementIDsfromthedatabaseareutilizedasaddresses;amatchisthentheoutputofthememory.Thesubsetoperationissimplyasummationoftheoutputbits;ifthesumisequaltothesizeofthecandidateitemset,thesubsetiffoundtobecomplete. Thisisasimpleenoughtransformationfromthemergestylearchitecture,butprovidesthepotentialforseveralsubstantiveincreasesinperformance.First,thememoryhardwareisbuiltintothedevice,soitdoesnothaveanassociatedcostinlogiccells.Next,theblockRAMisatwo-portmemory,meaningthatwegeta2xincreaseinbandwidthforfree.Third,thesummationoperationoverln(m)bits,wheremisthesizeofthecandidateitemset,isafastoperation.BecausetheRAMprovidesitsoutputatthebeginningofacycle,pipeliningoftheitemcomparisonsisnecessarybutshouldnotpresentproblems.GiventhattheblockRAMhasaonecyclepipelinedelay,theremaybesomeNOPcyclesinsertedaftertransactionstoallowforsummationtocompletebutoveralltheclockfrequencyshouldbehigher,aswellasallowingtwoelementspercycle.Asmentionedabove,theinversetradeoffswhenweattempttoincreasebandwidthbyusinglargernumbersofportsresultinaproportionalreductioninthenumberofcandidateitemsetsabletoconcurrentlyexecute. 4.2Medium-riskEngineering:AlignedTextMining Textminingisanewareathatbuildsonourexperienceintextmatchinganddatamining.Thefocusofthisworkisonprocessingtextualinformationovermultiplechannelsathighspeeds(10Gbit/sandup).Theobjectivesofthistextsurveyapplication[28]includeclusteringmessagesbysimilarity(withouttraining)accordingtogeographicinformation(eitheraccordingtocontentandtopicswithinthemessage,ortosendingorreceivingmetadata),selectionofmessagesbelongingtouserselectedclusters.Theselectionandclusteringalgorithmsshouldbeabletoadapttochangesintopicsofinterestandbeabletoquicklyrescanhistoricalrecordstofindmessagesthatmatchthecurrenttopicofinterest. Oneoftheresearchdirectionsweareinterestedinpursuingisthedevelopmentofanarchitecturalsuiteforprocessingtextintoencodeditemsets,efficientfiltering,andcorrelationanalysis.ThefilteringstepofaneffectivetextprocessingsystemcangreatlybenefitfromthereconfigurableaspectofFPGAs.Overtime,commonwordsbecomemeaningless,andrelevantkeywordschangeastimepasses;moreover,anidealfiltermustbeabletodetectcodedcommunicationandnotrejectaconversationbecauseisnotovertlyrelatedtothetopicofinterest.Moreover,itmustabletodothisathighdatarates.Wewilldeterminethemosteffectivestrategiestoprovidethebenefitsof 14 hardwareaccelerationandguaranteetheadaptiveflexibilityrequiredforconstantlychangingtopicsofinterestandcustomerobjectives. Utilizingourcurrentworkincorrelation-miningalgorithmsasabasis,wewilldevelopnewkeyword-baseddistanceformulationsandweightingmeasuresforautonomousclusteringstrategies,aswellasitemsetfilteringtoreducefalsecorrelations.Oneofthegoalsofthisproposedworkistoprovidereconfigurablehardwareimplementationsfastenoughtoenableonlinedatabaseprocessing,allowingfor“keyworddriftanalysis,”thatis,allowingpreviouslyunseentopicstoberecognizedandgivennewcorrelationcategories,aswellasremovingoldcategoriestobeprunedfromthedatabase. Giventhedynamickeywordanalysisdevelopedfortextfiltering,wecanfurthercorrelatemes-sagesacrosstimetofindmessagesinthepastthatrelatetocurrentlyrelevantkeywordsandtopics.Thekeyelementofthisanalysisisfastsetanalysisandtheabilitytohandlehighinputdatarates. Overall,thisresearchtrackisfeasibleandcanbecompletedinthenextyear. 4.3High-risk,HighReward:AutomaticAttackDetection Runtimeadaptationtonetworkattacksanexcitingfieldthathasonlyrecentlybeenmadefeasiblethroughadvancesinarchitectureanddevicecapabilities.Discoveringnetworkattackswithoutbeingtoldwhattolookforisadifficultproblem,withadealofpotentialresearchdirections[29,33].Wewillleverageourexperienceinhigh-performancestringmatchingforintrusiondetectionsystems[5,8]anddataminingtocreatesystemsthatarecapableofautomaticallygeneratingnewpatternsandextendingtheirthreatcoveragewithouthumanintervention. Weproposetocorrelationminethestringmatchingdata,alongwithvariousotherdatafromthestream,suchassourceanddestinationIP,portnumbers,protocoltype,andpotentiallyrouteinformationprovidedbytraceback.Aftersomeperiodoftraffic,thevariousdataisanalyzedviatheApriorialgorithmtoproduceaseriesofsetsthatoccurmorethanaspecifiednumberoftimeswithinthetrafficperiod.Theinterestingitemsarethosethatoccurnearandwithintheknownattacks–anattemptedattackcommonlyisnotasinglepacketonasinglecomputer,butaseriesofscanningattacksoveranentiresubnet[24].Bysearchingfortheseunknownpacketstemporaly-locatednearknownattacks,wecanautomaticallydeterminestreampatternsthatmayneedtobeconsideredasanewattack.Thegoalistodetermineanewattackandintegrateitintothefilteringsystembeforethescanfindacomputerthattheattackissusceptibleto.Theseproblemshavenotbeenfeasibletoaddressbeforetheadventoflarge,fastFPGAscapableofhandlingthecomputationallyexpensiveoperationsnecessaryforonlineoperation. Inordertocomposeanautomatedresponsesystemasdescribed,high-levelsystemarchitecturedetailsaswellaslow-levelarchitecturaldetailswillbeaddressed.Whiletheworkisaddressedtowardnetworksecurityissues,thecombinationofstringmatchingandcorrelationminingisstronglyconnectedtobioinformaticsproblemsandHomelandSecurityinterestsaswell. Thereareseveralproblemsthatmustbeaddressedbeforeweeffectivelyaddresstheproblem.First,goodsampledatamustbearranged.Whilethearchitecturaldesignoftherelevantkernelsisofmoreinterestthantheoverallsystemdesign,wewillrequiresampledatatoevaluatethefeasibilityofourapproach.Thereareseveralsourcesforthisdata,namely,theLincolnLabsdatasetsandtheDeternetwork[10].Becauseourapproachassumesatapatthechockpointofasubnet,weneeddatathatrepresentsthetrafficoverawiderangeofaddresses.TheDeternetworkmaybearelevantsourcefortestdatatoexplorethebehaviorofthedetectionsystem.Thesecondproblemtoaddressistheidentificationofstreamsthatdonothaveamatchingpattern.Inna¨ıvepatternmatching-basedintrusiondetection,thereisnointerestinastreamthatdoesnotmatchoneofthepatternsinthedatabase.Thosestreamsthatdohaveamatchareeasilycataloged,andtheyareeasilyusedasinputstoadataminingkernel.However,streamsthatdonotmatchaparticularpatternhavenoeasywayofbeingcollectedtogether. WeillustratethisprobleminFigure4.Streams1and3arebotheasilyrecognized.Whiletheidentifyingsequence“Pattern1”isnotlocatedatthesameoffsetwithineachstream,becausethe 15 Figure4:OffsetstreamsthatmatchapatterninthedatabaseareeasilyassignedID’s.Streamsnotmatchingpatternsarechallengingtocharacterize. patternmatchingsystemsearchesallpossibleoffsets,thesystemeasilydeterminesthatStream1andStream3are,infact,thesameattack. Stream2andStream4areadifferentsituation.Becausetheydonotmatchanycurrentlyknownattack,theycannotbegivenanidentifyingnumber.However,theybothcontainasequenceofcharacters“PatternA”that,whileunknowntotheIDSatthepresenttime,cansuccessfullycompromiseaserverifdelivered.Becausetheyattackerisintelligent,heknowsthatitisbesttosendtherelevantattackingstringindifferentflavorssoastoavoiddetection.Differentoffsets,anddifferentpaddingcharacterscanmaketwoequivalentattacksappearverydifferent. Thisisadifficultproblemtoaddress.Oneapproachistofindcommonsubstringswithintheunknownstreams.Unfortunately,astherearemanymoreinnocentstreamsthanattackingstreams,comparingthemallwithoutsomealgorithmicmagicwouldrequireimmenseamountsofcomputa-tionalresources.Assumingthattheattacksareplacedwithinthestreamidenticallyandthatwecancomparetheentirestreamatonce,thisrequiresO(m2)time.Assuming,morerealistically,thatweessentiallyhavetousetheSmith-Watermanalgorithm[32]atO(n2)perstream,wemovetoadif-ficultO(m2n2)time.Thisisinconvenientunlesswecanfindanotherapproachtotheidentificationproblem. Oneapproachistosimplyreducen.Anomalydetectionmethodsusestatisticalmodelsofserverusagethatcouldbeuseful.Failingrequestsonmultipleportsisacommoncharacteristicofascript-basedattack,butconnectionsfromaninnocentuserareunlikelytotraversemorethanahandfulofcommonlyusedports.Thus,anypacketthatishandledwithouterrorbytheapplicationlayercanprobablybeignored.Wecandetermineanerrorconditionbylookingatapplicationlayerreturns;packetsthatgotoaclosedportorreturnastandarderrorinanactiveportcanbereasonablyassumedtobesuspiciousandworthinvestigatingfurther.Thesepacketstreamsareloggedformining. Thisresearchtrackcouldbeconsideredapartialfulfillmentoftheholygrailofnetworksecurity.Whileitcannotprovidethelevelofsecurityavailableanall-knowing,completelycontrolledandbehaviorallyunderstoodsystem,ifsuccessfulitwouldbeasubstantivecontributiontothedetectionofnewattacks.Acomplete,testablesystemaccordingtothisdesignwouldbetheworkofmanyPhDstudents,butmodelingbasedonkernelsimulationsisfeasibleinthetimeframeofSection5. 5ResearchPlan Theresearchweplanonexploringdevelopsfromthemesthatwealreadyhaveexperiencein.Thekeyresearchcontributionscomefromperformanceoptimizedforaparticularapplicationdomain. 16 Webrieflyoutlineanapproximatetimelineforthenextyearbelow: •Spring2005 Developandimplementenhanceddataminingalgorithms,increasingthroughputbyatleastdoublingdatapathwidthandreducingclockperiod.ImplementinterfacecircuitryforPowerPC405hardprocessorormicroBlazesoftprocessorforVirtex-IIProboard.•Summer2005 Extenddataminingalgorithmsthroughfront-endpreprocessingtoworkwithstringmatchingarchitectures.UnderstandandaddressissuesofhardwareinterfacingwithmemorysystemofML-310board.Completesmall-scalehardwareverificationofdataminingarchitecturewithsoftwarecontrolleddataandcontrolflow.AcquireIPcoresforsubsequencematchingforstreamcomparisons.•Fall2005 Completeintegrationofdataminingandstringmatchingarchitecturesforautonomouspatternidentification. Wehopetosubmitpaperstothefollowingconferencesoverthenextyear: •IPDPS’05EfficientDataminingusingtheAprioriAlgorithmonFPGAs(SubmittedOct2004) •FCCM’05Hash-AcceleratedDataminingusingtheAprioriAlgorithm(DueJan142005)•FPL’05DataminingforTextAnalysis(DueMarch2005) •ConferenceonCyberSecurityAutomatedSignatureDiscoveryUsingPacketMining(Nov2005) 6OverallThesisGoals Ourworkdiffersfromotherresearchintheareainthatwepursueanintelligentapproachtoarchitec-turaldesignthatgoesbeyondimplementationtoanunderstandingofwhatconstrainsperformance.Whetherthismeansreducinginterconnectburdenthroughgraph-basedpartitioning,developinganunderstandingofmaximumbuffersizes,orextendingthefunctionalityofsystolicarrayarchitecture,thesecontributionsaremorethanimplementationsofaparticularalgorithm.Theyarecontributionstothewiderarchitecturalfield,astheunderlyingdesigndecisionsandperformanceenhancementsareapplicabletoapplicationsoutsideourparticulardomain. Thecontributionsofmyproposedthesisareasfollows: •Ourminimumbuffersizeproofallowsstringmatchinginsystolicarrayswithoutstallingforef-ficientalgorithmssuchasKMP.Thetwocomparatorapproachdoessignificantlylessworkthatana¨ıveapproachandallowson-the-flyreconfiguration.Thislow-area,runtime-reconfigurable,exact-matcharraycanbeutilizedinmanyapplicationstoprovidefast,efficientstringmatchingforalargevolumeofpatterns. •ThesystemlevelviewoftheIntrusionDetectionProblemallowsforexploitationofpatternredundancy.Throughpreprocessing,treestructuresandencodingsaredesignedsuchthatbothfrequencyandperformanceishigher.Thisapproachisvalidforanysetofpatternsthatareknownatdesigntimeanddonotoftenrequiremodifications. 17 •Ouruseofpartitioning,alsodependentondesign-timepatternknowledge,allowsthousandsofpatternstoexistinthesamesystemwithoutinterferingwitheachother.Thatis,the“pre-decoding”strategycreatesalargenumberofconnectionstoberouted;ifallofthepatternsareforcedtoexistinthesamesectionofthedevice,amajorityofthelinesnearaparticularcomparatorwillnotberequiredbythatcomparator.Thiscausesareductionintimeperformanceasfanoutandinterconnectdelayincreases,aparticularproblemininterconnect-limitedreconfigurabledevices.Bypartitioningtoseparatesectionsofthedevice,frequencyperformancecanbeincreased.Thisstrategyisapplicabletoanyproblemthatcanbeefficientlymin-cutforFPGAimplementation. •Our“SystolicInjection”methodologyallowssystolicarraystobeusedforlesspredictableproblems.Whileveryregularproblemssuchasmatrixmultiplicationareeasilyimplementedinasystolicarrayarchitecture,problemswithunpredictablegenerationofresultscancauseanotherwiseregulararraytofallintochaos.Ourinjectionallowsefficientstallingwithlittlereductioninperformance. 18 References [1]Bloom,B.,Space/timeTrade-offsinHashCodingwithAllowableErrors.Communicationsof theACM13(1970)422–426.[2]R.Agrawal,T.Imielinski,andA.Swami.MiningAssociationRulesbetweenSetsofItemsin LargeDatabases.InProceedingsoftheACMSIGMODConference,1993.[3]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProceedingsofthe11thInternational ConferenceonDataEngineering,1995.[4]Altera,Inc.http://www.altera.com. [5]Z.K.BakerandV.K.Prasanna.AMethodologyfortheSynthesisofEfficientIntrusion DetectionSystemsonFPGAs.InProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[6]Z.K.BakerandV.K.Prasanna.AutomatedIncrementalDesignofFlexibleIntrusionDe-tectionSystemsonFPGAs.InTheEighthAnnualWorkshoponHighPerformanceEmbeddedComputing(HPEC’04),2004.[7]Z.K.BakerandV.K.Prasanna.AutomaticSynthesisofEfficientIntrusionDetectionSystems onFPGAs.InProceedingsofthe14thAnnualInternationalConverenceonField-ProgrammableLogicandApplications(FPL’04),2004.[8]Z.K.BakerandV.K.Prasanna.TimeandAreaEfficientPatternMatchingonFPGAs. InTheTwelfthAnnualACMInternationalSymposiumonField-ProgrammableGateArrays(FPGA’04),2004.[9]Z.K.BakerandV.K.Prasanna.EfficientParallelDataMiningwiththeAprioriAlgorithmon FPGAs.InSubmittedtotheIEEEInternationalParallelandDistributedProcessingSymposium(IPDPS’05),2005.[10]T.Benzel,B.Braden,C.Neuman,andD.Kim.CyberDefenseTechnologyExperimental Research(DETER)Network.http://www.isi.edu/deter/,2003.[11]F.Bodon.AFastAprioriImplementation.InProceedingsoftheIEEEICDMWorkshopon FrequentItemsetMiningImplementations,2003.[12]CeloxicaMATLAB/SimulinkInterface. asp. http://www.celoxica.com/methodology/matlab. [13]Y.ChoandWilliamH.Mangione-Smith.DeepPacketFilterwithDedicatedLogicandRead OnlyMemories.InProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[14]Y.H.Cho,S.Navab,andW.H.Mangione-Smith.SpecializedHardwareforDeepNetwork PacketFiltering.InProceedingstheTenthACM/SIGDAInternationalConferenceonField-ProgrammableLogicandApplications(FPL’02),2002.[15]C.R.ClarkandD.E.Schimmel.ScalableParallelPatternMatchingonHighSpeedNetworks.In ProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[16]M.Estlick,M.Leeser,J.Szymanski,andJ.Theiler.AlgorithmicTransformationsintheIm-plementationofK-meansClusteringonReconfigurableHardware.InProceedingsoftheNinthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2001(FCCM’01),2001. 19 [17]M.Gokhake,D.Dubois,A.Dubois,M.Boorman,S.Poole,andV.Hogsett.Granidt:Towards GigabitRateNetworkIntrusionDetection.InProceedingstheEleventhAnnualACM/SIGDAInternationalConferenceonField-ProgrammableLogicandApplications(FPL’03),2002.[18]B.L.Hutchings,R.Franklin,andD.Carver.AssistingNetworkIntrusionDetectionwith ReconfigurableHardware.InProceedingsoftheTenthAnnualField-ProgrammableCustomComputingMachines(FCCM’02),2002.[19]P.James-Roxby,G.Brebner,andD.Bemmann.Time-CriticalSoftwareDecelerationinan FCCM.InProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[20]B.KernighanandS.Lin.AnEfficientHeuristicProcedureforPartitioningGraphs,1970.Bell SystemTech.[21]D.E.Knuth,J.Morris,andV.R.Pratt.FastPatternMatchinginStrings.InSIAMJournal onComputing,1977.[22]R.Krishnamurthy,S.Yalamanchili,K.Schwan,andR.West.ShareStreams:AScalableAr-chitectureandHardwareSupportforHigh-SpeedQoSPacketSchedulers.InProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[23]H.T.KungandC.E.Leiserson.Systolicarrays(forVLSI).InSparseMatrixProceedings,1979.[24]S.Lau.TheSpinningCubeofPotentialDoom.http://www.nersc.gov/nusers/security/ TheSpinningCube.php.[25]D.-U.Lee,W.Luk,C.Wang,C.Jones,M.Smith,andJ.Villasenor.AFlexibleHardware EncoderforLow-DensityParity-CheckCodes.InProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[26]J.Moscola,J.Lockwood,R.P.Loui,andM.Pachos.ImplementationofaContent-Scanning ModuleforanInternetFirewall.InProceedingsoftheEleventhAnnualIEEESymposiumonField-ProgrammableCustomComputingMachines(FCCM’03),2003.[27]D.Nguyen,J.Zambreno,andG.Memik.FlowMonitoringinHigh-SpeedNetworkswith 2DHashTables.InProceedingsofthe14thAnnualInternationalConverenceonField-ProgrammableLogicandApplications(FPL’04),2004.[28]Dept.oftheAirForce.CFWP:RapidProcessingofIntelligenceData, http://http://www2.eps.gov/spg/USAF/AFMC/AFRLRRS/Reference-Number-BAA-0%4-09-IFKA/SynopsisP.html. 2004. [29]M.QinandK.Hwang.FrequentEpisodeRulesforInternetAnomalyDetection.Proceedings oftheIEEEInternationalSymposiumonNetworkComputingandApplications,2004.[30]R.Sidhu,A.Mei,andV.K.Prasanna.StringMatchingonMulticontextFPGAsusingSelf-Reconfiguration.InProceedingsoftheSeventhAnnualACM/SIGDAInternationalSymposiumonFieldProgrammableGateArrays(FPGA’99),1999.[31]R.SidhuandV.K.Prasanna.FastRegularExpressionMatchingusingFPGAs.InProceedings oftheNinthAnnualIEEESymposiumonField-ProgrammableCustomComputingMachines(FCCM’01),2001.[32]T.F.SmithandM.S.Waterman.IdentificationofCommonMolecularSubsequences.Journal ofMolecularBiology,147:195–197,1981. 20 [33]B.So,M.W.Hall,andP.C.Diniz.UsingDataMiningtoDiscoverSignaturesinNetwork-based IntrusionDetection.InProceedingsoftheFirstInternationalConferenceonMachineLearningandCybernetics,2002.[34]Sourcefire.Snort:TheOpenSourceNetworkIntrusionDetectionSystem.http://www.snort. org,2003.[35]I.SourdisandD.Pnevmatikatos.Fast,Large-ScaleStringMatchfora10GbpsFPGA-Based NetworkIntrusionDetectionSystem.InProceedingstheEleventhAnnualACM/SIGDAInter-nationalConferenceonField-ProgrammableLogicandApplications(FPL’03),2003.[36]I.SourdisandD.Pnevmatikatos.AMethodologyfortheSynthesisofEfficientIntrusionDe-tectionSystemsonFPGAs.InProceedingsoftheTwelfthAnnualIEEESymposiumonFieldProgrammableCustomComputingMachines2004(FCCM’04),2004.[37]SRCComputers,Inc.http://www.srccomputers.com. [38]T.H.Cormen,C.E.Leiserson,andR.L.Rivest.IntroductiontoAlgorithms.MITPress,McGraw-Hill,1990.[39]C.Wolinski,M.Gokhale,andK.McCabe.AReconfigurableComputingFabric.InProceedings ofthe,2004.[40]Xilinx,Inc.http://www.xilinx.com. [41]Xilinx,Inc.http://www.xilinx.com/ise/advanced/forge_get.htm. [42]J.Zambreno,D.Nguyen,andA.Choudhary.ExploringArea/DelayTradeoffsinanAES FPGAImplementation.InProceedingsofthe14thAnnualInternationalConverenceonField-ProgrammableLogicandApplications(FPL’04),2004. 21 因篇幅问题不能全部显示,请点此查看更多更全内容