您好,欢迎来到爱go旅游网。
搜索
您的当前位置:首页Dynamically Reconfigurable Architecture for Image Processor Applications

Dynamically Reconfigurable Architecture for Image Processor Applications

来源:爱go旅游网
Dynamically Reconfigurable Architecture

for Image Processor Applications

Alexandro M. S. Adário

Eduardo L. Roehe

Sergio Bampi

Institute for Informatics – Federal University at Porto AlegreAv. Bento Gonçalves, 9500 – Porto Alegre, RS – Brazil

Cx. Postal 15064 – CEP: 91501-970

{adario, roehe, bampi}@inf.ufrgs.br

ABSTRACT

This work presents an overview of the principles that underlie thespeed-up achievable by dynamic hardware reconfiguration,proposes a more precise taxonomy for the execution models forreconfigurable platforms, and demonstrates the advantage ofdynamic reconfiguration in the new implementation of aneighborhood image processor, called DRIP. It achieves a real-time performance, which is 3 times faster than its pipelined non-reconfigurable version

Keywords

Reconfigurable architecture, image processing, FPGA

1. INTRODUCTION

Advanced RISC microprocessors can solve complex computingtasks through a programming paradigm, based on fixed hardwareresources. For most computing tasks it is cheaper and faster todevelop a program in general-purpose processors (GPPs)specifically to solve them. While GPPs are designed with this aim,focusing on performance and general functionality, total costs ofdesigning and fabricating RISC GPPs are increasing fast. Thesecosts involve three parts:

a) Hardware costs: GPPs are larger and more complex thannecessary for the execution of a specific task. Developingapplication-specific processors for highly specializedalgorithms is warranted only for large-volume applications thatmay require high power efficiency at expense of great hardwaredesign cost;

b) Design costs: functional units that may be rarely used in agiven application may be present in GPPs, and may consumesubstantial part of the design effort;

c) Energy costs: too much power is spent with functional units orblocks not used during a large fraction of the processing time.For specific applications or demanding requirements in terms ofpower, speed or costs, one may rely on either dedicated processorsor reused core processors, which may be well suited to theapplication or optimized for a given set of performance

___________________________

Permission to make digital/hardcopy of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage, the copyright notice, the title of the publicationand its date appear, and notice is given that copying is by permission of ACM, Inc.To copy otherwise, to republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee.DAC 99, New Orleans, Louisiana

(c) 1999 ACM 1-58113-109-7/99/06..$5.00

requirements. In the former case, only the necessary functionalunits highly optimized for a specific range of problems may bepresent, which will result in unsurpassed power and areaefficiencies for the application-specific algorithm. Until recently,application-specific processors (ASPs) implemented in user-programmable devices were not feasible, but with increasinglevels of FPGA integration (above 50K usable equivalent gates),as well as RISC cores and RAM merging into the reconfigurablearrays, the feasibility picture for user-configured ASPs haschanged dramatically.

By tightly coupling a programmable device (e.g. FPGAs) to aGPP, we can exploit with higher efficiency the potential of the socalled reconfigurable architecture. This structure can alsoaggregate some special on-chip macroblocks, like a sharedmemory. The dynamic reconfiguration of the hardware hasbecome a competitive alternative in terms of performance againsta GPP software implementation, and it offers significant time-to-market advantage over the conventional ASP approach.

Reconfigurable architectures allow the designer to create newfunctions and perform operations that would take too many cyclesof the GPPs. With a reconfigurable architecture, a GPP does notneed to include most of the complex functional units oftendesigned-in, as well as considerable power and time can be saved.These units may be implemented on the programmable device,when demanded by the application. Moreover, we can configurean application-specific subset of functional units from a larger setof units or a larger instruction set, and wire them up duringexecution time, increasing hardware density. Such density is theratio between the sum of the resources used by all function unitsthat can be mapped and the available resources on theprogrammable device.

This work is organized as follows. Section 2 presents aclassification of reconfigurable architectures based on theirexecution models and reconfiguration scheme. Section 3 showssome examples of implemented architectures. The neighborhoodprocessor is explained in section 4. DRIP Processor is introducedin section 5. Section 6 discusses the results of DRIP synthesis.

2. EXECUTION MODELS ANDPROGRAMMABILITY

Page [10] reports five design strategies by which programs maybe embedded in reconfigurable architectures: pure hardware,application-specific, sequential reuse, multiple simultaneous use,on-demand usage. Each model exploits a different part of the cost-performance spectrum of implementations and is well suited for aspecific range of applications.

In a pure hardware model, a given algorithm is converted into asingle hardware description, which is loaded into the FPGA.There is no relevant contribution of this model to reconfigurablearchitectures, since the configuration is fixed at design time. Thismodel can be implemented using conventional HDLs and thecurrently available synthesis tools.

PRISM [3] is an example of application-specific microprocessor(ASMP) model. In this system, the algorithm is compiled intotwo parts (Figure 1.a): an abstract machine code and an abstractprocessor. In the next step, the two are optimized to produce adescription of an ASP and the machine code level algorithmimplementation.

HardwareAlgorithmAlgorithmSoftwareHw1Hw2Hwn(a)(b)Algo 1Algo 3ABHw1Hw3InstructionManagerHWHw2Algo 2CD(c)(d)

Figure 1. Example of execution models

Very often an algorithm is too large to be implemented on theavailable devices or the design is area constrained by engineeringor economic reasons. To overcome this constraint, the design issplitted into several parts, which are moved in and out of thedevices, increasing the hardware density and producing a set ofreconfiguration steps (Figure 1.b). This model is called sequentialreuse.

If there is a large availability of programmable devices, manyalgorithms can be resident and execute simultaneously, interactingwith various degrees of coupling (tightened or loose) with the hostprocessor. The multiple simultaneous use model (Figure 1.c) isless common, requires more area than the sequential reuse, butcertainly is an interesting method to exploit reconfigurablecomputing.

The last model, on demand usage (Figure 1.d), is very interestingfor reconfigurable computing, and can be applied to a wide rangeof applications. This model is suitable for real-time systems andsystems that have a large number of functions or operations,which are not used concurrently, like the DISC (DynamicInstruction Set Computer) implementation.

We generalize the execution models presented by Page, looking atreconfigurability from the point of view of the reconfigurablearchitecture design. This classification divides the design modelsin three programmability classes, considering the number ofconfigurations and the time in which reconfiguration takes place:a) Static design (SD): The circuit has a single configuration,which is never changed neither before nor after system reset.

The programmable device is fully programmed to perform onlyone functionality that remains unchanged during systemlifetime. This class does not exploit the reconfigurationflexibility, taking advantage only of theimplementation/prototipation facilities.

b) Statically reconfigurable design (SRD): The circuit hasseveral configurations (N) and the reconfiguration occurs onlyat the end of each processing task. This can be classified as run-time reconfiguration, depending on the granularity of the tasksperformed between to successive reconfigurations. In this way,the programmable devices are better used and the circuit can bepartitioned, aiming for resources reusability. This class ofarchitecture is called SRA (statically reconfigurablearchitecture).c) Dynamically reconfigurable design (DRD): The circuit alsohas N configurations, but the reconfiguration takes place atruntime (RTR, Run-Time Reconfiguration). This kind of designuses more efficiently the reconfigurable architectures. Thetiming overhead associated to this RTR procedure has to bewell characterized within the domain of the possible set of run-time configurations. The overall performance will bedetermined by the overhead-to-computing ratio. Theimplementation may use partially programmable devices or aset of conventional programmable devices (when one process,the others are reconfigured). The resultant architecture is calledDRA (dynamically reconfigurable architecture).

SRD and DRD run-time reconfiguration advantages dependlargely on the specific algorithm and its partition in sizable graintasks. The reconfiguration overhead is heavily dependent onFPGA microarchitecture, and it will be significantly decreased byFPGA + RISC core + SRAM integration within the same die, anarea certainly open for recent innovations [7]. The SRD hardwarewill certainly show better performance when compared to GPPsoftware implementation, given the large time overhead incurredfor reconfiguration in current commercial FPGAs. The DRDhardware will benefit the most from innovations in the fastreconfiguration arena, while requiring significant more effort indeveloping compiler optimization. The main characteristics of allprogrammability classes are presented in Table 1.

DesignNumber ofReconfiguration atConfigurationsSD1Design timeSRDNEnd of taskDRDNExecution checkpointTable 1. Summary of Programmability Classes3. RECONFIGURABLE ARCHITECTURESIMPLEMENTATION

Several reconfigurable architectures were designed in the lastdecade, showing that this approach is feasible: DISC [13], PRISM[3], SPLASH [6], PAM [4] and Garp [7].

DISC is a processor that loads complex application-specificinstructions as required by a program. It uses a NationalSemiconductor CLAy FPGA and is divided in two parts: a globalcontroller and a custom-instruction space. Initially, a library ofimage processing instructions was created for DISC.

PRISM (Processor Reconfiguration through Instruction SetMetamorphosis) is a reconfigurable architecture for which specific

tools have been developed such that, for each application, newprocessor instructions are synthesized. The tools for the PRISMenvironment use some concepts inherited from hardware/softwarecodesign methods. Two prototypes, PRISM-I and PRISM-II, havebeen built using Xilinx XC3090 and XC4010 FPGAs,respectively.

SPLASH is a reconfigurable systolic array developed bySupercomputing Research Center in 1988. The basic computingengine of SPLASH is the Xilinx XC3090 FPGA. The secondversion of SPLASH, Splash 2, is a more general-purposereconfigurable processor array based on XC4010 FPGA modules.PAM (Programmable Active Memories) is a project developed byDEC PRL and consists of an array of Xilinx FPGAs. Withdynamic reconfiguration, it has demonstrated the fastestimplementation of RSA cryptography to that date [12].

Garp is a reconfigurable architecture that sets a trend toincorporate RISC cores with FPGA arrays. It incorporates aMIPS-II instruction-set compatible core with a reconfigurablearray that may implement co-processor functions as a slavecomputational unit located on the same die of the processor. Garpsimulation results have shown a 24X speed-up over a softwareimplementation in a UltraSparc 1/170 of the DES encryptionalgorithm. In an image dithering algorithm for a 640x480 pixelsframe the speed up obtained by Garp was 9.4 times.

4. NEIGHBORHOOD PROCESSOR 94.1 Array Processors

Array processors are a special class of parallel architecture,consisting of simpler processor called cells or processor elements(PEs). This class of processors has a large application onproblems with spatially defined data structures such asMathematical Modeling and Digital Image Processing. The PEsoften have [5]:

a) a 2D matrix layout;

b) operation in a bit-serial mode;c) access to a local memory;

d) connection to their nearest neighbors;

e) a synchronization scheme to execute the same instruction at any

given cycle.

A neighborhood processor is a special device that simulates anarray processor. It processes an input image, generating a newimage, where each output pixel is an image function of itscorrespondent in the input image and the nearest neighbors. Usinga standard neighborhood (e.g.: 3x3, 5x5 or 7x7 pixels), it scansthe image line by line. NP9 [1] is based on the neighborhoodprocessor architecture and is organized to process 3x3 (ninepixels) neighborhoods. Its architecture was first proposed by Leite[9].

4.2 The processor elements

The processor elements (PE) of NP9 are functionally simple andcan execute just two basic operations: addition (ADD),representing the class of linear algorithms, and maximum (MAX),representing the class of non-linear algorithms. Each PE has twoinputs (pixels X1 and X2), two weights (W1 and W2) associatedto those inputs and one output S.

XW11fSX2

W2Figure 2. Model of NP9 Processor Element

4.3 Data Flow Graph

The PEs interconnection matrix of NP9 follows a data flow graphdefined by a class of non-linear filters [11]. This class is widelyused in digital image processing and the kernel of its datastructure is represented by a sorting algorithm. The data flowgraph (Figure 3) is based on the odd-even transposition sortingalgorithm [8]. The hardware implementation of this algorithm isstraightforward. The structure defined achieves a good trade-offbetween complexity, parallelism, area cost and execution time.X1Y1X2Y2X3Y3X4Y4X5Y5X6Y6X7Y7X8Y8X9Y9Figure 3. NP9 Data Flow Graph4.4 General Structure of NP9

The general structure of NP9, at a high-level of abstraction shownin Figure 4, has three basic components: program register(PrgReg), execution pipeline and output mux. The executionpipeline corresponds to the data flow graph plus a stage register ateach cell output. The external interface has nine input pixels (X1to X9) and one serial output pixel (X_Out).

Prg_InPrgRegClkCell_PrgStatCell_WSelX1..X9PipelineMuxX_OutFigure 4. Structure of NP9There are two possible operation states for NP9 indicated by theprocessor status signal (Stat): programming (PROG) or execution(EXEC). During the PROG stage, NP9 receives, through theprogramming channel, the data corresponding to the functions(Cell_Prg) to be executed by the cells, the input weights (Cell_W)and the output selector (Sel). The entire program is stored in ashift register (PrgReg). In the EXEC stage, the previously storedalgorithm is executed, and the output (X_Out) is selected by amultiplexer.

4.5 Applications

Considering the primitive functions of NP9 and the programmingflexibility associated to the NP9 data flow graph, one canconfigure a large number of low-level image processingalgorithms onto its structure. Some algorithms that can beimplemented on this processor are the following [9]: linear(convolution), non-linear and hybrid filters, binary and gray-levelmorphological operations (dilation, erosion, thinning andthickening algorithms, morphological edge detectors, gradient,“hit or miss” operator, “top hat” operators), binary and gray-levelgeodesic operations (geodesic dilation and erosion, imagereconstruction), etc.

4.6 Implementation

NP9 was completely modeled in VHDL, compiled by QuickHDLunder Mentor Graphics environment, and synthesized withAutoLogic II. After that, Max+Plus II processed AutoLogicnetlists, generating NP9 architecture, implemented onto Flex10KFPGAs.

This final structure is a static design, and each algorithmgenerated for NP9 is implemented as a program. NP9 did notobtain the desired performance for real-time processing,considering 256 gray-level digital images of 1,024 x 1,024 pixelsat a rate of 30 frames/s.

In addition, the resource utilization was also inefficient, and theprocessor used 6,526 logic elements in two FPGA devices (1Flex10K70 and 1 Flex10K100). A dynamically reconfigurabledesign approach could reduce the resource utilization, allowing acheaper final board, and a better performance was possible usingreconfiguration. With this aim, DRIP was designed as a newreconfigurable architecture for digital image processing.

5. DRIP Architecture

DRIP (Dynamically Reconfigurable Image Processor) is areconfigurable architecture based on NP9. DRIP design goal is toproduce a digital image processing system using a dynamicreconfiguration (in a SRD scheme) approach. Based on previousNP9 design, we were expecting to obtain a minimum operationfrequency of 32 MHz for real-time processing.

5.1 Customization of the Processor Elements

The first step in the definition of the DRIP architecture was thecustomization of its PEs. As mentioned above, each PE canimplement two functions (ADD and MAX). The current model ofthe PE of NP9/DRIP operates with restricted weights, using onlythree values: -1, 0 and 1. Such parameters allow us to implement8 distinct functions from a set of 18 possible configurations, notconsidering functions symmetrical on its inputs (e.g.:max(X1*0,X2*1) is equivalent to max(X1*1,X2*0) with inputsexchanged). These functions are summarized in Table 2.

With this information, we have designed optimized components,each one representing a distinct function. All functions togetherformed a basic function library, used for algorithmimplementation during synthesis step (Figure 5). According to ourdesign, the worst-case delay and resource usage in a functionoccurs for max(Xa, Xb) and defines DRIP maximum clock.

Original OperationFunctionadd(0,0), max(0,0)Zeroadd(0,X2), add(X1,0)Xadd(0,-X2), add(-X1,0)-Xadd(X1,X2), add(-X1,-X2)Xa + Xbadd(X1,-X2), add(-X1,X2)Xa - Xbmax(0,X2), max(X1,0)If positive(X) then X else 0max(0,-X2), max(-X1,0)If negative(X) then X else 0other possibilities of maxMax(Xa, Xb)Table 2. Customized functions of DRIP PE5.2 Algorithm MappingAlgorithmFunctionSynthesisLibraryConfigurationConfigurationLibraryDRIPFigure 5. Design Flow of DRIP systemTypical design flow of the configuration of an algorithm ontoDRIP is shown in Figure 5. First, an image processing algorithmis specified and simulated to verify its functionality. Thespecification can be done graphically, using an interface thatrepresents the full DRIP/NP9 data flow graph. The algorithm canalso be described using a high level language like C and translatedto an intermediate representation which matches the DRIParchitecture.

After specification, the algorithm is compiled/synthesized usingthe previously designed function library, fully optimized toachieve better performance. During algorithm synthesis, someoptimizations are performed to reduce the complexity of thefunctions and to eliminate redundant or unused PEs. Theconfiguration bitstream that customizes the FPGA for theoptimized algorithm implementation is stored in a configurationlibrary. The reuse of the modules of this library is essential forefficient implementation of several image processing functions.Once the configuration bitstream data is stored, it can be usedrepeatedly and over several modules of the entire architecture.The synthesis and optimization steps for the configuration libraryelements may be slow, but this is counterbalanced by the reuse ofthe configuration library to implement more complex algorithms.Like in a software environment, the design and compilation timesare sometimes large, but the massive use of the software cancompensate its design costs and development time. This situationoccurs in low-level image processing, where common algorithmsare employed several times in distinct applications, and the size ofthe images and the number of frames require a great number ofiterations.

5.3 Digital Image Processing System

In a complete digital image processing system (Figure 6), DRIP isconnected to a visualization and acquisition system through aneighborhood generator. The generator receives the image pixelsin serial form, line by line, and sends a complete neighborhood tobe processed by DRIP. The configuration interface connects DRIPto the host computer and is responsible for the controlling of itsconfiguration.

Visualization andNeighborhoodAcquisition SystemGeneratorDRIPHostConfigurationComputerInterfaceFigure 6. A Digital Image Processing System using DRIPCurrently, DRIP is being proposed as part of the entire system ofFigure 6 in order to achieve a high performance image processor,relying on the dynamically reconfigurable features of DRIP. Thisgoal is particularly challenging, since the neighborhood generatoris a special memory with a very high bandwidth requirement. Thebest solution is to design a single chip containing DRIP and thegenerator, including substantial frame buffer memory, similar tothe Garp implementation approach.

5.4 Potential Applications

A dynamic image processing system that relies on DRIPflexibility can take advantage of a web-based environment forremote processing. This possibility is not so far from the currentstate-of-the-art stage, given the increasing bandwidth available inthe Internet. A Web-based framework can allow the distributionof the design/execution tasks: algorithm synthesis, programlibrary storage, image acquisition, algorithm execution, imagevisualization. Each of these tasks can be performed on distinctsites throughout the web. The motivations for using thereconfigurable hardware herein proposed via Internet may beexplained as an architecture on-demand.

A user may not have a machine with minimum requirements toexecute compile/synthesis tasks or the software is not available forlocal use, only in a centralized algorithm server. A user systemmay have insufficient speed to perform more complex imagetransformations exclusively in software. A system including DRIPprocessor can be a server for digital image processing, and remoteusers can submit images and receive the visualization in theirclient browsers. The feasibility of such implementation dependson issues that are not addressed in this paper.

6. RESULTS

The customized functions of DRIP were compiled for AlteraFlex10K FPGAs. We individually analyzed the performance ofeach function. After that, we choose the worst-case function inresource utilization and performance to implement a full DRIPdata flow graph. The preliminary estimated performance (51.28MHz) is 60% greater than the design target performance (32MHz) and almost 200% faster than the fixed hardware (non-

reconfigurable) NP9 implementation (17.3 MHz). Thecomparison between these performances is presented in Figure 7.On the resource usage, DRIP also achieved much better results. Itused only 1,113 logic elements of a Flex10k30, 83% less areathan the NP9 pipelined implementation. The reconfiguration timewas also reduced due to the FPGA device used, as the SRAM-based FPGA technology Flex10K used provides for much fasterreconfiguration. Unfortunately, this family does not supportpartial reconfiguration yet, as does the Xilinx XC6200 family.

5,0Target Performance(32 MHz)Partitioned NP9 (17.3 MHz)4,0

NP9 (12.34 MHz)s)DRIP (51.28 MHz)ionlli3,0m(s leix2,0P1,00,010

15

25

Frames/s

303540

Figure 7. Graph of Processed Pixels x Frames/s

7. Conclusions and future work

Dynamic reconfiguration can obtain a considerable gain in area,performance, and cost for an application specific system. Wedemonstrate such advantages with the implementation of areconfigurable dynamic image processor. Comparing with theequivalent statically configured (SD) design of NP9, an 80% areareduction and 3 times faster clock rate was achieved. The DRIPprocessor is then suitable for a high performance real-time imageprocessing.

The design flow emphasizes a fundamental requirement inreconfigurable design: performance gain and large application lifecycle must overcome the design costs and development time. Arelevant trend in FPGAs and reconfigurable architecture is alsobound to change considerably the picture in favor of DRD as theintegration of ASP, GPP, and memory in one single chip becomeseconomically viable.

The future directions of the DRIP project will require analgorithm/program specification framework to allow a user-friendly interaction between high-level abstraction and processorarchitecture. The development of a system board with full supportfor dynamic reconfiguration of the processor and high-bandwidthmemory availability is planned.

We are considering the implementation of a Web-basedframework for remote image processing. This framework willallow to define a methodology for dynamic reconfiguration on awidely distributed environment. That could become a relevantmarket for reconfigurable computing, for efficient supply of anarchitecture on-demand. Hardware upgrading, maintenance, andadaptation could be performed from a remote host.

8. REFERENCES

[1] Adário, A. M. S.; Côrtes, M. L.; Leite, N. J. “A FPGA

Implementation of a Neighborhood Processor for DigitalImage Applications” In: 10 Brazilian Symposium onIntegrated Circuit Design, Ago 1997. Proceedings..., 1997 p.125-134.[2] ALTERA. Data Book. Altera Corporation, San Jose,

California, 1996.[3] Athanas, P.; Silverman, H. F. “Processor Reconfiguration

Through Instruction Set Metamorphosis”. IEEE Computer,Mar 1993. p 11-18.[4] Bertin, P. et al, Introduction to Programmable Active

Memories. Paris: Digital Equipment Corp., Paris ResearchLab, June 1989. (PRL Report 3).[5] Fountain, T. J. Processor Arrays: Architecture and

Applications. Academic Press, London, 1987.[6] Gokhale, M. et al. “Building and Using a Highly Parallel

Programmable Logic Array.” Computer, vol. 24, no. 1, Jan1991. p 81-89.[7] Hauser, J. R.; Wawrzyneck J. “Garp: A MIPS Processor with

a Reconfigurable Coprocessor”. In: IEEE Symposium on

FPGAs for Custom Computing Machines, 1997.Proceedings... p 24-33.

[8] Knuth, D. E. The Art of Computer Programming, Reading,

Massachusetts: Addison-Wesley, 1973.[9] Leite, N. J.; Barros, M. A.; “A Highly Reconfigurable

Neighborhood Image Processor Based on FunctionalProgramming”. In: IEEE International Conference on ImageProcessing, Nov 1994. Proceedings... p 659-663.[10] Page I. Reconfigurable Processor Architectures.

Microprocessors and Microsystems, May 1996. (SpecialIssue on Codesign).[11] Pitas, I.; Venetsanopulos, A. N. “ A New Filter Structure for

Implementation of Certain of Image Processing Operations”.IEEE Trans. on Circuits and Systems, vol. 35, n. 6, June1988. P 636-647.[12] Shand, M.; Vuillemin, J. “Fast Implementations of RSA

Cryptography”. In: 11 Symposium on Computer Arithmetic,1993, Los Alamitos, California. Proceedings... p 252-259.[13] Wirthlin, M. J.; Hutchings, B. L. “A Dynamic Instruction Set

Computer”. In: IEEE Symposium on FPGAs for CustomComputing Machines, Apr. 1995. Proceedings... p 92-103

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- igat.cn 版权所有

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务