## Introduction##
Stream processing, widely used in communications and digital signal processing applications, requires high-throughput capabilities that is achieved in most cases using ASIC designs. But, lack of programmability is an issue especially in space applications, which use on-board components with long life-cycles requiring applications updates. On the other hand, FPGA allows reconfigurable hardware design at gate level, offering more flexibility than an ASIC at expenses of higher power consumption, more silicon and at a relatively reduced maximum clock frequency [2]. However, fine granularity reduces performance in FPGA because of the complexity of the programmable connections used to build logic blocks [3].
As a result, architectures are evolving towards hardware with reconfigurable capabilities that integrate modules to efficiently perform frequently used operations. The eXtreme Processing Platform (XPP) is the core of the High Performance Data Processors (HPDP) architecture [4]. The XPP allows runtime reconfiguration of a network of coarse-grained computation and storage elements. The algorithm's data-flow graph is implemented in configurations, in which each node is mapped to fundamental machine operations executed by a configurable ALU [5]
The present work aims to determine the effectiveness, portability and performance of an image processing algorithm in the HPDP architecture.
Space debris is a major issue for operational satellites and spacecraft. A Space Based Space Surveillance (SBSS) mission using an optical telescope has been proposed [1] in order to detect and track such debris. The required frame rate for the instrument calls for an efficient on-board image processing implementation. Such on-board data reduction can be implemented by detecting features of interest (debris, stars) while omitting the remaining image content (noise, space background).
## THE XPP AS THE CORE OF THE HPDP##
The XPP Core consists of three types of Processing Array Elements (PAE): arithmetic logic unit PAE (ALU-PAE), random access memory with I/O PAE (RAM-PAE) and the Function PAE (FNC-PAE). ALU-PAE and RAM-PAE objects are arranged in a rectangular array, called the XPP Data-flow Array [6].
For the implementation of the feature detection algorithm the XPP-III 40.16.2 core is used, consisting of 40 ALU-PAE objects arranged in a 5x8 array, 16 RAM-PAE and two FNC-PAE. For the HPDP project the XPP core has been selected by Airbus DS due to the availability as HDL source code among others. This enables the implementation on the STM65nm semiconductor technology, using a radiation hardened library. This makes the resulting HPDP chip suitable to operate in all earth orbits and beyond. The development of this chip is currently on-going, first prototypes are expected in the first half 2016.
## MAPPING THE SPACE DEBRIS DETECTION DATA-FLOW GRAPH##
The boundary tensor has been chosen as the feature detection algorithm and is constructed combining the results of applying a set of polar separable filters to the input image [7]. The convolution process accounts for 80\% of the required data processing. The proposed implementation takes advantage of properties of the filter kernels and performs simplifications in the data representation with the objective of reducing operations, XPP array resources use and cycle-count for data transactions with the system's main memory:
- The reference design [8] requires floating-point arithmetic. However, convolution is implemented using fix-point arithmetic in this work in order to reduce hardware resources use and guarantee a predictable execution time of operations, which is critical for real-time applications.
- A trade-off between accuracy and performance is possible: the least significant bit (LSB) of the input pixels is truncated, then all computations fit into 16 bits, and the implementation requires the transfer of half the data volume, at expenses of inducing an error in the detection result.
- Symmetry in a kernel is advantageous for the implementation, because it reduces by as much as half the number of necessary multiplications between kernel elements and pixels.
- Used kernels are derived from normalised Gaussian functions, then it is possible to demonstrate that no overflow in convolution operations occurs.
Finally, boundary tensor calculation is performed only once at the end of the algorithm and its complete implementation fits in a single XPP array configuration. For this reason, there are no intermediate values that must be temporarily stored in the system memory to be streamed-back to the XPP array for further processing, therefore calculations are carried out using the full bit-width.
## RESULTS ##
The runtime estimates are derived from a cycle-accurate simulation of the XPP array. The maximum average resultant throughput is 3.98 Bytes/cycle, which is achieved by the configurations computing convolution with odd symmetry kernels. The expected HPDP hardware specification integrates, among other elements, a single-port SRAM achieving 800 MBytes/s, and an XPP array working at 200 MHz clock. As a result, the minimum bandwidth to provide and store-back a continuous data stream to the XPP array is 1592 MBytes/s. Therefore, the performance of the algorithm's execution on the specified HPDP hardware is determined by the memory speed. Based on the number of write and read operations needed for the complete algorithm using sub-image processing, the estimated algorithm's execution time for the expected HPDP hardware is 734 ms for a 16 bits 2048x2048 pixels input image.
In terms of effectiveness, for each detected streak obtained from the HPDP simulation, there are approximately 10% less detected pixels compared with the reference implementation, as shown in the following image for an input containing a streak with an SNR of 7.19 dB. The error is negligible since the detection information per object can then be used to store full streak pixel values in order to not lose accuracy with respect to the position and brightness in a further processing step on-ground.
![Comparison between reference and HPDP implementation.][1]
## CONCLUSIONS ##
It has been shown that the boundary tensor algorithm can be mapped to a data-flow graph and a simple control-flow is only required for filter kernel update, border replication and pipeline cleaning tasks. Thus, the XPP array is appropriate for its implementation, because makes possible to exploit pipeline parallelism in convolution's multiplication and addition operations, and task parallelism, since four consecutive streams are used to compute the convolution of four pixels simultaneously, without data dependencies, reaching in average 4.7 GOp/s, for 16-bit fixed-point operations. The utilisation of 99% of XPP array computation elements (e.g. ALU-PAE), and the use of the maximum transfer mode of the 4D-DMA, shows that this implementation is taking advantage of all the capabilities of the architecture.
Finally, it is demonstrated that the LSB truncation is an effective alternative to meet the real-time requirement because the gain in performance is greater (twice as fast) than the error caused in the detection, evidenced by a loss of only 10% of high-detection pixels.
## REFERENCES##
[1] Utzmann, J., Wagner, A., Silha, J., Schildknecht, T., Willemsen, P., Teston, F., Flohrer, T., Space-Based Space Surveillance and Tracking Demonstrator: Mission and System Design. 65th International Astronautical Congress, Toronto, Canada (2014).
[2] Bailey, D.: Design for Embedded Image Processing on FPGAs John Wiley & Sons (2011).
[3] Bobda, C., Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications. Springer Netherlands (2007).
[4] Syed, M., Acher, G., Helfers, T., A High Performance Reliable Dataflow Based Processor for Space Applications. In: Proceedings of the ACM International Conference on Computing Frontiers, 1-4, ACM, New York, USA (2013).
[5] Sch¨uler, E., Weinhardt, M., XPP-III: Reconfigurable Processor Core. In: Dynamic System Reconfiguration in Heterogeneous Platforms: The MORPHEUS Approach, Chap. 6, Springer Netherlands (2009).
[6] PACT XPP Technologies AG, XPP-III Processor Overview White Paper. Germany (2006).
[7] K¨othe, U., Integrated edge and junction detection with the boundary tensor. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, 424-431 (2003).
[8] VIGRA Homepage, Heidelberg Collaboratory for Image Processing. http://ukoethe.github.io/vigra/
[1]: http://s15.postimg.org/43vmmj1cb/Results.png