12–16 Jun 2016
Gothenburg, Sweden
Europe/Amsterdam timezone

RC64: High Performance Rad-Hard Manycore DSP

15 Jun 2016, 15:15
30m
Gothenburg, Sweden

Gothenburg, Sweden

DSP Day: Rad-hard DSP chips and boards Session 1: Rad-Hard DSP Chips

Speaker

Prof. Ran Ginosar (Ramon Chips)

Description

The RC64 project has been inspired by ESA NG-DSP roadmap and by plans of European space industry. In addition to very high performance DSP capabilities, RC64 offers ESA-supported SpaceFibre links, interfaces to European space-qualified ADC and DAC, and a variety of other peripherals useful in European on-board DSP. RC64 is supported by three European FP7/Horizon2020 projects (QI2S—real time material identification on RC64, MacSpace—signal processing on RC64, and S3NET—formation flight of multiple small satellites) in consortia with several European companies and academia. RC64 will be ITAR-free. RC64 is designed as a rad-hard high-performance many-core signal processor comprising 64 CEVA X1643 DSP cores, 4 MBytes on-chip shared memory, telecomm FEC accelerators for DVB-S2X/RCS2 modems and high bandwidth I/O. Two packaging options will be offered, a hermetically sealed CCGA-624 (addressing ESCC9000 qualification) and a PBGA-624. It is designed for a variety of space applications. RC64 includes twelve SpaceFibre integrated full duplex high speed serial links (HSSL) using 6.25Gbps SERDES interfaces on chip, for a combined data rate of up to 120 Gbps. Four links also double as SRIO. The HSSLs enable efficient connectivity among multiple RC64 chips as well as FPGAs, ASICs and future ADC and DAC. DDR2/3 SDRAM interfaces include Reed-Solomon ECC to protect from SDRAM SEFI and SEE. The 32+16 bit wide DDR2/3 interface supports up to 25Gbps throughput. Other I/O interfaces in RC64 include two SpaceWire for control and four for instrument data, parallel LVDS interfaces for ADC and DAC connectivity, and ECC-protected interface to ten 8-bit flash memories. The on-chip shared memory system of RC64 complements the write-through data cache, the instruction cache, and the private store of each core. To support the unique task-oriented programming (TOP) model, the 64 cores access the single shared memory through the 64-by-256 ports high-throughput, low-latency multistage interconnection network, enabling simultaneous access of all processors to the shared memory with very little conflicts. Thanks to the caches, access to shared memory happens either for fetching a complete cache line (the interfaces and the interconnection network are optimized for transferring complete cache lines rather than individual words) or for writing a single word, due to the write-through mechanism. While write-through may result in higher traffic rate to memory than write-back, it eliminates the need for complex inter-core cache coordination mechanisms such as snooping, locking, coherency checking and directories. Instead, the programming model minimizes memory conflicts and prevents software from relying on shared memory synchronization. The on-chip 4 MByte shared memory acts as a local-store memory. Access to off-chip DDR2/3 memory is facilitated by software-controlled DMA. This approach simplifies software development and it is found to be very useful for DSP applications, which favor streaming over cache-based access to memory. Most DSP applications are implemented without resorting to external DDR memory at all. A hardware scheduler assigns tasks to processors. Each processor executes its task from its cache storage, accessing shared memory only when needed. When task execution is completed, the processor notifies the scheduler, which can subsequently assign a new task to that processor. RC64 can operate on its own, booting its code from a flash device. It is better utilized with an attached control processor, such as the dual-core LEON GR712RC, which controls RC64 via SpW RMAP ports. A single external processor may control multiple RC64 chips, chained via a SpW ring. RC64 contains several FDIR capabilities, such as tracing, error monitoring and full access to internal state. The control processor has full access to RC64 and its FDIR features. No operating system is used on RC64. A run-time executive manages boot and FDIR, handles all DMA-based I/O, supports task start/stop and task control by the hardware scheduler, handles all error detection and correction, communicates with the control processor, and provides networking to other RC64 chips and other devices. For instance, the run-time executive performs the following sequence of actions upon input of a data item: (1) the DMA controller is programmed to receive a block of data from input and store it into shared memory, (2) the DMA controller issues an interrupt, (3) an interrupt handling routine is preemptively invoked on one of the cores (4) the interrupt handling routine enqueues a descriptor (containing a pointer to the received data block) into an input control queue and generates a signal (called “software event”) for the scheduler, (5) the hardware scheduler, triggered by that signal, enables a task waiting for that data and enqueues it into the queue of tasks that are ready for execution, (6) eventually, the hardware scheduler dispatches that task to some processor, (7) the task dequeues the descriptor from the input control queue and consumes the data. The PRAM-like programming model of RC64 is based on non-preemptive execution of multiple sequential tasks. The programmer defines the tasks in sequential C code, and defines their dependencies and priorities in a (directed) task graph. Tasks are executed by cores and the task graph is ‘executed’ by the hardware scheduler. In this shared-memory model, concurrent tasks do not communicate. Concurrent tasks may share read-only data but they cannot share data that is written by any one of them. Execution of concurrent tasks does not necessarily happen at the same time—they may execute together or at any order, as determined by the scheduler. Some tasks, typically amenable to independent data parallelism, may be duplicable, accompanied by a quota that determines the number of instances that should be executed. All instances of the same duplicable task are mutually independent (they do not write-share any data) and concurrent. These instances are distinguishable from each other by their instance number. Ideally, their execution time is short (fine granularity). Special pipeline techniques are available for multi-stage signal processing of streams of continuous data, assuring very high core utilization and processing that employ the on-chip shared memory and avoid time-consuming access to external DDR memory. A set of tools is being developed to help write software for RC64. The tool chain encompasses CEVA tools for the individual core (compiler, assembler, linker and a set of DSP libraries) and contains the following enhancements for manycore programming and for RC64: A task compiler (converting task graphs to scheduler tables), a manycore emulator (for developing parallel applications on standard workstations), manycore cycle-accurate simulator and debugger, a tracer and event recorder, a parallel program profiler, and a set of parallel DSP libraries. When a single RC64 is not sufficiently powerful for the application, multiple RC64 chips can be joined together. The multiple RC64 chips are interconnected with high-speed serial links using SpaceFibre. A networking software layer in the run-time executive facilitates easy and virtualized communications among the many chips. RC64 has been designed for integration with tens or hundreds of other RC64 chips, enabling very powerful digital signal processing in space. RC64 will be implemented on 65nm CMOS. It will dissipate a maximum of 10W, when all 64 DSP cores are active at 300 MHz and all 12 SpaceFibre links are transmitting. Power is reduced proportionately to the number of active cores, active outputs and clock frequency. RC64 is designed for operation at 300 MHz and will achieve 38 GFLOPS (single precision) and 76 GMAC (16-bit). The 12 high speed serial links offer a total bandwidth of 120 Gbps. Additional high bandwidth is enabled for memories (25 Gbps DDR3 interface of 32 bit at 800 Mword/s with additional 16 bits for ECC) and for high performance ADC and DAC (38 Gbps over 48 LVDS channels of 800 Mbps). RC64 is implemented using RadSafe™ rad-hard-by-design (RHBD) technology and library. RadSafe™ is designed for a wide range of space missions, enabling TID tolerance to 300 kRad(Si), no latchup and very low SEU rate. All memories on chip are protected by various means and varying levels of error correction and detection. Special protection is designed for registers that hold data for extended time, such as configuration registers. RC64 implements extensive means for fault detection, isolation and recovery (FDIR). An external host can reset, boot and scrub the device through dual RMAP SpaceWire ports. RC64 contains numerous error counters and monitors that collect and report error statistics. Trace buffers, allocated in shared memory as desired, enable rollback and analysis (in addition to helping debug). Faulty sub-systems may be shut down and the scheduler is designed to operate with partial configurations. ![abstract with figures][1] [1]: http://www.ramon-chips.com/papers/RC64HighPerformanceRadHardManycoreDSPDay2016.pdf

Primary author

Prof. Ran Ginosar (Ramon Chips)

Co-authors

Mr Fredy Lange (Ramon Chips) Mr Peleg Aviely (Ramon Chips) Mr Tsvika Israeli (Ramon Chips)

Presentation materials