Scalable FPGA fabric for parallelising 2D-surface trajectory cost calculations: Design and Evaluation of Application and Hardware

One way of simplifying the two dimensional trajectory cost computation is to partition the 2D domain (i.e. the map ) into a grid of unit squares, and approximate the cost functions by constants within these sub-domains (called map segments ), and similarly replace the trajectory by a piece-wise linear approximation, and accumulate the contribution of each map segment by using the constant cost functions of that segment and the length (and possibly direction) of the trajectory there, which are also easily computed because of the piece-wise linear approximation. In hardware, the map segments can be naturally mapped onto a 2D array of processing nodes connected by a network-on-chip (NoC), where each node contains the cost data for the corresponding map segment, and can compute its local cost-contribution and add that into a data field of a packet, representing a trajectory, and pass it on to a neighbor, so that the packet traverses a path in the NoC that matches the trajectory, it represents. If the packet starts its journey through the network with a zero data-field, then after it finishes its journey and the final processing node adds its contribution to the field, it contains the cost of that trajectory. This architecture is scalable, and provides parallelization of computation, but has its draw-backs. Because the communications between the nodes must occur in all possible directions (to model all possible direction of the trajectory), deadlocks are a real possibility. One way of detecting probable deadlocks is by detecting no progress within a timeout interval, and then they can be resolved by dropping a waiting packet. But it is important to communicate the packet droppings to the external application. An auxiliary low band-width NoC, called the injection-ejection network (IENW), is planned to be used for this purpose, along with the main network, called computation network (CNW). IENW is also designed to be used to carry the packets into the correct start-point and carry out from the end-points in the processing array, reducing the CNW loading. Another problem is that the size of the hardware processing array is now connected to the map divisions, which makes reuse of hardware difficult. It may also be hard for applications to exploit the hardware optimally when it is too highly parallel, because then the application will have to produce packets at a high through-put. These problems are solved by letting more than one map segment be mapped onto the same processing node, using a structured approach introduced in Section 1.2.1.In the previous semester a SystemC design was developed to model this hardware accelerator. In the present semester, a high level C model incorporating an external application and a high level model of the accelerator was developed to study its performance at the highest possible level in order to demonstrate the effectiveness of the design as well as to provide design guidelines for application development, e.g. how to ensure the best utilization of the hardware from the application perspective, how to accommodate the property of packet-dropping in the accelerator, etc. This activity successfully demonstrates the existence of practical applications that can benefit from this design, thereby demonstrating its utility.In the present semester, a detailed micro-architecture of the communication infrastructure involving the CNW and IENW was developed and implemented in Verilog RTL. This was used for synthesis and timing, targeting a Xilinx Virtex7 FPGA. The results showed that a practical processing array of size 8x8 processing nodes can be comfortably accommodated, at a clock speed of about 245 MHz. These findings provide another level of confirmation of the feasibility of the design. The accelerator would also contain processors and the software running on these processors in order to implement the cost computation algorithm, packet routing, etc. These could not be implemented due to lack of time, but some guidelines for their development have been worked out. During the synthesis, the processors were replaced by a standard Microblaze micro-controller system for area estimation, assuming that they would have similar area. Thus the feasibility and utility of the design have been convincingly demonstrated, and its development has been placed on a clearly defined track.

Utgiver

Institutt for elektronikk og telekommunikasjon