Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier

3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence architectural choices are important to evaluate when implementing the design. GPUs are specially tuned for performing a set of operations on large sets of data. The task of a 3D graphic solution is to render a image or a scene. The scene contains geometric primitives as well as descriptions of the light, the way each object reflects light and the viewer position and orientation. This thesis evaluates four different pipelined, vectorized floating-point multipliers, supporting 16-bit, 32-bit and 64-bit floating-point numbers. The architectures are compared concerning area usage, power consumption and performance. Two of the architectures are implemented at Register Transfer Level (RTL), tested and synthesized, to see if assumptions made in the estimation methodologies are accurate enough to select the best architecture to implement given a set of architectures and constraints. The first architecture trades area for lower power consumption with a throughput of 38.4 Gbit/s at 300 MHz clock frequency, and the second architecture trades power for smaller area with equal throughput. The two architectures are synthesized at 200 MHz, 300 MHz and 400 MHz clock frequency, in a 65 nm low-power standard cell library and a 90 nm general purpose library, and for different input data format distributions, to compare area and power results at different clock frequencies, input data distributions and target technology. Architecture one has lower power consumption than architecture two at all clock frequencies and input data format distributions. At 300 MHz, architecture one has a total power consumption of 1.9210 mW at 65 nm, and 15.4090 mW at 90 nm. Architecture two has a total power consumption of 7.3569 mW at 65 nm, and 17.4640 mW at 90 nm. Architecture two requires less area than architecture one at all clock frequencies. At 300 MHz, architecture one has a total area of 59816.4414 um^2 at 65 nm, and 116362.0625 um^2 at 90 nm. Architecture two has a total area of 50843.0 um^2 at 65 nm, and 95242.0469 um^2 at 90 nm.

Utgiver

Institutt for elektronikk og telekommunikasjon