Norwegian University of Science and Technology

# Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier 

## Espen Stenersen

Master of Science in Electronics
Submission date: June 2008
Supervisor: Per Gunnar Kjeldsberg, IET
Co-supervisor: Torstein Dybdal, ARM Norway AS

## Problem Description

In 3D graphics, several floating-point formats are used in computations. The task is to make a floating-point multiplier with the current features:

- 256-bit input vector and 128-bit output.
- Supporting FP16/FP32/FP64 inputs.
- IEEE754 conforming.
- 5 step pipeline.
- Simple handshake interface.

Depending on input formats the following operations should be performed:

- Vec4 FP16 multiply uses a 128-bit input vector, and produces a 64 -bit output vector.
- Vec4 FP32 multiply uses a 256-bit input vector, and produces a 128 -bit output vector.
- Vec2 FP64 multiply uses a 256-bit input vector, and produces a 128-bit output vector.

The assignment is a continuation of the project task, where different floating-point multiplier architectures were proposed, analyzed and evaluated. Based on this, further analysis has to be made before an architecture is chosen. Implement the chosen architecture at register transfer level, for testing and synthesis.

Assignment given: 15. January 2008
Supervisor: Per Gunnar Kjeldsberg, IET

## Abstract

3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area and power consumption, and hence architectural choices are important to evaluate when implementing the design. GPUs are specially tuned for performing a set of operations on large sets of data. The task of a 3D graphic solution is to render a image or a scene. The scene contains geometric primitives as well as descriptions of the light, the way each object reflects light and the viewer position and orientation.

This thesis evaluates four different pipelined, vectorized floating-point multipliers, supporting 16 -bit, 32 -bit and 64 -bit floating-point numbers. The architectures are compared concerning area usage, power consumption and performance. Two of the architectures are implemented at Register Transfer Level (RTL), tested and synthesized, to see if assumptions made in the estimation methodologies are accurate enough to select the best architecture to implement given a set of architectures and constraints. The first architecture trades area for lower power consumption with a throughput of 38.4 Gbit/s at 300 MHz clock frequency, and the second architecture trades power for smaller area with equal throughput. The two architectures are synthesized at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency, in a 65 nm low-power standard cell library and a 90 nm general purpose library, and for different input data format distributions, to compare area and power results at different clock frequencies, input data distributions and target technology.

Architecture one has lower power consumption than architecture two at all clock frequencies and input data format distributions. At 300 MHz , architecture one has a total power consumption of 1.9210 mW at 65 nm , and 15.4090 mW at 90 nm . Architecture two has a total power consumption of 7.3569 mW at 65 nm , and 17.4640 mW at 90 nm . Architecture two requires less area than architecture one at all clock frequencies. At 300 MHz , architecture one has a total area of $59816.4414 \mu \mathrm{~m}^{2}$ at 65 nm , and $116362.0625 \mu \mathrm{~m}^{2}$ at 90 nm . Architecture two has a total area of $50843.0 \mu \mathrm{~m}^{2}$ at 65 nm , and $95242.0469 \mu \mathrm{~m}^{2}$ at 90 nm .

## Preface

This thesis concludes my Master's degree in Electrical Engineering at Norwegian University of Science and Technology (NTNU), and is a continuation of my 2007 autumn project. The assignment is given by ARM Norway, and involves research, implementation, testing and synthesis of a vectorized floating-point multiplier. The work was carried out from January 2008 to June 2008, and the topic was interesting, challenging and very instructive.

I spent a lot of time researching floating-point implementations in hardware, especially floating-point rounding, in addition to power consumption in sub-micron technologies. IEEE specifies a detailed standard for binary floating-point arithmetic, but leaves the implementation completely to the designer. Two different vectorized floating-point multipliers was implemented using the Verilog Hardware Description Language, which I had little knowledge of before starting this assignment. A significant amount of time was spent developing a sufficient testplan for the designs, and by researching and understanding the tools used for synthesis and the Tcl scripting language. Working on this thesis, I learned much about floating-point arithmetic in hardware, the synthesis and optimization process, and power consumption in different target technologies. I also gained further knowledge of digital design in general, and the Verilog Hardware Description Language.

A special thank goes to my supervisors, Associate Professor Per Gunnar Kjeldsberg (NTNU), and Torstein Hernes Dybdahl (ARM) for their guidance, feedback and interest in this assignment. I would also like to thank my fellow students for comments and constructive questions, my family and friends.

Espen Stenersen
Trondheim, June 2008.

## Contents

1 Introduction ..... 1
1.1 Floating-Point Multiplication ..... 3
1.2 Power and Area Optimized Designs ..... 4
1.2.1 Low Power Design ..... 4
1.2.2 Area Optimized Design ..... 6
1.3 High-Speed Multiplication ..... 6
1.4 Architecture Search-space Exploration ..... 7
1.4.1 Power Consumption ..... 7
1.4.2 Area Usage ..... 8
1.4.3 Throughput and Delay ..... 8
1.5 Proposed Architectures ..... 9
1.5.1 Architecture One ..... 9
1.5.2 Architecture Two ..... 10
1.5.3 Architecture Three ..... 10
1.5.4 Architecture Four ..... 11
1.6 Thesis Organization and Main Contributions ..... 11
2 Architecture Estimations ..... 13
2.1 Power Estimation Methodology ..... 13
2.2 Power Estimation ..... 17
2.3 Area Estimation ..... 19
2.4 Performance Estimation ..... 22
2.5 Trade-Off Considerations ..... 23
3 Implementation ..... 25
3.1 Choosing Architecture ..... 25
3.2 Vectorized Floating-Point Multiplier ..... 26
3.2.1 Inputs ..... 27
3.2.2 Outputs ..... 29
3.2.3 Architecture Description ..... 29
3.3 Testing and Simulation ..... 36
3.3.1 Reference Circuit ..... 37
3.3.2 Simulations ..... 37
4 Synthesis Results ..... 39
4.1 Synopsys ${ }^{\circledR}$ ..... 39
4.1.1 Static Power ..... 40
4.1.2 Dynamic Power ..... 40
4.1.3 Capturing Switching Activity for Synthesis ..... 40
4.1.4 Setting Design Constraints ..... 41
4.2 Architecture One ..... 41
4.2.1 Power ..... 42
4.2.2 Area ..... 49
4.3 Architecture Two ..... 50
4.3.1 Power ..... 50
4.3.2 Area ..... 57
4.4 Power Comparison ..... 59
4.5 Area Comparison ..... 63
5 Conclusions ..... 67
5.1 Estimation Methodologies ..... 68
5.2 Power Results ..... 68
5.3 Area Results ..... 69
5.4 Future Work ..... 69
A Architecture One Verilog Sources ..... 73
B Architecture Two Verilog Sources ..... 119
C Test Data Generator ..... 145
D Simulation Sources ..... 151
D. 1 Vectorized DesignWare floating-point multiplier Source ..... 151
D. 2 Testbench Sources ..... 158
D. 3 Switching Activity Simulation Source ..... 171

## List of Tables

2.1 Normalized leakage current for logic gates [1]. ..... 14
2.2 Significand multipliers static power consumption ..... 16
2.3 Significand multipliers dynamic power consumption. ..... 16
2.4 Static power estimation of proposed architectures ..... 17
2.5 Total power consumption, 256-bit input vector. ..... 18
2.6 Total power consumption, 128-bit input vector. ..... 19
2.7 Architecture area comparison, FA-cells and equivalent register- size. ..... 21
3.1 Trade-off considerations ..... 26
3.2 Format encoding. ..... 28
3.3 Rounding modes encoding ..... 28
3.4 Rounding mode reduction. ..... 36
4.1 Architecture one, 65 nm CMOS total power consumption. ..... 42
4.2 Architecture one, 65 nm CMOS building blocks power con- sumption. ..... 43
4.3 Architecture one, 90 nm CMOS total power consumption. ..... 44
4.4 Architecture one, 90 nm CMOS building blocks power con- sumption. ..... 44
4.5 Architecture one, 65 nm CMOS area usage. ..... 49
4.6 Architecture one, 90 nm CMOS area usage. ..... 50
4.7 Architecture two, 65 nm CMOS total power consumption. ..... 51
4.8 Architecture two, 65 nm CMOS building blocks power con- sumption. ..... 51
4.9 Architecture two, 90 nm CMOS total power consumption. ..... 52
4.10 Architecture two, 90 nm CMOS building blocks power con- sumption. ..... 52
4.11 Architecture two, 65 nm CMOS area usage. ..... 58
4.12 Architecture two, 90 nm CMOS area usage. ..... 58

## List of Figures

2.1 Full-Adder gate-level model ..... 15
2.2 Ratio of leakage power to total power in a 65 nm CMOS library at different process corners, supply voltages and temeratures [2]. ..... 15
2.3 Architecture power comparison. ..... 20
2.4 Architecture area comparison. ..... 21
2.5 Architecture latency comparison. ..... 23
3.1 Vectorized floating-point multiplier block diagram. ..... 27
3.2 First input vector layout. ..... 27
3.3 Second input vector layout. ..... 27
3.4 Clear register layout. ..... 28
3.5 Product vector layout. ..... 29
3.6 Exception register layout. ..... 29
3.7 Vectorized floating-point multiplier simple timing diagram. ..... 29
3.8 Vectorized floating-point multiplier architecture drawing. ..... 30
3.9 Architecture one exponent unit. ..... 32
3.10 Architecture two exponent unit. ..... 32
3.11 Architecture one significand multiplier unit. ..... 34
3.12 Architecture two significand multiplier unit. ..... 34
3.13 Architecture one rounding and exception unit. ..... 35
3.14 Architecture two rounding and exception unit. ..... 36
3.15 DW _vec_fp_mult block diagram. ..... 37
4.1 Architecture one, 65 nm CMOS power consumption. ..... 45
4.2 Architecture one, 90 nm CMOS power consumption. ..... 47
4.3 Architecture one, 90 nm and 65 nm CMOS power comparison. ..... 48
4.4 Architecture two, 65 nm CMOS power consumption. ..... 54
4.5 Architecture two, 90 nm CMOS power consumption. ..... 55
4.6 Architecture two, 90 nm and 65 nm CMOS power comparison. ..... 56
4.765 nm architecture power comparison. ..... 60
4.890 nm architecture power comparison. ..... 61
4.9 Estimated vs. real power comparison. ..... 62
4.1065 nm CMOS architecture area comparison. ..... 65
4.11 90nm CMOS architecture area comparison. ..... 66

## Chapter 1

## Introduction

Floating-point numbers are frequently used in scientific calculations, digital signal processing applications and in 3D graphics. In 3D graphics, floatingpoint performance are especially demanding and several floating-point number formats are used in computations. 3D graphics accelerators have a highly parallel structure that makes them more efficient for certain algorithms than general purpose processors. The 16 -bit, 32 -bit and 64 -bit floating-point formats FP16, FP32 and FP64 are used for high dynamic range textures, that is, where light and dark textures are spanned over a large area. All formats can be used as vertex coordinates, and the FP64 format is the minimum for graphic processing units (GPUs) to be used in scientific calculations.

ARM Norway develops hardware graphic accelerators, specifically tuned for embedded system environments, supporting the OpenGL ES and OpenVG APIs, which focus on high performance and low power consumption [3]. Mali ${ }^{\text {TM }} 200$ with GP2 fully supports OpenGL ES v2.0, v1.1 and OpenVG v1.0. Detailed information about the Mali ${ }^{\text {TM }}$ 3D Graphics System Solution can be found in [4]. OpenGL ES is a royalty-free cross-platform API for full function 2D and 3D graphics on embedded systems [5], and OpenVG is a royalty-free, cross-platform API that provides a low-level hardware acceleration interface for vector graphics libraries such as Flash and SVG [6].

The purpose of this floating-point multiplier is to support three different floating-point number formats, FP16, FP32 and FP64. It is a vectorized floating-point multiplier in the sense that the input vector is a vector of operands, where three different types input vectors are supported. For the FP16 format, the input vector should be

$$
[127: 0]=[D 1, D 0, C 1, C 0, B 1, B 0, A 1, A 0]
$$

and the output will become

$$
[63: 0]=[D 1 \times D 0, C 1 \times C 0, B 1 \times B 0, A 1 \times A 0]
$$

For the FP32 format, the input vector should be

$$
[255: 0]=[D 1, D 0, C 1, C 0, B 1, B 0, A 1, A 0]
$$

and the output will become

$$
[127: 0]=[D 1 \times D 0, C 1 \times C 0, B 1 \times B 0, A 1 \times A 0]
$$

For the FP64 format, the input vector should be

$$
[255: 0]=[D 1, D 0, C 1, C 0]
$$

and the output will become

$$
[127: 0]=[D 1 \times D 0, C 1 \times C 0]
$$

Depending on the input vector data format, the output vector will be a $64-$ bit or 128-bit vector of floating-point products on the IEEE 754 format.

This thesis is a continuation of my 2007 autumn project [7]. [7] presents four possible vectorized floating-point multiplier architectures with different area, power and throughput profiles. These architectures are evaluated and compared concerning area, power, throughput and latency. This thesis will further investigate power consumption of the four architectures and two architectures will be selected for RTL implementation. The implemented architectures will be tested and synthesized to see if assumptions and methodologies used to compare area and power are sufficient to select the best alternative given a set of constraints.

In a 3D graphic processing application, throughput are very important because it is operating on large data sets describing a frame or scene, where for example shading, lighting, positions and viewers perspective are considered. In any hardware implementation, area and power are usually important constraints. In a handheld, battery powered device, both area and power consumption are very important. Because of highly parallel computations, and the pipelined architecture of graphic accelerators, clock frequency is typically much lower than in a modern general purpose CPU.

This Chapter will first present the floating-point multiplication algorithm in Section 1.1. Then design strategies for low power and small area will be discussed in Section 1.2. In Section 1.3 some high-speed multiplier schemes are presented, and in Section 1.4, the vectorized floating-point multiplier architecture search-space will be explored. Section 1.5 presents the architectures evaluated in [7], and in Section 1.6, the outline and main contributions of this thesis are presented.

### 1.1 Floating-Point Multiplication

The IEEE standard for binary floating-point arithmetic specifies a detailed standard for floating-point representation in computers [8]. Floating-point numbers are represented by a sign, an exponent and a significand, and are written as follows

$$
\begin{equation*}
\text { floating }- \text { point number }=(-1)^{s} \times f \times \beta^{e-b i a s}, \tag{1.1}
\end{equation*}
$$

where $s$ represents the sign, $f$ the significand, $e$ the exponent and $\beta$ the base or radix. In IEEE 754 the base is always 2. Floating-point numbers in IEEE 754 format are biased to ensure that the exponent is always greater than zero, and thus making comparison between numbers easier. The exponent represents the range, and the significand the precision of the number.

Given to floating point numbers $n_{1}=(-1)^{s_{1}} \times f_{1} \times 2^{e_{1}}$ and $n_{2}=(-1)^{s_{2}} \times$ $f_{2} \times 2^{e_{2}}$. The floating-point product is computed as

$$
n=(-1)^{s_{1}+s_{2}} \times\left(f_{1} \times f_{2}\right) \times 2^{e_{1}+e_{2}-b i a s}
$$

This can be achieved by a simple algorithm. The floating-point multiplication algorithm is straight-forward; exponents are added and bias subtracted, significands are multiplied, and signs computed by an XOR-operation. Because the result of the significand multiplication is of width $2 n$, where $n$ is the width of each significand, rounding has to be performed to obtain a final product in the IEEE specified format. The algorithm is given below.

```
// Sign, exponent and significand computation.
sign = sign_1 ^ sign_2;
exponent = \overline{exponent_}\overline{1}+\mathrm{ exponent_2 - bias;}
significand = significand_1 * sig
// Normalizing.
if (normalize)
{
    significand = significand >> 1
    exponent = e + 1;
}
// Rounding
if (roundup)
{
    significand = signficand + 1;
}
// Post-normalizing.
if (postnormalize)
{
        significand = significand >> 1;
        exponent = exponent + 1;
}
product = {sign, exponent, significand };
```

For numbers on scientific notation, the fractional part has to be normalized if it is outside the interval $[0,10)$. Normalizing is performed by incrementing the exponent by one and dividing the fractional part by ten. Likewise, in binary IEEE arithmetic, if the significand is outside the interval $[0,2)$ it has to be normalized, and normalizing is performed by incrementing the exponent by one and dividing the significand by two. The decision for normalizing is simple; if the most significant bit in the result after significand multiplication equals one, every bit in the significand are shifted one position to the right and the exponent incremented by one. If significand is to be rounded, a ' 1 ' is added to the significand. If the significand is not normalized after rounding, post-normalizing occurs. The bits in the significand are shifted one position to the right, and exponent incremented by one.

IEEE specifies four rounding modes, round-to-nearest even, round-to positive infinity, round-to negative infinity and round-to zero. The rounding decision are based on rounding mode and guard digits. In round-to-nearest even mode, three guard digits are needed. In round-to positive infinity and round-to negative infinity, two guard digits in addition to sign bit are needed for making correct rounding decision. Rounding decisions for the different rounding modes and guard digits are further described in [9].

### 1.2 Power and Area Optimized Designs

Low power and small area can be contradicting requirements, but both is very important in handheld devices. Low power design exploit numerous techniques such as dynamic voltage and frequency scaling as well as different coding schemes and number representations to reduce the overall power consumption. Low area can be achieved by for example resource sharing, but is a trade-off between area, speed and latency.

### 1.2.1 Low Power Design

The average power dissipation in a CMOS circuit is given by the equation [10]

$$
\begin{align*}
P_{\text {avarage }} & =P_{\text {static }}+P_{\text {short-circuit }}+P_{\text {dynamic }}  \tag{1.2}\\
& =V_{D D} I_{\text {static }}+V_{D D} I_{\text {short-circuit }}+\alpha C_{L} V_{D D}^{2} f_{\text {clk }}
\end{align*}
$$

where $\alpha$ corresponds to the average number of $0 \rightarrow 1$ transitions at a given node each clock cycle, $V_{D D}$ the supply voltage, $C_{L}$ the capacitive load switched each cycle and $f_{\text {clk }}$ the clock frequency.

## Static Power Consumption

The static power dissipation is technology dependent, and increases as the transistor dimension and threshold voltages decreases. The static power consumption is an increasing problem in deep sub-micron technologies, and proportional to the amount of transistors in a given design. $I_{\text {static }}$ is composed of leakage currents due to tunneling effects and sub-threshold conduction. Static power dissipation can be reduced by optimizing the supply voltage and threshold voltage, or by reducing the amount of transistors, and hence area. Other techniques such as channel engineering and changing the doping profile of the transistors may also be used. In order to eliminate the static power dissipation, the supply voltage needs to be turned off when parts of the circuit is not used.

## Short-circuit Power Consumption

The short-circuit current contributes to the average power consumption when both the PMOS and NMOS transistor conduct simultaneously, creating a direct path from $V_{D D}$ to ground. Short-circuit currents can be minimized by designing PMOS and NMOS transistors with equal fall- and rise times.

## Dynamic Power Consumption

The dynamic power dissipation is the main contributor to the average power consumption when the circuit is operating. To reduce the power consumption, either the switching activity, the capacitive load, the supply voltage, the clock frequency or a combination of these can be reduced. [11] describes three architectural techniques to reduce the power consumption in CMOS circuits, trading area for lower power dissipation through hardware duplication, pipelining or a combination of these. Through hardware duplication both supply voltage and clock frequency can be reduced at the cost of additional registers at the input, and a multiplexer at the output. Through hardware pipelining, the clock frequency or supply voltage can be reduced while still maintaining the same throughput as a similar non-pipelined circuit at a higher supply voltage. In graphic processing implementations, pipelining is often used to improve throughput. [11] also describes techniques for reducing the switching activity through algorithmic optimization. Statistical knowledge of the input data can be exploited to lower the power dissipation, through choosing the best number representation, and hence lower the switching activity.

In floating-point multipliers, the significand multipliers consume the larger part of the area. Therefore, these should be implemented as area and power efficient as possible in order to minimize both static and dynamic power
dissipation. Numerous techniques for multiplier designs represents different power consumptions and area usage.

### 1.2.2 Area Optimized Design

Area can be reduced at the expense of larger latency and lower throughput, or by reusing or sharing computational units efficiently. If throughput or latency is an absolute demand due to some timing constraints, there may be a limit to how much area can be reduced without violating those constraints. Area is an important design parameter in handheld devices due to the size of the devices, and the energy consumption.

In floating-point multipliers supporting several formats, area can be reduced mainly by using one multiplier computing the significands for each supported format. This affects the power consumption as well, the dynamic power consumption increases unless measures are taken to minimize this, and the static power consumption is reduced due to less transistors.

### 1.3 High-Speed Multiplication

Multiplication involves two basic operations, generating partial products and accumulation of the partial products. The time to perform a multiplication can be reduced by either reducing the number of partial products or speed-up their accumulation [9]. High-speed multipliers can be divided into two different categories, bit-parallel- and bit-serial multipliers. Bit-parallel multipliers can be further divided into three different categories [12].

- Shift-and-add multipliers.
- Parallel multipliers.
- Array multipliers.

Shift-and-add multipliers generates partial products sequentially and accumulates them successively. This type of multiplier require the least amount of area, but is also the slowest. It can be implemented using only one bitparallel adder and successively adding the partial products row- or columnwise. The shift-and-add multiplier requires $n^{2}$ AND operations, and $n-1$ shift operations, where $n$ is the with of the operands.

Parallel multipliers generates all partial products in parallel, and uses an adder-tree for their accumulation. Thus it can be partitioned into three parts, partial product generation or reduction, partial product accumulation (carry-free addition) and carry-propagation addition for the final result. Partial product reduction is most often performed by some version
of Booth's algorithm, and partial product accumulation by a Dadda [13] or Wallace [14] tree. The carry-propagation addition is often performed by a carry-lookahead adder. Tree-based multipliers have a latency proportional to $O\left(\log _{2}(n)\right)$, where $n$ is the with of the operands.

Array multipliers consists of almost identical cells for the generation of partial products and their accumulation. Compared to three-based multipliers, the array multiplier utilizes the least amount of area, but has larger latency. Array multipliers are good candidates for pipelining, and relatively easy to implement. The cells for partial product generation and accumulation are adders, most often implemented as carry-save adders to make them more efficient. Array multipliers have a latency proportional to $O(n)$, where $n$ is the with of the operands.

High-speed multipliers and multiplier schemes are further described and elaborated in [7].

### 1.4 Architecture Search-space Exploration

Given the specification, a vectorized, pipelined IEEE compliant floatingpoint multiplier supporting 16 -bit, 32 -bit and 64 -bit floating-point numbers, there is a minimum requirement of computational units. One significand multiplier, exponent adder and rounding and exception logic capable of handling every supported format is required. In addition to input- and output registers, and pipeline registers.

### 1.4.1 Power Consumption

Power consumption consists of both a static and a dynamic component. The static component is hard to estimate because it is strongly technology dependent, but is directly related to the chip area. The dynamic component depends on variables such as switching activity and glitching. Glitching activity can be much higher than functional activity in certain datapath modules such as adders and multipliers, and in a 32-bit multiplier, the power dissipation due to glitches can be three times higher than that due to functional activities [15]. Glitching can be reduced by balancing signal paths, and hence reducing uneven arrival times.

The optimized minimum power solution is difficult to obtain because the probability distribution of the different formats is unknown, and because static power dissipation can be a large contributor to the overall power and energy consumption. The choice of using only one significand multiplier for every supported format, or several significand multipliers for every supported format is crucial for both the power and energy consumption as well as the
area usage and throughput. However, the FP32 format is assumed to be the main data format, and used frequently compared the FP16 and FP64 formats. Because the use of the different supported formats is unknown, and only assumptions can be made, it is difficult to optimize the overall floatingpoint multiplier concerning power consumption. If FP16 computations are performed very infrequently compared to FP32 computations, the FP16 significand computations can be performed in the FP32 significand multiplier with little power overhead in the long run. This favors a solution with at least one 24-bit significand multiplier for the FP32 (and FP16) format, and one 53 -bit significand multiplier for the FP64 format. However, even if the power dissipation seems to be low, the total energy consumption by computing an entire input vector has to be considered.

Reducing the input vector also reduces area requirements due to less computational units and registers, and hence less static power dissipation and total power dissipation. However, energy consumption is not significantly reduced. Because of reduced throughput, additional cycles are needed to compute an entire input vector.

### 1.4.2 Area Usage

A minimum area solution would have only one XOR-gate computing the sign, one exponent adder and one significand multiplier supporting every format, rounding and normalizing logic supporting all three formats and a 256 -bit input register and an 128-bit output register, in addition to exception logic handling exceptions raised during computation. Pipeline registers will infer a significant increase in area, and should not be used in a minimum area solution. This architecture will suffer from very low throughput and clock-speed, in addition to a high power consumption due to glitching in very long and possible uneven signal paths plus functional switching. This floating-point multiplier is inefficient and energy consuming, and will not be suited for a battery powered graphic solution.

Power and energy consumption, as well as throughput and critical path delay, can be improved at the expense of additional pipeline registers. The area consuming part of any floating-point multiplier is the significand multiplier. Different multiplier schemes may be used to reduce the overall area usage. Amongst bit-parallel multipliers, the array multiplier requires the least amount of area, but is also the slowest [12].

### 1.4.3 Throughput and Delay

A vectorized floating-point multiplier, optimized concerning throughput and delay requires pipelining to reduce the critical path delay and parallel com-
putation to increase the data processed each cycle. However, parallelizing the computations requires additional computational units, which increases both area and static power dissipation significantly. In a graphic processing application, high throughput is an important criteria, however in a battery powered graphic processing application performance has to be a compromise between throughput, area and energy consumption.

To maximize the throughput, at least two significand multipliers and exponent adders supporting the FP32 and FP16 format, and two significand multipliers and exponent adders supported every format are needed, in addition to four XOR-gates computing the signs, and rounding and normalizing logic capable of handling four products in parallel. The exception logic also needs to be able to handle exceptions from four products simultaneously. An 128 -bit input bus may not only reduce area and power consumption, it may also reduce the throughput.

Critical path delay is limited by the FP64 significand multiplier, assuming registers at the input and output of this multiplier. Techniques for fast multiplication can be applied to speed up the multiplication. Compression multipliers such as Dadda [13] and Wallace [14], or versions of this, in addition to techniques for reducing partial products, speeds up the multiplication at the expense of area and possibly power overhead.

### 1.5 Proposed Architectures

The architectures presented in [7] lies somewhere in between the solutions discussed above, and have different power consumptions, area, throughputs and latencies, where latency is measured in cycles before a product vector is ready at the output. Four architectures are presented.

### 1.5.1 Architecture One

This architecture attempts to be a throughput and power optimized solution at the cost of increased area. Achieving a high throughput requires parallel computation of input vectors. To minimize the dynamic power consumption, two 53 -bit multipliers, four 24 -bit multipliers and four 11-bit multipliers are used to compute the significands of the FP64, FP32 and FP16 formats respectively. In addition, two 11-bit bit adders and subtractors, four 8-bit bit adders and subtractors and four 5 -bit bit adders and subtractors to compute the exponents of the FP64, FP32 and FP16 formats respectively. Four XORgates are used to compute the signs. By using components that exactly fit the operand widths, unnecessary switching is reduced when computing the different formats. Architecture one has a latency of four cycles, assuming a 256 -bit input bus, and throughput is 256 bits per clock cycle. But, if
input bus is reduced to 128 -bit, throughput reduces to 128 bits per cycle and latency increases to five cycles. In addition, only one 53 -bit multiplier, two 24 -bit multipliers and two 16 -bit multipliers, one 11 -bit exponent adder and subtractor, two 8 -bit adders and subtractors and two 5 -bit adders and subtractors are needed if input bus is reduced to 128 -bit. An architectural drawing of architecture one is given in [7].

### 1.5.2 Architecture Two

Architecture two attempts to be a throughput and area optimized solution by using more general significand multipliers and exponent adders than architecture one. Two 53 -bit multipliers and two 24 -bit multipliers are used to compute the significands of all supported formats. Two 11 -bit adders and subtractors and two 8 -bit adders and subtractors to compute the exponents of the FP16, FP32 and FP64 data formats, in addition to four XOR-gates computing the signs. By reducing the area, static power dissipation is reduced, but dynamic power is increased due to functional switching. Significands have to be extended to fit the with of the multipliers for the FP32 and FP16 formats. The 11-bit exponent adders have to support subtraction of three different bias values, and the 8-bit exponent adders have to support subtraction of the FP16 and FP32 bias values. As architecture one, this architecture has a latency of four cycles and a throughput of 256 bits per cycle, assuming 256 -bit input bus. If input bus is reduced to 128 -bit, latency increases to five cycles, and throughput decreases to 128 bits per cycle. As for architecture one, number of significand multipliers and exponent adders and subtractors are halved. The architectural layout of architecture two is also given in [7].

### 1.5.3 Architecture Three

Architecture three attempts to be an area and power optimized architecture, where throughput is traded for smaller area. One 53 -bit multiplier, one 24 -bit multiplier and one 11-bit multiplier computes the significands of the FP64, FP32 and FP16 formats respectively. One 11-bit adder and subtractor, one 8 -bit adder and subtractor and one 5 -bit adder and subtractor are used to compute the exponents of the FP64, FP32 and FP16 formats respectively. One XOR-gate is used to compute the signs. By reducing area, static power is reduced, and by using components that fit the operand width of their designated format, functional switching is reduced and hence dynamic power consumption. This architecture has a latency of six cycles, assuming 256 -bit input bus. The throughput of this architecture is 64 bits per cycle. If input bus is reduced to 128 -bit, neither latency or throughput is reduced because only one product is computed each cycle. However, input register size may be reduced and hence area and static power dissipation. Architecture three
should have a 64 -bit input bus to avoid wait cycles, and hence reducing registers required, and area further. The architectural layout of architecture three is given in [7].

### 1.5.4 Architecture Four

This architecture is close to an area optimized solution, and almost identical to architecture three, except only one 53 -bit multiplier is used to compute the significands of all supported formats, one 11-bit adder and subtractor is used to compute the exponents, and one XOR-gate computing the sign. The Exponent subtractor supports FP16, FP32 and FP64 bias values. By reducing area to a minimum of components needed for computing the products of all formats, static power is reduced even further but a the cost of functional switching. As architecture three, this architecture has a latency of six cycles and a throughput of 64 bits per cycle, assuming 256 -bit input bus. If input bus is reduced, latency and throughput are unaffected.

As discussed in [7], and above, area, and hence static power, can be reduced for architecture one and two by reducing the input bus from 256 -bit to 128 -bit. This does not change the overall energy consumption significantly because an additional cycle is needed to compute an entire input vector. Architecture three and four are not affected by reducing the input bus. The rounding, normalizing/post-normalizing and exception logic are equal for all four architectures presented in [7].

### 1.6 Thesis Organization and Main Contributions

The rest of this thesis is organized as follows. In Chapter 2, a power estimation methodology is presented and used to compare the the architectures presented in [7] concerning power consumption. The architectures are further compared concerning area usage, and performance including latency and throughput. Chapter 2 also discusses trade-off considerations when choosing an architecture to implement. In Chapter 3, two architectures are selected for implementation, and the implemented architectures are presented and described. In addition, testing and simulation of the two architectures are discussed. Chapter 4 describes how synthesis has been performed, and presents the synthesis power and area results. The two architectures are further compared concerning power consumption and area usage. Chapter 5 concludes this thesis.

The main contributions of this thesis are:

- A power estimation methodology for comparing the relative differences
in power consumption of the architectures proposed in [7].
- Comprehensive RTL implementation of two vectorized floating-point multiplier architectures.
- Synthesis results of the two architectures realized in a 65 nm low-power library, and a 90 nm general purpose library, for comparison with estimations performed in this thesis, and in [7], in two different target technologies.


## Chapter 2

## Power, Area and Performance Estimation

Power, area and performance estimations are important to consider when choosing an architecture to implement. Especially power can be hard to estimate because it is strongly technology dependent, and both static and dynamic power dissipation have to be taken into account. When moving into deep sub-micron technology, static power dissipation can be a significant contributor to the total power consumption.

This Chapter will first present a power estimation methodology based on power dissipated by significand multipliers in Section 2.1. In Section 2.2, this power estimation methodology will be used to compare power consumption of the four architectures presented in Section 1.5. Section 2.3 compares area requirements of the proposed architectures, and in Section 2.4, latency, throughput and clock frequency of the proposed architectures will be discussed. Trade-off considerations that should be considered when choosing an architecture to implement are presented in Section 2.5.

### 2.1 Power Estimation Methodology

The significand multiplier is the major computational unit in any floatingpoint multiplier. Therefore, estimating the power consumed by the significand multipliers will give a good indication of the total power consumption of the overall floating-point multiplier.

When computing the resulting significand of two floating-point numbers of size $n$-bit, the $n$ most significant bits of the $n \times n$-bit product are the bits of interest. This means that for example if a FP32 significand is computed in a FP64 significand multiplier, the FP32 significand has to be extended to fit the width of the FP64 significand multiplier. If additional bits are
appended as the most significant bits, shifting has to be performed after the multiplication, or multiplexers connected to the output register of the significand multiplier has to select the correct bits for further computations such as rounding and normalizing. Alternatively, additional bits can be appended as the least significant bits, and avoid the shifting or multiplexing.

In order to estimate the power consumption, both static and dynamic power consumption, a power model or methodology is needed. In [1] simulations of leakage currents for different logic gates are performed for a 65 nm CMOS library, with standard threshold transistors and standard cells with a driving force of one. The result is given in Table 2.1. In [2], simulations are performed to analyze the ratio of static power dissipation to total power dissipation. The simulations are performed in a 65 nm CMOS library for different process corners and different supply voltages and temperatures. In the simulated circuit it is assumed that $95 \%$ of the gates are quiet and $5 \%$ are switching. The simulation result is given in Figure 2.2.

| Input | NAND | AND | XOR |
| :---: | :---: | :---: | :---: |
| L L | 1 | 5.3 | 17.9 |
| L H | 5.9 | 10.2 | 17.9 |
| H L | 7.1 | 11.4 | 9.1 |
| H H | 4.5 | 14.5 | 9.1 |

Table 2.1: Normalized leakage current for logic gates [1].
As can be seen from Table 2.1, static power can be reduced by setting unused bits to other values than zero. However, this simple power model aims to highlight the relative differences between the four architectures, and not their exact power consumptions. As seen from Equation 1.2, static power is given by $I_{\text {static }} \times V_{D D}$. Assuming equal $V_{D D}$ for all architectures, $V_{D D}$ can be eliminated from the equation, and total static power of a Full-Adder can be computed as

$$
2 \times 17.9+2 \times 1+1 \times 4.5=\underline{42.3}
$$

assuming unused bits are set to ' 0 '. The Full-Adder model used for the static power computation is given in Figure 2.1, and differs from the Full-Adder model used in [7, 16]. The model used in Figure 2.1 utilizes 6 transistors less, and therefore reduces the area, and in addition makes it possible to use the simulated data from Table 2.1.

Figure 2.2 shows that for a typical 65 nm CMOS process the static power dissipation is approximately $30 \%$ of total dissipated power. Hence if the static power is 42.3 , the dynamic power will be

$$
42.3 \times(7 / 3)=\underline{98.7}
$$



Figure 2.1: Full-Adder gate-level model.


Figure 2.2: Ratio of leakage power to total power in a 65 nm CMOS library at different process corners, supply voltages and temeratures [2].

Assuming that the significand multipliers are implemented as array multipliers as described in [7], the FP16 multiplier requires

$$
11 \times 10=\underline{110}
$$

FAs, the FP32 multiplier

$$
24 \times 23=\underline{552}
$$

FAs, and the FP64 multiplier

$$
53 \times 52=\underline{2756} .
$$

FAs. The static power consumption of the three different multipliers are given in Table 2.2, normalized to the value of the FP16 multiplier, assuming every input-bits equals ' 0 '.

| Multiplier | Static Power | Normalized Static Power |
| :---: | :---: | :---: |
| FP16 | 4653.0 | 1.0 |
| FP32 | 23349.6 | 5.0 |
| FP64 | 116578.8 | 25.1 |

Table 2.2: Significand multipliers static power consumption.

As shown in Table 2.2, the 53-bit FP64 significand multiplier dissipates 25.1 times more static power than the 11-bit FP16 significand multiplier, and the 24 -bit FP32 significand multiplier 5 times more than the FP16 multiplier. Assuming dynamic dissipated power equals approximately $70 \%$ of total power consumption, dynamic power consumption is computed and given in Table 2.3, where the dynamic power is normalized to the static power dissipation of the FP16 significand multiplier.

| Multiplier | Dynamic Power | Normalized Dynamic Power |
| :---: | :---: | :---: |
| FP16 | 10857.0 | 2.3 |
| FP32 | 54482.4 | 11.7 |
| FP64 | 272017.2 | 58.5 |

Table 2.3: Significand multipliers dynamic power consumption.

This estimation methodology has several sources of error, which may lead to the wrong conclusions. The most severe source of error in this methodology, is probably the assumption that $95 \%$ of the gates are quit during the simulations given in Figure 2.2. Floating-point multiplications are frequently performed in a graphic application, and in the proposed architectures $95 \%$ of the gates will not be quiet during computation. In addition static power
consumption is very technology dependent, and may be different for a lowpower and a general purpose CMOS process, and may even vary between vendors as well. Because leakage current simulations are performed by [1], and static power consumption by [2] this may enhance the error, and lead to not choosing the best architecture for implementation given a set of area, power and throughput constraints.

### 2.2 Power Estimation

The architectures presented in [7] have different power consumptions, areas and throughputs. In Table 2.4, the static power dissipation for each of the four architectures is computed, assuming none of the significand multipliers are performing any computation.

$$
P_{\text {static }}=\begin{align*}
& \# \text { FP64 multipliers } \times P_{\text {static }}(\mathrm{FP} 64)+ \\
& \# \text { FP32 multipliers } \times P_{\text {static }}(\mathrm{FP} 32)+  \tag{2.1}\\
& \# \text { FP16 multipliers } \times P_{\text {static }}(\mathrm{FP} 16)
\end{align*}
$$

The values in Table 2.4 are computed according to Equation 2.1, and the values are normalized to architecture four.

| Architecture | Static Power | Normalized Static Power |
| :---: | :---: | :---: |
| One | 345168.0 | 3.0 |
| Two | 279856.8 | 2.4 |
| Three | 144581.4 | 1.2 |
| Four | 116578.8 | 1.0 |

Table 2.4: Static power estimation of proposed architectures.

The total power consumption is given by both the static power consumption and the dynamic power consumption, where the dynamic power consumption is given by

$$
P_{\text {dynamic }}=\begin{align*}
& \# \text { FP64 multipliers } \times P_{\text {dynamic }}(\mathrm{FP} 64)+ \\
& \# \text { FP32 multipliers } \times P_{\text {dynamic }}(\mathrm{FP} 32)+  \tag{2.2}\\
& \# \text { FP16 multipliers } \times P_{\text {dynamic }}(\mathrm{FP} 16)
\end{align*}
$$

, and the total power consumption given by

$$
\begin{equation*}
P_{\text {total }}=P_{\text {static }}+P_{\text {dynamic }} \tag{2.3}
\end{equation*}
$$

The methodology presented in Section 2.1, is a simplified and inaccurate methodology. However, the relative differences between the architectures
evaluated in [7] and described in Section 1.5 are well highlighted through this simple methodology. The static and dynamic power consumption computed in Table 2.2 and Table 2.3 are used to compute the total significand multiplier power consumption for each of the four architectures. In Table 2.5 and 2.6 , the total power dissipated per cycle, and total power dissipated per multiplication are computed for the different supported formats. Total power per multiplication is important because if the input bus is reduced to 128-bit, two cycles are needed to compute the significands of an entire input vector for architecture one and two. It is also important to consider how input data format affects power dissipation of the four architectures, because the input data distribution is unknown. It can only be assumed that the FP32 format are frequently used compared to the FP16 and FP64 format. This knowledge may be important when choosing architecture. Total power includes both static and dynamic power dissipation, where the values are normalized to the purely static power dissipation of architecture four.

| Data format | Architecture | Total Power | Normalized <br> Power per <br> Cycle | Normalized <br> Power per <br> Multiplication |
| :---: | :---: | :---: | :---: | :---: |
| FP16 | One | 388596.0 | 3.33 | 3.33 |
|  | Two | 932856.0 | 8.00 | 8.00 |
|  | Three | 155438.4 | 1.33 | 5.33 |
|  | Four | 388596.0 | 3.33 | 13.33 |
| FP32 | One | 563097.6 | 4.83 | 4.83 |
|  | Two | 932856.0 | 8.00 | 8.00 |
|  | Three | 199063.8 | 1.71 | 6.83 |
|  | Four | 388596.0 | 3.33 | 13.33 |
|  | One | 889202.4 | 7.63 | 7.63 |
|  | Two | 823891.2 | 7.07 | 7.07 |
|  | Three | 416598.6 | 3.57 | 14.29 |
|  | Four | 388596.0 | 3.33 | 13.33 |

Table 2.5: Total power consumption, 256-bit input vector.

Figure 2.3 displays the differences in power consumption per cycle, and power consumption per multiplication for the four architectures. It shows that in addition to reducing the overall chip area, reducing the input bus also reduces the power dissipated each clock cycle for architecture one and two. The total power dissipated by architecture three and four is unchanged by reducing the input bus. This is because the amount of computational units are not reduced as for architecture one and two. However, even if total power consumption per cycle is reduced for architecture one and two, the power consumption per multiplication is not reduced because an additional

| Data format | Architecture | Total Power | Normalized <br> Power per <br> Cycle | Normalized <br> Power per <br> Multiplication |
| :---: | :---: | :---: | :---: | :---: |
| FP16 | One | 194298.0 | 1.67 | 3.33 |
|  | Two | 466428.0 | 4.00 | 8.00 |
|  | Three | 155438.4 | 1.33 | 5.33 |
|  | Four | 388596.0 | 3.33 | 13.33 |
| FP32 | One | 281548.8 | 2.42 | 4.83 |
|  | Two | 466428.0 | 4.00 | 8.00 |
|  | Three | 199063.8 | 1.71 | 6.83 |
|  | Four | 388596.0 | 3.33 | 13.33 |
|  | One | 444601.2 | 3.81 | 7.63 |
|  | Two | 411945.6 | 3.53 | 7.07 |
|  | Three | 416598.6 | 3.57 | 14.29 |
|  | Four | 388596.0 | 3.33 | 13.33 |

Table 2.6: Total power consumption, 128-bit input vector.
cycle is needed to compute an entire 256-bit input vector.

The relative differences in power consumption of the four architectures are well highlighted in Figure 2.3. Architecture three and four dissipates the least amount of power per cycle, but suffers from high total power consumption when computing an entire input vector compared to architecture one and two. Because only one product is computed each cycle, four cycles are needed to compute an entire input vector. Architecture one dissipates slightly more power than architecture three per cycle assuming a 128-bit input bus, but has significantly lower total power consumption when an entire input vector is considered. Architecture one has lowest power consumption per multiplication for all data formats, except FP64. Because the FP32 format is assumed to be the most used data format this should be an important consideration when choosing the architectures to implement. Total power consumption per multiplication is more important to consider than power dissipation per cycle. Because the rounding and exception logic, which is a significant part of the architectures, are not considered when computing power consumption, the relative differences may be greater or smaller.

### 2.3 Area Estimation

Area estimations are performed following the methodology described in [7]. Number of Full-Adders and equivalent 1-bit register cells are used to compute the total area requirements. Control logic and additional computational logic


Figure 2.3: Architecture power comparison.
requires little area compared to the significand multipliers, exponent adders and registers. The rounding logic differs somewhat for the architectures evaluated. Architecture one and two requires additional rounding logic due to parallel computing of product vectors. The ratio of transistors required by the Full-Adder model in Figure 2.1 and the register model presented in [7] is given by Equation 2.4.

$$
\begin{equation*}
\text { transistor ratio }=\frac{\# \text { transistors in } F A}{\# \text { transistors in register }}=\frac{40}{36}=\underline{1.11} \tag{2.4}
\end{equation*}
$$

| Architecture | Input-bus | \# FA-cells | Eq. register-size | \# Transistors |
| :---: | :---: | :---: | :---: | :---: |
| One | 256 -bit | 8160 | 924 | 9990.7 |
|  | 128 -bit | 4080 | 530 | 5063.3 |
| Two | 256 -bit | 6616 | 1134 | 8485.1 |
|  | 128 -bit | 3308 | 635 | 4310.6 |
| Three | 256 -bit | 3418 | 612 | 4409.8 |
|  | $128-$ bit | 3418 | 484 | 4281.8 |
| Four | 256 -bit | 2756 | 612 | 3674.2 |
|  | $128-$ bit | 2756 | 484 | 3546.2 |

Table 2.7: Architecture area comparison, FA-cells and equivalent registersize.


Figure 2.4: Architecture area comparison.

Figure 2.4 illustrates the area usage of the different architectures as a function of required transistors as presented in Table 2.3. Figure 2.4 shows that for an 256-bit input bus, architecture one requires more than twice as much transistors as architecture three and four, and architecture two approximately almost twice as much as architecture three. For an 128 -bit input bus,
the relative differences are much smaller, and not more than approximately 1000 transistors. Area reduction of architecture one and two is large compared to architecture three and four, because number of computational units such as significand multipliers and exponent adders are reduced, while only number of equivalent 1-bit register cells is reduced in architecture three and four. From an area point of view, an 128-bit input bus is favored.

Because logic not included in this area estimation methodology differs somewhat for the different architectures, this is a source of error. The largest computational unit not considered in this methodology are the rounding and exception unit, and because this unit is larger, and equal, for architecture one and two compared to architecture three and four, the differences will be greater than displayed in Figure 2.4. However, the relative difference in area usage by the proposed architectures are still well highlighted because the rounding and exception logic are small compared to the significand multipliers.

### 2.4 Performance Estimation

Performance is measured by clock frequency and data processed each clock cycle. The maximum clock frequency will be approximately equal for all architectures, and determined by the critical path delay. The clock frequency is given by the inverse of the delay trough the 53 -bit significand multiplier.

The data processed each cycle, or the throughput, is determined by the ability to process data in parallel. The architectures described in [7] have different throughputs and latencies. Throughput is measured in how many products computed each clock cycle. Reducing the input bus also reduces the throughput for architecture one and two, but not for architecture three and four. The ARM 3D graphic solutions typically runs at 300 MHz clock frequency. Assuming a clock frequency of 300 MHz , and a 256 -bit input bus, the throughput of architecture one and two will be

$$
256 \text { bit } \times 300 M H z=76800 \frac{M b i t}{s}
$$

and for architecture three and four

$$
64 \text { bit } \times 300 M H z=19200 \frac{M b i t}{s}
$$

If the input bus is reduced to 128 -bit, the throughput of architecture one and two will become

$$
128 \text { bit } \times 300 \mathrm{MHz}=38400 \frac{\mathrm{Mbit}}{s}
$$

and for architecture three and four the throughput will be unchanged.
The computations above shows that architecture one and two have higher throughput than architecture three and four. However, if the input bus is reduced to 128 -bit, architecture one and two still have higher throughput, but reduced by $50 \%$ compared to an 256 -bit input bus, while the throughput of architecture three and four remains the same.

Latency is in this context defined as the number of clock cycles from a vector arrives at the input to the product vector are ready at output. The latencies for the different architectures are given in Figure 2.5.


Figure 2.5: Architecture latency comparison.
The delay through the 53 -bit significand multiplier is equal to the inverse of the delay through 106 full-adder cells, assuming the multiplier is implemented as an array multiplier. For a typical low-power 65 nm CMOS process the delay through one full-adder cell equals 0.11 ns , which gives a maximum clock frequency of 90.9 MHz . To achieve higher clock frequencies, the significand multipliers must be implemented using a faster multiplier scheme. The Dadda or Wallace multiplier, with or without Booth recoded input will achieve this as described in Section 1.3. In the power and area estimations, significand multipliers are assumed implemented as array multipliers. However, changing the significand multiplier scheme does not changes the relative difference between the architectures, as long as the change is equal for all four architectures.

### 2.5 Trade-Off Considerations

When choosing the architecture to implement, design constraints have to be considered. Because throughput is very important in a graphic application, throughput should be kept as high as possible. In a handheld, battery pow-
ered device, area and power are also very important. Hence, the decision of which architectures to implement should be a trade-off between area, power and throughput. A weight-function could be used to help the decision, where area usage, power consumption and throughput are weighted according to importance. But, because total power consumption has a static and a dynamic component, where the dynamic component are dependent of which format being computed, and the static power component directly related to area usage, the weight-function can become complex. In addition, data format distribution may vary from user to user, which makes the decision even harder. However, the FP32 format is expected to be the most used data format. Thus, this should be weighted as more important than the FP16 and FP64 formats.

Because of error sources in the area and power estimation methodologies, such as logic not considered and the assumptions of quiet gates in the static power consumption calculation as described in Section 2.1, this should be kept in mind when choosing architecture. Because of the error sources in the estimation methodologies, two architectures should be implemented and compared to see how well the area and power methodologies predicted the relative differences in area usage and power consumption.

## Chapter 3

## Implementation

An IEEE compliant, pipelined, vectorized floating-point multiplier is to be implemented RTL for testing and synthesis. In Section 3.1, two architectures are selected for implementation based on the analysis and trade-off considerations performed in Chapter 2. Section 3.2 presents the implemented architectures, describes the differences between them, and provides user information. In Section 3.3, testing are discussed. Section 3.3 describes the testing and simulation, and what have been tested.

### 3.1 Choosing Architecture

The width of the input bus affects area usage, power consumption and throughput for the evaluated architectures. Area and power consumption can be significantly reduced, if the input bus is reduced from 256-bit to 128 bit. However, this lowers the throughput and increases the latency. The total power consumption by computing an entire input vector does not change, if the input bus is reduced from 256 -bit to 128 -bit, following the assumptions made in the methodologies presented in Chapter 2. The total energy consumption may be reduced somewhat if the input data is highly correlated, however this can not be assumed. By reducing the input bus both area and power consumption are reduced significantly for architecture one and two. Area is slightly reduced for architecture three and four as well. Table 3.1, presents a summary of estimated area usage, power consumption, latency and throughput (at 300 MHz ) for architecture one, two, three and four, assuming 128-bit input. Power is presented for only FP16 computations, only FP32 computations and only FP64 computations, where power dissipated by computing an entire input vector is considered.

Total power consumption by computing an entire input vector and area is the most important criteria when choosing an architecture to implement, in addition to the throughput. But, as seen from Table 3.1, dissipated power

|  | Architecture |  |  |  |
| :--- | :---: | :---: | :---: | :---: |
|  | One | Two | Three | Four |
| Area | 5063.3 | 4310.6 | 4281.8 | 3546.2 |
| FP16 Power | 3.33 | 8.00 | 5.33 | 13.33 |
| FP32 Power | 4.83 | 8.00 | 6.83 | 13.33 |
| FP64 Power | 7.63 | 7.07 | 14.29 | 13.33 |
| Throughput | 38400 | 38400 | 19200 | 19200 |
| Latency | 5 | 5 | 6 | 6 |

Table 3.1: Trade-off considerations.
is dependent of which format being computed, and input data format distribution should be considered when choosing architecture. Architecture one has lower power consumption than the other architectures for only FP16 and FP32 computations. But when only FP64 computations are performed, architecture two has lower power consumption than architecture one. This is because of static power dissipated by the significand multipliers in architecture one. If static power is modeled to high, architecture one might have lower power consumption than architecture one for only FP64 computations as well. Architecture one and two have lower latency and higher throughput than architecture three and four. Architecture one and two have larger area than architecture three and four, but architecture four suffers from significantly higher power consumption. Architecture three has higher power consumption than architecture one for all input formats, but lower than architecture two for FP16 and FP32 input data.

Based on the analysis above, and the estimations performed in Section 2.2, 2.3 and 2.4, the input bus should be 128 -bit and architecture one should be implemented to minimize the trade-off between area and power consumption, while keeping a relatively high throughput. Because only power dissipated in the multipliers are considered, and the sources of error discussed in Chapter 2, the differences may be greater or less due to power dissipated in registers, logic not considered, and fan-out effects in multiplexers. To see if the analysis made in Chapter 2 are accurate enough to make a correct implementation decision, given a set of constraints, architecture one and two should be implemented and compared concerning area and power.

### 3.2 Vectorized Floating-Point Multiplier

Two partially IEEE compliant, vectorized floating-point multipliers have been implemented. Architecture one and two was selected for implementation in RTL. The vectorized floating-point multipliers does not support denormalized inputs. If denormalized input vectors are provided to the
floating-point multiplier, these are treated as zero. Otherwise, the floatingpoint multiplier complies to the IEEE 754 specifications concerning delivering the correct result and exception generations.

The general block diagram of the vectorized floating-point multiplier topmodule is given in Figure 3.1.


Figure 3.1: Vectorized floating-point multiplier block diagram.

### 3.2.1 Inputs

The vectorized floating-point multipliers have five inputs, vectors, format, mode, clear and start in addition to clock and reset inputs. The format input tells the floating-point multiplier which format to compute, FP16, FP32 or FP64, and the mode input tells which rounding mode to apply. The clear input is used to clear exceptions, and the start input tells the floating-point multiplier that vectors are ready at the input. start must be kept high as long as input vectors are ready at the input. Input vectors should be given as shown in Figure 3.2 and Figure 3.3.

$$
\begin{array}{|l|l|l|l|}
\hline \text { B1 } & \text { B0 } & \text { A1 } & \text { A0 } \\
\hline
\end{array}
$$

Figure 3.2: First input vector layout.

$$
\begin{array}{l|l|l|l|}
\hline \text { D1 } & \text { D0 } & \text { C1 } & \text { C0 } \\
\hline
\end{array}
$$

Figure 3.3: Second input vector layout.
Because the input bus is 128-bit and the input vector for the FP32 and FP64 formats are 256-bit, the input vector has to be provided in two cycles
where $\mathrm{A} 0, \mathrm{~A} 1, \mathrm{~B} 0$ and B 1 should be given in the first cycle, and $\mathrm{C} 0, \mathrm{C} 1, \mathrm{D} 0$ and D1 should be given in the second cycle. The same has been done for the FP16 format, therefore the upper 64-bits of the input vector should be set to zero when FP16 computations are performed.

Data formats and rounding modes are encoded as given in Table 3.2 and Table 3.3 respectively.

| Format | Encoding |
| :---: | :---: |
| FP16 | 00 |
| FP32 | 01 |
| FP64 | 10 |

Table 3.2: Format encoding.

| Mode | Encoding |
| :---: | :---: |
| Round-to-nearest even | 00 |
| Round-to-plus infinity | 01 |
| Round-to-minus infinity | 10 |
| Round-to zero | 11 |

Table 3.3: Rounding modes encoding.

Exceptions should be cleared by setting the correct bits on the clear input bus to one. The layout of the the clear register is given in Figure 3.4.

| Underflow |  | Overflow |  | Inexact |  | Invalid |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 |

Figure 3.4: Clear register layout.
invalid[0] correspond to the product $A 0 \times A 1$, invalid[1] to the product $B 0 \times B 1$, invalid[2] to the product $C 0 \times C 1$ and invalid[3] to the product $D 0 \times D 1$. Likewise for the inexact, underflow and overflow exceptions, except the index should be incremented as shown in Figure 3.4. However, this functionality has not been implemented properly, and exceptions are not cleared as specified in the IEEE 754 standard.

### 3.2.2 Outputs

The vectorized floating-point multiplier has three outputs, products, exceptions and ready. The ready output is set to one whenever a product vector is ready at the output. Products are laid out as given in Figure 3.5.

| $D 1 \times D 0$ | $C 1 \times C 0$ | $B 1 \times B 0$ | $A 1 \times A 0$ |
| :--- | :--- | :--- | :--- |

Figure 3.5: Product vector layout.
The exception layout is exactly the same as the clear register layout, and as given in Figure 3.6.

| Underflow |  | Overflow |  | Inexact |  | Invalid |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 |

Figure 3.6: Exception register layout.
invalid $[0]$, inexact $[4]$, overflow $[8]$ and underflow $[12]$ corresponds to the product $A 0 \times A 1$. Exceptions for the products $B 0 \times B 1, C 0 \times C 1$ and $D 0 \times D 1$ are found by incrementing the index.

A typical scenario with only one input vector pair is given in Figure 3.7.


Figure 3.7: Vectorized floating-point multiplier simple timing diagram.

### 3.2.3 Architecture Description

The architectures have some minor changes from the ones described in [7]. These changes does not affect the relative differences between the area, power and performance estimations performed in Chapter 2 of the two implemented architectures. Figure 3.8 shows a more detailed architecture diagram than provided in [7].


Figure 3.8: Vectorized floating-point multiplier architecture drawing.

## Building Blocks

The major building blocks in the design are the select input demultiplexer, the sign unit, exponent unit, multiplier unit, check special unit, rounding and exception unit and the select output demultiplexer.

The select input demultiplexer provides the sign unit-, exponent unitand multiplier unit- registers with correct data, and selects parts of the input registers based on which data format being computed. The sign unit, exponent unit and multiplier unit computes the resulting signs, exponents and significands respectively. The check special unit checks for special inputs like NaNs, infinities and zeroes used by the rounding and exception unit to generate correct result and exceptions. The select output demultiplexer select which part of the exception registers and output registers to load in addition to setting the correct value of the ready register. These units are equal for both implemented architectures.

There are some differences between the architectures of the two implementations. In architecture two, the exponent unit and the multiplier unit needs to know which format being computed, in addition to the actual content of the building blocks as described in Section 1.5. As described in [7], the bus width, and register size, between the multiplier unit and the computed significands register changes between the two architectures. In architecture one this is 106 -bits, and in architecture two this is 154 -bits because the products of the 53 -bit significand multiplier and the 24 -bit significand multiplier equals 154-bit. This also infers a wider bus between the computed signs, exponents and significands registers and the rounding and exception unit. The exception and rounding unit differs for the two architectures, and will be discussed later.

## Exponent Unit

In the exponent unit, exponent adders and subtractors are implemented using carry-lookahead adders in the DesignWare ${ }^{\circledR}$ library form synopsys [17]. One adder computes the sum of the two exponents, and a subtractor computes the sum minus the bias. The exponent unit are different in architecture one and two. In architecture one, one 11-bit adder and subtractor, two 8-bit adders and subtractors and two 5 -bit adders and subtractors are used for computing the resulting exponents of the FP64, FP32 and FP16 formats respectively. The exponent unit of architecture one is given in Figure 3.9.

In architecture two, one 11-bit adder and subtractor and one 8-bit adder and subtractor are used to compute the exponents. The 11-bit subtractor supports subtraction of all FP16, FP32 and FP64 bias values, and the 8-bit


Figure 3.9: Architecture one exponent unit.


Figure 3.10: Architecture two exponent unit.
subtractor supports subtraction of FP16 and FP32 bias values. Exponent unit of architecture two is given in Figure 3.10.

The output bus from the exponent unit includes four extra bits in addition to the actual exponents. These are overflow bits from the exponent additions and bias subtractions used by the rounding and exception unit to generate correct exceptions, and are equal for both architectures. An input demultiplexer selects the input bits from the exponent register to supply the correct adder, and an output multiplexer puts the result from correct adder on the output bus.

## Multiplier Unit

The significand multipliers are implemented as unsigned parallel-prefix multipliers provided by the DesignWare ${ }^{\circledR}$ datapath and building block IP library [18] from Synopsys to obtain higher clock frequencies than 90.9 MHz as estimated in Section 2.4. This implementation is flexible, and dynamically generated based on context, e.g., area and timing constraints, and technology library. It exploits the characteristics of different implementations and generates the optimal architecture [19]. The content of the multiplier unit differs for the two architectures. In architecture one, one 53 -bit, two 24 -bit and two 11-bit unsigned multipliers are used to compute the significands of the FP64, FP32 and FP16 data formats respectively. Architecture one significand multiplier unit are given in Figure 3.11.

In architecture two, one 53 -bit multiplier and one 24 -bit multiplier are used to compute the significands. The 53 -bit multiplier are used to compute the resulting significand of all formats, and the 24 -bit multiplier is used to compute the resulting significand of the FP32 and FP16 formats. The significand multiplier unit in architecture two are given in Figure 3.12.

An input demultiplexer selects which bits should go to which multiplier. In architecture one, an output multiplexer selects which multiplier group result FP16, FP32 or FP64 should be put on the output bus. In architecture two, the significands are extended to fit the width of the 53 -bit and 24 -bit multiplier input buses. Zeroes are appended as least significant bits to avoid shifting or demultiplexing in the rounding and exception unit.

## Rounding and Exception Unit

In the original architecture proposals in [7], the rounding and exception unit is equal for all four architectures. However, this has been implemented differently in architecture one and two to better highlight the differences between them concerning power. In architecture one, specialized rounding


Figure 3.11: Architecture one significand multiplier unit.


Figure 3.12: Architecture two significand multiplier unit.
units for each format are used, as for the multiplier unit and exponent unit. One rounding and exception block handling the FP64 format, two rounding and exception blocks handling the FP32 format and two handling the FP16 format. The rounding and exception unit in architecture one is given in Figure 3.13. In architecture two, one rounding and exception block handles every format, and one handling the FP32 and FP16 formats. A simple rounding algorithm has been implemented. The rounding and exception unit of architecture two is given in Figure 3.14.


Figure 3.13: Architecture one rounding and exception unit.
The implemented rounding algorithm is basically the same as the one presented in Section 1.1, except a demultiplexer is used in the post-normalizing step to select the appropriate significand bits as the simple algorithm presented in [20]. Rounding could be performed faster and more efficiently, if for example the QFT algorithm presented in [20] is used. However, this requires a significand multiplier that outputs the sum and carry vectors as separated carry-save encoded vectors. The four rounding modes, round-to-nearest even, round-to positive infinity, round-to negative infinity and round-to zero have been reduced to three, round-to-nearest even, round-to infinity and round-to zero as in [21]. Round-to positive infinity, round-to negative infinity and round-to zero can be reduced to round-to infinity and round-to zero based on the sign as given in Table 3.4.


Figure 3.14: Architecture two rounding and exception unit.

| IEEE Rounding mode | Positive Number | Negative Number |
| :---: | :---: | :---: |
| Round-to-nearest even | Round-to-nearest even |  |
| Round-to positive infinity | Round-to infinity | Round-to zero |
| Round-to negative infinity | Round-to zero | Round-to infinity |
| Round-to zero | Round-to zero |  |

Table 3.4: Rounding mode reduction.

In the rounding and exception unit, a demultiplexer supplies the different rounding and exception blocks with signs, exponents and significands computed in their respective units from their registers, as well as information computed in the check special unit.

### 3.3 Testing and Simulation

Testing have been performed using the open source Verilog simulator and synthesis tool, Icarus Verilog [22]. Test cases have been generated using the C-code in Appendix C. "Random" floating-point numbers are created, and special values are included "randomly" to ensure simulation of exceptional cases like NaN times any number, and zero times infinity. 500,000 test cases have been simulated for both architectures and for all supported data formats and rounding modes.

### 3.3.1 Reference Circuit

The DesginWare ${ }^{\circledR}$ library from Synopsys provides a simulation model of a fully IEEE compliant floating-point multiplier [23, 24]. This has been used to create a vectorized version, that computes four products in parallel. The block diagram of the DesignWare vectorized floating-point multiplier is given in Figure 3.15.

The Verilog code for the DesignWare vectorized floating-point multiplier can be found in Appendix D.1. In addition, because the DesignWare floatingpoint multiplier supports denormalized numbers, the output is set to zero if denormalized product, and an inexact exception is generated. The correctness of the DesignWare vectorized multiplier can easily be verified by looking at the code.


Figure 3.15: DW _vec_fp_mult block diagram.

### 3.3.2 Simulations

Whenever a product is ready, the computed product and exceptions is compared to the product vector and exceptions computed by the DesignWare floating-point multiplier. The testbench used for simulating the two architectures can be found in Appendix D.2.

Both architectures have been tested with FP16, FP32 and FP64 input vectors. For each format, the rounding modes round-to-nearest even, round-
to positive infinity, round-to negative infinity and round-to zero have been tested with 500,000 test cases. The testbench does not try to change data format or rounding mode during simulation time, however this is believed to work. To verify correct operation of the implemented architectures, this has to be tested. However, the emphasis of this assignment is not verification, but rather highlighting the differences between the architectures concerning power and area. The testbench used for simulation prints statistics about input vectors, output vectors and exceptions generated when finished to ensure every exceptional cases have been covered by input vectors during simulation. In addition, behavioral testing has been performed at module level, to ensure correct behavior of lower level modules such as rounding and exceptions unit, exponent unit, demultiplexers etc.

One error in the rounding unit has been detected with the FP16 format in round-to-nearest even and round-to positive infinity mode. This error is believed to be format independent, but has not been detected when the FP32 and FP64 formats have been tested. The error arises in post-normalization when result should be rounded to the smallest representable normalized number, but is flushed to zero instead. This error has not been corrected because the emphasis of this thesis lies on power and area comparison, and choosing the best architecture to implement, given a set of constraints. The correction of this error will probably not infer any significant increase in area nor power consumption.

## Chapter 4

## Synthesis Results


#### Abstract

Synopsys Design Compiler ${ }^{\text {TM }}$ [25] and Power compiler ${ }^{\text {TM }}[26]$ are used to synthesize the designs, and perform area, timing and power analysis. A typical general purpose low-power standard cell library is used to map the design into a 65 nm technology, and a general purpose standard cell library to map the design into a 90 nm technology. Because the 65 nm library is a low-power library, and the 90 nm library is a general purpose library, somewhat different power results are expected. However, this will in addition highlight the differences in which target technology the architectures are realized in.


This Chapter will first present how Synopsys Design Compiler ${ }^{\text {TM }}$ and Power Compiler ${ }^{\mathrm{TM}}$ calculates power, how to capture switching activity in the implemented architectures, and how design constraints are set to optimize the result, in Section 4.1. The power consumption and area usage of architecture one is presented in Section 4.2. In Section 4.3 power consumption and area usage of architecture two will be presented. In Section 4.4 the power consumption of the two architectures will be compared, and in Section 4.5 area usage of the two architectures will be compared.

### 4.1 Synopsys $^{\circledR}$

It is important to understand how Synopsys models and computes power to obtain useful information from the synthesis reports. The following describes how static and dynamic power is computed and are taken from the Power Products Reference Manual [27] by Synopsys. The power analysis tool calculates and reports power based on equations given in [27]. DesignPower and Power Compiler ${ }^{\mathrm{TM}}$ use these equations and information modeled in the technology library to evaluate the power of the design.

### 4.1.1 Static Power

Static power is the power dissipated by a gate when it is not switching. It is dissipated in several ways, mostly due to source-to-drain leakage currents caused by reduced threshold voltages preventing the gate from completely turning off. Other currents leaks also contributes, and hence it is often called leakage power. For designs that are active most of the time, leakage power is less than $1 \%$ of the total power.

### 4.1.2 Dynamic Power

Dynamic power dissipates when the circuit is active. Dynamic power has two sources, internal power and switched power. Internal power is any power dissipated within the boundary of a cell. During switching, a circuit dissipates internal power by the charging or discharging of any existing capacitances internal to the cell. The definition of internal power includes power dissipated by a momentary short circuit between the PMOS and NMOS transistors of a gate, called short circuit power. The switching power of a driving cell is the power dissipated by the charging and discharging of the load capacitance at the output of the cell. The total load capacitance at the output of a driving cell is the sum of the net and gate capacitances on the driving output.

### 4.1.3 Capturing Switching Activity for Synthesis

Synopsys provides several ways of including simulated switching activity into the power calculations. These are described in the Power Products Reference Manual [27]. The testbench used for capture the switching activity of the different nets in the two architectures are given in Appendix D.3. This testbench has been used for simulating typical switching activity, only FP16 computations switching activity, only FP32 switching activity and only FP64 switching activity. When typical switching activity is captured, FP32 computations are assumed to be performed $60 \%$ of the time, FP16 computations $20 \%$ of the time and FP64 computations $20 \%$ of the time. This distribution is chosen to ensure switching in all nets and registers, and is not given by ARM or any other. But, the FP32 format has been indicated to be the main format used in computations. To capture switching activity, the method described in Power Products Reference Manual Appendix B has been used. The function rt12saif creates a switching activity file (SAIF) from the Verilog RTL design files in the Synopsys dc_ shell. dc_shell is the Synopsys tools command line interface. The UNIX utility saif2trace is used to create a forward-annotation trace file based on the information about non-combinational and combinational elements in the SAIF file. This file is included in the testbench to generate switching information as a value change dump file (VCD) of the different design elements. The

VCD file is converted to a backward-annotation SAIF file by the UNIX utility vcd2saif, that uses the set_switching_activity command in $d c$ _shell to set the static probability and toggle rate for elements in the design. The backward-annotation SAIF file is read in the $d c$ _shell before compilation by the function read_saif, which incorporates information about switching activity into the compilation and optimization process performed by the Design Compiler ${ }^{\mathrm{TM}}$ and Power Compiler ${ }^{\mathrm{TM}}$.

### 4.1.4 Setting Design Constraints

Area and power constraints are set by the dc_shell commands set_max_area, and set_max_total_power. Maximum dynamic power and leakage power may be set individually by the commands set_max_dynamic_power and set_max_leakage_power, respectively. Timing constraints can be set by the set_max_transition command from input ports or pins to output ports or pins. However, if the design is clocked, Design Compiler ${ }^{\text {TM }}$ assumes single cycle datapaths between registers and the create_clock command can be used to set timing. To synthesize and optimize the design the set_max_area and set_max_total_power have been set to zero. To set timing constraints of combinational logic between registers, the create_clock command has been used.

Architecture one and two have been synthesized for $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency, and for different input data format distributions, as described in Section 4.1.3. When simulating switching activity, all four rounding modes have been simulated for each format to capture switching in every register and combinational units. The compile_ultra command from the $d c \_$shell enables the Design Compiler Ultra optimizations available from Synopsys as described in [28], which, i.a., includes advanced arithmetic optimization and obtains better quality of result for timing and area. Design Compiler Ultra and Power Compiler works side by side. Power Compiler optimizes for timing, area and power simultaneously and includes switching activity information to obtain better results concerning power.

### 4.2 Architecture One

Architecture one attempts to be a power optimized vectorized floating-point multiplier. In this Section, the area usage and power consumption of this architecture, realized in 65 nm and 90 nm CMOS, will be investigated. Power units are given in $m W$, and area units in $\mu m^{2}$.

### 4.2.1 Power

Table 4.1 and 4.3 presents internal-, switching-, leakage- and total power dissipated by architecture one in 65 nm and 90 nm CMOS technology respectively, with typical input data distribution, as described in Section 4.1.3 at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency. Table 4.2 and 4.4 shows which part of the circuit that dissipates the largest amount of power.

| Clock frequency | Power |  |  |  |
| :---: | :--- | :--- | :--- | :--- |
| 200 MHz | Internal | 1.0800 | 85.17 | \% of dynamic power |
|  | Switching | 0.1880 | 14.83 | \% of dynamic power |
|  | Leakage | 0.0185 | 1.44 | \% of total power |
|  | Total | 1.2860 | 100.00 | \% of total power |
| 300 MHz | Internal | 1.6190 | 85.30 | \% of dynamic power |
|  | Switching | 0.2790 | 14.70 | \% of dynamic power |
|  | Leakage | 0.0228 | 1.19 | \% of total power |
|  | Total | 1.9210 | 100.00 | \% of total power |
| 400 MHz | Internal | 2.1700 | 84.83 | \% of dynamic power |
|  | Switching | 0.3880 | 15.17 | \% of dynamic power |
|  | Leakage | 0.0286 | 1.11 | \% of total power |
|  | Total | 2.5870 | 100.00 | \% of total power |

Table 4.1: Architecture one, 65 nm CMOS total power consumption.
From Table 4.1 it can be seen that, in 65 nm low power CMOS, leakage power is much less than estimated, on average $1.25 \%$ of total power compared to the estimated value of $30 \%$. This is because no idle simulation has been performed as in [2], and because target library is optimized for low power. The major power component, internal power, is due to charging and discharging of capacitive loads internal to the cells, where the cells represents the instantiated Verilog modules. The average increase in total power consumption equals $0.6505 \mathrm{~mW} / 100 \mathrm{MHz}$. From Table 4.2 it can be seen that over $85 \%$ of total power is consumed by registers in the 65 nm circuit. Significand multipliers only accounts for $4.63 \%$ of total power on average. This is a surprising result, which contradicts the assumptions made in the power estimation methodology, that the significand multipliers are the most power consuming units in the design. However, this result is partially because of datapath optimizations performed by the Synopsys tools, it is also possible that the sequential elements are not optimized for low power in the same manner as the datapath elements. This should be investigated further.

From Table 4.3 it can be seen that, in 90 nm CMOS, power consumption is much larger than in 65 nm CMOS. All power components are increased in size, internal, switching and leakage. The most important increase are the

| Clock frequency | Power | \% of total power |
| :---: | :--- | :---: |
| 200 MHz | Registers | 87.6 |
|  | Multiplier Unit | 4.3 |
|  | Rounding Unit | 0.7 |
|  | Select Output | 3.0 |
|  | Select Input | 3.9 |
|  | Registers | 86.9 |
|  | Multiplier Unit | 4.7 |
|  | Rounding Unit | 0.9 |
|  | Select Output | 3.1 |
|  | Select Input | 3.6 |
| 400 MHz | Registers | 86.6 |
|  | Multiplier Unit | 4.9 |
|  | Rounding Unit | 0.9 |
|  | Select Output | 3.0 |
|  | Select Input | 3.8 |

Table 4.2: Architecture one, 65 nm CMOS building blocks power consumption.
increase in ratio of switching power to total dynamic power and the ratio of leakage power to total power. Switching power is on average, at the different clock frequencies, $36 \%$ of total dynamic power, and leakage power $9.62 \%$ of total power. The large increase in power consumption is partially because the 65 nm library is a low-power library, and Synopsys Design Compiler ${ }^{\mathrm{TM}}$ and Power Compiler ${ }^{\mathrm{TM}}$ exploits features in the low-power library to obtain lower power consumption, and hence internal-, switching- and leakage power is reduced. The average increase in total power is $5.1850 \mathrm{~mW} / 100 \mathrm{MHz}$. In Table 4.4 power dissipated by major units, when realized in 90 nm CMOS, are presented. The results presented in Table 4.4 are more as expected, where the significand multipliers accounts for the larger part of the total power consumption. Approximately $60 \%$ of total power is dissipated in the significand multipliers, and approximately $28 \%$ by the registers, compared to the 65 nm results where on average $87 \%$ is dissipated in registers and $4.6 \%$ in multipliers.

Figure 4.1 shows power consumption of architecture one at 200 MHz , 300 MHz and 400 MHz in 65 nm CMOS and for typical input data distribution, only FP16 input data, only FP32 input data and only FP64 input data. At 200 MHz a strange case occurs. When only FP16 computations are performed, power consumption is much larger than when the other input data distributions are computed. From the synthesis report it can be seen that the large power consumption is mostly due to high internal and switching power in the 53 -bit and one of the 24 -bit multipliers. Architecture one

| Clock frequency | Power |  |  |  |
| :---: | :--- | :--- | :--- | :--- |
| 200 MHz | Internal | 5.5030 | 64.15 | \% of dynamic power |
|  | Switching | 3.0760 | 35.85 | \% of dynamic power |
|  | Leakage | 1.1500 | 11.81 | \% of total power |
|  | Total | 9.7340 | 100.00 | \% of total power |
| 300 | Internal | 8.9450 | 64.17 | \% of dynamic power |
|  | Switching | 4.9950 | 35.83 | \% of dynamic power |
|  | Leakage | 1.4700 | 9.54 | \% of total power |
|  | Total | 15.4090 | 100.00 | \% of total power |
| 400 MHz | Internal | 11.8440 | 63.70 | \% of dynamic power |
|  | Switching | 6.7500 | 36.30 | \% of dynamic power |
|  | Leakage | 1.5100 | 7.51 | \% of total power |
|  | Total | 20.1040 | 100.00 | \% of total power |

Table 4.3: Architecture one, 90 nm CMOS total power consumption.

| Clock frequency | Power | \% of total power |
| :---: | :--- | :---: |
| 200 MHz | Registers | 28.2 |
|  | Multiplier Unit | 61.1 |
|  | Rounding Unit | 7.0 |
|  | Select Output | 1.0 |
|  | Select Input | 1.2 |
|  | Registers | 28.4 |
|  | Multiplier Unit | 60.0 |
|  | Rounding Unit | 8.0 |
|  | Select Output | 1.0 |
|  | Select Input | 1.1 |
|  | Registers | 27.0 |
|  | Multiplier Unit | 60.4 |
|  | Rounding Unit | 9.3 |
|  | Select Output | 0.9 |
|  | Select Input | 1.1 |

Table 4.4: Architecture one, 90 nm CMOS building blocks power consumption.


Figure 4.1: Architecture one, 65 nm CMOS power consumption.
has been simulated and synthesized at 200 MHz for only FP16 computations several times to locate the reason for this strange behavior without luck. The behavior is strange because this does not happen at either 300 MHz or 400 MHz clock frequency, where power consumption when performing FP16 computations are as expected. It may have happened because of insufficient control in the synthesis process, because only power, area and timing constraints are set, unexpected optimizations may have occurred. Figure 4.1 shows that FP32 computations are the most power consuming, except the strange case when performing FP16 computations at 200 MHz . However, what is important to remember is that switching activity information from the simulation are included in the optimization process performed by Design Compiler ${ }^{\mathrm{TM}}$ and the Power Compiler ${ }^{\mathrm{TM}}$, which may lead to somewhat different circuits and hence power consumptions.

Figure 4.2 shows power consumption of architecture one at 200 MHz , 300 MHz and 400 MHz in 90 nm CMOS and for typical input data distribution, only FP16 input data, only FP32 input data and only FP64 input data. Figure 4.2 better highlights the effect of increasing clock frequency than Figure 4.1 because the 90 nm library is a general purpose library and not optimized for low power. Figure 4.2 shows that for any of the three clock frequencies, FP16 computations are the least power consuming. Typical input data distribution is the second least power consuming, FP32 computations the second largest and FP64 computations the largest. Internal power is significantly higher for typical input data distribution than for any other because the capacitance switched internal to the multiplier unit is higher, and switching power is significantly lower because several multipliers are now driving the output. When performing only FP16, FP32 or FP64 computations the internal load capacitance is reduced because only some multipliers are used, and hence internal power is reduced. Switching power is increased because the output of the used multipliers have to drive a wide bus, and the gates connected to the bus.

Figure 4.3 compares dissipated power by architecture one in 65 nm and 90 nm CMOS assuming typical input data distribution at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz . The differences are large. In 90 nm CMOS, on average at the different clock frequencies, 7.79 times more power is dissipated than in 65 nm CMOS. It can also be seen that in 90 nm CMOS, switching power is a significantly larger part of the total dynamic power consumption. In addition, leakage power is much higher in the 90 nm circuit. However, many of the differences are probably mostly due to that the 90 nm library is a general purpose library not optimized for low power, as the 65 nm library is. Hence, different optimizations are performed by the Synopsys tools to meet the constraints of lowest possible total power consumption and smallest possible area at a given clock frequency.


Figure 4.2: Architecture one, 90 nm CMOS power consumption.

(a) Power comparison at 200 MHz .

(b) Power comparison at 300 MHz .

(c) Power comparison at 400 MHz .

Figure 4.3: Architecture one, 90 nm and 65 nm CMOS power comparison.

### 4.2.2 Area

Table 4.5 and 4.6 presents registers, significand multiplier unit, exponent unit, rounding and exception unit and total area usage by architecture one in 65 nm and 90 nm CMOS technology, with typical input data distribution.

| Clock frequency | Area |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| 200 MHz | Registers | 5347.6470 | 9.46 | \% of total area |
|  | Multiplier unit | 42684.6992 | 75.50 | \% of total area |
|  | Exponent unit | 827.8377 | 1.47 | \% of total area |
|  | Rounding unit | 4378.4097 | 7.74 | \% of total area |
|  | Total | 56536.7109 | 100 | \% of total area |
| 300 MHz | Registers | 5357.0068 | 8.96 | \% of total area |
|  | Multiplier unit | 45084.0664 | 75.37 | \% of total area |
|  | Exponent unit | 829.3976 | 1.39 | \% of total area |
|  | Rounding unit | 5220.324 | 8.73 | \% of total area |
|  | Total | 59816.4414 | 100 | \% of total area |
| 400 MHz | Registers | 5406.4063 | 8.6 | \% of total area |
|  | Multiplier unit | 47699.4180 | 75.84 | \% of total area |
|  | Exponent unit | 828.8778 | 1.32 | \% of total area |
|  | Rounding unit | 5557.2886 | 8.84 | \% of total area |
|  | Total | 62899.0469 | 100 | \% of total area |

Table 4.5: Architecture one, 65 nm CMOS area usage.

Differences in area usage for the four input data distributions considered in Section 4.2 .1 are small compared to differences in power consumption and around $1 \%$ at the different clock frequencies. The registers, multiplier unit and rounding and exception unit are the most area consuming building blocks in architecture one when realized in both 65 nm and 90 nm CMOS technology. Significand multipliers are by far the largest building block, which together with registers and the rounding and exception logic accounts for over $90 \%$ of total area. The ratio of registers to total area, multiplier unit to total area and rounding and exception unit to total area does not change significantly when realized in 65 nm CMOS or 90 nm CMOS. On average, at 200 MHz , 300 MHz and 400 MHz , the 90 nm circuit is 1.91 times larger than the 65 nm circuit. This is a bit larger than expected, because the gate length in 90 nm CMOS is approximately 1.4 times larger than in 65 nm CMOS. However, area is also dependent of available gates and marco blocks in the target library. Area usage is, as power consumption, dependent on clock frequency, because area is traded to meet the timing constraints, mainly in the 53-bit significand multiplier. The increase in total area are more linearly increasing with clock frequency in the 65 nm circuit than the 90 nm circuit.

| Clock frequency | Area |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| 200 MHz | Registers | 9983.7744 | 9.17 | \% of total area |
|  | Multiplier unit | 83370.6250 | 76.58 | \% of total area |
|  | Exponent unit | 1564.0778 | 1.44 | \% of total area |
|  | Rounding unit | 8086.0181 | 7.43 | \% of total area |
|  | Total | 108863.2578 | 100 | \% of total area |
| 300 MHz | Registers | 9995.8486 | 8.59 | \% of total area |
|  | Multiplier unit | 89636.4297 | 77.03 | \% of total area |
|  | Exponent unit | 1571.7610 | 1.35 | \% of total area |
|  | Rounding unit | 8897.1494 | 7.65 | \% of total area |
|  | Total | 116362.0625 | 100 | \% of total area |
| 400 MHz | Registers | 9989.2627 | 8.47 | \% of total area |
|  | Multiplier unit | 89948.2031 | 76.27 | \% of total area |
|  | Exponent unit | 1568.4681 | 1.33 | \% of total area |
|  | Rounding unit | 9979.4297 | 8.46 | \% of total area |
|  | Total | 117933.8281 | 100 | \% of total area |

Table 4.6: Architecture one, 90 nm CMOS area usage.

### 4.3 Architecture Two

Architecture two trades power for area, and attempts to be an area and throughput optimized vectorized floating-point multiplier. This Section investigates power consumption and area usage of architecture two, realized in 65 nm and 90 nm CMOS at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency. As for architecture one, power units are given in $m W$, and area units in $\mu \mathrm{m}^{2}$.

### 4.3.1 Power

Table 4.7 and 4.9 presents internal-, switching-, leakage- and total power dissipated by architecture two realized in a 65 nm low-power CMOS and 90 nm CMOS, with typical input data distribution as described in Section 4.1.3.

In the 65 nm circuit, power is mostly dissipated by charging and discharging of capacitances internal to the cells, and charging and discharging of capacitances at the output of the cells. Leakage power is very low, and the ratio of leakage power to total power decreases with increasing clock frequency because the dynamic power component grows faster than the static power component. The estimated leakage power is over 90 times larger on average, at the different clock frequencies. Table 4.8 shows which building blocks in the design that dissipates the most power. The registers, significand multipliers and the rounding and exception logic accounts for approximately $95 \%$ of total power consumption, where the multiplier unit is the most power consuming building block. The average increase in power is

| Clock frequency | Power |  |  |  |
| :---: | :--- | :--- | :--- | :--- |
| 200 MHz | Internal | 2.4660 | 57.46 | \% of dynamic power |
|  | Switching | 1.8260 | 42.54 | \% of dynamic power |
|  | Leakage | 0.0168 | 0.39 | \% of total power |
|  | Total | 4.3088 | 100 | \% of total power |
| 300 MHz | Internal | 4.2240 | 57.59 | \% of dynamic power |
|  | Switching | 3.1100 | 42.41 | \% of dynamic power |
|  | Leakage | 0.0229 | 0.31 | \% of total power |
|  | Total | 7.3569 | 100 | \% of total power |
| 400 MHz | Internal | 5.3430 | 56.61 | \% of dynamic power |
|  | Switching | 4.0950 | 43.39 | \% of dynamic power |
|  | Leakage | 0.0215 | 0.23 | \% of total power |
|  | Total | 9.4595 | 100 | \% of total power |

Table 4.7: Architecture two, 65 nm CMOS total power consumption.

| Clock frequency | Power | \% of total power |
| :---: | :--- | :---: |
| 200 MHz | Registers | 36.2 |
|  | Multiplier Unit | 52.9 |
|  | Rounding Unit | 6.8 |
|  | Select Output | 1.6 |
|  | Select Input | 1.2 |
|  | Registers | 35.3 |
|  | Multiplier Unit | 54.8 |
|  | Rounding Unit | 7.2 |
|  | Select Output | 1.5 |
|  | Select Input | 1.1 |
|  | Registers | 33.2 |
|  | Multiplier Unit | 54.2 |
|  | Rounding Unit | 8.1 |
|  | Select Output | 1.5 |
|  | Select Input | 1.1 |

Table 4.8: Architecture two, 65 nm CMOS building blocks power consumption.

| Clock frequency | Power |  |  |  |  |
| :---: | :--- | :--- | :--- | :--- | :---: |
| 200 MHz | Internal | 6.4050 | 64.16 | \% of dynamic power |  |
|  | Switching | 3.5780 | 35.84 | \% of dynamic power |  |
|  | Leakage | 0.9980 | 9.09 | \% of total power |  |
|  | Total | 10.9810 | 100.00 | \% of total power |  |
| 300 | Internal | 10.5920 | 64.97 | \% of dynamic power |  |
|  | Switching | 5.7120 | 35.03 | \% of dynamic power |  |
|  | Leakage | 1.1600 | 6.64 | \% of total power |  |
|  | Total | 17.4640 | 100.00 | \% of total power |  |
| 400 MHz | Internal | 14.1970 | 64.08 | \% of dynamic power |  |
|  | Switching | 7.9570 | 35.92 | \% of dynamic power |  |
|  | Leakage | 1.3100 | 5.58 | \% of total power |  |
|  | Total | 23.4620 | 100.00 | \% of total power |  |

Table 4.9: Architecture two, 90 nm CMOS total power consumption.

| Clock frequency | Power | \% of total power |
| :---: | :--- | :---: |
| 200 MHz | Registers | 26.8 |
|  | Multiplier Unit | 63.2 |
|  | Rounding Unit | 5.7 |
|  | Select Output | 1.3 |
|  | Select Input | 1.1 |
| }{} | Registers | 25.1 |
|  | Multiplier Unit | 61.3 |
|  | Rounding Unit | 7.8 |
|  | Select Output | 1.4 |
|  | Select Input | 1.0 |
|  | Registers | 25.4 |
|  | Multiplier Unit | 63.0 |
|  | Rounding Unit | 7.8 |
|  | Select Output | 1.3 |
|  | Select Input | 1.0 |

Table 4.10: Architecture two, 90 nm CMOS building blocks power consumption.

## $2.5735 \mathrm{~mW} / 100 \mathrm{MHz}$.

Table 4.9 presents internal-, switching-, leakage- and total power dissipated by architecture two realized in 90 nm CMOS. Compared to the 65 nm circuit, the 90 nm circuit has significantly higher total power consumption at all clock frequencies. The internal power percentage is higher, while the switching power percentage is lower. Leakage power is also increased compared to the 65 nm circuit, and on average responsible for $7.10 \%$ of total power consumption. The ratio leakage to total power is reduced as clock frequency increases, because internal and switching power grows faster than leakage power. Leakage power is independent of clock frequency but proportional to area. As clock frequency increases, larger area are required by mainly the significand multipliers as a tradeoff between area, timing and power. Average increase in power is $6.2405 \mathrm{~mW} / 100 \mathrm{MHz}$, which is 3.6670 mW higher than the 65 nm circuit. Table 4.10 shows which part of the circuit that dissipates most power. Compared to the 65 nm circuit, less power is consumed by the registers, and more power is consumed by the multiplier unit. Average increase in power consumed by the multiplier unit is $8.53 \%$, and average reduction in power consumed by the registers are $9.13 \%$. As for the 65 nm circuit, the rounding unit is the third most power consuming unit. The differences in leakage-, internal- and switching power of the 65 nm circuit and the 90 nm circuit are probably due to different optimizations performed by the Synopsys tools based on available cells in the target library.

Figure 4.4 shows power consumption of architecture two at 200 MHz , 300 MHz and 400 MHz in 65 nm CMOS and for typical input data distribution, only FP16 input data, only FP32 input data and only FP64 input data. Internal-, switching-, leakage- and total power are included to show how input data distribution affects power consumption of architecture two. FP16 computations dissipates the least amount of power at all clock frequencies, mostly because of less charging and discharging of load capacitances both internal in multiplier cells and at their outputs. The differences of the four input data distributions increases with clock frequency, and are clearest at 400 MHz . Figure 4.5 shows power consumption of architecture two realized in 90 nm CMOS, which shows even larger differences in power consumption at increasing clock frequency compared to the 65 nm circuit. The effect of which data format used are somewhat different in the 65 nm circuit and the 90 nm circuit. The most significant difference are the internal- and switching power component of the two circuits. In the 65 nm circuit, the internal power are almost equal for typical input data, FP32 input data and FP64 input data, however in the 90 nm circuit, the internal power is significantly larger for typical input data than for FP32- and FP64 input data. In both circuits, the internal power are smallest for FP16 input data. In the 65 nm circuit, the switching power are almost equal for typical input data, FP32- and FP64


Figure 4.4: Architecture two, 65 nm CMOS power consumption.


Figure 4.5: Architecture two, 90 nm CMOS power consumption.


Figure 4.6: Architecture two, 90 nm and 65 nm CMOS power comparison.
input data as well, but in the 90 nm circuit typical input data has the lowest switching power, then FP16- input data, and FP32- and FP64 input data have almost equal switching power. These differences are probably due to available cells in the target library, and hence optimization performed in the datapaths. In the simplified power estimation methodology used to compare the architectures, architecture two was estimated to have equal power consumption for only FP16, only FP32 and only FP64 computations. This is not the case. In the estimation methodology, is was assumed that every bit in the multipliers had equal switching for all formats. As seen from Figure 4.4 and Figure 4.5, FP16 computations have both less switching and internal power compared to FP32 and FP64 computations. This should be considered if the power estimation methodology is to be improved. Concerning total power, both circuits have power consumptions where FP16 computations requires the least amount of power, then typical input data computations, and FP32and FP64 computations have almost equal total power consumptions.

Figure 4.3 compares dissipated power by architecture two in 65 nm and 90 nm CMOS assuming typical input data distribution. Power consumption by the 90 nm circuit is on average 10.26 mW higher than the 65 nm circuit, at $200 \mathrm{MHz}, 200 \mathrm{MHz}$ and 400 MHz clock frequency. Differences in leakage power of the two circuits are well highlighted in Figure 4.3. Internal- and switching power are closer to equal for the 65 nm circuit than the 90 nm circuit. This is due to more switching internal in the cells in the 90 nm circuit, and less switching of outputs. These differences are probably due to different optimizations because of available cells in the target library.

### 4.3.2 Area

The area usage of architecture two is presented in Table 4.11 and 4.12, where area required by registers, the multiplier unit, the exponent unit and the exception and rounding unit, in addition to total area are included.

Differences in area usage for the four input data distributions are small compared to differences in power consumption and around $1 \%$ at the different clock frequencies. The significand multiplier unit is by far the largest unit in both the 65 nm circuit and the 90 nm circuit. The ratio of multiplier unit area to total area is approximately $2 \%$ larger in the 90 nm circuit, compared to the 65 nm circuit. The 90 nm circuit are on average 1.93 times larger than the 65 nm circuit and $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency. The area usage of architecture two increases more linearly with clock frequency when realized in the 90 nm general purpose library. When realized in the 65 nm low-power library, the largest increase in area occurs when going from 200 MHz to 300 MHz clock frequency. When going from

| Clock frequency | Area |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| 200 MHz | Registers | 5739.7617 | 12.27 | \% of total area |
|  | Multiplier unit | 33117.7344 | 70.77 | \% of total area |
|  | Exponent unit | 656.7595 | 1.40 | \% of total area |
|  | Rounding unit | 3883.3469 | 8.30 | \% of total area |
|  | Total | 46796.8789 | 100 | \% of total area |
| 300 MHz | Registers | 5775.6426 | 11.36 | \% of total area |
|  | Multiplier unit | 36380.5391 | 71.55 | \% of total area |
|  | Exponent unit | 659.3595 | 1.30 | \% of total area |
|  | Rounding unit | 4614.4980 | 9.08 | \% of total area |
|  | Total | 50843.0000 | 100 | \% of total area |
| 400 MHz | Registers | 5800.0767 | 11.53 | \% of total area |
|  | Multiplier unit | 35370.0977 | 70.30 | \% of total area |
|  | Exponent unit | 675.4794 | 1.34 | \% of total area |
|  | Rounding unit | 4896.8638 | 9.73 | \% of total area |
|  | Total | 50314.1602 | 100 | \% of total area |

Table 4.11: Architecture two, 65 nm CMOS area usage.

| Clock frequency | Area |  |  |  |  |
| :---: | :--- | :--- | :--- | :--- | :---: |
| 200 MHz | Registers | 10668.7070 | 11.7 | \% of total area |  |
|  | Multiplier unit | 66626.7500 | 73.06 | \% of total area |  |
|  | Exponent unit | 1250.1659 | 1.37 | \% of total area |  |
|  | Rounding unit | 6541.7031 | 7.17 | \% of total area |  |
|  | Total | 91194.0938 | 100 | \% of total area |  |
|  | Registers | 10696.1465 | 11.23 | \% of total area |  |
|  | Multiplier unit | 68993.0703 | 72.44 | \% of total area |  |
|  | Exponent unit | 1247.9708 | 1.31 | \% of total area |  |
|  | Rounding unit | 8062.9741 | 8.47 | \% of total area |  |
|  | Total | 95242.0469 | 100 | \% of total area |  |
| 400 MHz | Registers | 10759.8096 | 10.81 | \% of total area |  |
|  | Multiplier unit | 71746.6641 | 72.10 | \% of total area |  |
|  | Exponent unit | 1255.6536 | 1.26 | \% of total area |  |
|  | Rounding unit | 8995.9385 | 9.04 | \% of total area |  |
|  | Total | 99506.2188 | 100 | \% of total area |  |

Table 4.12: Architecture two, 90 nm CMOS area usage.

300 MHz to 400 MHz , area is reduced somewhat for the 65 nm circuit. This is probably because at 300 MHz , the 65 nm circuit has traded area for better power results.

### 4.4 Power Comparison

Because architecture one trades area for better power results, and architecture two trades power for better area results, different power and area results were expected, as estimated in Section 2.2 and 2.3. This Section will compare the power results of the implemented architectures. In addition, because the 65 nm circuits are realized using a low-power CMOS process, and the 90 nm circuits are realized using a general purpose CMOS process, differences in target process will be highlighted. Figure 4.7 compares dissipated power by architecture one and two at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz for typical input data distribution realized in a 65 nm low-power CMOS process, and Figure 4.8 compares dissipated power by the two architectures realized in a 90 nm general purpose CMOS process.

From Figure 4.7 it can be seen that architecture one has much better power results than architecture two. On average, at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz , power consumed by architecture two is 5.1104 mW larger than by architecture one. Both internal power and switching power are significantly lower in architecture one, due to reduced switching inside the cells and switching at the their outputs. Hence, to obtain the best power results architecture one should be chosen. The difference in power consumption by the two architectures grows larger as clock frequency increases. However, because the vectorized floating-point multipliers are reaching the limit of how much clock frequency can be increased without introducing pipeline registers in the significand multipliers, or multicycle multipliers, power results may be different at higher frequencies. Because of the surprising result that registers are more power consuming the 65 nm circuit of architecture one, power may be further reduced if low-power registers are used, assuming this is the cause for the result.

If the two architectures are realized in a 90 nm general purpose CMOS process, the differences between architecture one and architecture two are less distinct, as seen in Figure 4.8. On average, at the different clock frequencies, architecture two consumes 2.2200 mW more power than architecture one. The difference in power consumption of the two architectures are much less when realized in a general purpose process than if a low-power process is used. However, the difference grows larger as clock frequency increases because dynamic power becomes more dominant over leakage power. At

(c) Power comparison at 400 MHz .

Figure 4.7: 65 nm architecture power comparison.


Figure 4.8: 90 nm architecture power comparison.

400 MHz , the difference in power consumption of the two architectures equals approximately 4 mW , while at 200 MHz the difference is approximately 1 mW . As for the 65 nm circuits, to obtain the best power result, architecture one should be chosen.


Figure 4.9: Estimated vs. real power comparison.
Architecture one and two was selected for implementation based on estimations performed in Chapter 2. Figure 4.9 compares the estimated power consumption of architecture one and two to the real power consumption obtained from synthesis, for only FP16 input data, only FP32 input data and only FP64 input data. The numbers for the real power consumption are from the 65 nm circuits at 300 MHz , but the same relative difference between the architectures would be obtained at 200 MHz and 400 MHz , and by the 90 nm circuits, except at 200 MHz in the 65 nm architecture one circuit where power consumption is surprisingly high when computing only FP16 input data. As seen from Figure 4.9, the estimated power consumption gives a
good picture of the relative difference between the two architectures, except when only FP64 input data are computed. When only FP64 input data are computed, architecture one is estimated to have higher power consumption than architecture two. This is because static power was assumed to be $30 \%$ of total power consumption, which is much higher than in both the 65 nm and the 90 nm circuits, where average static power is less than $1.5 \%$ and less $10 \%$, respectively. In addition, significand multipliers are not implemented as array multipliers as assumed in the estimation methodology but parallelprefix multipliers which exploits low-power features in the target library to obtain better power results. The relative estimated difference in power consumption by the two architectures, is largest when only FP16 input data is computed, and decreases when only FP32 data is computed. But, as seen in Figure 4.9 , the real difference in power consumption grows larger larger when only FP32 and only FP64 input data is computed compared to only FP16 input data. Hence, the estimation methodology predicted correctly in two of three cases, and has a fidelity of $66 \%$.

### 4.5 Area Comparison

Architecture one trades area for better power results, and architecture two trades power for better area results. By using multipliers, adders and subtractors, and rounding and exception logic that exactly fit the width of the operands being computed, architecture one reduces total power consumption at the cost of additional multipliers, adders and subtractors, and rounding and exception logic. This additional logic increases total area significantly, compared to architecture two. Figure 4.10 compares the 65 nm circuits concerning area usage, and Figure 4.11 compares the 90 nm circuits. The relative difference in area usage of the two architectures are approximately equal in the 65 nm and 90 nm circuits. On average, in the 65 nm realization of architecture one and two, architecture one is $10432.7200 \mu \mathrm{~m}^{2}$ larger than architecture two. In the 90 nm realization, architecture one is $19072.2630 \mu^{2}$ larger than architecture two. On average, at different clock frequencies, the multiplier unit in architecture two accounts for approximately $70 \%$ of total area in 90 nm CMOS, and approximately $72 \%$ in 65 nm CMOS. In architecture one, the multiplier unit accounts for approximately $76 \%$ of total area in both 65 nm and 90 nm CMOS. Hence, the multiplier unit is the largest unit in both architectures. Figure 4.10 and 4.11 shows that the differences in the multiplier unit accounts for almost all the difference between the architectures. The difference in area usage by registers, exponent unit and rounding and exception unit is very small. As discussed in Section 4.2.2, area of architecture one increases more linearly with clock frequency in the 65 nm circuits than in the 90 nm circuits. For architecture two, area increases more linearly with increasing clock frequency in the 90 nm circuits than in
the 65 nm circuits. This is probably due to the nature of the architectures and optimization performed by the Synopsys tools based on target library.

As can be seen from Table 4.5, 4.6, 4.11 and 4.12 , significand multipliers accounts for more than $70 \%$ of total area, and registers for more than $9 \%$ of total area. The estimations performed in Section 2.3 is based on transistors used in significand multipliers, exponent adders and registers, assuming multipliers implemented as array multipliers. In the synthesized circuits, significand multipliers are implemented as parallel-prefix multiplier from the DesignWare ${ }^{\circledR}$ library provided by Synopsys. In the estimations performed in Section 2.3, architecture one is estimated to require approximately $15 \%$ larger area than architecture two. In the 90 nm realization of the two architectures, architecture one requires, on average at the different clock frequencies, $16.7 \%$ larger area than architecture two. In the 65 nm circuits, on average, architecture one requires $17.5 \%$ larger area than architecture two. Hence, the estimation methodology has a fidelity of $100 \%$, even if significand multipliers are implemented differently than assumed. This is because the multipliers are by far the largest building blocks of the design, and together with the registers accounts for approximately $80 \%$ of total area at different clock frequencies and target technologies.

(a) Area usage at 200 MHz .

(b) Area usage at 300 MHz .

(c) Area usage at 400 MHz .

Figure 4.10: 65 nm CMOS architecture area comparison.

(a) Area usage at 200 MHz .

(b) Area usage at 300 MHz .

(c) Area usage at 400 MHz .

Figure 4.11: 90 nm CMOS architecture area comparison.

## Chapter 5

## Conclusions

This Chapter concludes the thesis. Four, partially IEEE compliant, pipelined, vectorized floating-point multipliers supporting FP16, FP32 and FP64 input data was proposed in [7], and evaluated concerning area, power, latency and throughput. A methodology for estimating power has been developed to help choosing the best architecture to implement given a set of constraints. Two architectures with different area usage and power consumption have been implemented in RTL. Architecture one trades area for better power results, and architecture two trades power for smaller area. The two architectures have equal latency and throughput. The implemented architectures have a latency of five clock cycles, and a throughput of $38400 \mathrm{Mbit} / \mathrm{s}$ at 300 MHz clock frequency.

The architectures have been tested with 500,000 testcases for each supported format and rounding mode to ensure correct behavior according to the IEEE standard for binary floating-point arithmetic. The simulations revealed an error in the rounding logic, which in rare cases rounds the product to zero when it should be rounded to the smallest representable normalized number in round-to-nearest even and round-to positive infinity mode. The error is believed to be format independent, but has only been detected when performing FP16 computations.

Architecture one and two have been synthesized at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency, and for typical input data distribution, assuming 20\% FP16 computations, 60\% FP32 computations and 20\% FP64 computations. In addition, the architectures have been synthesized for only FP16 computations, only FP32 computations and only FP64 computations to see how input data distribution affects power consumption. The architectures have been synthesized using a 65 nm low-power standard cell library, and a 90 nm general purpose standard cell library, to see how target technology affects the architectures concerning power.

### 5.1 Estimation Methodologies

An area estimation methodology was developed in [7], and a power estimation methodology has been developed in this thesis. The estimation methodologies have been used to select architecture one and two for implementation.

Power is estimated based on power dissipated by the significand multipliers, and simulation results from [1] and [2]. Static power consumed by a Full-Adder cell was computed using results from [1]. Assuming 30\% static power consumption, in accordance with simulations performed in [2], dynamic power was computed from static power. The total power dissipated by the significand multipliers in the four proposed architectures was computed assuming multipliers implemented as array multipliers, using fulladders. Because static power is strongly technology dependent, and varies between process technologies, this estimation methodology has several uncertainties and sources of error. The power estimation methodology predicted architecture one to have the lowest power consumption for FP16 and FP32 input data, and architecture two to have the lowest power consumption for FP64 input data. From synthesis power reports it is seen that architecture one has lower power consumption than architecture two for all input data distributions, clock frequencies, and in both 65 nm and 90 nm technology. Hence, the power estimation methodology predicted correctly in two of the three estimated input data cases and has a fidelity of $66 \%$.

Because area required by registers and significand multipliers accounts for the larger part of the vectorized floating-point multipliers proposed in [7], this is used to compare the architectures concerning area. Area estimations are performed based on the transistor count in significand multipliers and registers. The ratio of number of transistors in a Full-Adder cell to number of transistors in a 1-bit register, is used to compute the area required by the significand multipliers and registers of the different architectures. This gives a good picture of the the relative difference in area usage by the four architectures. Architecture one was estimated to be $15 \%$ larger than architecture two. From synthesis area reports it is seen that architecture one is $16.7 \%$ larger than architecture two in 90 nm technology, and $17.5 \%$ in 65 nm technology, on average at $200 \mathrm{MHz}, 300 \mathrm{MHz}$ and 400 MHz clock frequency. Hence, the estimation methodology predicted a quite accurate relative difference between the architectures, and has a fidelity of $100 \%$.

### 5.2 Power Results

The implemented architectures are designed to have different power consumptions, independent of target technology. But, because the 65 nm li-
brary is a low-power process, and the 90 nm library is a general purpose process, this highlights the differences in power consumption by the two architectures depending on target technology, in addition the the architectural differences. When realized in a 65 nm low-power library, architecture one has a total power consumption of 1.9200 mW at 300 MHz , and architecture two a total power consumption of 7.3569 mW . The average increase in total power consumption is $0.6505 \mathrm{~mW} / 100 \mathrm{MHz}$ for architecture one, and $2.5735 \mathrm{~mW} / 100 \mathrm{MHz}$ for architecture two. When realized in a 90 nm general purpose library, architecture one has a total power consumption of 15.4090 mW at 300 MHz , and an average increase in total power consumption equal to $5.1850 \mathrm{~mW} / 100 \mathrm{MHz}$. Architecture two has a total power consumption of 17.4640 mW , and an average increase of $6.2405 \mathrm{~mW} / 100 \mathrm{MHz}$.

The difference in power consumption by the two architectures are higher when realized in a low-power process than in a general purpose process technology. The difference in power consumption at 300 MHz is 5.4369 mW . This is because the synthesis tools exploits the low-power properties of the library when performing circuit optimization. When realized in a general purpose library, the difference in total power consumption is 2.0550 mW at 300 MHz . Hence, to fully obtain the best power result, architecture one should be realized in a low-power process.

### 5.3 Area Results

Because architecture one trades area for better power results it was estimated to use $15 \%$ larger area than architecture two. When realized in the 65 nm library, architecture one area usage is $59816.4414 \mu \mathrm{~m}^{2}$ at 300 MHz , architecture two area usage is $50843.0000 \mu^{2}$. When realized in the 90 nm library, architecture one area usage is $116362.0625 \mu^{2}$, and architecture two area usage is $95242.0469 \mu \mathrm{~m}^{2}$. Area is affected by clock frequency because area is traded to meet timing constraints, mainly in the 53 -bit significand multiplier. In the 65 nm circuits, architecture one is $17.5 \%$ larger than architecture two, and in the 90 nm circuits, architecture one is $16.7 \%$ larger than architecture one. Hence, the relative difference in area usage are approximately equal when realized in a low-power library and a general purpose library.

### 5.4 Future Work

The implemented architectures have several improvements. Sticky exceptions, and clearing of exceptions have not been implemented properly. The implemented vectorized floating-point multipliers generates exceptions according to the IEEE standard for binary floating-point arithmetic, but the standard requires that exceptions shall be sticky and explicitly cleared by
user. By writing to a clear-register, exceptions should be cleared. This has not been implemented according to the standard, and should be implemented to comply the IEEE standard.

An error in the rounding logic has been detected when simulating only FP16 input data in rounding-to-nearest even and round-to positive infinity mode. The error is believed to be format independent, but is only successfully detected by the FP16 input vectors. Result should be rounded to the smallest representable normalized number, but is rounded to zero. This error also has to be corrected to make the vectorized floating-point multipliers IEEE compliant. Rounding could be performed more effectively if the QFT algorithm presented in $[20]$ is used. This requires the sum and carry form the significand multipliers to be delivered as carry-save encoded vectors. The DesignWare ${ }^{\circledR}$ library provides a multiplier with carry-save encoded sum and carry output [29], which could be used when implementing this algorithm.

Because the power estimation methodology did not predict correct relative difference in power consumption in all cases, this should be improved. To improve the power estimation methodology, target technology has to be taken into account, because static power differs significantly for a low-power library and a general purpose library. In addition, power consumption of architecture two is are not equal for FP16 computations, FP32 computations and FP64 computations as estimated. As seen from the 65 nm synthesis results of architecture one, the multiplier unit is not the most power consuming. This should be investigated further. If this is the case, the power estimation methodology can not be based on the significand multipliers alone, power dissipated by registers also have to be included. A weight-function should be developed, where input format distribution and target technology are included when estimating power. Power, area and throughput should be weighted for a given set of architectures and constraints to give a better basis for choosing the best architecture to implement. Clock frequency should perhaps be included in the methodology as well, because differences in power consumption by the two architectures grows larger with increasing clock frequency.

The architectures are generically implemented, and can relatively easy be changed to a 256 -bit input vectorized floating-point multiplier. The differences in area usage will be greater, and it might be interesting to look at power consumption of the two architectures, especially in a general purpose process where static power is a significant contributor to total power. Hence, architecture two might have better power results than architecture two due to lower static power dissipation, and because FP16, FP32 and FP64 computations have not equal dynamic power consumption as assumed in the power estimation methodology.

## Bibliography

[1] S. T. Oskuii, Design of Low-Power Reduction-Trees in Parallel Multipliers. PhD thesis, Norwegian University of Science and Technology, 2008.
[2] Q. X. et al., "Efficient subthreshold leakage current optimization - leakage current optimization and layout migration for $90-$ and $65-\mathrm{nm}$ asic libraries," Circuits and Devices Magazine, IEEE, vol. 22, no. 5, pp. 3947, Sept.-Oct. 2006.
[3] ARM,"Mali ${ }^{\text {TM }}$ graphics solution." http://www.arm.com/products/esd/ multimediagraphics_malioverview.html.
[4] A. Stevens, "Arm ${ }^{\circledR}$ mali ${ }^{\text {TM }}$ 3d graphics system solution." http://www.arm.com/miscPDFs/16514.pdf, December 2006.
[5] Khronos, "Opengl es - the standard for embedded accelerated 3d graphics." http://www.khronos.org/opengles/.
[6] Khronos, "Openvg - the standard for vector graphics acceleration." http://www.khronos.org/openvg/.
[7] E. Stenersen, "Vectorized 256-bit input fp16/fp32/fp64 floating-point multiplier." Norwegian University of Science and Technology, 2007.
[8] IEEE, IEEE Standard for Binary Floating-Point Arithmetic. IEEE, 1985.
[9] I. Koren, Computer Arithmetic Algorithms. Natick, MA, USA: A. K. Peters, Ltd., 2001.
[10] T. Njølstad, "Introduction to sie40aa low power digital design," NTNU, 2002.
[11] R. B. Anantha P. Chandrakasan, Low Power Digital CMOS Design. Springer, 1995.
[12] L. Wanhammar, DSP Integrated Circuits. Academic Press, 1999.
[13] L. DADDA, "Some schemes for parallel multipliers," Alta Frequenza 34, pp. 349-356, May 1965.
[14] C. WALLACE, "A suggestion for a fast multiplier," EEE Trans. Electron. Comp., pp. 14-17, Feb. 1964.
[15] M. Pedram, "Power minimization in ic design: principles and applications," ACM Trans. Des. Autom. Electron. Syst., vol. 1, no. 1, pp. 3-56, 1996.
[16] D. D. Gajski, Principles of Digital Design. Prentice Hall, 1997.
[17] Synopsys, "Designware ${ }^{\circledR}$." http://www.synopsys.com/dw/ buildingblock.php.
[18] Synopsys, "Designware ${ }^{\circledR}$." http://www.synopsys.com/products /designware/docs/doc/dwf/datasheets/dw02_mult.pdf.
[19] Synopsys, "Designware ${ }^{\circledR}$." http://www.synopsys.com/products/ designware/dwtb/articles/multiplier_bldg_block/ multiplier_bldg_block.html.
[20] G. Even and P.-M. Seidel, "A comparison of three rounding algorithms for ieee floating-point multiplication," IEEE Trans. Comput., vol. 49, no. 7, pp. 638-650, 2000.
[21] N. T. Quach, N. Takagi, and M. J. Flynn, "Systematic ieee rounding method for high-speed floating-point multipliers," IEEE Trans. Very Large Scale Integr. Syst., vol. 12, no. 5, pp. 511-521, 2004.
[22] S. Williams, "Icarus verilog." http://www.icarus.com/eda/verilog/.
[23] Synopsys, "Designware ${ }^{\circledR}$." http://www.synopsys.com/products/ designware/docs/doc/dwf/datasheets/dw_fp_mult.pdf.
[24] Synopsys, "Designware ${ }^{\circledR}$." http://www.synopsys.com/products/ designware/docs/doc/dwf/datasheets/fp_overview2.pdf.
[25] Synopsys, "Design compiler ${ }^{\mathrm{TM}}$." http://www.synopsys.com/products/ logic/design_compiler.html.
[26] Synopsys, "Power compiler ${ }^{\mathrm{TM}}$." http://www.synopsys.com/products/ power/power_ds.html.
[27] Synopsys, Power Products Reference Manual, 1999.
[28] Synopsys, "Design compiler ultra performance capabilities." http://www.analogy.com/products/logic/adc_ultratech_bgr.pdf.
[29] Synopsys, "Partial product multiplier." http://synopsys.com/ products/designware/docs/doc/dwf/datasheets/dw02_multp.pdf.

## Appendix A

## Architecture One Verilog Sources

```
The defines file contains definitions used in the design files.
```

```
// File......: defines.v
```

// File......: defines.v
// Author.....: Espen Stenersen
// Author.....: Espen Stenersen
// Date.......: Wed May 14 11:45:28 CEST 2008
// Date.......: Wed May 14 11:45:28 CEST 2008
// Revision...: 1.0
// Revision...: 1.0
// Description: Contains definitions used in the design files.
// Description: Contains definitions used in the design files.
Openrand widths, exponent widths, significand widths,
Openrand widths, exponent widths, significand widths,
bias values and bus widths.
bias values and bus widths.
//
//
'define FP16 0
'define FP16 0
'define FP32 1
'define FP32 1
'define FP64 2
'define FP64 2
'define FP16W 16
'define FP16W 16
'define FP32W 32
'define FP32W 32
'define FP64W 64
'define FP64W 64
'define FP16SW 10
'define FP16SW 10
'define FP32SW 23
'define FP32SW 23
'define FP64SW 52
'define FP64SW 52
'define FP16EW 5
'define FP16EW 5
'define FP32EW 8
'define FP32EW 8
'define FP64EW 11
'define FP64EW 11
'define FP16BIAS 15
'define FP16BIAS 15
'define FP32BIAS 127
'define FP32BIAS 127
'define FP64BIAS 1023
'define FP64BIAS 1023
'define FRACBUS 2*('FP64SW+1)
'define FRACBUS 2*('FP64SW+1)
'define EXPBUS 4*'FP32EW
'define EXPBUS 4*'FP32EW
'define SIGNBUS 4
'define SIGNBUS 4
'define BUS 128
'define BUS 128
'define EVEN 0
'define EVEN 0
define PINF
define PINF
'define NINF 2

```
'define NINF 2
```

```
// File......: vec_fp_mult.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 10:30:15 CEST 2008
// Revision...: 1.0
// Description: Vectorized FP16/FP32/FP64 floating-point multiplier
//
    top module. Assembles the architecture.
/
`include "defines.v"
module vec_fp_mult
(
start, // Input. Starts computation.
vectors, // Input. FP vectors to be computed.
format, // Input. Format of vectors.
mode, // Input. Rounding mode.
clear, // Input. Clears specified ex
products, // Output. Computed products
exceptions, // Output. Exceptions raised.
ready, // Output. Output vector ready.
clk,
reset_n
);
    // input(s)
    input clk;
    input reset_n;
    input
    input ['BUS-1:0]
    input [1:0
start;
vectors;
format;
input [1:0
mode;
clear;
output(s)
output ['BUS-1:0] products
output [15:0] exceptions,
output
ready;
// wire(s)
reset
wire ['BUS /2-1:0] 
wire [1:0
wire [1:0] IR_-to_stage3
wire [1:0] IR_to_-stage4;
wire [1:0] M_to_stage2;
wire [1:0] M
wire [1:0] M_to_stage4;
wire [15:0]
wire ['FRACBUS-1:0]
wire ['FRACBUS-1:0]
wire ['EXPBUS-1:0]
wire ['EXPBUS / 2+3:0
wire ['SIGNBUS - 1:0]
wire ['SIGNBUS / 2-1:0]
wire
wire
wire
wire
wire
S0_to_stage4;
DRF to stage3
DRF_to_stage4
DRE-to-stage4
DRE - to _stage4
DRS to stage4
star\overline{t_to_stage1;}
start to stage2;
start_to_-stage3;
start_to_stage4;
load_\overline{ST0;}
```

```
/// Module instantiation
```

/// Module instantiation
// Registers. Keep track of start signal to set ready signal when
// Registers. Keep track of start signal to set ready signal when
// needed.
// needed.
reg_enable \#(1) ST0
reg_enable \#(1) ST0
(
(
.d (start), // Data in.
.d (start), // Data in.
.q (start_to_stage1), // Data out.
.q (start_to_stage1), // Data out.
.enable (load_ST0), // Enable bit.
.enable (load_ST0), // Enable bit.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(1) ST1
ff \#(1) ST1
(
(
.d (start_to_stage1), // Data in.
.d (start_to_stage1), // Data in.
.q (start_to_-stage2), // Data out.
.q (start_to_-stage2), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(1) ST2
ff \#(1) ST2
(
(
.d (start_to_stage2), // Data in.
.d (start_to_stage2), // Data in.
.q (start_to_stage3), // Data out.
.q (start_to_stage3), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(1) ST3
ff \#(1) ST3
(
(
.d (start_to_stage3), // Data in.
.d (start_to_stage3), // Data in.
.q (start_to_stage4), // Data out.
.q (start_to_stage4), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
// Pipeline stage 1.
// Pipeline stage 1.
stage1 stage1
stage1 stage1
(
(
.vectors (vectors), // Input. Vectors.
.vectors (vectors), // Input. Vectors.
.start (start), // Input. Start computing.
.start (start), // Input. Start computing.
.format (format), // Input. Data format.
.format (format), // Input. Data format.
.mode (mode), // Input. Rounding mode.
.mode (mode), // Input. Rounding mode.
.DRH0_out (DRH_to_stage2), // Output. [127:64] of input.
.DRH0_out (DRH_to_stage2), // Output. [127:64] of input.
.DRL0_out (DRL_to_stage2), // Output. [63:0] of input.
.DRL0_out (DRL_to_stage2), // Output. [63:0] of input.
.IR0_out (IR_-to_stage2), // Output. Format.
.IR0_out (IR_-to_stage2), // Output. Format.
.M0_out (M_-to_-stage2), // Output. Rounding mode.
.M0_out (M_-to_-stage2), // Output. Rounding mode.
.clk
.clk
.reset (reset)
.reset (reset)
);
);
// Pipeline stage 2.
// Pipeline stage 2.
stage2 stage2
stage2 stage2
(
(
.DRH0 (DRH_to_stage2), // Input from input register
.DRH0 (DRH_to_stage2), // Input from input register
DRHO.

```
```

        DRL0
        DRLO.
        .format
        IRO.
        .mode
        .DRF0 out
        .DRE0_out
        .DRS0 out
        .IR1_out
        .M1_out
        .clk
        .reset
    );
// Pipeline stage 3
stage3 stage3
(
.DRF0 (DRF_to_stage3), // Input from register DRF0
.DRE0
.DRS0
.format
. mode
.DRS1_out
.DRE1 out
.DRF1_out
.S0 out
.M2_out
.IR2_out
.clk
.reset
);
// Pipeline stage 4 \& 5.
stage4 stage4
(
.start (start to stage4),// Input.
.DRF1 (DRF__to_stage4), // Input from register DRF1.
.DRE1 (DRE_to_stage4), // Input from register DRE1.
.DRS1 (DRS_to_stage4), // Input from register DRS1.
.specials (S0_to_stage4), // Input form register S0.
.format
.mode
.clear_excps
.products
. exceptions
.ready
.clk
.reset
);
// Internal active high reset.
assign reset = !reset n;
assign load_ST0 = start;
endmodule // vec_fp_mult

```
```

// File......: stage1.v
// Author.....: Espen Stenersen
Date......: Fri Apr 18 16:11:23 CEST 2008
// Revision...: 1.0
// Description: Stage one in pipeline.
//
'include "defines.v"
module stage1
(
vectors, // Input. Vectors.
start, // Input. Start computing.
format, // Input. Data format.
mode, // Input. Rounding mode.
DRH0 out, // Output. [127:64] of input.
DRL0_out, // Output. [63:0] of input.
IR0_out, // Output. Format.
M0_out, // Output. Rounding mode.
clk,
reset
);
// input(s)
input ['BUS-1:0] vectors;
input [0:0] start;
input [1:0] format;
input [1:0] mode;
input clk;
input reset;
// output(s)
output ['BUS/2-1:0] DRH0_out;
output ['BUS/2-1:0] DRL0_out;
output [1:0] IR0_out;
output [1:0] M0_out;
// wire(s)
wire load_drh;
wire load_drl;
wire load_ir0;
wire load_m0;
// reg(s)
/// Module instantiation.
// Registers.
reg_enable \#('BUS/2) DRH0
(
.d (vectors['BUS-1:`BUS/2]), // Data in.
.q (DRH0_out), // Data out.
.enable (load_drh), // Enable bit.
.clk (clk)
.reset (reset)
);
reg_enable \#('BUS/2) DRL0
(

```
```

// File......: stage2.v
/ Author.....: Espen Stenersen
Date.......: Fri Apr 18 16:29:31 CEST 2008
// Revision...: 1.0
// Description: Stage two of pipeline.
//
'include "defines.v"
module stage2
(
DRH0, // Input from input register DRHO
DRL0, // Input from input register DRLO
format, // Input from format register IR0
mode, // Input from rounding mode register MO.
DRF0 out, // Output to significand multipliers
DRE0_out, // Output to exponent adders.
DRS0_out, // Output to sign computation.
IR1_out, // Output.
M1_out, // Output.
clk,
reset
);
// input(s)
input ['BUS / 2-1:0] DRH0;
input ['BUS/2-1:0] DRL0;
input [1:0] format;
input [1:0] mode;
input clk;
input reset;
// output(s)
output ['FRACBUS-1:0] DRF0 out;
output ['EXPBUS-1:0] DRE0_out
output ['SIGNBUS-1:0] DRS0_out
output [1:0] IR1_out;
output [1:0] M1_out;
// wire(s)
wire ['FRACBUS-1:0] fracs;
wire ['EXPBUS-1:0] exps;
wire ['SIGNBUS-1:0] signs;
// reg(s)
Module instantiations.
//
// Registers.
ff \#('FRACBUS) DRF0
(
.d (fracs), // Data in.
.q (DRF0_out), // Data out.
.clk (clk),
.reset (reset)
);
ff \#('EXPBUS) DRE0

```
```

8
64
65
66
67
68
69
71
72
74
75
76
78

```
    .d (exps), // Data in.
```

    .d (exps), // Data in.
    .q (DRE0_out), // Data out.
    .q (DRE0_out), // Data out.
    .clk (clk),
    .clk (clk),
    .reset (reset)
    .reset (reset)
    );
);
ff \#('SIGNBUS) DRS0
ff \#('SIGNBUS) DRS0
(
(
.d (signs), // Data in.
.d (signs), // Data in.
.q (DRS0_out), // Data out.
.q (DRS0_out), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(2) IR1
ff \#(2) IR1
(
(
.d (format), // Data in.
.d (format), // Data in.
.q (IR1_out), // Data out.
.q (IR1_out), // Data out.
clk (clk),
clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(2) M1
ff \#(2) M1
(
(
.d (mode), // Data in.
.d (mode), // Data in.
.q (M1_out), // Data out.
.q (M1_out), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
// Input mux / selector.
// Input mux / selector.
sel_input sel_input
sel_input sel_input
(
(
.drh (DRH0), // Input from data-register high (DRHO).
.drh (DRH0), // Input from data-register high (DRHO).
.drl (DRL0), // Input from data-register low (DRLO).
.drl (DRL0), // Input from data-register low (DRLO).
.format (format), // Input form instrucion(format) register.
.format (format), // Input form instrucion(format) register.
.signs (signs), // Output to sign bus.
.signs (signs), // Output to sign bus.
.exps (exps), // Output to exponent bus.
.exps (exps), // Output to exponent bus.
.fracs (fracs) // Output to significand bus.
.fracs (fracs) // Output to significand bus.
);
);
defparam sel_input.WIDTH = 'BUS/2;
defparam sel_input.WIDTH = 'BUS/2;
defparam sel input.SIGNBUS ='SIGNBUS;
defparam sel input.SIGNBUS ='SIGNBUS;
defparam sel_input.EXPBUS = 'EXPBUS;
defparam sel_input.EXPBUS = 'EXPBUS;
defparam sel__input.FRACBUS = 'FRACBUS;
defparam sel__input.FRACBUS = 'FRACBUS;
endmodule // stage2

```
endmodule // stage2
```

```
// File......: stage3.v
// Author.....: Espen Stenersen
// Date......: Fri Apr 18 16:51:03 CEST 2008
// Revision...: 1.0
// Description: Stage three of pipeline.
//
`include "defines.v"
module stage3
(
    DRF0, // Input from fraction register.
    DRE0, // Input from exponent register.
    DRS0, // Input from sign register.
    format, // Input from format register.
    mode, // Input from rounding mode register.
    DRS1_out, // Output to sign register.
    DRE1_out, // Output to exponent register.
    DRF1_out, // Output to fraction register.
    S0_out, // Output to special values register.
    M2 out, // Output to rounding mode register.
    IR2__out, // Output to format register.
    clk,
    reset
);
    // input(s)
    input ['FRACBUS-1:0] DRF0
    input ['EXPBUS - 1:0] DRE0;
    input ['SIGNBUS - 1:0] DRS0
    input [1:0] format;
    input [1:0] mode;
    input clk;
    input reset;
    // output(s)
    output ['FRACBUS-1:0] DRF1 out;
    output ['SIGNBUS / 2-1:0] DRS1_out;
    output ['EXPBUS/2+3:0] DRE1_out; // + overflow/underflow bits.
    output [15:0] S0_out;
    output [1:0] M2_out;
    output [1:0] IR2__out;
    // wire(s)
    wire ['FRACBUS-1:0] prods;
    wire ['SIGNBUS /2-1:0] signs ;
    wire ['EXPBUS/2+3:0] sums;
    wire [15:0] specials;
    wire [3:0] ints;
    wire [3:0] infs
    wire [3:0] nans;
    wire [3:0] zeroes;
    // reg(s)
    // Module instantiations.
    // Module instantiations
    // Registers.
    ff #('FRACBUS) DRF1
```

```
    .d (prods), // Data in.
```

    .d (prods), // Data in.
    .q (DRF1_out), // Data out.
    .q (DRF1_out), // Data out.
    .clk (clk),
    .clk (clk),
    .reset (reset)
    .reset (reset)
    );
);
ff \#('EXPBUS/2+4) DRE1
ff \#('EXPBUS/2+4) DRE1
(
(
.d (sums), // Data in.
.d (sums), // Data in.
.q (DRE1_out), // Data out.
.q (DRE1_out), // Data out.
.clk (clk),
.clk (clk),
reset (reset)
reset (reset)
);
);
ff \#(`SIGNBUS / 2) DRS1 ff #(`SIGNBUS / 2) DRS1
(
(
.d (signs), // Data in.
.d (signs), // Data in.
.q (DRS1_out), // Data out.
.q (DRS1_out), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(16) S0
ff \#(16) S0
(
(
.d (specials), // Data in.
.d (specials), // Data in.
.q (S0_out), // Data out.
.q (S0_out), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(2) IR2
ff \#(2) IR2
.d (format), // Data in.
.d (format), // Data in.
.q (IR2_out), // Data out.
.q (IR2_out), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
ff \#(2) M2
ff \#(2) M2
(
(
.d (mode), // Data in.
.d (mode), // Data in.
.q (M2_out), // Data out.
.q (M2_out), // Data out.
.clk (clk),
.clk (clk),
.reset (reset)
.reset (reset)
);
);
// Computational units.
// Computational units.
chk_special \#('FRACBUS, 'EXPBUS) chk_special
chk_special \#('FRACBUS, 'EXPBUS) chk_special
(
(
.fracs (DRF0), // Input from significand bus.
.fracs (DRF0), // Input from significand bus.
.exps (DRE0), // Input from exponent bus.
.exps (DRE0), // Input from exponent bus.
.format (format), // Input.
.format (format), // Input.
.infs (infs), // Output.
.infs (infs), // Output.
.ints (ints), // Output.
.ints (ints), // Output.
.nans (nans), // Output.
.nans (nans), // Output.
.zeroes (zeroes) // Output.
.zeroes (zeroes) // Output.
);
);
mult_unit mult_unit
mult_unit mult_unit
(
(
.fracs (DRF0), // Input from significand bus.

```
    .fracs (DRF0), // Input from significand bus.
```

```
    .format (format), // Input from instruction register.
    .prods (prods) // Output to significand bus.
    );
    exp_unit exp_unit
    (
        .exps (DRE0), // Input from exponent bus.
        .format (format), // Input from instruction register.
        .sums (sums) // Output to exponent bus.
    );
    sign_unit sign_unit
    (
        .signs (DRS0), // Input signs from sign bus.
        .signs_comp (signs) // Output to sign bus.
    );
    assign specials = {ints, zeroes, infs, nans};
endmodule // stage3
```

```
// File......: sel_output_tb.v
// Author.....: Espen Stenersen
// Date......: Thu Apr 24 21:39:26 CEST 2008
// Revision...: 1.0
// Description: For testing select output logic.
//
'include "defines.v"
module stage4
(
    start, // Input.
    DRF1, // Input from fraction register DRF1.
    DRE1, // Input from exponent register DRE1.
    DRS1, // Input from sign register DRS1.
    specials, // Input form special values register SO.
    format, // Input from format register IR2.
    mode, // Input from rounding mode regiser M2.
    clear_excps,// Input. Clear exceptions.
    products, // Output. Final result.
    exceptions, // Output. Exceptions
    ready, // Output. Result ready.
    clk,
    reset
);
    // Input(s)
    input
    input ['FRACBUS-1:0] DRF1;
    input ['EXPBUS/2+3:0] DRE1; // + overflow/underflow bits.
    input ['SIGNBUS/2-1:0] DRS1;
    input [15:0] specials;
    input [1:0] format;
    input [1:0] mode;
    input [15:0] clear_excps;
    input
    input
    // Output(s)
    output ['BUS-1:0] products;
    output [15:0]
    output
    // wire(s)
    wire ['BUS-1:0] products;
    wire [15:0]
    wire
    wire
    wire
    wire
    wire
    wire [15:0]
    wire [15:0]
    wire [7:0]
    wire [7:0]
    wire [7:0]
    wire [7:0]
    wire [7:0]
    wire [7:0]
    wire ['BUS-1:0]
    wire
```

```
start
```

start
clk;
clk;
reset;
reset;
exceptions;
exceptions;
ready;
ready;
exceptions tmp;
exceptions tmp;
load_drh;
load_drh;
load drlh;
load drlh;
load__drll;
load__drll;
load_excep_l;
load_excep_l;
load_excep_h;
load_excep_h;
ex;
ex;
clear;
clear;
clear_l;
clear_l;
clear_h;
clear_h;
ex_h_in;
ex_h_in;
ex_l_in;
ex_l_in;
ex_h_oout;
ex_h_oout;
ex_l_out;
ex_l_out;
prods;
prods;
ready_tmp;

```
ready_tmp;
```

```
wire ['BUS/2-1:0] result;
wire [7:0] exceps;
wire [1:0] format;
// reg(s)
/ Module instantiation.
// <
// Product registers.
reg_enable #(32) DRLL
(
    .d (prods[31:0]), // Data in.
    .q (products[31:0]), // Data out.
    .enable (load_drll), // Enable bit.
    .clk (clk),
    .reset (reset)
);
reg enable #(32) DRLH
(
    .d (prods[63:32]), // Data in.
    .q (products[63:32]), // Data out.
    .enable (load_drlh), // Enable bit.
    .clk (clk),
    .reset (reset)
);
reg_enable #(64) DRH
(
    .d (prods[127:64]), // Data in.
    .q (products[127:64]), // Data out.
    .enable (load_drh), // Enable bit.
    .clk (clk),
    .reset (reset)
);
// Exception registers.
reg_enable #(8) EXCPL
//reg_excep #(8) EXCPL
(
    .d (ex_l_in), // Data in.
    .q (ex_l_out), // Data out.
    .enable (load_excep_l), // Enable bit.
    //.clear (clear_l),
    .clk (clk),
    .reset (reset)
);
reg_enable #(8) EXCPH
//reg_excep #(8) EXCPH
(
    .d (ex_h_in), // Data in.
    .q (ex_h_out), // Data out.
    .enable (load_excep_h), // Enable bit.
    //.clear (clear_h),
    .clk (clk),
    .reset (reset)
);
```

```
// Clear register. Written to in order to clear exceptions.
// [unf p3 .. p0, ovf p3 .. p0, inx p3 .. p0, nan p3 .. p0]
ff #(16) CLEAR
(
    .d (clear_excps), // Data in.
    .q (clear), // Data out.
    .clk (clk),
    .reset (reset)
);
// Ready register.
reg_set #(1) READY
(
    .set (ready tmp),
    .q
    .clk (clk),
                                    ready)
    .reset (reset)
);
// Rounding unit.
rne_unit rne_unit
(
    . fracs (DRF1), // Input from fraction bus.
    .exps (DRE1), // Input form exponent bus.
    .signs (DRS1), // Input from sign bus.
    .format (format), // Input from instrucion register.
    .special (specials), // Input form check special.
    .mode (mode), // Input from mode register.
    .exceps (exceps), // Output exceptions.
    .result (result) // Output. Rounded result.
);
// Output selector.
sel_output sel_output
(
    .result (result), // Input from rounding logic.
    .exceps (exceps), // Input from rounding logic.
    .format (format), // Input from format register.
    .start (start), // Input from start register.
    .products (prods), // Output to output register.
    .load_drh (load_drh), // Output to output register.
    .load_drlh (load_drlh), // Output to output register.
    .load_drll (load__drll), // Output to output register.
    .exce\overline{p}tions (ex), - // Output to exception register.
    .load_excep_l (load_excep_l),// Output to exception register.
    .load__excep_h (load_excep_h),// Output to exception register.
    .reset (reset),
    .clk (clk)
);
// Assigns.
//
assign ready_tmp =
    (format = 'FP16) ? load_drlh&start : load_drh&start;
assign clear l =
    {clear[13:12], clear[9:8], clear[5:4], clear[1:0]};
assign clear h =
    {clear[15:14], clear[11:10], clear[7:6], clear [3:2]};
```

```
187
188
189
190
191
192
1 9 3
194
195
196
1 9 7
198
199
201
202 endmodule // stage4
```

```
// File.......: chk_special.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 11:30:08 CEST 2008
// Revision...: 1.0
// Description: Checks if inputs equals special values such as
// infinity, nan, zero or int. Result is used for
// exception generation.
//
`include "defines.v"
module chk_special
(
    fracs, // Input from significand bus.
    exps, // Input from exponent bus.
    format, // Input.
    infs, // Output.
    ints, // Output.
    nans, // Output.
    zeroes // Output.
);
    parameter FRACBUS = 'FRACBUS;
    parameter EXPBUS = 'EXPBUS;
    // input(s)
    input [FRACBUS-1:0] fracs;
    input [EXPBUS-1:0] exps;
    input [1:0] format;
    // output(s)
    output [3:0] infs;
    output [3:0] ints;
    output [3:0] nans;
    output [3:0] zeroes;
    // wire(s)
    wire [EXPBUS/2-1:0] exponent_a;
    wire [EXPBUS/2-1:0] exponent_b;
    wire [FRACBUS/2-1:0] significand_a;
    wire [FRACBUS/2-1:0] significand_b;
    wire nan_a0;
    wire nan_a1;
    wire nan_b0;
    wire nan_b1;
    wire inf-a0;
    wire inf_a1;
    wire inf-b0;
    wire inf_b1;
    wire int_a0;
    wire int_a1;
    wire int_b0;
    wire int_-b1;
    wire zero_a0;
    wire zero_al;
    wire zero_b0;
    wire zero_b1;
    // reg(s)
```

```
6 3
6 4
```

// Combinational assigns.

```
// Combinational assigns.
// fracs[1*('FP16SW+1)-2:0*('FP16SW+1) because significands are
// fracs[1*('FP16SW+1)-2:0*('FP16SW+1) because significands are
// now extended to 11, 24 and 53 bits included the implict bit.
// now extended to 11, 24 and 53 bits included the implict bit.
// Assign invalid inputs.
// Assign invalid inputs.
assign nan_a0 =
assign nan_a0 =
    (format = 'FP16) ?
    (format = 'FP16) ?
        (&exps [1*(`FP16EW) - 1:0*('FP16EW)]) &
        (&exps [1*(`FP16EW) - 1:0*('FP16EW)]) &
        (| fracs[1*('FP16SW+1) - 2:0*('FP16SW+1)])
        (| fracs[1*('FP16SW+1) - 2:0*('FP16SW+1)])
    (format = 'FP32) ?
    (format = 'FP32) ?
    (&exps[1*('FP32EW) - 1:0*('FP32EW)]) &
    (&exps[1*('FP32EW) - 1:0*('FP32EW)]) &
        (|fracs[1*('FP32SW+1) - 2:0*('FP32SW+1)]) :
        (|fracs[1*('FP32SW+1) - 2:0*('FP32SW+1)]) :
    (format = 'FP64) ?
    (format = 'FP64) ?
        (&exps[1*('FP64EW) - 1:0*('FP64EW)]) &
        (&exps[1*('FP64EW) - 1:0*('FP64EW)]) &
        (| fracs [1*('FP64SW+1) - 2:0*('FP64SW+1)]) : 1'b0;
        (| fracs [1*('FP64SW+1) - 2:0*('FP64SW+1)]) : 1'b0;
assign nan_b0 =
assign nan_b0 =
    (format = 'FP16) ?
    (format = 'FP16) ?
        (&exps[2*('FP16EW) - 1:1*('FP16EW)]) &
        (&exps[2*('FP16EW) - 1:1*('FP16EW)]) &
        (|fracs[2*('FP16SW+1) - 2:1*('FP16SW+1)]) :
        (|fracs[2*('FP16SW+1) - 2:1*('FP16SW+1)]) :
    (format = 'FP32) ?
    (format = 'FP32) ?
        (&exps[2*('FP32EW) - 1:1*('FP32EW)]) &
        (&exps[2*('FP32EW) - 1:1*('FP32EW)]) &
        (|fracs [2*('FP32SW+1) - 2:1*('FP32SW+1)])
        (|fracs [2*('FP32SW+1) - 2:1*('FP32SW+1)])
    (format = 'FP64) ?
    (format = 'FP64) ?
        (&exps [2*('FP64EW) - 1:1*('FP64EW)]) &
        (&exps [2*('FP64EW) - 1:1*('FP64EW)]) &
        (| fracs [2*('FP64SW+1) - 2:1*('FP64SW+1)]) : 1'b0;
        (| fracs [2*('FP64SW+1) - 2:1*('FP64SW+1)]) : 1'b0;
assign nan_a1 =
assign nan_a1 =
    (format = 'FP16) ?
    (format = 'FP16) ?
        (&exps[3*('FP16EW) - 1:2*('FP16EW)]) &
        (&exps[3*('FP16EW) - 1:2*('FP16EW)]) &
        (|fracs [3*(`FP16SW+1) - 2:2*(`FP16SW+1)])
        (|fracs [3*(`FP16SW+1) - 2:2*(`FP16SW+1)])
    (format = 'FP32) ?
    (format = 'FP32) ?
        (&exps [3*('FP32EW) - 1:2*('FP32EW)]) &
        (&exps [3*('FP32EW) - 1:2*('FP32EW)]) &
        (|fracs[3*('FP32SW+1) - 2:2*('FP32SW+1)]) :
        (|fracs[3*('FP32SW+1) - 2:2*('FP32SW+1)]) :
    (format = 'FP64) ? 1'b0 : 1'b0;
    (format = 'FP64) ? 1'b0 : 1'b0;
assign nan_b1 =
assign nan_b1 =
    (format = 'FP16) ?
    (format = 'FP16) ?
        (&exps[4*('FP16EW) - 1:3*('FP16EW)]) &
        (&exps[4*('FP16EW) - 1:3*('FP16EW)]) &
        (|fracs[4*('FP16SW+1) - 2:3*('FP16SW+1)]) :
        (|fracs[4*('FP16SW+1) - 2:3*('FP16SW+1)]) :
    (format = 'FP32) ?
    (format = 'FP32) ?
        (&exps [4*('FP32EW) - 1:3*('FP32EW)]) &
        (&exps [4*('FP32EW) - 1:3*('FP32EW)]) &
        (|fracs[4*('FP32SW+1) - 2:3*('FP32SW+1)]) :
        (|fracs[4*('FP32SW+1) - 2:3*('FP32SW+1)]) :
    (format ='FP64) ? 1'b0 : 1'b0;
    (format ='FP64) ? 1'b0 : 1'b0;
// Assign infinity inputs.
// Assign infinity inputs.
assign inf_a0=
assign inf_a0=
    (format ='FP16) ?
    (format ='FP16) ?
        (&exps[1*('FP16EW) - 1:0*('FP16EW)]) &
        (&exps[1*('FP16EW) - 1:0*('FP16EW)]) &
        (~}|\textrm{fracs}[1*(`'FP16SW+1)-2:0*('FP16SW+1)]) :
        (~}|\textrm{fracs}[1*(`'FP16SW+1)-2:0*('FP16SW+1)]) :
    (format = 'FP32) ?
    (format = 'FP32) ?
        (&exps [1*('FP32EW) - 1:0*('FP32EW)]) &
        (&exps [1*('FP32EW) - 1:0*('FP32EW)]) &
        (~ |fracs [1*('FP32SW+1) - 2:0*('FP32SW+1)]) :
        (~ |fracs [1*('FP32SW+1) - 2:0*('FP32SW+1)]) :
    (format = 'FP64) ?
    (format = 'FP64) ?
        (&exps [1*('FP64EW) - 1:0*('FP64EW)]) &
        (&exps [1*('FP64EW) - 1:0*('FP64EW)]) &
        (~ |fracs [1*('FP64SW+1) - 2:0*('FP64SW+1)]) : 1'b0;
```

        (~ |fracs [1*('FP64SW+1) - 2:0*('FP64SW+1)]) : 1'b0;
    ```
```

assign inf_b0 =
(format = 'FP16) ?
(\&exps [2*('FP16EW) - 1:1*('FP16EW)]) \&
(~ |fracs[2*('FP16SW+1) - 2:1*('FP16SW+1)]) :
(format = 'FP32) ?
(\&exps[2*('FP32EW) - 1:1*('FP32EW)]) \&
(~ |fracs [2*(`'FP32SW+1)-2:1*(`FP32SW+1)]) :
(format = 'FP64) ?
(\&exps[2*('FP64EW) - 1:1*('FP64EW)]) \&
(~ |fracs [2*('FP64SW+1) - 2:1*('FP64SW+1)]) : 1'b0;
assign inf_a1=
(format = 'FP16) ?
(\&exps[3*('FP16EW) - 1:2*('FP16EW)]) \&
(~ |fracs[3*('FP16SW+1) - 2:2*('FP16SW+1)]) :
(format = 'FP32) ?
(\&exps[3*('FP32EW) - 1:2*('FP32EW)]) \&
(~ | fracs [ 3*('FP32SW+1) - 2:2*('FP32SW+1)]) :
(format = 'FP64) ? 1'b0 : 1'b0;
assign inf_b1=
(format = 'FP16) ?
(\&exps[4*('FP16EW) -1:3*('FP16EW)]) \&
(~|fracs [4*('FP16SW+1) - 2:3*('FP16SW+1)]) :
(format = 'FP32) ?
(\&exps[4*('FP32EW) - 1:3*('FP32EW)]) \&
(~ |fracs[4*('FP32SW +1) - 2:3*('FP32SW+1)]) :
(format ='FP64) ? 1'b0 : 1'b0;
// Assign zero inputs.
assign zero_a0 =
(format = 'FP16) ?
(~}|\operatorname{exps[1*('FP16EW) - 1:0*('FP16EW) ]) \&
(~) fracs [1*('FP16SW+1)-2:0*('FP16SW+1)]) :
(format = 'FP32) ?
(~ | exps[1*('FP32EW) - 1:0*('FP32EW)]) \&
(~ |fracs[1*('FP32SW+1) - 2:0*('FP32SW+1)]) :
(format = 'FP64) ?
(~ | exps[1*('FP64EW) - 1:0*('FP64EW)]) \&
(~}|\textrm{fracs}[1*(`FP64SW+1)-2:0*('FP64SW+1)]) : 1'b0 assign zero_b0 =     (format - 'FP16) ?         (~}|\operatorname{exps[2*('FP16EW) - 1:1*('FP16EW)]) &         (~) fracs [2*(`FP16SW+1)-2:1*('FP16SW+1)]) :
(format = 'FP32) ?
(~ | exps[2*('FP32EW) - 1:1*('FP32EW)]) \&
(~ |fracs [2*('FP32SW+1)-2:1*('FP32SW+1)]) :
(format = 'FP64) ?
(~ | exps[2*('FP64EW) - 1:1*('FP64EW)]) \&
(~ |fracs [2*('FP64SW+1) - 2:1*('FP64SW+1)]) : 1'b0;
assign zero_a1 =
(format = 'FP16) ?
(~ | exps[3*('FP16EW) - 1:2*('FP16EW)]) \&
(~ | fracs [3*('FP16SW+1) - 2:2*('FP16SW+1)]) :
(format = 'FP32) ?
(~ | exps [3*('FP32EW) - 1:2*('FP32EW)]) \&
(~) fracs [3*('FP32SW+1)-2:2*('FP32SW+1)]) :
(format = 'FP64) ? 1'b0 : 1'b0;

```
```

assign zero b1 =
(format $\overline{=}$ 'FP16) ?
(~|exps [4*('FP16EW) $\left.\left.-1: 3 *\left({ }^{\sim} \mathrm{FP} 16 \mathrm{EW}\right)\right]\right) \&$
$\left(\sim \mid\right.$ fracs $\left.\left[4 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}+1\right)-2: 3 *\left({ }^{( } \mathrm{FP} 16 \mathrm{SW}+1\right)\right]\right):$
(format $=$ 'FP32) ?
(~|exps [4*('FP32EW) -1:3*('FP32EW)]) \&
$(\sim)$ fracs $\left.\left[4 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}+1\right)-2: 3 *\left({ }^{\circ} \mathrm{FP} 32 \mathrm{SW}+1\right)\right]\right):$
(format = 'FP64) ? 1'b0 : 1'b0;
// Assign integer inputs.
assign int_a0 $=$
(format $=$ 'FP16) ?
(| exps [1*('FP16EW) $\left.\left.-1: 0 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{EW}\right)\right]\right) \&$
(~|fracs [1*('FP16SW+1) $\left.\left.-2: 0 *\left({ }^{( } \mathrm{FP} 16 \mathrm{SW}+1\right)\right]\right)$ :
(format ='FP32) ?
(|exps[1*('FP32EW) - 1:0*('FP32EW)]) \&
$\left(\sim \mid\right.$ fracs $\left.\left[1 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}+1\right)-2: 0 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}+1\right)\right]\right):$
(format ='FP64) ?
(| exps [1*('FP64EW) $\left.\left.-1: 0 *\left({ }^{\prime} \mathrm{FP} 64 \mathrm{EW}\right)\right]\right) \&$
$(\sim)$ fracs $\left.\left[1 *\left({ }^{\prime} \mathrm{FP} 64 \mathrm{SW}+1\right)-2: 0 *\left({ }^{\prime} \mathrm{FP} 64 \mathrm{SW}+1\right)\right]\right): 1$ 'b0;
$\operatorname{assign}$ int_b0 $=$
(format $=$ 'FP16) ?
(| exps [2*('FP16EW) $\left.\left.-1: 1 *\left({ }^{`} \mathrm{FP} 16 \mathrm{EW}\right)\right]\right) \&$
$\left(\sim \mid \mathrm{fracs}\left[2 *\left({ }^{( } \mathrm{FP} 16 \mathrm{SW}+1\right)-2: 1 *\left({ }^{( } \mathrm{FP} 16 \mathrm{SW}+1\right)\right]\right)$ :
(format $=$ 'FP32) ?
(| exps [2*('FP32EW) $\left.\left.-1: 1 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{EW}\right)\right]\right) \&$
( ${ }^{\sim} \mid$ fracs $\left.\left[2 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}+1\right)-2: 1 *\left({ }^{\text {'FPP32SW }}+1\right)\right]\right)$ :
(format $={ }^{\prime}$ FP64) ?
(| exps [2*('FP64EW) $\left.\left.-1: 1 *\left({ }^{‘} \mathrm{FP} 64 \mathrm{EW}\right)\right]\right) \&$
$\left(\sim\right.$ fracs $\left.\left[2 *\left({ }^{\prime} \mathrm{FP} 64 \mathrm{SW}+1\right)-2: 1 *\left({ }^{\prime} \mathrm{FP} 64 \mathrm{SW}+1\right)\right]\right): 1$ 'b0;
$\operatorname{assign} \operatorname{int}$ a $1=$
(format $=' \mathrm{FP} 16$ ) ?
(| exps [ $\left.\left.3 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{EW}\right)-1: 2 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{EW}\right)\right]\right) \&$
( $\left.{ }_{\sim}^{\mid f r a c s}\left[3 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}+1\right)-2: 2 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}+1\right)\right]\right)$ :
(format $=$ 'FP32) ?
(| exps [3*('FP32EW) $\left.\left.-1: 2 *\left({ }^{( } \mathrm{FP} 32 \mathrm{EW}\right)\right]\right) \&$
( ${ }^{-}$fracs $\left.\left[3 *\left({ }^{\text {'FPP32SW }}+1\right)-2: 2 *\left({ }^{( } \mathrm{FP} 32 \mathrm{SW}+1\right)\right]\right)$ :
(format $=' \mathrm{FP} 64$ ) ? 1 ' $\mathrm{b} 0: 1$ 'b0;
assign int_b1 =
(format ${ }^{-}=$'FP16) ?
(| $\left.\operatorname{exps}\left[4 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{EW}\right)-1: 3 *\left({ }^{〔} \mathrm{FP} 16 \mathrm{EW}\right)\right]\right)$ \&
$\left(\sim \sim\right.$ fracs $\left.\left[4 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}+1\right)-2: 3 *\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}+1\right)\right]\right):$
(format ='FP32) ?
( | exps [4* ('FP32EW) $\left.\left.-1: 3 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{EW}\right)\right]\right) \&$
(~|fracs [4*('FP32SW+1) $\left.\left.-2: 3 *\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}+1\right)\right]\right)$ :
(format $=' \mathrm{FP} 64)$ ? 1 'b0 : 1 'b0;
// Assign outputs.
assign infs [0] = inf_a0;
assign infs [1] $=\inf ^{-} \mathrm{b} 0$;
assign infs[2] $=\inf _{-}^{-}$a1;
$\operatorname{assign} \operatorname{infs}[3]=\inf _{-}^{-} \mathrm{b} 1$;
assign ints $[0]=$ int_a 0 ;
$\operatorname{assign} \operatorname{ints}[1]=$ int $_{-}^{-} \mathrm{b} 0$;
assign ints [2] $=$ int $_{-}^{-}$a1;
assign ints [3] = int_b1;
assign nans $[0]=$ nan_a ${ }^{-}$;

```
```

249 assign nans[1] = nan_b0
250 assign nans[2]= nan_a1
assign nans [3] = nan b1
assign zeroes[0] = zero_a0;
assign zeroes [1] = zero_b0
assign zeroes[2] = zero a1;
assign zeroes [3] = zero__b1;
256
endmodule // chk_special

```
```

/
File......: exp_unit.v
Author.....: Espen Stenersen
Date......: Tue Apr 15 11:40:17 CEST 2008
// Revision...: 1.0
// Description: Exponent adder unit.
//
`include "defines.v"
module exp_unit
(
exps, // Input from exponent bus.
format, // Input from instruction register.
sums // Output to exponent bus.
);
// input(s)
input ['EXPBUS-1:0] exps;
input [1:0] format;
// output(s)
output ['EXPBUS/2+3:0] sums;
// wire(s)
wire ['FP16EW-1:0] fp16_a_0
wire ['FP16EW-1:0] fp16_-b_0;
wire ['FP16EW-1:0] fp16_sum_0; ;
wire
wire
wire ['FP16EW-1:0] fp16_a_1;
wire ['FP16EW-1:0] fp16_b_1;
wire ['FP16EW-1:0] fp16_sum_-1;
wire fp16 - ovf - ab 1;
wire fp16_ovf_biased_
wire ['FP32EW - 1:0] fp32_a_0;
wire ['FP32EW-1:0] fp32_b_0;
wire ['FP32EW-1:0] fp32_sum_0;
wire
wire
wire ['FP32EW - 1:0]
wire ['FP32EW-1:0] ['F fp32_b_1;
wire fp32_ovf_ab_1;
wire fp32_ovf_biased_1;
wire ['FP64EW-1:0] fp64_a_0-
wire ['FP64EW - 1:0] fp64_bb_0;
wire ['FP64EW-1:0] fp64_sum_0;
wire fp64__ovf_-ab_0;
wire fp64__ovf_}\mathrm{ biased_0;
// reg(s)
Module instantiation.
// Module instantiation
exp_add \#('FP16EW, 'FP16BIAS ) fp16_add_0
(
la

```
```

    .ovf_ab (fp16_ovf_ab_0),
    .ovf_
    );
exp_add \#('FP16EW, 'FP16BIAS) fp16_add_1
(
.a (fp16_a_1),
.sum (fp16_sum_1),
.ovf_ab (fp16_ovf_ab_1),
.ovf_biased (fp16_ovf_biased_1)
);
exp_add \#('FP32EW, 'FP32BIAS ) fp32_add_0
(
.a (% (fp32-a-0),
.sum (fp32_sum_0),
.ovf_ab (fp32_ovf_ab_0),
.ovf_biased (fp32_ovf_biased_0)
);
exp_add \#('FP32EW, 'FP32BIAS) fp32_add_1
(
.a (fp32-a
.sum (fp32_-sum_1),
.ovf_ab (fp32_ovf_ab_1),
.ovf_
);
exp_add \#('FP64EW, 'FP64BIAS) fp64_add_0
(
.a (fp64_a-0),
.sum (fp64_-sum_0),
.ovf_ab (fp64_ovf_ab_0),
.ovf_biased (fp64_ovf_biased_0)
);
/// Combinational assign.
//
// Input demux.
assign fp16_a_0=
(format =- '}\textrm{FP}16) ? exps[1*`FP16EW-1:0*`'FP16EW] : 0
assign fp16_b_0 =
(format - -''FP16) ? exps[2*`'FP16EW-1:1*'FP16EW] : 0; assign fp16_a_1=     (format =-'FP16) ? exps[3*'FP16EW-1:2*`'FP16EW] : 0;
assign fp16_b_1=
(format =-'FP16) ? exps[4*`'FP16EW-1:3*`'FP16EW] : 0;
assign fp32_a_0 =
(format =-`'FP32) ? exps[1*`'FP32EW-1:0*`'FP32EW] : 0; assign fp32_b_0 =     (format - -'`FP32) ? exps[2*'FP32EW-1:1*'FP32EW] : 0;
assign fp32_a_1 =
(format =-'FP32) ? exps[3*`FP32EW-1:2*`'FP32EW] : 0;
assign fp32_b_1=
(format =}='\textrm{FP}32) ? exps[4*`FP32EW-1:3*`\textrm{FP}32\textrm{EW}]:0
assign fp64_a_0=

```
```

125
126
127
128
129
130
131
132
1 3 3
134
135
136
137
138
1 3 9
140 endmodule // exp_unit

```
```

/// File......: exp_add.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 10:44:27 CEST 2008
// Revision...: 1.0
// Description: Exponent adder. Adds the two inputs, and subtracts
//
//
'include "defines.v"
module exp _add
(
a, // Input operand
b, // Input operand
sum, // Output sum.
ovf_ab, // Overflow after addition.
ovf_biased // Overflow after subtraction.
);
parameter WIDTH = 1;
parameter BIAS = 'FP32BIAS;
// input(s)
input [WIDTH-1:0] a;
input [WIDTH-1:0] b;
output(s)
output [WIDTH-1:0] sum;
output ovf ab
output ovf_biased;
// wire(s)
wire [WIDTH:0] a_plus_b_tmp;
wire [WIDTH:0] biased_tmp;
assign a plus b tmp = a + b;
assign biased_tmp = a_plus_b_tmp - BIAS;
assign sum = biased tmp[WIDTH-1:0];
assign ovf_ab = a_\overline{plus_b_tmp[WIDTH];}
assign ovf_biased = biased_tmp[WIDTH];
endmodule // exp_add

```
```

// File......: mult_unit.v
File......: mult_unit.v
Date.......: Tue Apr 15 11:37:56 CEST 2008
Revision...: 1.0
Description: Significand multiplier unit.
//
'include "defines.v"
module mult_unit
(
fracs, // Input from significand bus.
format, // Input from instruction register.
prods // Output to significand bus.
);
// input(s)
input ['FRACBUS-1:0] fracs;
input [1:0] format;
// output(s)
output ['FRACBUS-1:0] prods;
// wire(s)
wire ['FP16SW:0] fp16_a_0;
wire ['FP16SW:0] fp16_bb_0;
wire [2*'FP16SW+1:0] fp16_-p_0;
wire ['FP16SW:0] fp16_a_1;
wire ['FP16SW:0] fp16-b-1,
wire [2*'FP16SW+1:0] fp16_p_1;
wire ['FP32SW:0] fp32_a_0;
wire ['FP32SW:0] fp32_b 0;
wire [2*'FP32SW+1:0] fp32_-p_-0;
wire ['FP32SW:0] fp32_a-
wire ['FP32SW:0] fp32_-b_-1;
wire [2*`FP32SW + 1:0] fp32__p_1;     wire ['FP64SW:0] fp64_a_0;     wire ['FP64SW:0] fp64_-b_0; ;     wire [2*`FP64SW + 1:0] fp64_p_0;
// reg(s)
//
/ Module instantiations.
//
uns_mult \#('FP16SW+1) uns_mult_fp16_0
(
.a(fp16_a_0),
.b(fp16-b-0),
.p(fp16_p_0)
);
uns_mult \#('FP16SW+1) uns_mult_fp16_1
(
.a(fp16_a_1),
.b(fp16-b_1),

```
```

);
uns_mult \#('FP32SW+1) uns_mult_fp32_0
(
.a(fp32_a_0),
.b(fp32-b-0),
.p(fp32_p_0)
);
uns_mult \#('FP32SW+1) uns_mult_fp32_1
(
.a(fp32_a_1),
.b(fp32_b--1)
.p(fp32_p_1)
);
uns_mult \#('FP64SW+1) uns_mult_fp64_0
(
.a(fp64_a_0),
.b(fp64_b-0),
.p(fp64_p_0)
);
// Combinaional assigns.
//
// Input demux.
assign fp16_a_0 =
(format = 'FP16) ? fracs[1*('FP16SW+1)-1:0*('FP16SW+1)] : 0;
assign fp16_b_0=
(format - -''FP16) ? fracs[2*('FP16SW+1) - 1:1*('FP16SW+1)] : 0;
assign fp16_a_1=
(format =-'FP16) ? fracs[3*('FP16SW+1)-1:2*('FP16SW+1)] : 0;
assign fp16_b_1=
(format =-'FP16) ? fracs [4*('FP16SW+1)-1:3*('FP16SW+1)] : 0;
assign fp32_a_0 =
(format = 'FP32) ? fracs [1*('FP32SW+1) - 1:0*('FP32SW+1)] : 0;
assign fp32_b_0 =
(format =-''FP32) ? fracs[2*('FP32SW+1)-1:1*('FP32SW+1)] : 0;
assign fp32_a_1=
(format =-'FP32) ? fracs[3*('FP32SW+1)-1:2*('FP32SW+1)]:0;
assign fp32_b_1 =
(format =-'FP32) ? fracs [4*('FP32SW+1)-1:3*('FP32SW+1)]:0;
assign fp64_a_0 =
(format =-'FP64) ? fracs[1*('FP64SW+1)-1:0*('FP64SW+1)] : 0;
assign fp64_b_0 =
(format =-'FP64) ? fracs[2*('FP64SW+1)-1:1*('FP64SW+1)]:0;
// Output mux.
assign prods=
(format = 'FP16) ? {fp16_p_1, fp16_p_0} :
(format = 'FP32) ? {fp32_p_1, fp32_p_0} :
(format ='FP64) ? fp64_\overline{p}_\overline{0}:0;
endmodule // mult_unit

```
```

/
File.......: uns_mult.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 10:40:36 CEST 2008
// Revision...: 1.0
// Description: Unsigned multiplier used for significand
//
//
'include "defines.v"
module uns _mult
(
a, // Input, multiplicand.
b, // Input, multiplier.
p // Output, product.
);
parameter WIDTH = 'FP64SW +1;
// input(s)
input [WIDTH-1:0] a;
input [WIDTH-1:0] b;
// output(s)
output [2*WIDTH-1:0] p;
assign p = a * b;
endmodule // uns_mult

```
```

/// File......: sign_unit.v
// File......: sign_unit.v
// Date......: Tue Apr 15 11:42:25 CEST 2008
5// Revision...: 1.0
// Description: Sign computation unit.
7 //
module sign_unit
10(
signs, // Input signs from sign bus.
signs_comp // Output to sign bus.
13 );
14

```
// File......: rne_unit.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 12:19:50 CEST 2008
// Revision...: 1.0
// Description: Rounding, normalizing and exception unit.
//
'include "defines.v"
module rne_unit
(
    fracs, // Input from fraction bus.
    exps, // Input form exponent bus.
    signs, // Input from sign bus.
    format, // Input from instrucion register.
    special, // Input form check special.
    mode, // Input from mode register.
    exceps, // Output exceptions.
    result // Output. Rounded result.
);
    // input(s)
    input ['FRACBUS-1:0] fracs;
    input ['EXPBUS/2+3:0] exps;
    input ['SIGNBUS / 2-1:0] signs;
    input [1:0] format;
    input [1:0] mode;
    input [15:0] special;
    // output(s)
    output ['BUS/2-1:0] result;
    output [7:0] exceps;
    // wire(s)
    wire [2*`FP64SW + 1:0
    wire
    wire ['FP64EW+1:0]
    wire [7:0]
    wire ['FP64SW+'FP64EW :0]
    wire [3:0]
    wire [2*'FP64SW + 1:0]
    wire
    wire ['FP64EW+1:0]
    wire [7:0]
    wire ['FP64SW+'FP64EW :0]
    wire [3:0]
    wire [1:0]
    wire [1:0]
    wire [1:0]
    wire [1:0]
    wire [1:0]
    wire [1:0] fp32-underflow;
    wire [1:0] fp32 inexact;
    wire [1:0] fp32-invalid;
    wire [1:0] fp64 overflow;
    wire [1:0] fp64-underflow;
    wire [1:0] fp64 inexact;
    wire [1:0] fp64_invalid;
    wire [3:0] exceps fp16_rne_0
    wire [3:0] exceps_fp16_rne_1;
    wire [3:0] exceps_fp32_rne_0;
```



```
125
26
```

rne \#('FP32SW, 'FP32EW) fp32_rne_1

```
rne #('FP32SW, 'FP32EW) fp32_rne_1
(
(
    .frac (frac_fp32_rne_1),
    .frac (frac_fp32_rne_1),
    .sign (sign_rne_1),
    .sign (sign_rne_1),
    .exp (exp_fp32_rne_1),
    .exp (exp_fp32_rne_1),
    .specials (specials__rne__ 1),
    .specials (specials__rne__ 1),
    mode (mode)
    mode (mode)
    .result (result_fp32_rne_1),
    .result (result_fp32_rne_1),
    .exceps (exceps fp32 rne 1)
    .exceps (exceps fp32 rne 1)
);
);
rne #('FP64SW, 'FP64EW) fp64_rne_0
rne #('FP64SW, 'FP64EW) fp64_rne_0
(
(
    .frac (frac fp64 rne 0)
    .frac (frac fp64 rne 0)
    .sign (sign__rne_-0),
    .sign (sign__rne_-0),
    exp (exp fp64 rne 0),
    exp (exp fp64 rne 0),
    .specials (specials _rne_0),
    .specials (specials _rne_0),
    mode (mode),
    mode (mode),
    result (result_fp64_rne_0),
    result (result_fp64_rne_0),
    exceps (exceps_fp64_rne_0)
    exceps (exceps_fp64_rne_0)
);
);
/// Combinalional assign.
/// Combinalional assign.
// Inputs to rounding logic.
// Inputs to rounding logic.
assign frac_fp16_rne_0 = fracs[2*('FP16SW+1) -1:0*('FP16SW+1)];
assign frac_fp16_rne_0 = fracs[2*('FP16SW+1) -1:0*('FP16SW+1)];
assign frac fp16 rne 1 = fracs[4*('FP16SW+1) - 1:2*('FP16SW+1)];
assign frac fp16 rne 1 = fracs[4*('FP16SW+1) - 1:2*('FP16SW+1)];
assign frac_fp32_rne_0 = fracs [2*('FP32SW+1) - 1:0*('FP32SW+1)];
assign frac_fp32_rne_0 = fracs [2*('FP32SW+1) - 1:0*('FP32SW+1)];
assign frac fp32 rne 1 = fracs[4*('FP32SW+1) -1:2*('FP32SW+1)];
assign frac fp32 rne 1 = fracs[4*('FP32SW+1) -1:2*('FP32SW+1)];
assign frac_fp64_rne_0 = fracs[2*('FP64SW+1) - 1:0*('FP64SW+1)];
assign frac_fp64_rne_0 = fracs[2*('FP64SW+1) - 1:0*('FP64SW+1)];
// Two msb bits represents the overflow bits during exponent
// Two msb bits represents the overflow bits during exponent
// addition.
// addition.
assign exp_fp16_rne_0 = exps[1*('FP16EW+1):0*('FP16EW+2)];
assign exp_fp16_rne_0 = exps[1*('FP16EW+1):0*('FP16EW+2)];
assign exp__fp16_rne_}\mp@subsup{_}{_}{-}= exps[2*('FP16EW+1)+1:1*('FP16EW+2)]
assign exp__fp16_rne_}\mp@subsup{_}{_}{-}= exps[2*('FP16EW+1)+1:1*('FP16EW+2)]
assign exp_fp32_rne_0 = exps[1*('FP32EW+1):0*('FP32EW+2)];
assign exp_fp32_rne_0 = exps[1*('FP32EW+1):0*('FP32EW+2)];
assign exp_fp32_rne__ - = exps[2*('FP32EW+1) +1:1*('FP32EW+2)];
assign exp_fp32_rne__ - = exps[2*('FP32EW+1) +1:1*('FP32EW+2)];
assign exp fp64 rne 0 = exps[1*('FP64EW+1):0*('FP64EW+2)];
assign exp fp64 rne 0 = exps[1*('FP64EW+1):0*('FP64EW+2)];
assign sign_rne_0= signs[0];
assign sign_rne_0= signs[0];
assign sign_rne_1 = signs [1];
assign sign_rne_1 = signs [1];
assign specials rne 0 =
assign specials rne 0 =
    {special[13], special[12], special[9], special[8],
    {special[13], special[12], special[9], special[8],
    special[5], special[4], special[1], special[0]};
    special[5], special[4], special[1], special[0]};
assign specials rne 1 =
assign specials rne 1 =
    {special[15], special[14], special[11], special[10],
    {special[15], special[14], special[11], special[10],
    special[7], special[6], special[3], special[2]};
    special[7], special[6], special[3], special[2]};
// Outputs from rounding logic.
// Outputs from rounding logic.
assign result_fp16_rne = {result_fp16_rne_1, result_fp16_rne_0 };
assign result_fp16_rne = {result_fp16_rne_1, result_fp16_rne_0 };
assign result fp32 rne = {result fp32 rne 1, result fp32 rne 0 };
assign result fp32 rne = {result fp32 rne 1, result fp32 rne 0 };
assign result_fp64_rne = result_fp64_rne_0;
assign result_fp64_rne = result_fp64_rne_0;
assign fp16_underflow =
assign fp16_underflow =
    {exceps_\overline{fp}16_rne_1[3], exceps_fp16_rne_0[3]};
```

    {exceps_\overline{fp}16_rne_1[3], exceps_fp16_rne_0[3]};
    ```
    assign fp16 overflow \(=\)
    \(\left\{\right.\) exceps_- \(\overline{f p} 16 \_r n e \_1[2]\), exceps_fp16_rne_0[2]\};
    assign \(\mathrm{fp} 1 \overline{6}\) inexact \(=\)
    \{exceps_fp16_rne_1[1], exceps_fp16_rne_0[1]\};
    assign fp1 \(\overline{6}\) invalid \({ }^{-}=\)
        \(\{\) exceps_fp16_rne_1[0], exceps_fp16_rne_0[0]\};
    assign fp 32 underflow \(=\)
        \(\left\{\right.\) exceps_fp \(32 \_\)rne_1[3], exceps_fp32_rne_0[3]\};
    assign fp \(3 \overline{2}\) _overflow \(\bar{w}=\)
        \(\left\{\right.\) exceps_ \(\overline{\mathrm{fp}} 32 \_\)rne_1[2], exceps_fp32_rne_0[2]\};
    assign fp \(3 \overline{2}\) inexact \({ }^{-}=\)
        \(\{\) exceps_fp32_rne_1[1], exceps_fp32_rne_0[1]\};
    assign \(\mathrm{fp} 3 \overline{2}\) invalid \({ }^{-}=\)
        \(\left\{\right.\) exceps_fp \(32 \_\)rne_1[0], exceps_fp \(32 \_\)rne_ \(\left.0[0]\right\}\);
    assign fp64_underflow \(=\)
        \(\left\{1^{\prime} b 0\right.\), exceps_fp64_rne_0[3]\};
    assign fp64_overflow \(=\)
        \(\{1\) 'b0, exceps_fp64_rne_0[2]\};
    assign fp64_inexact \(=\)
        \(\{1\) 'b0, exceps_fp64_rne_0[1]\};
    assign fp64_invālid =
        \(\left\{1^{\prime} b 0\right.\), exceps_fp64_rne_0[0]\};
    assign fp16_exceps \(=\)
        \{fp16_underflow, fp16_overflow, fp16_inexact, fp16_invalid \};
    assign fp 32 exceps \(=\)
        \{fp32_underflow, fp32_overflow, fp32_inexact, fp32_invalid \(\}\);
    assign fī 64 _exceps \(=\)
        \{fp64_underflow, fp64_overflow, fp64_inexact, fp64_invalid \};
    \(\operatorname{assig} n\) exceps \(=\)
        \((\) format \(='\) FP16) ? fp16_exceps :
        \((\) format \(='\) FP32 \() ?\) fp32_exceps :
        (format \(=\) 'FP64) ? fp64_exceps : 0;
    assign result \(=\)
        (format ='FP16) ? result_fp16_rne :
        (format ='FP32) ? result_fp32_rne :
        (format \(=\) 'FP64) ? result_fp64_rne : 0;
endmodule // rne_unit
```

// File......: rne.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 11:10:54 CEST 2008
// Revision...: 1.0
// Description: Rounding and exception unit. Rounds, normalizes and
//
// generates exceptions if needed.
'include "defines.v"
module rne
(
frac, // Input. Fractional part from significand
multiplication
sign, // Input. Sign from sign computation.
exp, // Input. Biased exponent from exponent addition.
specials, // Input. NaNs, infinities, zeros..
mode, // Input. Rounding mode.
result, // Output. Rounded result or special value.
exceps // Output. Exceptions.
);
parameter SW = 52;
parameter EW = 11;
// input(s)
input [2*SW+1:0] frac;
input [EW+1:0] exp;
input sign;
input [7:0] specials;
input [1:0] mode;
output(s)
output [SW+EW:0] result;
output [3:0] exceps;
// wire(s)
wire normalize;
wire postnormalize;
wire lsb;
wire round;
wire sticky;
wire roundup;
wire rounded;
wire ovf ab;
wire ovf-biased
wire ovf postnorm;
wire round_to_nearest_even;
wire round to infinity;
wire round_to_zero;
wire nan_a;
wire nan_b;
wire int_a;
wire int b;
wire inf_a;
wire inf b;
wire zero_a;
wire zero_b;
wire int_times_inf;
wire invalid;

```
```

wire overflow;

```
wire overflow;
wire overflow_tmp;
wire overflow_tmp;
wire underflow;
wire underflow;
wire underflow_tmp;
wire underflow_tmp;
wire inexact;
wire inexact;
wire exp_zero;
wire exp_zero;
wire [SW:0] significand;
wire [SW:0] significand;
wire [SW:0] significand_tmp;
wire [SW:0] significand_tmp;
wire [SW:0] significand__plus_ulp;
wire [SW:0] significand__plus_ulp;
wire [EW:0] exponent;
wire [EW:0] exponent;
wire [SW+EW:0] result_tmp;
wire [SW+EW:0] result_tmp;
wire [SW+EW:0] produc\overline{t}_nan;
wire [SW+EW:0] produc\overline{t}_nan;
wire [SW+EW:0] product_zero;
wire [SW+EW:0] product_zero;
wire [SW+EW:0] product_large;
wire [SW+EW:0] product_large;
wire [SW+EW:0] product_overflow;
wire [SW+EW:0] product_overflow;
// reg(s)
// reg(s)
// Round and normalize / Postnormalize.
// Round and normalize / Postnormalize.
// Normalize if result from multiplier lies in [2,4)
assign normalize = frac[2*SW+1];
assign significand_tmp=
    normalize ?
    frac[2*SW:SW] >> 1: frac [2*SW:SW];
assign exponent =
    normalize ?
    exp[EW-1:0] + 1 : exp[EW-1:0];
// Assign rounding bits.
assign lsb =
    normalize ?
    frac [SW+1] :
    frac [SW];
assign round =
    normalize ?
    frac [SW] :
    frac[SW-1];
assign sticky=
    normalize ?
    | frac [SW-1:0] :
    | frac [SW-2:0];
// Reduce to three rounding modes.
assign round to nearest even =
    (round & (l lsb
assign round_to infinity =
    (! sign&(!mode\overline{[1]&mode[0]) | sign&(mode[1]&!mode[0]))&}
    (round|sticky);
assign round to zero =
    ( sign&(* mode[1]& mode[0]) |~ sign&(mode[1]&~ mode[0]))|&mode;
// Round-up if necessary.
assign significand_plus_ulp = significand_tmp + 1'b1;
```

```
assign roundup = round_to_infinity | round_to_nearest_even;
assign significand =
    roundup ?
        significand_plus_ulp : significand_tmp;
// Post-normalize if result after rounding lies in [2,4).
assign postnormalize = !significand [SW] & significand_tmp [SW];
assign result tmp =
    postnormalize ?
    {sign, exponent[EW-1:0] + 1'b1, significand [SW:1]} :
    {sign, exponent[EW-1:0], significand[SW-1:0]};
// Inexact if result was rounded.
assign rounded = round | sticky;
assign ovf_postnorm =
    exponent}[EW] | &exponent[EW-1:0]&(normalize|postnormalize)
// Generate exceptions.
assign ovf_ab=exp[EW+1];
assign ovf_biased = exp[EW];
// Invalid inputs from chk_special.
assign nan_a = specials[0];
assign nan_b = specials [1];
assign inf_a = specials [2];
assign inf_b}=\mathrm{ - specials[3];
assign zero_a = specials[4];
assign zero_-b = specials[5];
assign int_a = specials [6];
assign int_b = specials[7];
// Generate exceptions.
assign int_times_inf = (int_a&inf_b)|(int_b&inf_a);
assign invalid =
    (nan_a | nan_b)|
    (zero_a&inf_\overline{b}| zero_b&inf_a)|
    (inf_\overline{a}| inf_b)&!int_times__inf;
assign inexact =
    (rounded & (!invalid)|
    overflow_tmp|
    round_to__zero&overflow_tmp|
    underflow&(!(zero_a|zero_b)))&!int_times_inf;
assign underflow =
    (~ ovf_ab&ovf_biased)|
    (~ | result_tm\overline{p}[SW+EW-1:SW]) &!(ovf_ab&ovf_biased|ovf_postnorm) &
    ! overflow w&!invalid |(zero_a|zero_\overline{b})&!(nan_a|nan_b|inf_a|inf_b);
// If overflow occurs and rounding mode equals round-to zero,
// result shall be rounded to largest representative number.
// e.x 0111101111111111.
assign overflow tmp =
    ((ovf_ab&ovf_biased |ovf_postnorm&!underflow )|
    &result _tmp [\overline{SW}+EW-1:SW]&! underflow ) &!invalid ;
```

186
187

```
    assign overflow = overflow_tmp&!round_to_zero | int_times_inf;
    // Compute special results.
    assign product_nan =
        {1'b0, {EW{1'b1 }}, {(SW-1){1'b0 }}, 1'b1};
    assign product_zero =
        {result_tmp[SW+EW], {(SW+EW) {1'b0 }}};
    assign product overflow =
    {result_tmp [SW+EW], {EW{1'b1}}, {(SW) {1'b0 }}};
    assign product large =
        {result_tmp [SW+EW], {(EW-1) {1'b1}}, 1'b0, {(SW) {1'b1}}};
    // Final product decided by exceptions.
    assign result =
        invalid ? product_nan :
        overflow ? product overflow :
        underflow ? product_zero :
        round_to_zero & overflow_tmp & !int_times_inf ? product_large :
        result tmp;
    assign exceps[0] = invalid;
assign exceps[1] = inexact;
assign exceps[2] = overflow;
assign exceps[3] = underflow;
endmodule // rne
```

```
// File...... sel_input.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 10:54:28 CEST 2008
// Revision...: 1.0
// Description: Selects data from the input registers and puts it on
//
/
'include "defines.v"
module sel_input
(
    drh, // Input from data-register high (DRHO)
    drl, // Input from data-register low (DRLO).
    format, // Input form instrucion(format) register.
    signs, // Output to sign bus.
    exps, // Output to exponent bus.
    fracs // Output to significand bus.
);
    parameter WIDTH = 'BUS / 2;
    parameter SIGNBUS = 'SIGNBUS
    parameter EXPBUS = 'EXPBUS;
    parameter FRACBUS = 'FRACBUS;
    // input(s)
    input [WIDTH-1:0] drh;
    input [WIDTH-1:0] drl;
    input [1:0] format;
    // output(s)
    output [SIGNBUS-1:0] signs;
    output [EXPBUS-1:0] exps;
    output [FRACBUS-1:0] fracs;
    // wire(s)
    // reg(s)
    reg [SIGNBUS-1:0] signs tmp;
    reg [EXPBUS-1:0] exps_亩mp;
    reg [FRACBUS-1:0] fracs tmp;
    // Combinational logic.
    always @ (drh or drl or format) begin
    signs_tmp = 0;
    exps_-tmp = 0;
    fracs_tmp = 0;
    case (format)
        'FP16: begin
            signs tmp =
                {\overline{drl [4*`FP16W-1], drl[3*`FP16W-1],}
                drl[2*'FP16W-1], drl[1*`FP16W-1]};
            exps tmp =
            {drl[4*('FP16W) - 2:3*`FP16W+'FP16SW],
                drl[3*('FP16W) - 2:2*'FP16W+'FP16SW],
```

```
63
64
65
66
67
68
69
71
72
73
74
75
76
77
78
80
81
82
83
85
86
87
88
90
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109 endmodule // sel_input
```

```
//
File......: sel_output.v
// Author.....: Espen Stenersen
// Date......: Thu Apr 24 23:42:46 CEST 2008
// Revision...: 1.0
// Description: Loads the correct locations in output register and
//
/
`include "defines.v"
module sel_output
(
    result, // Input from rounding logic.
    exceps, // Input from rounding logic.
    format, // Input from format register.
    start, // Input from start register.
    products, // Output to output register.
    load_drh, // Output to output register.
    load_drlh, // Output to output register.
    load_drll, // Output to output register.
    exceptions, // Output to exception register.
    load_excep_l, // Output to exception register.
    load_excep_h, // Output to exception register.
    reset,
    clk
);
// input(s)
input ['BUS/2-1:0] result ;
input [7:0] exceps;
input [1:0] format;
input start;
input clk;
input reset;
// output(s)
output ['BUS-1:0] products;
output [15:0] exceptions;
output
output
output - - 
*
load excep l;
output load_excep_h;
// wire(s)
// reg(s)
reg ['BUS-1:0] products;
reg [15:0] exceptions;
reg load_drh;
reg load_drlh;
reg load_drll;
reg load_excep_l;
reg load_excep_-h;
reg counter;
always @ (posedge clk) begin
    if (reset) begin
```

```
            counter <= 0;
```

            counter <= 0;
    end
    end
    else begin
    else begin
        if (start) begin
        if (start) begin
            counter <= counter + 1;
            counter <= counter + 1;
            end
            end
            else begin
            else begin
            counter <= 0;
            counter <= 0;
            end
            end
        end
        end
    end
end
always @ (result or exceps or format or counter or start) begin
always @ (result or exceps or format or counter or start) begin
products = 0;
products = 0;
exceptions = 0;
exceptions = 0;
load_drh = 0;
load_drh = 0;
load_drlh = 0;
load_drlh = 0;
load_drll = 0;
load_drll = 0;
load_excep_l = 0;
load_excep_l = 0;
load_excep__h = 0;
load_excep__h = 0;
case (format)
case (format)
'FP16: begin
'FP16: begin
case (counter)
case (counter)
0: begin
0: begin
products[31:0] = result[31:0];
products[31:0] = result[31:0];
exceptions[7:0] = exceps;
exceptions[7:0] = exceps;
load_drll = 1;
load_drll = 1;
load_excep_l = 1;
load_excep_l = 1;
end
end
1: begin
1: begin
products[63:32] = result[31:0];
products[63:32] = result[31:0];
exceptions[15:8] = exceps;
exceptions[15:8] = exceps;
load_drlh = 1;
load_drlh = 1;
load_excep_h = 1;
load_excep_h = 1;
end
end
endcase
endcase
end
end
'FP32: begin
'FP32: begin
case (counter)
case (counter)
0: begin
0: begin
products[63:0] = result[63:0];
products[63:0] = result[63:0];
exceptions[7:0] = exceps;
exceptions[7:0] = exceps;
load_drll = 1;
load_drll = 1;
load_drlh = 1;
load_drlh = 1;
load_excep_l = 1;
load_excep_l = 1;
end
end
1: begin
1: begin
products[127:64] = result[6 3:0];
products[127:64] = result[6 3:0];
exceptions[15:8] = exceps;
exceptions[15:8] = exceps;
load_drh = 1;
load_drh = 1;
load_excep_h = 1;
load_excep_h = 1;
end
end
endcase
endcase
end
end
'FP64: begin
'FP64: begin
case (counter)
case (counter)
0: begin
0: begin
products[63:0] = result [6 3:0];
products[63:0] = result [6 3:0];
exceptions[7:0] = exceps;
exceptions[7:0] = exceps;
load_drll = 1;
load_drll = 1;
load__drlh = 1;

```
                    load__drlh = 1;
```

```
///F
// Author....: Espēn Stenersen
// Date......: Tue Apr 15 10:31:28 CEST 2008
5// Revision...: 1.0
6// Description: Generic register with synchronous reset and enable.
7//
'include "defines.v"
module reg_enable
(
    d, // Data in.
    q, // Data out.
    enable, // Enable bit.
    clk,
    reset
);
    parameter WIDTH = 'BUS;
    // input(s)
    input [WIDTH-1:0] d;
    input enable;
    input clk;
    input reset;
        output(s)
    output [WIDTH-1:0] q;
    l/ wire(s)
    // reg(s)
    reg [WIDTH-1:0] q;
    always@ (posedge clk) begin
        if (reset) begin
            q<= 0;
        end
        else if (enable) begin
            q<= d;
        end
    end
endmodule // reg_enable
```

```
/
// File.......: reg set.v
// Author.....: Espen Stenersen
// Date......: Thu Apr 24 23:07:57 CEST 2008
// Revision...: 1.0
// Description: Generic register with syncrhonous set.
//
module reg_set
    set, // Input.
    q, // Output.
    clk,
    reset
);
    parameter WIDTH = 1;
    // input(s)
    input set;
    input clk;
    input reset;
    // output(s)
    output q;
    // wire(s)
    // reg(s)
    reg q;
    always@ (posedge clk) begin
        if (reset) begin
        q<= 0;
        end
        else if (set) begin
            q<= 1;
        end
        else begin
            q<= 0;
        end
    end
endmodule // reg_set
```

```
/// File.......: ff.v
// File.......: ff.v
// Date......: Tue Apr 15 11:45:16 CEST 2008
5 // Revision...: 1.0
6 // Description: Generic register with synchronous reset.
//
module ff
(
    d, // Data in.
    q, // Data out.
    clk,
    reset
);
    parameter WIDTH = 1;
    // input(s)
    input [WIDTH-1:0] d;
    input
    input reset;
        clk;
    // output(s)
    output [WIDTH-1:0] q;
    // reg(s)
    reg [WIDTH-1:0] q;
    always@ (posedge clk) begin
        if (reset)
        q<=0;
        else
            q}<=\textrm{d}
    end
endmodule // ff
```


## Appendix B

## Architecture Two Verilog Sources

Only sources that differs between the two architectures are included in this Chapter, exponent unit building blocks, multiplier unit building blocks and rounding and exception unit building blocks.

```
// File.....: defines.v
// Author.....: Espen Stenersen
// Date......: Wed May 14 11:45:28 CEST 2008
// Revision...: 1.0
// Description: Contains definitions used in the design files.
/ Openrand widths, exponent widths, significand widths,
// bias values and bus widths.
```



```
'define FP16 0
'define FP32 1
'define FP64 2
'define FP16W 16
'define FP32W 32
'define FP64W 64
'define FP16SW 10
'define FP32SW 23
'define FP64SW 52
'define FP16EW 5
'define FP32EW 8
'define FP64EW 11
'define FP16BIAS 15
'define FP32BIAS 127
'define FP64BIAS 1023
`define FRACBUS 2*('FP64SW+1)
'define FRACBUSOUT 154
'define EXPBUS 4*'FP32EW
'define EXPBUSOUT 20
'define SIGNBUS 4
```

| 37 | 'define BUS |
| :--- | :--- |
| 38 |  |
| 39 | 128 |
| 40 | 'define EVEN |
| 41 |  |
| 42 | 0 |
| 42 | define Pefine NINF |

```
// File......: exp_unit.v
// Author.....: Espēn Stenersen
Date......: Tue Apr 15 11:40:17 CEST 2008
// Revision...: 1.0
// Description: Exponent adder unit.
//
'include "defines.v"
module exp_unit
    exps, // Input from exponent bus.
    format, // Input from instruction register.
    sums // Output to exponent bus.
);
    // input(s)
    input ['EXPBUS-1:0] exps;
    input [1:0] format;
    // output(s)
    output ['EXPBUSOUT-1:0] sums;
    // wire(s)
    wire ['FP32EW-1:0] fp32_a_0;
    wire ['FP32EW-1:0] fp32_b_-0;
    wire ['FP32EW-1:0] fp32_sum_0;
    wire fp32_ovf_ab_0;
    wire
    wire ['FP64EW-1:0] fp64_a_0;
    wire ['FP64EW-1:0] fp64_b_0;
    wire ['FP64EW-1:0] fp64_sum_0;
    wire fp64_ovf_ab_0
    wire fp64_ovf_
    // reg(s)
    // Module instantiation.
    exp_add8 #('FP32EW) fp32_add_0
    (
        .a (fp32_a_0),
        .b (fp32-bb-0),
        .sum (fp32_-sum_0),
        .format (format),
        .ovf_ab (fp32_ovf_ab_0),
        .ovf_biased (fp32__ovf_biased_0)
    );
    exp_add11 #('FP64EW) fp64_add_0
    (
        .a (fp64_a_0),
        .b (fp64-b-0),
        .format (format),
        .sum (fp64_sum_0),
        .ovf_ab (fp64_ovf_ab_0),
        .ovf_biased (fp64_ovf_biased_0)
    );
```

```
*
70
72
73
64
65
66
68
69
```

// File......: exp_add8.v
// Author.....: Espen Stenersen
/ Date......: Tue Apr 15 10:44:27 CEST 2008
// Revision...: 1.0
// Description: Exponent adder. Adds the two inputs, and subtracts
//
/
`include "defines.v"
module exp_add8
(
a, // Input operand.
b, // Input operand.
format, // Input.
sum, // Output sum.
ovf_ab, // Overflow after addition.
ovf_biased // Overflow after subtraction.
);
parameter WIDTH = 'FP32EW;
// input(s)
input [WIDTH-1:0] a;
input [WIDTH-1:0] b;
input [1:0] format;
// output(s)
output [WIDTH-1:0] sum;
output ovf_ab
output ovf_biased;
// wire(s)
wire [WIDTH:0] a_plus_b_tmp;
wire [WIDTH:0] b\overline{ased_- t\overline{mp}}\mathrm{ ;}
// Exponent1 + exponent2
assign a_plus_b_tmp =a + b;
// Subtract bias.
assign biased_tmp =
(format = 'FP16) ? a_plus_b_tmp - 'FP16BIAS :
(format = 'FP32) ? a_plus_b__tmp - 'FP32BIAS : 0;
// Selcet part of sum.
assign sum =
(format = 'FP16) ? biased tmp['FP16EW-1:0] :
(format ='FP32) ? biased_tmp ['FP32EW-1:0] : 0 ;
// Compute overflow / underflow detection bits.
assign ovf ab =
(format = 'FP16) ? a_plus_b_tmp['FP16EW] :
(format = 'FP32) ? a__plus_b__tmp['FP32EW] : 0;
// Compute overflow / underflow detection bits.
assign ovf biased =
(format - 'FP16) ? biased_tmp['FP16EW] :
(format = 'FP32) ? biased_tmp ['FP32EW] : 0;
endmodule // exp_add8

```
```

// File......: exp_add11.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 10:44:27 CEST 2008
// Revision...: 1.0
// Description: Exponent adder. Adds the two inputs, and subtracts
//
/
'include "defines.v"
module exp _add11
(
a, // Input operand.
b, // Input operand.
format, // Input.
sum, // Output sum.
ovf_ab, // Overflow after addition.
ovf_biased // Overflow after subtraction.
);
parameter WIDTH = 'FP64EW;
// input(s)
input [WIDTH-1:0] a;
input [WIDTH-1:0] b;
input [1:0] format;
// output(s)
output [WIDTH-1:0] sum;
output
ovf ab
// wire(s)
wire [WIDTH:0] a_plus_b_tmp;
wire [WIDTH:0] b\overline{ased_tmp;}
// Exponent1 + exponent2
assign a_plus_b_tmp =a + b;
// Subtract bias.
assign biased_tmp =
(format = 'FP16) ? a_plus_b_tmp - 'FP16BIAS
(format ='FP32) ? a_plus_b_tmp - 'FP32BIAS
(format = 'FP64) ? a_plus_b_tmp - 'FP64BIAS : 0;
// Selcet part of sum.
assign sum =
(format = 'FP16) ? biased_tmp['FP16EW-1:0] :
(format = 'FP32) ? biased_tmp ['FP32EW-1:0] :
(format ='FP64) ? biased_tmp['FP64EW-1:0] : 0 ;
// Compute overflow / underflow detection bits.
assign ovf ab=
(format = 'FP16) ? a_plus_b_tmp['FP16EW]
(format = 'FP32) ? a_plus_b_tmp['FP32EW] :
(format = 'FP64) ? a__plus_b_-tmp['FP64EW] : 0;
// Compute overflow / underflow detection bits.
assign ovf biased =
(format = 'FP16) ? biased_tmp['FP16EW] :
(format ='FP32) ? biased_tmp['FP32EW] :

```

66 endmodule // exp_add11
```

// File......: mult_unit.v
File......: mult_unit.v
Date.......: Tue Apr 15 11:37:56 CEST 2008
Revision...: 1.0
Description: Significand multiplier unit.
//
'include "defines.v"
module mult_unit
(
fracs, // Input from significand bus.
format, // Input from instruction register.
prods // Output to significand bus.
);
// input(s)
input ['FRACBUS-1:0] fracs;
input [1:0] format;
// output(s)
output ['FRACBUSOUT-1:0] prods;
// wire(s)
wire ['FP32SW:0] fp32_a_0;
wire ['FP32SW:0] fp32__b_0;
wire [2*`FP32SW+1:0] fp32_p_0;
wire ['FP64SW :0] fp64_a_0;
wire ['FP64SW:0] fp64_b-_0;
wire [2*'FP64SW+1:0] fp64_p_0;
// reg(s)
// Module instantiations.
// Module instantiations.
uns_mult \#('FP32SW+1) uns_mult_fp32_0
(
.a(fp32_a_0),
.b(fp32-b-0),
.p(fp32_p_-0)
);
uns_mult \#('FP64SW+1) uns_mult_fp64_0
(
.a(fp64_a_0),
.b(fp64_b_0),
.p(fp64_-p_0)
);
/// Combinaional assigns.
// Combinaional assigns.
// Input demux.
assign fp32_a_0 =
(format =-'FP16) ?
{fracs [1*('FP16SW+1)-1:0*('FP16SW+1)],
{('FP32SW-'FP16SW) {1'b0 }}} :

```
    (format ='FP32) ?
        fracs[1*('FP32SW+1)-1:0*('FP32SW+1)] : 0;
    assign fp32_b_0=
        (format =-'FP16) ?
            {fracs[2*('FP16SW+1)-1:1*('FP16SW+1)],
            {('FP32SW-'FP16SW) {1'b0}}} :
    (format = 'FP32) ?
        fracs[2*('FP32SW+1)-1:1*('FP32SW+1)] : 0;
    assign fp64_a_0 =
    (format = 'FP16) ?
        {fracs[3*('FP16SW+1)-1:2*('FP16SW+1)],
        {('FP64SW-'FP16SW) {1'b0}}} :
    (format = 'FP32) ?
        {fracs[3*('FP32SW+1)-1:2*('FP32SW+1)],
        {('FP64SW-'FP32SW) {1'b0 }}} :
    (format = 'FP64)?
        fracs[1*('FP64SW+1)-1:0*('FP64SW+1)] : 0;
    assign fp64_b_0 =
    (format --'FP16) ?
        {fracs [4*('FP16SW+1)-1:3*('FP16SW+1)],
        {('FP64SW-'FP16SW) {1'b0}}} :
    (format='FP32)?
        {fracs [4*('FP32SW+1)-1:3*('FP32SW+1)],
        {('FP64SW-'FP32SW) {1'b0 }}} :
    (format ='FP64) ?
        fracs[2*('FP64SW+1)-1:1*('FP64SW+1)] : 0;
    assign prods = {fp64_p_0, fp32_p_0 };
endmodule // mult_unit
```

```
/// File......: uns_mult.v
File......: uns_mult.v
// Date......: Tue Apr 15 10:40:36 CEST 2008
// Revision...: 1.0
// Description: Unsigned multiplier used for significand
//
//
'include "defines.v"
module uns_mult
(
    a, // Input, multiplicand.
    b, // Input, multiplier.
    p // Output, product.
);
    parameter WIDTH = 'FP64SW +1;
    // input(s)
    input [WIDTH-1:0] a;
    input [WIDTH-1:0] b;
    // output(s)
    output [2*WIDTH-1:0] p;
    assign p = a * b;
endmodule // uns_mult
```

```
// File......: rne_unit.v
// Author..... Esp\overline{e}n Stenersen
// Date......: Tue Apr 15 12:19:50 CEST 2008
// Revision...: 1.0
// Description: Rounding, normalizing and exception unit.
//
'include "defines.v"
module rne_unit
(
    fracs, // Input from fraction bus.
    exps, // Input form exponent bus.
    signs, // Input from sign bus.
    format, // Input from instrucion register.
    special, // Input form check special.
    mode, // Input from mode register.
    exceps, // Output exceptions.
    result // Output. Rounded result.
);
// input(s)
input ['FRACBUSOUT-1:0] fracs;
input ['EXPBUSOUT-1:0] exps;
input ['SIGNBUS/2-1:0] signs;
input [1:0] 
input [1:0] 
input [15:0]
    // output(s)
    output ['BUS/2-1:0] result;
output [7:0] exceps;
// wire(s)
    wire
    wire
    wire [2*'FP32SW + 1:0]
lol
lol
lol
lol
lol
lol
lol
lol
lol
lol
lol
lol
lol
wire [2*'FP64SW +1:0] 
    // reg(s)
/// Module instantiation.
rne32 #('FP32SW, 'FP32EW) rne_0
(
            lacme,
            lacme,
            lacme,
sign_rne_0;
sign_rne_1;
frac_rne_0
wire [2*'FP64SW+1:0] 
lol
lol
lol
lol
specials rne 0;
specials_rne_1;
lol
input [1:0] 
input [1:0] 
    special;
```

```
    .specials (specials_rne_0),
    .mode (mode),
    .format (format),
    .result (result_rne_0),
    .exceps (exceps_rne_0)
);
rne64 #('FP64SW, 'FP64EW) rne 1
(
    .frac (frac_rne_1),
    .sign (sign_rne_-1),
    .exp (exp_rne_1),
    .specials (specials_rne_1),
    .mode (mode),
    .format (format),
    .result (result_rne_1),
    .exceps (exceps rne 1)
);
/// Combinalional assign.
// Inputs to rounding logic.
assign frac_rne_ 0 = fracs[2*('FP32SW+1) - 1:0];
assign frac_rne_1 = fracs['FRACBUSOUT-1:2*('FP32SW+1)];
// Two msb bits represents the overflow bits during exponent
// addition.
assign exp_rne_0 =
    (format = 'FP16) ?
        exps['FP16EW+1:0] :
    (format = 'FP32) ?
        exps['FP32EW+1:0] :
    0;
assign exp_rne_1=
    (format = 'FP16) ?
        exps ['EXPBUSOUT-1:`FP16EW+2] :
    (format = 'FP32) ?
        exps ['EXPBUSOUT-1:'FP32EW+2] :
    (format ='FP64)?
        exps['FP64EW+1:0] :
    0;
assign sign_rne_0= signs[0];
assign sign_rne_1=
    (format = '\overline{FP64) ?}
        signs[0] :
    signs[1];
assign specials_rne_0=
    {special[13:12], special[9:8], special[5:4], special[1:0]};
assign specials rne 1 =
    (format = }\overline{\textrm{FP}}64)\mathrm{ ) ?
        {special[13:12], special[9:8], special[5:4], special[1:0]} :
    {special[15:14], special[11:10], special[7:6], special[3:2]};
```

125
126
127
128
129
130
131
132

```
/
/ File.......: rne32.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 11:10:54 CEST 2008
// Revision...: 1.0
// Description: Rounding and exception unit. Rounds, normalizes and
//
//
'include "defines.v"
module rne32
(
    frac, // Input. Fractional part from multiplication.
    sign, // Input. Sign from sign computation.
    exp, // Input. Biased exponent from exponent addition.
    specials, // Input. NaNs, infinities, zeros..
    format, // Input.
    mode, // Input. Rounding mode.
    result, // Output. Rounded result or special value.
    exceps // Output. Exceptions.
);
parameter SW = 'FP32SW;
parameter EW = 'FP32EW;
// input(s)
input [2*SW+1:0] frac;
input [EW+1:0] exp;
input sign;
input [7:0] specials;
input [1:0] mode;
input [1:0] format;
// output(s)
output [SW+EW:0] result;
output [3:0] exceps;
// wire(s)
wire normalize;
wire postnormalize;
wire lsb;
wire round;
wire sticky;
wire roundup;
wire rounded;
wire ovf_ab;
wire ovf_biased;
wire ovf_postnorm;
wire round_to_nearest_even;
wire round__to_-infinity ;
wire round_to_zero;
wire nan_a;
wire nan_b;
wire int_a;
wire int_-b;
wire inf_a;
wire inf_b;
wire zero_a;
wire zero_b;
wire int_times_inf;
```

```
63
```

wire invalid;

```
wire invalid;
wire overflow;
wire overflow;
wire overflow_tmp;
wire overflow_tmp;
wire underflow;
wire underflow;
wire
wire
wire [SW:0]
wire [SW:0]
wire [SW:0] significand_tmp;
wire [SW:0] significand_tmp;
wire [SW:0] significand__plus_ulp;
wire [SW:0] significand__plus_ulp;
wire [EW:0] exponent;
wire [EW:0] exponent;
wire [EW:0] exponent_tmp;
wire [EW:0] exponent_tmp;
wire [SW+EW:0] result_tmp;
wire [SW+EW:0] result_tmp;
wire [SW+EW:0] produc\overline{t}_nan;
wire [SW+EW:0] produc\overline{t}_nan;
wire [SW+EW:0] product_zero;
wire [SW+EW:0] product_zero;
wire [SW+EW:0] product_large;
wire [SW+EW:0] product_large;
wire [SW+EW:0] product_overflow;
wire [SW+EW:0] product_overflow;
// reg(s)
// reg(s)
// Round and normalize / Postnormalize.
// Round and normalize / Postnormalize.
// __________
// __________
// Normalize if result from multiplier lies in [2,4)
// Normalize if result from multiplier lies in [2,4)
assign normalize = frac[2*SW+1];
assign normalize = frac[2*SW+1];
assign significand tmp=
assign significand tmp=
    normalize ?
    normalize ?
    frac[2*SW:SW] >> 1:
    frac[2*SW:SW] >> 1:
    frac [2*SW:SW];
    frac [2*SW:SW];
assign exponent tmp =
assign exponent tmp =
    (format = '\overline{FP16) ?}
    (format = '\overline{FP16) ?}
    normalize ?
    normalize ?
    exp['FP16EW-1:0] + 1 : exp ['FP16EW-1:0] :
    exp['FP16EW-1:0] + 1 : exp ['FP16EW-1:0] :
    (format = 'FP32) ?
    (format = 'FP32) ?
    normalize ?
    normalize ?
    exp[\mp@subsup{}{}{`}\textrm{FP}32EW-1:0] + 1 : exp['FP32EW-1:0] : 0;
    exp[\mp@subsup{}{}{`}\textrm{FP}32EW-1:0] + 1 : exp['FP32EW-1:0] : 0;
// Assign rounding bits.
// Assign rounding bits.
assign lsb=
assign lsb=
    (format = 'FP16) ?
    (format = 'FP16) ?
    normalize ?
    normalize ?
    frac[37] :
    frac[37] :
    frac[36] :
    frac[36] :
    (format = 'FP32)?
    (format = 'FP32)?
    normalize ?
    normalize ?
    frac[24] :
    frac[24] :
    frac[23] : 0;
    frac[23] : 0;
assign round =
assign round =
    (format = 'FP16)?
    (format = 'FP16)?
    normalize ?
    normalize ?
    frac[36] :
    frac[36] :
    frac[35] :
    frac[35] :
    (format = 'FP32) ?
    (format = 'FP32) ?
    normalize ?
    normalize ?
    frac[23] :
    frac[23] :
    frac[22] : 0;
    frac[22] : 0;
assign sticky=
```

assign sticky=

```
```

    (format = 'FP16)?
    normalize ?
    |frac[35:26] :
    |frac[34:25] :
    (format = 'FP32)?
    normalize ?
    |frac[22:1] :
    |frac[21:0] : 0;
    // Reduce to three rounding modes.
assign round_to_nearest_even =
(round \& (l lsb | sticky)) \& !(|mode);
assign round_to_infinity =
(!sign\&(!mode[1]\&mode[0])|sign\&(mode[1]\&!mode[0])) \&
(round|sticky);
assign round_to zero =
(sign\&(~
// Round-up if necessary.
assign significand_plus_ulp=
(format = 'FP\overline{16})?
significand_tmp [SW:SW-'FP16SW] + 1'b1 :
(format =-'FP32) ?
significand_tmp[SW:SW-'FP32SW] + 1'b1 : 0;
assign roundup = round_to_infinity | round_to_nearest_even;
assign significand=
(format = 'FP16)?
roundup ?
significand_plus_ulp : significand_tmp [SW:SW-'FP16SW] :
(format = 'FP32) ?
roundup ?
significand_plus_ulp : significand_tmp[SW:SW-'FP32SW]:0;
// Post-normalize if result after rounding lies in [2,4).
assign postnormalize =
(format = 'FP16) ?
!significand ['FP16SW]\&significand_tmp [SW] :
(format = 'FP32) ?
!significand['FP32SW]\&significand_tmp [SW] : 0;
assign exponent =
postnormalize ?
exponent_tmp + 1 :
exponent_tmp;
assign result tmp =
(format = 'FP16) ?
postnormalize ?
{sign, exponent['FP16EW-1:0], significand ['FP16SW-1:0]} :
{sign, exponent['FP16EW-1:0], significand ['FP16SW-1:0]} :
(format = 'FP32) ?
postnormalize ?
{sign, exponent['FP32EW-1:0], significand['FP32SW-1:0]} :
{sign, exponent['FP32EW-1:0], significand['FP32SW - 1:0]} : 0;
// Inexact if result was rounded.
assign rounded = round | sticky;

```
```

assign ovf_postnorm =
(format = 'FP16) ?
exponent['FP16EW]
\&exponent['FP16EW-1:0]\&(normalize| postnormalize) :
(format = 'FP32) ?
exponent['FP32EW]
\&exponent[`FP32EW-1:0]\&(normalize| postnormalize) :
0;
// Generate exceptions.
assign ovf_ab =
(format = 'FP16) ?
exp ['FP16EW+1] :
(format = 'FP32) ?
exp['FP32EW+1] : 0;
assign ovf_biased =
(format = 'FP16) ?
exp['FP16EW] :
(format = 'FP32) ?
exp['FP32EW] : 0;
// Invalid inputs from chk_special.
assign nan_a = specials[0];
assign nan_b = specials[1];
assign inf_a = specials[2];
assign inf_b = specials[3];
assign zero_a = specials [4];
assign zero_-b = specials[5];
assign int_a = specials[6];
assign int_b = specials[7];
// Generate exceptions.
assign int_times_inf = (int_a\&inf_b)|(int_b\&inf_a);
assign invalid =
(nan_a | nan_b)|
(zero_a\&inf_\overline{b}}|\mathrm{ zero_b\&inf_a)|
(inf_\overline{a}| inf_b)\&!int_times_inf;
assign inexact =
(rounded \& (!invalid)|
overflow_tmp|
round_to_zero\&overflow_tmp |
underflow\&(!(zero_a|zero_b)))\&!int_times_inf;
assign underflow =
(format = 'FP16)?
(~ovf_ab\&ovf_biased)|
(~|result_tmp['FP16SW+'FP16EW-1:'FP16SW]) \&
!(ovf_ab\&ovf_biased|ovf_postnorm) \&
!overflow\&!invalid|(zero_a | zero_b) \&
!(nan_a|nan_b|inf_a|inf_\overline{b}):
(format = 'FP\overline{3}2) ?
(~ovf_ab\&ovf_biased)|
(~|result tmp ['FP32SW+'FP32EW-1:'FP32SW]) \&
!(ovf_ab\&ovf_biased|ovf_postnorm) \&
! overflow\&!invalid|(zero_a | zero_b) \&

```
    \(!\left(\right.\) nan_a \(\mid\) nan_b \(\mid\) inf_a \(\left.\mid i n f \_b\right): 0 ;\)
// If overflow occurs and rounding mode equals round-to zero,
// result shall be rounded to largest representative number.
//e.x 0111101111111111.
assign overflow_tmp \(=\)
        (format \(\left.={ }^{\prime} \overline{\mathrm{F} P} 16\right)\) ?
        ( (ovf_ab\&ovf biased|ovf postnorm\&! underflow)|
        \&result_tmp ['FP16SW+'FP16EW-1:'FP16SW]\&!underflow) \&
        ! invalid :
    (format ='FP32) ?
        ( (ovf_ab\&ovf_biased|ovf_postnorm\&!underflow)|
        \&result_tmp ['FP32SW+'FP32EW-1:'FP32SW]\&!underflow) \&
        ! invalid : 0 ;
assign overflow \(=\) overflow_tmp\&!round_to_zero | int_times_inf;
// Compute special results.
assign product_nan \(=\)
    (format \(={ }^{\top} \mathrm{FP} 16\) ) ?
        \(\{1\) 'b0, \(\quad\) 'FP16EW \(\{1\) 'b1 \(\}\}, \quad\left\{\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}-1\right)\{1\right.\) 'b0 \(\left.\}\right\}, 1\) 'b1 \(\}:\)
    \((\) format \(=' \mathrm{FP} 32)\) ?
        \(\left\{1\right.\) 'b0 \(, \quad\left\{{ }^{\prime} \mathrm{FP} 32 \mathrm{EW}\{1\right.\) 'b1 \(\left.\}\right\}, \quad\left\{\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}-1\right)\{1\right.\) 'b0 \(\left.\left.\}\right\}, 1^{\prime} \mathrm{b} 1\right\}:\)
        0 ;
assign product_zero \(=\)
    (format \(={ }^{\top}\) FP16) ?
        \{result_tmp ['FP16SW+'FP16EW], \(\left.\left\{\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}+{ }^{\prime} \mathrm{FP} 16 \mathrm{EW}\right)\left\{1^{\prime} \mathrm{b} 0\right\}\right\}\right\}\) :
    (format ='FP32) ?
        \{result_tmp ['FP32SW+'FP32EW], \(\left\{\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}+{ }^{\prime} \mathrm{FP} 32 \mathrm{EW}\right)\{1\right.\) 'b0 \(\left.\left.\}\right\}\right\}\) :
        0 ;
assign product_overflow \(=\)
    (format = 'FP16) ?
        \{result_tmp ['FP16SW+'FP16EW],
        \(\left\{{ }^{\prime} \mathrm{FP} 16 \mathrm{EW}-\overline{\mathrm{W}}\left\{1^{\prime} \mathrm{b} 1\right\}\right\}, \quad\left\{\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}\right)\{1\right.\) 'b0 \(\left.\left.\}\right\}\right\}:\)
    (format ='FP32) ?
        \{result_tmp ['FP32SW+'FP32EW],
        \{'FP32EWW \(\{1\) 'b1 \(\}\}, \quad\left\{\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}\right)\{1\right.\) 'b0 \(\left.\left.\}\right\}\right\}:\)
        0 ;
assign product large \(=\)
    (format \(={ }^{\text {'FP1 }}\) ) ?
        \{result tmp ['FP16SW+'FP16EW],
        \(\left\{\left({ }^{\prime} \mathrm{FP} 16 \overline{\mathrm{E} W}-1\right)\left\{1^{\prime} \mathrm{b} 1\right\}\right\}, 1\) 'b0, \(\left\{\left({ }^{\prime} \mathrm{FP} 16 \mathrm{SW}\right)\{1\right.\) 'b1 \(\left.\left.\}\right\}\right\}:\)
    \((\) format \(='\) FP32 \() ?\)
        \{result_tmp ['FP32SW+'FP32EW],
        \(\left\{\left({ }^{\prime} \mathrm{FP} 32 \overline{\mathrm{E} W}-1\right)\{1\right.\) 'b1 \(\left.\}\right\}, 1\) 'b0, \(\left\{\left({ }^{\prime} \mathrm{FP} 32 \mathrm{SW}\right)\{1\right.\) 'b1 \(\left.\left.\}\right\}\right\}\) :
        0 ;
// Final product decided by exceptions.
assign result \(=\)
    invalid ? product_nan :
    overflow ? product_overflow :
    underflow ? product zero :
    round_to_zero \& overflow_tmp \& ! int_times_inf ? product_large :
    result_tmp;
assign exceps [0] = invalid;
assign exceps [1] \(=\) inexact;
assign exceps[2] = overflow;

311 assign exceps [3] = underflow;
312
313 endmodule // rne32
```

//
/ File.......: rne64.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 15 11:10:54 CEST 2008
// Revision...: 1.0
// Description: Rounding and exception unit. Rounds, normalizes and
//
//
'include "defines.v"
module rne64
(
frac, // Input. Fractional part from multiplication.
sign, // Input. Sign from sign computation.
exp, // Input. Biased exponent from exponent addition.
specials, // Input. NaNs, infinities, zeros..
format, // Input.
mode, // Input. Rounding mode.
result, // Output. Rounded result or special value.
exceps // Output. Exceptions.
);
parameter SW = 52;
parameter EW = 11;
// input(s)
input [2*SW+1:0] frac;
input [EW+1:0] exp;
input sign;
input [7:0] specials;
input [1:0] mode;
input [1:0] format;
// output(s)
output [SW+EW:0] result;
output [3:0] exceps;
// wire(s)
wire normalize;
wire postnormalize;
wire lsb;
wire round;
wire sticky;
wire roundup;
wire rounded;
wire ovf_ab;
wire ovf_biased;
wire ovf_postnorm;
wire round_to_nearest_even;
wire round__to_-infinity ;
wire round_to_zero;
wire nan_a;
wire nan_b;
wire int_a;
wire int_-b;
wire inf_a;
wire inf_b;
wire zero_a;
wire zero_b;
wire int_times_inf;

```
```

wire

```
wire
wire
wire
wire
wire
wire
wire
wire
wire
wire
wire
wire
wire
wire [SW:0
wire [SW:0
wire [SW:0]
wire [SW:0]
wire [SW:0]
wire [SW:0]
wire [EW:0]
wire [EW:0]
wire [EW:0]
wire [EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
wire [SW+EW:0]
// reg(s)
// Round and normalize / Postnormalize.
// ___________________
// Normalize if result from multiplier lies in [2,4)
assign normalize = frac[2*SW+1];
assign significand_tmp=
    normalize ?
    frac[2*SW:SW] >> 1: frac [2*SW:SW];
assign exponent tmp =
    (format = '\overline{\textrm{F}}16) ?
    normalize ?
    exp['FP16EW-1:0] + 1 : exp['FP16EW-1:0] :
    (format = 'FP32) ?
    normalize ?
    exp['FP32EW-1:0] + 1 : exp['FP32EW-1:0] :
    (format = 'FP64) ?
    normalize ?
    exp['FP64EW-1:0] + 1 : exp['FP64EW-1:0] : 0;
// Assign rounding bits.
assign lsb=
    (format = 'FP16) ?
    normalize ?
    frac[95] :
    frac[94] :
    (format = 'FP32) ?
    normalize ?
    frac[82] :
    frac[81] :
    (format = 'FP64)?
    normalize ?
    frac[53] :
    frac[52] : 0;
assign round =
    (format = 'FP16)?
```

```
    normalize ?
    frac[94] :
    frac[93] :
    (format = 'FP32) ?
    normalize ?
    frac[81] :
    frac[80] :
    (format = 'FP64) ?
    normalize ?
    frac[52] :
    frac[51] : 0;
assign sticky =
    (format = 'FP16)?
    normalize ?
    |frac[93:84] :
    |rac[92:83] :
    (format ='FP32) ?
    normalize ?
    |frac[80:59] :
    |frac[79:58] :
    (format = 'FP64) ?
    normalize ?
    |frac[51:1] :
    |fac[50:0] : 0;
// Reduce to three rounding modes.
assign round to_nearest_even =
    (round & (lsb | sticky)) & !(|mode);
assign round to infinity =
    (!}\operatorname{sign}&(!\mathrm{ mode [1]&mode[0])| sign&(mode[1]&!mode[0])) &
    (round|sticky);
assign round to zero =
    (\operatorname{sign&(~ mode[1]&mode[0])|~}\operatorname{sign&(mode[1]&~mode[0]))|&mode;}
// Round-up if necessary.
assign significand_plus_ulp=
    (format = 'FP16) ?
    significand_tmp [SW:SW-'FP16SW] + 1'b1 :
    (format = 'FP32) ?
    significand_tmp[SW:SW-'FP32SW] + 1'b1 :
    (format = 'FP64) ?
    significand_tmp[SW:SW-'FP64SW] + 1'b1 : 0;
assign roundup = round_to_infinity | round_to_nearest_even;
assign significand=
        (format = 'FP16) ?
        roundup ?
        significand_plus_ulp : significand_tmp [SW:SW-'FP16SW] :
        (format ='FP32) ?
        roundup ?
        significand_plus_ulp : significand_tmp [SW:SW-'FP32SW] :
        (format =''FP64) ?
        roundup ?
        significand_plus_ulp : significand_tmp[SW:SW-'FP64SW]:0;
// Post-normalize if result after rounding lies in [2,4).
assign postnormalize =
    (format = 'FP16) ?
    !significand ['FP16SW]&significand_tmp [SW]
```

```
    (format = 'FP32) ?
    !significand ['FP32SW]&significand_tmp [SW] :
    (format = 'FP64) ?
    !significand ['FP64SW]&significand_tmp [SW] : 0;
assign exponent =
    postnormalize ?
    exponent_tmp + 1 :
    exponent_tmp;
assign result tmp =
    (format = 'FP16)?
    postnormalize ?
        {sign, exponent['FP16EW-1:0], significand['FP16SW-1:0]} :
        {sign, exponent['FP16EW-1:0], significand ['FP16SW-1:0]} :
    (format = 'FP32) ?
    postnormalize ?
        {sign, exponent['FP32EW-1:0], significand ['FP32SW-1:0]} :
        {sign, exponent['FP32EW-1:0], significand ['FP32SW-1:0]} :
    (format = 'FP64)?
    postnormalize ?
        {sign, exponent['FP64EW-1:0], significand ['FP64SW-1:0]} :
        {sign, exponent['FP64EW-1:0], significand ['FP64SW-1:0]} :
    0;
// Inexact if result was rounded.
assign rounded = round | sticky;
assign ovf postnorm =
    (format = 'FP16) ?
        exponent['FP16EW]
        &exponent['FP16EW-1:0]&(normalize| postnormalize) :
    (format = 'FP32) ?
        exponent['FP32EW]
        &exponent['FP32EW-1:0]&(normalize| postnormalize) :
    (format = 'FP64) ?
        exponent['FP64EW] |
        &exponent['FP64EW - 1:0]&(normalize| postnormalize) :
    0;
// Generate exceptions.
//
assign ovf_ab=
    (format = 'FP16)?
        exp['FP16EW+1] :
    (format = 'FP32) ?
        exp['FP32EW+1] :
    (format = 'FP64) ?
        exp['FP64EW+1] : 0;
assign ovf biased =
    (format }\mp@subsup{}{}{-}='FP16)
        exp ['FP16EW]
    (format = 'FP32) ?
        exp['FP32EW]
    (format = 'FP64) ?
        exp['FP64EW] : 0;
```

247

```
// Invalid inputs from chk_special.
assign nan_a = specials[0];
assign nan_b = specials [1];
assign inf_a=specials[2];
assign inf_b = specials[3];
assign zero__a= specials[4];
assign zero_-b = specials[5];
assign int_\overline{a}}=\mathrm{ specials [6];
assign int_b = specials[7];
// Generate exceptions.
assign int_times_inf = (int_a&inf_b)|(int_b&inf_a);
assign invalid=
    (nan_a | nan_b)|
    (zero__a&inf_\overline{b}| zero_b&inf_a)|
    (inf_\overline{a}| inf_b)&!int_times__inf;
assign inexact =
    (rounded & (!invalid)|
    overflow_tmp|
    round_to_zero&overflow _tmp|
    underf
assign underflow =
    (format = 'FP16) ?
        (~ ovf_ab&ovf_biased)|
        (~ |re\overline{sult_tmp}['FP16SW+'FP16EW-1:'FP16SW]) &
        !(ovf_ab&ovf_biased|ovf_postnorm) &
        ! overflow&!invalid|(zero_a | zero_b) &
        !(nan_a|nan_b|inf_a|inf_\overline{b}) :
    (format =
        (~ovf_ab&ovf_biased)|
        (~ |result_tmp ['FP32SW+'FP32EW-1:'FP32SW]) &
        !(ovf ab&ovf biased |ovf postnorm) &
        !overflow &!invalid |(zero_a | zero_b) &
        !(nan_a|nan_b|inf_a|inf_\overline{b}) :
    (format = 'FP64) ?
        (~ ovf_ab&ovf_biased)|
        (~ |result tmp['FP64SW+`FP64EW-1:`FP64SW]) &
        !(ovf_ab&ovf_biased|ovf_postnorm) & 
        ! overflow &!iñvalid|(zero_a | zero_b) &
        !(nan_a nan_b |inf_a|inf_\overline{b}):0;
// If overflow occurs and rounding mode equals round-to zero,
// result shall be rounded to largest representative number.
// e.x 0111101111111111.
assign overflow_tmp=
        (format =
            ((ovf_ab&ovf_biased|ovf_postnorm&!underflow)|
        &result_tmp ['`}\mp@subsup{}{}{[
    (format = 'FP32)?
        ((ovf_ab&ovf_biased|ovf_postnorm&!underflow)|
        &result tmp[`'FP32SW+'FP\overline{32EW-1:'FP32SW]&!underflow)&!invalid :}
    (format = 'FP64) ?
        ((ovf_ab&ovf_biased|ovf_postnorm&!underflow)|
        &result_tmp['`
    0;
assign overflow = overflow_tmp&!round_to_zero | int_times_inf;
```

309
310 311 312 313

```
// Compute special results.
assign product_nan =
    (format =}\mp@subsup{}{}{`}\textrm{FP}16) 
        {1'b0, {'FP16EW{1'b1}}, {('FP16SW-1){1'b0 }}, 1'b1} :
    (format = 'FP32) ?
        {1'b0, {'FP32EW{1'b1}}, {('FP32SW-1){1'b0}}, 1'b1}:
    (format = 'FP64) ?
        {1'b0, {'FP64EW{1'b1}}, {('FP64SW-1) {1'b0 }}, 1'b1} :
    0;
assign product_zero =
    (format = '`FP16) ?
        {result_tmp['FP16SW+'FP16EW], {('FP16SW+'FP16EW) {1'b0}}} :
    (format = 'FP32)?
        {result_tmp['FP32SW+'FP32EW], {('FP32SW+'FP32EW) {1'b0}}} :
    (format = 'FP64)?
        {result_tmp['FP64SW+'FP64EW], {('FP64SW+'FP64EW) {1'b0}}} :
    0;
assign product overflow =
    (format = '`FP16)?
        {result_tmp ['FP16SW+'FP16EW],
        {'FP16E\overline{W}{1'b1}}, {('FP16SW) {1'b0}}} :
    (format = 'FP32)?
        {result_tmp ['FP32SW+'FP32EW],
        {'FP32EWW {1'b1 }}, {('FP32SW) {1'b0 } } }:
    (format = 'FP64) ?
        {result_tmp['FP64SW+'FP64EW],
        {'FP64EW {1'b1 }}, {('FP64SW) {1'b0}}} :
    0;
assign product large =
    (format =
        {result_tmp ['FP16SW+'FP16EW],
        {('FP16\overline{EW}-1) {1'`b1}}, 1'b0, {('FP16SW) {1'b1}}} :
    (format = 'FP32) ?
        {result_tmp['FP32SW+'FP32EW],
        {('FP32ĒW-1) {1'b1}}, 1'b0, {('FP32SW) {1'b1 }}}:
    (format = 'FP64) ?
        {result_tmp['FP64SW+'FP64EW],
        {('FP64EW-1){1'b1}}, 1'b0, {('FP64SW) {1'b1}}} :
    0;
assign product_min =
    (format = '`FP16) ?
        {result_tmp['FP16SW+'FP16EW],
        {('FP16EW-1) {1'b0}}, 1'b1, {(`FP16SW) {1'b0}}} :
    (format = 'FP32) ?
        {result_tmp ['FP32SW+`FP32EW],
        {('FP32\overline{EWW-1) {1'b0 }}, 1'b1, {('FP32SW) {1'b0 }}}:}
    (format = 'FP64) ?
        {result tmp ['FP64SW+'FP64EW],
        {('FP64\overline{EW}-1) {1'b0}}, 1'b1, {('FP64SW) {1'b0}}} :
    0;
// Final product decided by exceptions.
assign result =
    invalid ? product_nan :
    overflow ? product_overflow :
    underflow ? product_zero :
```

```
371 round_to_zero & overflow_tmp & !int_times_inf ? product_large :
372
373
374 assign exceps[0] = invalid;
375
376
377
378
379 endmodule // rne64
```


## Appendix C

## Test Data Generator

```
// Author: Espen Stenersen.
// Date: Spring 2008.
// Description: Generates FP16, FP32 and FP64 testvectors including
//
//
//
//
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <xlocale.h>
#include <stdlib.h>
#include <unistd.h>
#define FP16 0
#define FP32 1
#define FP64 2
#define FP16WIDTH 16;
#define FP32WIDTH 32;
#define FP64WIDTH 64;
#define FP16EXPONENT 5;
#define FP32EXPONENT 8;
#define FP64EXPONENT 11;
void usage();
void generate(int format, int testcases);
void generate_nan(int width, int exponent);
void generate zero(int width, int exponent);
void generate_infinity(int width, int exponent);
void generate_-special(int width, int exponent);
void generate random(int width, int exponent);
FILE *f;
int main (int argc, char const *argv[])
{
    int format;
    // Initializes the random generator.
    srand(time(0) * getpid());
```

9

```
    if (argc< < ) usage();
    else
    {
    if ( strcmp("-fp16", argv[1]) =0 ) format = FP16;
    else if ( strcmp("-fp32", argv[1])=0 ) format = FP32;
    else if ( strcmp("-fp64", argv[1]) =0 ) format = FP64;
    else usage();
    generate(format, atoi(argv[2]));
}
return 0;
void generate(int format, int testcases)
int i = 0;
int width = 0;
int exponent = 0;
int random = 0;
switch( format )
{
    case FP16: width = FP16WIDTH; exponent = FP16EXPONENT; f = fopen
    ("fp16testcases.txt", "wt"); break;
    case FP32: width = FP32WIDTH; exponent = FP32EXPONENT; f = fopen
        ("fp32testcases.txt", "wt"); break;
    case FP64: width = FP64WIDTH; exponent = FP64EXPONENT; f = fopen
        ("fp64testcases.txt", "wt"); break;
}
// Genrates nan x nan.
generate_nan(width, exponent);
generate_nan(width, exponent);
// Generates zero x infinity.
generate _infinity(width, exponent);
generate__infinity(width, exponent);
// Generates zero x zero.
generate zero(width, exponent);
generate_zero(width, exponent);
// Generates zero z infinity.
generate_zero(width, exponent);
generate _infinity(width, exponent);
// Generates infinity x nan.
generate_infinity(width, exponent);
generate_nan(width, exponent);
// Generates zero x nan.
generate_zero(width, exponent);
generate_nan(width, exponent);
    i = 12;
    while (i < testcases)
    {
        random = rand() % 999;
        if (random = 14)
        {
        generate_special(width, exponent);
        }
```

\}
\{

```
        else
        {
            generate random(width, exponent);
        }
        1++;
    }
    fclose (f);
}
void generate_random(int width, int exponent)
{
    int j = 0;
    int normalized = 0;
    int bit = 0;
    while (j < width)
    {
        bit = rand() % 2;
        if (j< < ) fprintf(f, "%d", bit);
        else if (j < exponent)
        {
            if (bit = 1) normalized++;
            fprintf(f, "%d", bit);
        }
        else
        if (j = exponent)
            {
                if (normalized < 1)
                    fprintf(f, "%d", 1);
                else
                        fprintf(f, "%d", bit);
            }
            else
                fprintf(f, "%d", bit);
        }
        j++;
    }
    fprintf(f, "\n");
}
// Generate random special input vectos. e.x zero x infinity.
void generate special(int width, int exponent)
    int random = rand() % 6;
    // Genrates nan x nan.
    if (random = 0)
    {
        generate nan(width, exponent);
        generate_nan(width, exponent);
    }
    // Generates zero x infinity.
    else if (random=1)
    {
        generate_infinity(width, exponent);
        generate_infinity(width, exponent);
    }
    // Generates zero x zero.
    else if (random=2)
    {
        generate_zero(width, exponent);
        generate_zero(width, exponent);
```

```
    }
    // Generates zero z infinity.
    else if (random=3)
    {
        generate_zero(width, exponent);
        generate_infinity(width, exponent);
    }
    // Generates infinity x nan.
    else if (random=4)
    {
        generate_infinity(width, exponent);
        generate__
    }
    // Generates zero x nan.
    else if (random = 5)
    {
        generate_zero(width, exponent);
        generate_nan(width, exponent);
    }
}
// Generate NaN vectors.
void generate_nan(int width, int exponent)
{
    int i = 0;
    int bit = rand() % 2;
    while (i < width)
    {
        if (i< 1) fprintf(f, "%d", bit);
        else if (i< exponent + 1) fprintf(f, "%d", 1);
        else if (i < width - 1) fprintf(f, "%d", 0);
        else fprintf(f, "%d", 1);
        i ++;
    }
    fprintf(f, "\n");
}
// Generate zero vectors.
void generate_zero(int width, int exponent)
    int i = 0;
    int bit = rand() % 2;
    while (i < width)
    {
        if (i< 1) fprintf(f, "%d", bit);
        else fprintf(f, "%d", 0);
        i++;
    }
    fprintf(f, "\n");
}
216 // Generate infinity vectors.
void generate_infinity(int width, int exponent)
{
    int i = 0;
    int bit = rand() % 2;
    while (i < width)
    {
        if (i< 1) fprintf(f, "%d", bit);
        else if (i < exponent + 1) fprintf(f, "%d", 1);
        else fprintf(f, "%d", 0);
        i++;
```

215

```
227 }
228
229 }
230
// Prints user info.
void usage()
233 {
234
235
236
237
238
239
240
241 }
```


## Appendix D

## Simulation Sources

## D. 1 Vectorized DesignWare floating-point multiplier Source

```
// File......: dw_vec_fp16_mult.v
// Author.....: Espen Stenersen
// Date......: Thu Apr 24 16:40:38 CEST 2008
// Revision...: 1.0
// Description: Vectorized FP16 floating-point multiplier based on
the DesignWare simulation model.
//
include "defines.v"
module dw_vec fp mult
(
    dw vectors, // Input from testbench.
    dw_mode, // Input from testbench.
    format, // Input from testbench.
    dw_products, // Output to testbench.
    dw_exceptions // Output to testbench.
);
    // input(s)
    input [2*`BUS-1:0] dw vectors;
    input [2:0] dw_mode;
    input [1:0] format;
    // output(s)
    output ['BUS-1:0] dw products;
    output [15:0] dw_exceptions;
    // wire(s)
    wire [7:0] fp16 mult0 status;
    wire [7:0] fp16_mult1_status;
    wire [7:0] fp16 - mult2-}\mathrm{ status;
    wire [7:0] fp16_mult3_status;
    wire ['FP16SW+'FP16EW :0] fp16_mult0_z;
    wire ['FP16SW+'FP16EW:0] fp16 mult1 z;
    wire ['FP16SW+'FP16EW :0] fp16_mult2_z;
```



```
fp16 mult3 z;
fp16_mult0_a;
fp16_mult0_b;
fp16_mult1_a;
fp16-mult1 \({ }^{-}\)b;
fp16 mult2 a;
fp16_-mult2_-b;
fp16 mult3 a;
fp16_mult3_-b;
fp16 mult0 z tmp;
fp16_mult1_z_tmp;
fp16 mult2 z tmp;
fp16_mult3_z_tmp;
fp32 mult0 status;
fp32-mult1_-status;
fp32_mult2_status;
fp32_mult3_status;
fp32_mult0_z;
fp32_mult1_z;
fp32-mult2 \(2^{-}\)z;
fp32_mult3_z;
fp32-mult0_- \({ }^{-}\);
fp32 mult0 b;
fp32_-mult1_a;
fp32 mult1 b;
fp32_mult2_a;
fp32_mult2_b;
fp32_mult3_a;
fp32_-mult3_b;
fp32_mult0_z_tmp;
fp32_mult1_- \({ }^{-}\)-tmp;
fp32_mult2_z_tmp;
fp32_mult3_z_tmp;
fp64_mult0_status;
fp64- mult1 \({ }^{-}\)status;
fp64_mult0_z;
fp64_mult1_z ;
fp64 mult0 a;
fp64_-mult0_-b;
fp64 mult1 a;
fp64_mult1_b;
fp64-mult0_z_tmp;
fp64_mult1_z_tmp;
// reg(s)
/// Module instantiation.
// Module instantiation
dw_fp_mult \#('FP16SW, 'FP16EW, 1) fp16_mult0
(
    .a (fp16_mult0_a),
    .b (fp16_mult0_b),
    .rnd (dw_mode),
    . z (fp16_mult0_z),
    .status (fp16_mult0_status)
) ;
dw_fp_mult \#('FP16SW, 'FP16EW, 1) fp16_mult1
```

```
(
    .a (fp16_mult1_a),
    .b (fp16_mult1_b),
    .rnd (dw_mode),
    .z (fp16_mult1_z),
    .status (fp16_mult1_status)
);
dw_fp_mult #('FP16SW, 'FP16EW, 1) fp16_mult2
(
    .a (fp16_mult2_a),
    .b (fp16_mult2_b),
    .rnd (dw_mode),
    .z (fp16_mult2_z),
    .status (fp16_mult2_status)
);
dw_fp_mult #('FP16SW, 'FP16EW, 1) fp16_mult3
(
    .a (fp16_mult3_a),
    .b (fp16_mult3_b),
    .rnd (dw_mode),
    .z (fp16_mult3_z),
    .status (fp16_mult3_status)
);
dw_fp_mult #('FP32SW, 'FP32EW, 1) fp32_mult0
(
    .a (fp32_mult0_a),
    .b (fp32_mult0_b),
    .rnd (dw_mode),
    .z (fp32_mult0_z),
    .status (fp32_mult0_status)
);
dw_fp_mult #('FP32SW, 'FP32EW, 1) fp32_mult1
(
    .a (fp32_mult1_a),
    .b (fp32_mult1_b),
    .rnd (dw_mode),
    .z (fp32_mult1_z),
    .status (fp32_mult1_sstatus)
);
dw_fp_mult #('FP32SW, 'FP32EW, 1) fp32_mult2
(
    .a (fp32_mult2_a),
    .b (fp32_mult2_b),
    .rnd (dw_mode),
    .z (fp3"_mult2_z),
    .status (fp32_mult2_status)
);
dw_fp_mult #('FP32SW, 'FP32EW, 1) fp32_mult3
(
    .a (fp32_mult3_a),
    .b (fp32_mult3_b),
    .rnd (dw_mode),
    .z (fp32_mult3_z),
    .status (fp32_mult3_status)
);
dw_fp_mult #('FP64SW, 'FP64EW, 1) fp64_mult0
(
    .a (fp64_mult0_a),
    .rnd
    (fp64_-mult0_-b),
```

```
```

    .z (fp64_mult0_z),
    ```
```

    .z (fp64_mult0_z),
    .status (fp64__mult0_
    .status (fp64__mult0_
    );
);
dw_fp_mult \#('FP64SW, 'FP64EW, 1) fp64_mult1
dw_fp_mult \#('FP64SW, 'FP64EW, 1) fp64_mult1
(
(
.a (fp64_mult1_a),
.a (fp64_mult1_a),
.b (fp64_-mult1_b),
.b (fp64_-mult1_b),
.rnd (dw_mode),
.rnd (dw_mode),
.z (fp64_mult1_z),
.z (fp64_mult1_z),
.status (fp64_mult1_status)
.status (fp64_mult1_status)
);
);
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l
// l

```
```

// l

```
```

/// Set exceptions.
// Invalid. dw_status [2].
assign dw_exceptions [0] =
(format ='FP16) ?
fp16_mult0_status [2] :
(format ='FP32)?
fp32_mult0_status [2] :
(format $={ }^{-}$'FP64) ?
fp64_mult0_status [2] :
0 ;
assign dw_exceptions[1] =
(format $=$ 'FP16) ?
fp16_mult1_status [2] :
(format ='FP32)?
fp32 mult1 status[2] :
(format $={ }^{-}$'FP64) ?
fp64_mult1_status[2] :
0 ;

198

## D.1. VECTORIZED DESIGNWARE FLOATING-POINT MULTIPLIER SOURCE155

```
assign dw_exceptions[2] =
    (format = 'FP16)?
    fp16 mult2 status[2] :
    (format = 'FP32) ?
    fp32_mult2_status[2]
    (format = 'FP64)?
    0 : 0;
assign dw_exceptions[3] =
    (format = 'FP16) ?
    fp16_mult3_status[2] :
    (format = '`FP32)?
    fp32_mult3_status[2]
    (format = ''FP64) ?
    0 : 0;
// Inexact. dw_status [5].
assign dw_exceptions[4] =
    (format = 'FP16)?
    fp16_mult0_status[5]|fp16_mult0_status[3] :
    (format = '`FP32) ?
    fp32_mult0_status[5]|fp32_mult0_status[3] :
    (format = '`FP64)?
    fp64_mult0_status[5]|fp64_mult0_status[3] :
    0;
assign dw_exceptions[5] =
    (format }\mp@subsup{}{}{-}=\mp@subsup{}{}{\prime}\textrm{FP}16)
    fp16_mult1_status[5]|fp16_mult1_status[3] :
    (format = ''FP32) ?
    fp32_mult1_status[5]|fp32_mult1_status[3] :
    (format = ''FP64)?
    fp64_mult1_status[5]|fp64_mult1_status[3] :
    0;
assign dw_exceptions[6] =
    (format }\mp@subsup{}{}{-}=\mp@subsup{}{}{\prime}\textrm{FP}16)
    fp16_mult2_status[5]|fp16_mult2_status[3] :
    (format = ''FP32)?
    fp32_mult2_status[5]|fp32_mult2_status[3] :
    (format = ''FP64)?
    0 : 0;
assign dw_exceptions[7] =
    (format = 'FP16)?
    fp16_mult3_status[5]|fp16_mult3_status[3] :
    (format = 'FP32) ?
    fp32_mult3_status[5]|fp32_mult3_status[3] :
    (format = ''FP64) ?
    0 : 0;
// Overflow. dw_status[1].
assign dw exceptions[8]=
    (format = 'FP16) ?
    fp16_mult0_status[1] :
    (format = 'FP32)?
    fp32_mult0_status[1] :
    (format = ''FP64) ?
    fp64_mult0_status[1] :
    0;
assign dw_exceptions[9] =
```

```
    (format = 'FP16) ?
    fp16_mult1_status[1] :
    (format = ''FP32)?
    fp32_mult1_status[1] :
    (format = '`FP64)?
    fp64_mult1_status[1] :
    0;
assign dw_exceptions[10] =
    (format = 'FP16)'?
    fp16_mult2_status[1] :
    (format = '`FP32) ?
    fp32_mult2_status[1] :
    (format = '`FP64)?
    0 : 0;
assign dw_exceptions[11] =
    (format }\mp@subsup{}{}{-}='FP16)
    fp16 mult3 status[1] :
    (format = ''FP32) ?
    fp32_mult3_status[1] :
    (format = 'FP64)?
    0 : 0;
// Underflow. dw_status [0] | \hat{A dw_status [3] (underflow/denormal).}
assign dw_exceptions[12] =
    (format }\mp@subsup{}{}{-}=\mp@subsup{}{}{\prime}\textrm{FP}16)
    fp16_mult0_status[0]|fp16_mult0_status[3] :
    (format = '`FP32)?
    fp32_mult0_status[0]|fp32_mult0_status[3] :
    (format = '`FP64)?
    fp64_mult0_status[0]|fp64_mult0_status[3] :
    0;
assign dw_exceptions[13] =
    (format }=\mathrm{ 'FP16)?
    fp16_mult1_status[0]|fp16_mult1_status[3] :
    (format =}\mp@subsup{}{}{`}\mp@subsup{}{}{`}\textrm{FP}32)
    fp32_mult1_status[0]|fp32_mult1_status[3] :
    (format = '`FP64)?
    fp64_mult1_status[0]|fp64_mult1_status[3] :
    0;
assign dw exceptions[14] =
    (format }\mp@subsup{}{}{-}=\mp@subsup{}{}{\prime}\textrm{FP}16)
    fp16_mult2_status[0]|fp16_mult2_status[3] :
    (format = ''FP32) ?
    fp32_mult2_status[0]|fp32_mult2_status[3] :
    (format = ''FP64)?
    0 : 0;
assign dw_exceptions[15] =
    (format = 'FP16)?
    fp16_mult3_status[0]|fp16_mult3_status[3] :
    (format = 'FP32)?
    fp32_mult3_status[0]|fp32_mult3_status[3] :
    (format = `'FP64)?
    0 : 0;
// Flush product to zero if denormal output from dw_dp_mult.
assign fp16_mult0_z_tmp = fp16_mult0_status[3] ?
    {fp16_mult0_z['F\overline{P}16\textrm{W}-1], fp16_mult0_z['FP16W-2:0]&1'b0} :
```


## D.1. VECTORIZED DESIGNWARE FLOATING-POINT MULTIPLIER SOURCE157

```
    fp16_mult0_z;
    assign fp16_mult1_z_tmp = fp16_mult1_status [3] ?
    {fp16_m\overline{ult1_z['F}\overline{\textrm{P}}16\textrm{W}-1], fp\overline{16}_mult\overline{1}_z['FP16W-2:0]&1'b0} :
    fp16_\overline{mult1_\overline{z}};
    assign fp16_mult2_z_tmp = fp16_mult2_status [3] ?
    {fp16_mult2_z['FP16W-1], fp16_mult2_z['FP16W-2:0]&1'b0} :
    fp16_mult2_z;
    assign fp16_mult3_z_tmp = fp16_mult3_status [3] ?
```



```
    fp16_mult3_z;
    assign fp32_mult0_z_tmp = fp32_mult0_status [3] ?
```



```
    fp32_mult0_z;
    assign fp32 mult1 z tmp = fp32 mult1 status[3] ?
    {fp32_mult1_z['FP}32W-1], fp32_mult1_z['FP32W-2:0]&1'b0} :
    fp32_\overline{mult1_\overline{z}}\mathrm{ ;}
    assign fp32_mult2_z_tmp = fp32_mult2_status[3] ?
    {fp32_mult2_z['FP32W-1], fp32_mult2_z['FP32W-2:0]&1'b0} :
    fp32_mult2_z
    assign fp32_mult3_z_tmp = fp32_mult3_status[3] ?
        {fp32_m\overline{ult3_z['FPP}32W-1], fp\overline{32_mult\overline{3}_z['FP32W-2:0]&1'b0} :}
        fp32_mult3_z;
    assign fp64_mult0_z_tmp = fp64_mult0_status [3] ?
        {fp64_m\overline{ult0_z['`}\overline{P}64W-1], fp}\overline{64_mult0
        fp64_mult0_z;
    assign fp64 mult1 z tmp = fp64 mult1 status[3] ?
    {fp64_m\overline{ult1_z z['F\overline{P}64W-1], fp}\overline{64_mult1}_z['FP64W-2:0]& 1'b0} :
    fp64_-mult1_\overline{z};
    // Output mux.
    //
    assign dw_products = (format = 'FP16) ?
        {fp16_mult3_z_tmp, fp16_mult2_z_tmp,
        fp16_mult1_ z _ tmp, fp16_mult0_-z_tmp} :
        (format = '\overline{FP32) ?}
        {fp32 mult3 z_tmp, fp32_mult2 z tmp,
        fp32_- mult1_\overline{z}
        (format = ''\overline{FP64) ?}
        {fp64_mult1_z_tmp, fp64_mult0_z_tmp} : 0;
    endmodule /// dw_}ve\mp@subsup{\overline{c}}{-}{}f\mp@subsup{p}{-}{mult
```


## D. 2 Testbench Sources

```
// File...... vec_fp_mult_tb.v
// Author.....: Espen Stenersen
// Date......: Thu Apr 17 13:49:28 CEST 2008
// Revision...: 1.0
// Description: Testbench for top module vec_fp_mult.
//
'include "../rtl/defines.v"
//'timescale
'define CLK_PERIOD 1
module vec_fp_mult_tb;
    'include "../tb/defines_tb.v"
    'include "../tb/debug.v"
parameter W = 'FP16W'
parameter SW = 'FP16SW;
parameter EW = 'FP16EW
parameter FORMAT = 'FP16
parameter MODE = 'ZERO
parameter VECTORS = 100000;
// wire(s)
wire ['BUS-1:0] products;
wire [15:0] exceptions;
wire ready;
wire exceptions_failed;
wire products_failed;
wire [3:0] nan, inf, zero;
wire ['BUS-1:0] dw_products;
wire [15:0] dw exceptions;
wire [2*'BUS-1:0] dw_vectors;
// reg(s)
reg [W-1:0] testmem [0:VECTORS-1];
reg ['BUS-1:0] vectors;
reg [W-1:0] A0, B0, A1, B1;
reg [1:0] format;
reg [1:0] mode;
reg [15:0] clear;
reg start;
reg clk;
reg reset_n;
reg [2:0] dw_mode;
// Counters.
integer i_vec, i_ans, i_passed, i_failed, i_total;
integer i_nan, i_zero, \overline{i_inf, i_iñf_times_zēro;}
integer step, i_ovf, i_unf, i_inv, i__inx;
integer i_nan_times_any;
// L
```

// Module instantiation.
//
vec_fp_mult DUT
(
.start (start), // Input. Starts computation.
vectors
(vectors), // Input. FP vectors.
$\begin{array}{ll}\text { - vectors } & \text { (vectors), } \quad \text { format), Input. FP vectors. } \\ \text {.format }\end{array}$
.mode (mode), // Input. Rounding mode.
. clear (clear), // Input. Clears exceptions.
.products (products), // Output. Computed products.
. exceptions (exceptions), // Output. Exceptions raised.
.ready (ready), // Output. Output ready.
.clk (clk),
.reset_n (reset_n)
);
dw_vec_fp_mult dw_vec_fp_mult
(
.dw_vectors (dw_vectors), // Input from testbench.
.dw_mode (dw_mode),
(dw_mode), // Input from testbench.
$\begin{array}{ll}. d w \_m o d e & \text { ( } \mathrm{dw} \text { _mode) } \\ \text {. format } & \text { (format), }\end{array}$
.dw_products (dw_products), // Output to testbench.
. dw_exceptions (dw_exceptions) // Output to testbench.
);
/// Initials.
//
// Generate stimuli.
initial begin

```
/// Verbosity levels:
        0: Only final report.
        Signal events and updates.
        Error messages.
        3: Elaborated error messages with product vectors,
                exception vectors and input vectors that caused the
                error.
        4: 1 and 3 combined.
verbosity(3);
initialize; // Call to initialize task.
@(posedge clk) // wait cycle.
@ (posedge clk) // wait cycle.
@ (posedge clk) reset_n = 1;
@ (posedge clk) // wait cycle.
@ (posedge clk) // wait cycle.
for (i_vec = 0; i_vec < VECTORS; i_vec = i_vec + step) begin
        @ (posedge clk) begin
            start = 1;
            if (format = 'FP16) begin
                vectors[1*'FP16W-1:0*'FP16W] <= testmem[i_vec + 0];
                vectors [2*'FP16W - 1:1*'FP16W] <= testmem[i_vec + 1];
```

```
            vectors[3*`FP16W - 1:2*`FP16W] <= testmem[i_vec + 2];
                    vectors [4*'FP16W - 1:3*`FP16W] <= testmem[i_vec + 3];
                    A0}<=\mathrm{ testmem[i_vec + 0];
                    B0<= testmem[i_vec + 1];
                    A1<= testmem[i-vec + 2];
                    B1<= testmem[i_vec + 3];
        end
        else if (format = 'FP32) begin
            vectors [1**'FP32W - 1:0**FP32W] <= testmem [i_vec + 0];
            vectors[2*'FP32W-1:1*'FP32W] <= testmem[i_vec + 1];
            vectors [ 3**FP32W - 1:2**FP32W] <= testmem[i_vec + 2];
            vectors[4*'FP32W-1:3**FP32W] <= testmem[i_vec + 3];
            A0}<=\mathrm{ testmem[i_vec + 0];
            B0}<=\mathrm{ testmem[i__vec + 1];
            A1 <= testmem[i_vec + 2];
            B1<= testmem[i_vec + 3];
            end
            else if (format = 'FP64) begin
                vectors [1*'FP64W-1:0*'FP64W] <= testmem[i_vec + 0];
                vectors [2*'FP64W-1:1*'FP64W] <= testmem[i_vec + 1];
                A0<= testmem[i_vec + 0];
                B0}<=\mathrm{ testmem[i_vec + 1];
            end
            else begin
                    vectors < = 0;
                    A0<= 0;
                    B0<= 0;
                    A1 <= 0;
                    B1 <= 0;
            end
        end
    end
    // Empty pipeline.
    @ (posedge clk) // wait cycle
    @ (posedge clk) // wait cycle.
    start = 0;
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
    print_report;
    $finish;
end
// Sequential test logic.
//
    clock generator.
always #'CLK_PERIOD clk = !clk;
// Monitor / checker.
always @ (ready) begin
    if (reset_n = 0) begin
        i_ans <= 0;
        i_total < = 0;
    end
    // When products and exceptions are ready at output.
    if (ready=1) begin
        i_total < = i_total + 1;
```

```
            if (format = 'FP64) begin
```

            if (format = 'FP64) begin
            i_ans <= i_ans + 4;
            i_ans <= i_ans + 4;
            end
            end
            else begin
            else begin
            i_ans <= i_ans + 8;
            i_ans <= i_ans + 8;
            end
            end
            if ((products != dw_products)|
            if ((products != dw_products)|
            (exceptions != dw exceptions)) begin
            (exceptions != dw exceptions)) begin
            i_failed = i_failed + 1;
            i_failed = i_failed + 1;
            end
            end
            else begin
            else begin
            i_passed = i_passed + 1;
            i_passed = i_passed + 1;
            end
            end
        end
        end
    end
end
/// Combinaional test logic.
// -
// Clears exceptions when arised.
always @ (ready) begin
if (ready) begin
clear = 'hffff;
end
else if (!ready) begin
clear = 'h0000;
end
end*/
// Produces test statistics.
// Counts infity inputs.
integer i0:
always@ (inf or start) begin
for (i0 = 0; i0 < 4; i0 = i0 + 1) begin
if (inf[i0]\&start) i_inf = i_inf + 1;
end
end
// Counts zero inputs.
integer i1;
always @ (zero or start) begin
for (i1 = 0; i1 < 4; i1 = i1 + 1) begin
if (zero[i1]\&start) i_zero = i_zero + 1;
end
end
// Counts invalid inputs.
integer i2;
always @ (nan or start) begin
for (i2 = 0; i2 < 4; i2 = i2 + 1) begin
if (nan[i2]\&start) i_nan = i_nan + 1;
end
end
// Counts infinity times zero inputs.
integer i3;
always @ (inf or zero or start) begin

```
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
```

    for (i3 = 0; i3 < 2; i3 = i3 + 2) begin
            if (inf[i3]&zero[i3+1]&start)
                i_inf_times_zero = i_inf_times_zero + 1;
    end
    for (i3 = 0; i3< 2; i3 = i3 + 2) begin
        if (inf[i3+1]&zero[i3]&start)
            i_inf_times_zero = i_inf_times_zero + 1;
        end
    end
// Counts invalid times any number (not invalid times invalid).
integer i4;
always@ (nan or start) begin
for (i4=0; i4< 2; i4 = i4 + 1) begin
if (nan[i4]\&!nan[i4+1]\&start)
i_nan_times_any = i_nan_times_any + 1;
end
end
// Counts underflows.
integer o0;
always@ (ready or exceptions) begin
for (o0 = 0; o0< 4; o0 = o0 + 1) begin
if (exceptions[12 + o0]\&ready) i_unf = i_unf + 1;
end
end
// Counts overflows.
integer o1;
always @ (ready or exceptions) begin
for (o1 = 0; o1 < 4; o1 = o1 + 1) begin
if (exceptions[8 + o1]\&ready) i_ovf = i_ovf + 1;
end
end
// Counts inexacts.
integer o2;
always @ (ready or exceptions) begin
for (o2=0; o2< 4; o2= o2 + 1) begin
if (exceptions[4+o2]\&ready) i_inx = i_inx + 1;
end
end
// Counts invalids.
integer o3;
always @ (ready or exceptions) begin
for (o3 = 0; o3< 4; o3= o3 + 1) begin
if (exceptions[0 + o3]\&ready) i_inv = i_inv + 1;
end
end
/// Assigns.
// Assigns
assign dw_vectors =
(format = 'FP16)?
{testmem[i_ans + 7], testmem[i_ans + 6],
testmem[i_-ans + 5], testmem[i_-ans + 4],
testmem[i_ans + 3], testmem[i_ans + 2],
testmem[i_ans + 1], testmem[i_ans + 0]} :
(format = 'FP32)?
{testmem[i_ans + 7], testmem[i_ans + 6],
testmem[i_-ans + 5], testmem[i_-ans + 4],
testmem[i_-ans + 3], testmem[i_ans + 2],
testmem[i_ans + 1], testmem [i_ans + 0]} :

```
```

    (format = 'FP64) ?
    {testmem[i_ans + 3], testmem[i_ans + 2],
    testmem[i_ans + 1], testmem[i_ans + 0]} : 0;
    assign nan [0] = (\&A0[W-2:SW]) \&(|A0[SW-1:0]);
assign inf[0]=(\&A0[W-2:SW]) \&(~ |A0[SW-1:0]);
assign zero[0] = (~ |A0[W-2:SW])\&(~ |A0[SW-1:0]);
assign nan[1] = (\&A1[W-2:SW]) \&(|A1[SW-1:0]);
assign inf[1] =(\&A1[W-2:SW]) \&(~ |A1 [SW-1:0]);
assign zero[1]=(~|A1[W-2:SW])\&(~ |A1[SW-1:0]);
assign nan [2] = (\&B0[W-2:SW]) \& (|B0[SW-1:0]);
assign inf[2] = (\&B0[W-2:SW]) \& (~ |B0[SW-1:0]);
assign zero[2] = (~ | B0[W-2:SW])\&(~ |B0[SW-1:0]);
assign nan [3] =(\&B1[W-2:SW]) \&(|B1[SW-1:0]);
assign inf[3] =(\&B1[W-2:SW]) \&(~ |B1[SW-1:0]);
assign zero[3] = (~ |B1[W-2:SW])\&(~ |B1[SW-1:0]);
// Tasks.
//
task initialize;
begin
set_mode(MODE) ; format = FORMAT;
clk = 1; reset n = 0; clear = 0; start = 0;
A0}=0;\textrm{B}0=\overline{0};\textrm{B}1=0;\textrm{A}1=0; vectors=0
i_vec = 0; i_ans=0; i_passed = 0; i_failed = 0;
i_nan = 0; i_zero = 0; i_inf=0; i_inf_times_zero = 0;

```

```

            i_inx = 0;
            // Opens correct testcase readfile.
            if (format = 'FP16) begin
                step = 4;
                $readmemb('FP16TESTCASES, testmem);
            end
            else if (format = 'FP32) begin
                step = 4;
                $readmemb('FP32TESTCASES, testmem);
            end
            else if (format = 'FP64) begin
                step = 2;
                $readmemb('FP64TESTCASES, testmem);
            end
            $display(" ")
        end
    endtask
// Sets rounding mode for both floating-point multipliers.
task set_mode;
input [1:0] r_mode;
begin
case (r_mode)

```
```

            'EVEN: begin
                    mode = 'EVEN;
                    dw_mode = 'DW_EVEN;
            end
            'PINF: begin
                    mode = 'PINF;
                    dw_mode = 'DW_PINF;
            end
            'NINF: begin
                    mode = 'NINF;
                    dw_mode = 'DW_NINF;
            end
            'ZERO: begin
                    mode = 'ZERO;
                    dw mode = 'DW ZERO;
            end
            default: begin
            $display("Not`valid_rounding_mode!" );
            $finish;
            end
                endcase
            end
    endtask
endmodule // vec_fp_mult_tb

```
```

/
// File......: debug.v
// Author.....: Espen Stenersen
/ Date......: Sat Apr 26 01:04:00 CEST 2008
/ Revision...: 1.0
Description: Tasks for debuging design. Reports errors and signal
status at different verbosity level.
/
reg v1;
reg v2;
reg v3;
reg v4;
Verbosity levels:
/ 0: Only final report
// 1: Signal events and updates.
/ 2: Error messages.
// 3: Elaborated error messages with product vectors, exception
// vectors and input vectors that caused the error.
// 4: 1 and 3 combined.
task verbosity;
input [2:0] verbosity;
begin
case (verbosity)
0: begin
v1 = 0; v2 = 0; v3 = 0; v4 = 0;
end
// Signal updates.
1: begin
v1 = 1;
print header;
end

```
```

37
38
39
40
4 1
42
4 3
44
4 5
46
47
49
50
51
52
53

```
            // Error messages.
```

            // Error messages.
            2: begin
            2: begin
                    v2 = 1;
                    v2 = 1;
            end
            end
            / Elaborated messages.
            / Elaborated messages.
            3: begin
            3: begin
            v3 = 1;
            v3 = 1;
            end
            end
            // Elaborated messages with signal updates.
            // Elaborated messages with signal updates.
            4: begin
            4: begin
            v4 = 1;
            v4 = 1;
            end
            end
            default: begin
            default: begin
            v}1=0; v2=0; v3=0; v4 = 0; 
            v}1=0; v2=0; v3=0; v4 = 0; 
            end
            end
        endcase
        endcase
        end
        end
    endtask
endtask
/// Prints error elaborated error messages.
/// Prints error elaborated error messages.
always @ (ready) begin
always @ (ready) begin
if ((v2|v3)\&(ready=1)) begin
if ((v2|v3)\&(ready=1)) begin
if ((products != dw_products)|
if ((products != dw_products)|
(exceptions != dw_exceptions)) begin
(exceptions != dw_exceptions)) begin
print_error;
print_error;
end
end
end
end
end
end
/// Updates the signal status print-out.
/// Updates the signal status print-out.
always @ (reset_n) begin
always @ (reset_n) begin
if (v1|v4)
if (v1|v4)
$display("@_%0d\t\t|_reset_n\t\t|_%b", ($time)/2, reset_n);
$display("@_%0d\t\t|_reset_n\t\t|_%b", ($time)/2, reset_n);
end
end
always@ (start) begin
always@ (start) begin
if (v1|v4)
if (v1|v4)
$display("@_%0d\t\t|\smilestart\t\t\t|\smile%b", ($time)/2, start);
$display("@_%0d\t\t|\smilestart\t\t\t|\smile%b", ($time)/2, start);
end
end
always @ (ready) begin
always @ (ready) begin
if (v1)
if (v1)
$display("@_%0d\t\t|\smileready \t \t \t | % %b", ($time)/2, ready);
$display("@_%0d\t\t|\smileready \t \t \t | % %b", ($time)/2, ready);
if (v4\&ready) print_error;
if (v4\&ready) print_error;
if (v4\&!ready)
if (v4\&!ready)
$display("@_%0d\t \t | ready \t \t \t|\smile%b", ($time)/2, ready);
$display("@_%0d\t \t | ready \t \t \t|\smile%b", ($time)/2, ready);
end
end
always @ (clear) begin
always @ (clear) begin
if (v1|v4)
if (v1|v4)
$display("@_%0d\t \t|`clear \t \t \t | %%b_%b`%b`%%b" , ($time)/2,
$display("@_%0d\t \t|`clear \t \t \t | %%b_%b`%b`%%b" , ($time)/2,
clear[15:12], clear[11:8], clear[7:4], clear[3:0]);
clear[15:12], clear[11:8], clear[7:4], clear[3:0]);
end
end
always @ (ready) begin
always @ (ready) begin
if (ready = 1) begin

```
    if (ready = 1) begin
```

```
        if ((v1)&(products != dw_products))
        $display("@_%0d\t\t|\smileproducts \t \t | ERROR!" , ($time)/2);
    end
end
always @ (ready) begin
    if (ready = 1) begin
        if ((v1)&(exceptions != dw_exceptions))
            $display("@_%0d\t \t|`exceptions \t \t|_ERROR!", ($time)/2);
        end
end
always @ (format) begin
    if (v1|v4) begin
        case (format)
            'FP16: begin
                $display("@_%0d\t \t|fformat \t \t|_16-bit_floating - point`
                    (\'b%b)",
                ($time)/2, format);
            end
                'FP32: begin
                    $display("@_%0d\t \t | format \t \t|`32-bit_floating-point」
                                    (\'b%b)",
                    ($time)/2, format);
                end
                'FP64: begin
                    $display("@_%0d\t \t|fformat \t \t|_64-bit_floating -point`
                        (\'b%b)",
                    ($time)/2, format);
                end
            endcase
        end
end
always @ (mode) begin
    if (v1|v4) begin
        case (mode)
            'EVEN: begin
                    $display("@_%0d\t \t|mmode\t \t \t|`Round-to-nearest`even」
                                    (\'b%b)",
                ($time)/2, mode);
            end
            'PINF: begin
                $display("@_%0d\t\t|_mode\t\t\t | Round-to-positive \
                    infinity\smile(\'b%b)",
                ($time)/2, mode);
            end
            'NINF: begin
                $display("@_%0d\t\t | mode\t \t \t | Round-to-negativev
                    infinity\smile(\'b%b)" ,
                ($time)/2, mode);
            end
            'ZERO: begin
                $display("@_%0d\t \t | mode\t \t \t | Round-to\smilezero`(\ 'b%b)"
                ($time)/2, mode)
            end
        endcase
        end
end
always @ (exceptions) begin
    if (v4) begin
        $display("@_%0d\t \t|`exceptions \t \t|_%b_%b_%%b_%b", $time/2,
        exceptions[15:12], exceptions[11:8], exceptions[7:4],
        exceptions[3:0]);
```



```
        $display("DW_\smile[127:64]\t|Â %b", dw_products[127:64]);
        $display("DW_\smile[\smile63:0_]\ttÂ %b", dw_products[6 3:0]);
        $write("
```

$\qquad$

```
        $write("—___n");
        $display("DUT`[15:0]\t | A % %b_%b_%%_%b_(underflow_overflow
                inexact_invalid)",
        exceptions[15:12], exceptions[11:8], exceptions[7:4],
                exceptions[3:0]);
            $write("
```

$\qquad$

``` " ) ;
            $write("—___n");
            $display("DW_\smile[15:0]\t|Â %b\smile%b`%b\smile%bb (underflow
                inexact\smileinvalid)",
            dw exceptions[15:12], dw exceptions[11:8], dw exceptions
                [7:4], dw_exceptions[3:0]);
            $write("___");
            $write("___ \_ ");
        end
endtask
Prints final report.
task print_report;
    begin
        $display("\n\n");
        $write("********************************************" );
        $write("****************************************\n");
        $display("\nFINAL_REPORT\n" );
        print format(format);
        print_mode(mode);
        $display("Input`statistics");
        $write("___");
        $write("____n")
        $display(" Total\_invalid_inputs \t\t:_%0d", i_nan);
        $display("Total_zero_inputs \t \t:_%0d", i_ze`ro);
        $display("Total_infinity_inputs \t \t:_%0d", i_inf);
        $display("Total\_infintiy_times_zero\t:`%0d", i_inf_times_zero
            );
            $display("Total^invalid_times」any_number \t:„%0d",
                i nan times any);
            $write("_____");
            $write("-__n")
            $display(" Total_input\_vectors \t \t:_%0d", VECTORS);
            $write("—}")
            $write("___n");
            $display(" ");
            $display("Ouput`statistics");
            $write("
```

$\qquad$

```
            $write("_______n")
            $display("Total\_overflowed_products \t:\smile%0d", i__ovf);
            $display ("Total\_underflowed_products\t:`%0d", \overline{i}_unf);
            $display("Total_invalid_products \t\t:_%0d", i_inv);
            $display("Total_inexact_products \t\t:&%0d", i_inx);
            $write("
            $write("—__n")
            if (format = 'FP64) begin
            // Times two because each product vector consists of
            // two products
            $display("Total\_products \t\t\t: „%0d", 2*i total);
            //$display("Total products passed\t\t: %0\overline{d", 2*i_passed);}
            //$display("Total products failed\t\t: %0d", 2*i_failed);
            //$write("__________________________________________
```



```
            $display("Total\_product\_vectors \t\t:`%0d", i_total);
            $display("Total\_products\smilevectors\smilepassed \t: %%0d", i_passed)
            ;
            $display("Total\_products\smilevectors^failed \t:`%0d", i_failed)
        end
        else begin
            // Times four because each product vector consists of
            // four products
            $display("Total_products\t\t\t:&%0d", 4*i_total);
            //$display("Total products passed\t\t: %0\overline{d", 4*i_passed);}
            //$display("Total products failed\t\t: %0d", 4*i_failed);
            // $write("__________________________________________
```



```
            $display("Total」product」vectors \t\t:_%%0d", i_total);
            $display("Total\_products\smilevectors\smilepassed \t:`%0d" , i_passed)
            $display(" Total_products\_vectors_failed \t:_%0d", i_failed)
        end
        $write("___");
        $write("—__ n")
        $display("\n\n");
        if (i_failed > 0) begin
            $display("Test`finished`without\smilesuccess!");
        end
        else begin
            $display("Test`finished`successfully!");
        end
        $display(" " );
        $write( "****************************************" );
        $write("*****************************************\n");
    end
endtask
task print header;
    begin
```



```
        $write("_=n");
        $display("Time_(cycle)\t|_Signal\t\t|Â Event");
        $write("_=_");
        $write("=_____________________n");
    end
endtask
// Prints rounding mode.
//
task print_mode;
    input [1:0] mode;
    begin
        case (mode)
            'EVEN: begin
            $display("Rounding_mode\t\t\t:`Round-to-nearest_even \n"
                );
            end
            'PINF: begin
                $display ("Rounding_mode\t\t\t:`Round-to-positive
                infinity\n");
```

```
            end
            'NINF: begin
                    $display("Rounding_mode\t\t\t:`Round-to-negative」
                    infinity \n");
            end
            'ZERO: begin
                                    $display("Rounding_mode\t\t\t:`Round-to\_zero \n");
            end
            endcase
        end
endtask
// Prints data format tested.
task print_format;
    input [\overline{5:0] data_format;}
    begin
            if (data_format ='FP16) begin
                $display("Data_format \t\t\t:`16-bit\_floating-point`(FP16)"
                );
            end
            else if (data_format ='FP32) begin
                $display("Data_format \t \t\t:_32-bit`floating-point`(FP32)"
                    );
            end
            else if (data_format ='FP64) begin
                $display("D
                    );
            end
    end
endtask
```


## D. 3 Switching Activity Simulation Source

```
// File......: vec_fp_mult_stimuli_tb.v
// Author.....: Espen Stenersen
// Date......: Tue Apr 29 14:19:15 CEST 2008
// Revision...: 1.0
// Description: Generates switching activity information to the
//
synopsys power analasys tools.
/
'include "defines.v"
'timescale 1ps/1ps
'define CLK_PERIOD 5000
module vec_fp_mult_stimuli_tb;
parameter FP16STEP = 4;
parameter FP32STEP = 4;
parameter FP64STEP = 2;
parameter FP16VECTORS = 100;
parameter FP32VECTORS = 0;
parameter FP64VECTORS = 0;
// wire(s)
wire [127:0] products;
wire [15:0] exceptions;
wire ready;
// reg(s)
reg start;
reg [127:0] vectors;
reg [1:0] format;
reg [1:0] mode;
reg [15:0] clear;
reg clk;
reg reset_n;
reg ['FP16W-1:0] fp16testmem [0:FP16VECTORS];
reg ['FP32W-1:0] fp32testmem [0:FP32VECTORS];
reg ['FP64W-1:0] fp64testmem [0:FP64VECTORS];
integer i_vec;
// Module instantiation.
//
    vec fp mult DUT
    (
        .start (start), // Input. Starts computation.
        .vectors (vectors), // Input. FP vectors to be computed.
        .format (format), // Input. Format of vectors.
        .mode (mode), // Input. Rounding mode.
        .clear (clear), // Input. Clears exceptions.
        products (products), // Output. Computed products.
        . exceptions (exceptions), // Output. Exceptions raised.
        .ready (ready), // Output. Output vector ready.
        .clk (clk),
```

```
);.reset_n (reset_n)
);
initial begin
    clk = 1; reset_n = 0; start = 0; vectors = 0; clear = 0;
    format = 0; möde = 'EVEN;
    'include "d1 tracefile.v"
    $dumpfile("toggle1_200_fp16.vcd");
    $readmemb("fp16testcases.txt", fp16testmem);
    $readmemb("fp32testcases.txt", fp32testmem);
    $readmemb("fp64testcases.txt", fp64testmem);
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
    @ (posedge clk) reset_n = 1;
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
    // Round-to-nearest even.
    for (i_vec = 0; i_vec < FP16VECTORS; i_vec = i_vec + FP16STEP)
        begin
        @ (posedge clk) begin
            format = 'FP16;
            start = 1;
            vectors[1*`FP16W-1:0*`'FP16W] <= fp16testmem[i_vec + 0];
            vectors[2*'FP16W - 1:1*'FP16W] <= fp16testmem[i__vec + 1];
            vectors[3*'FP16W-1:2*`'FP16W] <= fp16testmem[i_vec + 2];
            vectors [4*'FP16W-1:3*'FP16W] <= fp16testmem[i_vec + 3];
        end
    end
    for (i_vec = 0; i_vec < FP32VECTORS; i_vec = i_vec + FP32STEP)
        begin
        @ (posedge clk) begin
            format = 'FP32;
            start = 1;
            vectors[1*'FP32W - 1:0*`'FP32W] <= fp32testmem[i vec + 0];
            vectors[2*'FP32W-1:1*'FP32W] <= fp32testmem[i_vec + 1];
            vectors [3*`FP32W-1:2*`'FP32W] <= fp32testmem [i-vec + 2];
            vectors[4*'FP32W-1:3*'FP32W] <= fp32testmem[i_vec + 3];
        end
    end
    for (i_vec = 0; i_vec < FP64VECTORS; i_vec = i_vec + FP64STEP)
        begin
        @ (posedge clk) begin
            format = 'FP64;
            start = 1;
            vectors[1*`'FP64W-1:0*`'FP64W] <= fp64testmem [i_vec + 0];
            vectors[2*'FP64W-1:1*`FP64W] <= fp64testmem[i_vec + 1];
        end
    end
    // Round-to-positive infinity.
    mode = 'PINF;
    for (i_vec = FP16VECTORS; i_vec < 2*FP16VECTORS; i_vec = i_vec +
        FP}16STEP) begi
        @ (posedge clk) begin
                format ='FP16;
                start = 1;
                vectors[1*`FP16W - 1:0*`'FP16W] <= fp16testmem[i_vec + 0];
                vectors[2*'FP16W-1:1*'FP16W] <= fp16testmem [i- vec + 1];
                vectors[3*'FP16W-1:2*`'FP16W] <= fp16testmem[i_vec + 2];
                vectors [4*'FP16W - 1:3*`FP16W] <= fp16testmem [i__vec + 3];
```

```
    end
end
for (i_vec = FP32VECTORS; i_vec < 2*FP32VECTORS; i_vec = i_vec +
    FP32STEP) begin
    @ (posedge clk) begin
        format = 'FP32;
        start = 1;
        vectors[1*`FP32W-1:0*`FP32W] <= fp32testmem[i vec + 0];
        vectors[2*'FP32W - 1:1*'FP32W] <= fp32testmem[i__vec + 1];
        vectors[3*'FP32W-1:2*`'FP32W] <= fp32testmem[i_vec + 2];
        vectors[4*'FP32W-1:3*'FP32W] <= fp32testmem[i_vec + 3];
    end
end
for (i_vec = FP64VECTORS; i_vec < 2*FP64VECTORS; i_vec = i_vec +
        FP64STEP) begin
        @ (posedge clk) begin
            format = 'FP64;
            start = 1;
            vectors[1*`FP64W-1:0*`FP64W] <= fp64testmem[i_vec + 0];
            vectors[2*'FP64W-1:1*'FP64W] <= fp64testmem[i__vec + 1];
        end
end
// Round-to-negative inifity.
mode = 'NINF;
for (i_vec = 2*FP16VECTORS; i_vec < 3*FP16VECTORS; i_vec = i_vec
        + FP16STEP) begin
        @ (posedge clk) begin
            format = 'FP16;
            start = 1;
            vectors[1*`'FP16W - 1:0*`'FP16W] <= fp16testmem [i vec + 0];
            vectors[2*'FP16W-1:1*'FP16W] <= fp16testmem[i_vec + 1];
            vectors[3*`FP16W-1:2*`'FP16W] <= fp16testmem[i_vec + 2];
            vectors[4*'FP16W-1:3*`FP16W] <= fp16testmem[i_vec + 3];
        end
end
for (i_vec = 2*FP32VECTORS; i_vec < 3*FP32VECTORS; i_vec = i_vec
        + FP32STEP) begin
    @ (posedge clk) begin
            format = 'FP32;
            start = 1;
            vectors[1*`FP32W-1:0*`FP32W] <= fp32testmem[i_vec + 0];
            vectors[2*'FP32W - 1:1*'FP32W] <= fp32testmem[i vec + 1];
            vectors[3*`'FP32W - 1:2*`'FP32W] <= fp32testmem[i_vec + 2];
            vectors[4*'FP32W-1:3*'FP32W]}<=\mathrm{ fp32testmem[i_vec + 3];
        end
end
for (i_vec = 2*FP64VECTORS; i_vec < 3*FP64VECTORS; i_vec = i_vec
        +
    @ (posedge clk) begin
        format = 'FP64;
        start = 1;
        vectors[1*`'FP64W-1:0*`'FP64W] <= fp64testmem[i_vec + 0];
        vectors[2*'FP64W-1:1*'FP64W]}<=\mathrm{ fp64testmem [i_vec + 1];
    end
end
// Round-to zero.
mode = 'ZERO;
for (i_vec = 3*FP16VECTORS; i_vec < 4*FP16VECTORS; i_vec = i_vec
        + FP16STEP) begin
    @ (posedge clk) begin
        format = 'FP16;
        start = 1;
```

```
            vectors[1*'FP16W-1:0*'FP16W] <= fp16testmem[i_vec + 0];
                    vectors [2*'FP16W-1:1*'FP16W] <= fp16testmem[i_vec + 1];
                    vectors[3*'FP16W - 1:2*`FP16W] <= fp16testmem[i_vec + 2];
                    vectors [4*'FP16W-1:3*`'FP16W]}<=\mathrm{ fp16testmem[i_vec + 3];
            end
        end
        for (i_vec = 3*FP32VECTORS; i_vec < 4*FP32VECTORS; i_vec = i_vec
            - FP32STEP) begin
            @ (posedge clk) begin
                format = 'FP32;
                start = 1;
                vectors[1*'FP32W-1:0*`'FP32W] <= fp32testmem[i vec + 0];
                vectors[2*'FP32W-1:1*'FP32W] <= fp32testmem[i_vec + 1];
                vectors[3*'FP32W-1:2*'FP32W] <= fp32testmem[i_ vec + 2];
                vectors[4*'FP32W-1:3*'FP32W] <= fp32testmem[i_vec + 3];
            end
        end
        for (i_vec = 3*FP64VECTORS; i_vec < 4*FP64VECTORS; i_vec = i_vec
            + FP64STEP) begin
            @ (posedge clk) begin
                format = 'FP64;
                start = 1;
                vectors[1*`'FP64W-1:0*`'FP64W] <= fp64testmem [i_vec + 0];
                vectors[2*'FP64W-1:1*'FP64W] <= fp64testmem[i_vec + 1];
            end
        end
    // Empty pipeline.
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
    start = 0;
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
    @ (posedge clk) // wait cycle.
        $finish;
    end
    // Toggles clearing of exceptions.
    always @ (ready) begin
        if (ready=1) begin
        clear = 16'b11111111111111111;
        end
        else begin
            clear = 16'b0000000000000000;
        end
    end
    // clock generator.
    always #(`CLK_PERIOD/2) clk = !clk;
endmodule // vec_fp_mult_stimuli_tb
```

