Fused floating-point arithmetic for application specific processors

Min, Jae Hong

Fused floating-point arithmetic for application specific processors

Date

2013-12

Authors

Min, Jae Hong

Abstract

Floating-point computer arithmetic units are used for modern-day computers for 2D/3D graphic and scientific applications due to their wider dynamic range than a fixed-point number system with the same word-length. However, the floating-point arithmetic unit has larger area, power consumption, and latency than a fixed-point arithmetic unit. It has become a big issue in modern low-power processors due to their limited power and performance margins. Therefore, fused architectures have been developed to improve floating-point operations. This dissertation introduces new improved fused architectures for add-subtract, sum-of-squares, and magnitude operations for graphics, scientific, and signal processing.
A low-power dual-path fused floating-point add-subtract unit is introduced and compared with previous fused add-subtract units such as the single path and the high-speed dual-path fused add-subtract unit. The high-speed dual-path fused add-subtract unit has less latency compared with the single-path unit at a cost of large power consumption. To reduce the power consumption, an alternative dual-path architecture is applied to the fused add-subtract unit. The significand addition, subtraction and round units are performed after the far/close path. The power consumption of the proposed design is lower than the high-speed dual-path fused add-subtract unit at a cost in latency; however, the proposed fused unit is faster than the single-path fused unit. High-performance and low-power floating-point fused architectures for a two-term sum-of-squares computation are introduced and compared with discrete units. The fused architectures include pre/post-alignment, partial carry-sum width, and enhanced rounding. The fused floating-point sum-of-squares units with the post-alignment, 26 bit partial carry-sum width, and enhanced rounding system have less power-consumption, area, and latency compared with discrete parallel dot-product and sum-of-squares units. Hardware tradeoffs are presented between the fused designs in terms of power consumption, area, and latency. For example, the enhanced rounding processing reduces latency with a moderate cost of increased power consumption and area. A new type of fused architecture for magnitude computation with less power consumption, area, and latency than conventional discrete floating-point units is proposed. Compared with the discrete parallel magnitude unit realized with conventional floating-point squarers, an adder, and a square-root unit, the fused floating-point magnitude unit has less area, latency, and power consumption. The new design includes new designs for enhanced exponent, compound add/round, and normalization units. In addition, a pipelined structure for the fused magnitude unit is shown.