When computing a high-order polynomial in order to approximate the solution of some problem, some intermediate values may be even outside that range. Single precision values are perfectly adequate to represent the input data and the final results of any computation, because 24 bit is about the limit for any analog-to-digital or digital-to-analog conversion, and the exponent range is also sufficient for the physical quantities that can be measured directly, but when you simulate any semiconductor device and even when you simulate just an electrical circuit with discrete components, it is very frequent to have intermediate results with values much outside the range that can be represented in single precision, even up to 10^60 or 10^-60. Such wide ranges are unavoidable in complex physical simulations, because their origin is in the ratios between quantities at human or astronomic sizes and quantities at atomic or molecular sizes. Some of these cannot be represented in single precision, while for the others one or two multiplications or divisions are enough to cause underflows and overflows. In physics there are many universal constants or material constants with ranges between 10^10 and 10^40, and their reciprocals are between 10^-10 and 10^-40. One must use either a third number for the exponent, or at least one of the two 32-bit numbers must be integer and partitioned into exponent and significand parts.īoth approaches will lead to much more complex algorithms and a much worse speed ratio for FP64 implemented with FP32 vs. So it is not enough to use two FP32 numbers to represent one FP64 number. With the very small exponent range of FP32 numbers, underflow and overflow is very likely and this must be corrected in any double precision implementation. The standard double-precision numbers have an exponent range that is large enough to make underflow and overflow very unlikely in most algorithms. ![]() The exponent range must also be extended. ![]() The reason is that it is not enough to extend the precision of the 32-bit FP numbers. not for special graphics cases), implementing double-precision operations using single-precision operations is considerably more complicated than implementing quadruple-precision operations using double-precision operations. ![]() For general computational applications (i.e.
0 Comments
Leave a Reply. |