### Introduction to Scientific Programming Computational Problem Solving Using: Maple and C Mathematica and C

Author:
Joseph L. Zachary
Online Resources:
Maple/C Version
Mathematica/C Version

# Floating-Point Number Tutorial

In this tutorial we will explore the nature of floating-point numbers, as explained in Chapter 2. The tutorial will help you understand the significance of mantissa size and exponent range and the meaning of underflow, overflow, and roundoff error.

## Simulation

We will be using a floating-point number simulator throughout this tutorial. You can start it by clicking on the following button.

A Java applet should appear here

As discussed in Chapter 2, a floating-point number system is characterized by

• a maximum mantissa size (Digits) and
• a range in which exponents must lie ([minexp..maxexp]).
In such a system, the positive floating-point numbers consist of all real numbers that can be written in the form
```     e
m 10
```
where
• 1 <= m < 10,
• m consists of Digits digits, and
• e is an integer and minexp <= e <= maxexp.
The system also contains the corresponding negative numbers and zero.

The floating-point number simulator allows you to design your own floating-point number systems. When the simulator first appears, it will be simulating a system with a maximum mantissa size of 1, a minimum exponent of -1, and a maximum exponent of 1. You can verify this by looking at the three controls at the top of the simulator window.

The simulator also displays a list of the valid mantissas, and both a list and a graph of the valid positive floating-point numbers. Finally, the simulator allows you to calculate the values of arithmetic expressions in the floating-point system. (Initially, it displays the result of adding 2 and 3.)

Locate all of these pieces of information in the simulator window before going on.

## Mantissas and Exponents

With only one-digit mantissas permitted, the simulator reveals that there are only nine valid mantissas. Because there are three possible choices for an exponent, however, there are 27 different positive floating-point numbers. For example, suppose that we choose 3 as a mantissa. We can then make the numbers .3 (using an exponent of -1), 3 (using an exponent of 0), and 30 (using an exponent of 1).

The floating-point numbers between .1 and .9 are separated by intervals of .1; the floating-point numbers between 1 and 9 are separated by intervals of 1; and the floating-point numbers between 10 and 90 are separated by intervals of 10. This is reflected by the graph at the bottom of the simulator.

You can zoom in on the graph by clicking the button on the left end of the graph. If you zoom in once or twice you can get a clearer picture of what's going on between 0 and 1. You can zoom out by clicking the button on the right end of the graph. You can also use the mouse to select a region of the graph that you would like to examine in more detail.

Let's modify the floating-point number system being simulated and see how that changes the collection of available numbers. Use the controls at the top of the window to change the minimum exponent to -2 and the maximum exponent to 2. If you can't figure how to do this, or if what you do doesn't seem to work, click below and it will be done automatically.

Increasing the range of the exponents increases the range of the positive floating-point numbers. In addition to the 27 numbers that we had before, we now have nine numbers between .01 and .09 and nine numbers between 100 and 900. But notice that although we have enlarged the range of our floating-point numbers, we have not changed their spacing. Between .1 and 90, the numbers are exactly as they were before.

We can increase the detail of our numbers by increasing the maximum mantissa size. Use the control at the top of the window to increase the maximum mantissa size to 2. As before, if you can't figure out how to do this, or if what you do doesn't work properly, click below and it will be done automatically.

Notice what has and has not changed. (It may help to use the two buttons above to quickly switch back and forth between the simulations of one-digit and two-digit mantissas.)

• The range of the numbers has changed only slightly. Before they ranged from .01 to 900; now they range from .01 to 990.

• With one-digit mantissas, the possible mantissas ranged from 1 to 9 in increments of 1. With two-digit mantissas, the possible mantissas range from 1 to 9.9 in increments of .1.

• The gaps between the positive floating-point numbers are now ten times smaller. For example, with one-digit mantissas the numbers between 10 and 90 occurred in steps of 10; now they occur in steps of 1.

There are two aspects of the graph that you should notice.

• Zoom out so that you can see the region between 0 and 10,000. You will notice that beyond approximately 1000, there are no floating-point numbers whatsoever.

• Now zoom in so that you can see the region between 0 and .1 in detail. You will notice that between 0 and .01, there is a noticeable gap where there are no floating-point numbers whatsoever. This is called the hole at zero.
Let's try to fill the hole at zero by making the minimum exponent -3 instead of -2. You can make the change yourself, or click below and it will be done automatically.

If you zoom in close to zero, you will discover that while the hole at zero is now smaller, it is still there. No matter how small an exponent we allow, there will always be a range of numbers close to zero for which no good approximation exists in our floating-point number system. Similarly, no matter how large an exponent we allow, there will always be a range of large numbers for which no good approximation exists.

## Exercises

1. Try to get a feel for how the maximum mantissa size and permissible exponent range controls what floating-point numbers exist and how they are distributed. To do this, you will need to adjust the controls at the top of the simulator window. (The graph at the bottom of the simulator is sometimes slow to redisplay. If you get impatient, you can interrupt it by using the Halt menu at the top of the simulator window.)

2. Below are several sets of numbers. For each, what is the simplest floating-point number system that contains each number in the set? Use the floating-point simulator to verify your answers.
• {5, 50, 500, 5000}
• {4, 4.4, 4.44}
• {4, .4, .04, .004}
• {3, 33, 333, 3333}

## Overflow, Underflow, and Roundoff Error

Now let's turn our attention to what happens when we do arithmetic calculations with floating-point numbers. Let's return to a simple system with one-digit mantissas and exponents ranging from -1 to 1. Either do it yourself, or click below.

We will now begin using the expression input region, which should be displaying the result of calculating 2+3. Delete the contents of the input region and enter the sum "5 + 7" instead. When you hit the return key, the sum of the two numbers in the current floating-point system will be displayed.

The sum that is displayed is 10 instead of 12. Because 12 is not a valid floating-point number in the current system, it is rounded to the nearest number that is a floating-point number. In this case, the closest floating-point number to the sum is 10. (Notice that the expression status region reports that one rounding was performed when computing the sum.) The error in the result that is caused by rounding is called roundoff error.

Now increase the mantissa size from 1 to 2. This time the sum will be displayed as 12, and the status area will report that no roundings were performed. This is because, in the new floating-point system with two digits of mantissa, 12 is a valid floating-point number.

Of course, it is possible to observe roundoff error even in this new system. For example, try computing the quotient of 1 and 8. You will see that the true answer, 0.125, is rounded to the two-digit 0.13. If you increase the mantissa size to 3 digits, the exact three-digit answer will be displayed.

Let's return to the simple number system. Reduce the mantissa size to 1.

Now calculate the sum of 50 and 70. This time, the status area will report that the true sum, 120, is too large. When the result of a calculation is too large to represent in a floating-point number system, we say that overflow has occurred. This happens when a number's exponent is larger than allowed. For example, the exponent of 120 is 2 (in other words, 120 = 1.2e2), and in the current floating-point system no exponents larger than 1 are allowed.

Increase the mantissa size from 1 to 2. You will notice that the result is still too large to represent as a floating-point number. The mantissa size has no effect on overflow errors; only on roundoff errors.

Now increase the maximum exponent from 1 to 2. This time the correct sum will be displayed. This, of course, is because the exponent (2) of the sum is now within the maximum allowed.

Just as overflow occurs when an exponent is too big, underflow occurs when an exponent is too small. To see it, try calculating the quotient of 1 and 50. This time the simulation complains that the quotient (0.02) is too small, which is an underflow error. The problem is that the exponent of 0.02 is -2, which is less than the permitted -1. If you decrease the minimum exponent to -2, the underflow error will be eliminated.

## Exercises

1. Use the simulator to get a better feel for what roundoff error, overflow, and underflow are. Experiment with different mantissa sizes, exponent ranges, and arithmetic calculations.

2. Below are several arithmetic calculations. What is the simplest floating-point number system in which the calculation can be performed without an underflow, an overflow, or roundoff error occurring? Use the floating-point simulator to verify your answers.
• 2.2 + 3.45
• 1 / 80
• 4e2 + 7e2
• 1e1 + 1e-1