Most awk implementations use floating point double precision to represent every kind of numeric value. However, this can cause worry when one is trying to sum up large numbers in very large log files: when is it safe to rely on awk's numbers and when should one shell out to dc or bc for arbitrary precision arithmetic?
The easiest way to investigate loss of accuracy is to find out when some number N is no longer distinct from N+1:
awk 'BEGIN{for (i = 0; i < 64; i++) printf "%s\t%19.0f\t%s\n", i, 2^i, (((2^i+1) == (2^i))? "in" : "") "accurate"}'
This will print out a list of numbers.
The largest reliable value that this process finds for my instance of gawk 3.1.5 running under 32-bit Linux is 2^53-1, with 53 being the 52 size in bits of the mantissa, plus 1 because the precision is still correct with the exponent=1 with a double precision IEEE 754 numbers.
IEEE 754 double precision floating point numbers are formatted thusly:
1 bit | 11 bits | 52 bits |
---|---|---|
sign | exponent | fraction |
Note that it says "fraction" above, not "mantissa". This is because the fraction field is interpreted differently in different circumstances.
If all of the exponent bits are 0, the fraction is a 52-bit unsigned integer value. (Unsigned because the sign bit gives the overall sign—yes, this means there's +0 and -0. Thanks, IEEE!) If the exponent field has any non-zero bits, it is assumed that the exponent has been normalized such that the highest bit in the mantissa is 1. Since that highest bit is always 1, there's no need to actually provide it. This means that with an exponent value of 1, you can continue getting precise values for up to 53 bits wide (2^53-1). Starting with an exponent value of 2, however, you lose precision as N and N+1 get encoded into the same representation. So 2^53 and 2^53+1 both encode as the same value. The following table shows the in-memory representation of several illustrative values
value | sign+exponent | fraction |
---|---|---|
2^51 | 000 | 8000000000000 |
2^52 | 001 | 0000000000000 |
2^53-1 | 001 | FFFFFFFFFFFFF |
2^53 | 002 | 0000000000000 |
2^53+1 | 002 | 0000000000000 |
Notice how the last two values are the same (approximately 9007199254740992 in decimal)? Starting with 2^53 you do not know what the actual intended value is going to be. You lose precision.
What are floating point numbers?
Not all numbers can be represented accurately using floating point