Characteristics · DoubleFloats.jl

This package provides extended precision versions of Float64, Float32, Float16.

type name	significand	exponent	◊	base type	significand	exponent
`Double64`	106 bits	11 bits	◊	`Float64`	53 bits	11 bits
`Double32`	48 bits	8 bits	◊	`Float32`	24 bits	8 bits
`Double16`	22 bits	5 bits	◊	`Float16`	11 bits	5 bits

Double64 is a magnitude ordered, nonoverlapping pair of Float64s
Double32 is a magnitude ordered, nonoverlapping pair of Float32s
Double16 is a magnitude ordered, nonoverlapping pair of Float16s

arithmetic (+, -, *, /) is exact
elementary functions are quite accurate
often better than C "double-double" libraries

For Double64 arguments within 0.0..2.0           except tan(x), cot(x) as they approach ±Inf

expect the abserr of elementary functions to be 1e-30 or less
expect the relerr of elementary functions to be 1e-28 or less

When used with reasonably sized values, expect successive DoubleFloat ops to add no more than 10⋅𝘂² to the cumulative relative error (𝘂 is the relative rounding unit, usually 𝘂 = eps(x)/2). Relative error can accrue steadily. After 100,000 DoubleFloat ops with reasonably sized values, the relerr could approach 100,000 * 10⋅𝘂². In practice these functions are considerably more resiliant: our algorithms come frome seminal papers and extensive numeric investigation.

            should you encounter a situation where either error grows strongly in one direction, please submit an issue