Characteristics

This package provides extended precision versions of Float64, Float32, Float16.


type namesignificandexponentbase typesignificandexponent
Double64106 bits11 bitsFloat6453 bits11 bits
Double3248 bits8 bitsFloat3224 bits8 bits
Double1622 bits5 bitsFloat1611 bits5 bits

Double64 is a magnitude ordered, nonoverlapping pair of Float64s

Double32 is a magnitude ordered, nonoverlapping pair of Float32s

Double16 is a magnitude ordered, nonoverlapping pair of Float16s



For Double64 arguments within 0.0..2.0           except tan(x), cot(x) as they approach ±Inf


When used with reasonably sized values, expect successive DoubleFloat ops to add no more than 10⋅𝘂² to the cumulative relative error (𝘂 is the relative rounding unit, usually 𝘂 = eps(x)/2). Relative error can accrue steadily. After 100,000 DoubleFloat ops with reasonably sized values, the relerr could approach 100,000 * 10⋅𝘂². In practice these functions are considerably more resiliant: our algorithms come frome seminal papers and extensive numeric investigation.

            should you encounter a situation where either error grows strongly in one direction, please submit an issue