M episode10/README.md +25 -24
@@ 75,14 75,10 @@ copy as is and just chop off the end bec
the first few bits. We add the following the second switch statement:
func (e *Encoder) writeFloat(input float64) error {
+ switch {
...
- var (
- exp, frac = unpackFloat64(input)
- trailingZeros = bits.TrailingZeros64(frac)
- )
- ...
- switch {
case math.IsNaN(input):
+ var _, frac = unpackFloat64(input)
var f = frac >> (float64FracBits - float16FracBits)
return e.writeFloat16(math.Signbit(input), 1<<float16ExpBits-1, f)
...
@@ 140,44 136,47 @@ between subnormal numbers and regular floating point numbers is the fractional’s
prefix. The regular ones are prefixed with a 1, while the subnormal ones have a
0 as prefix.
-Here’s the formula for regular floating point numbers:
+Here’s the formula for regular 16 bits floating point numbers:
> (−1)<sup>signbit</sup> × 2<sup>exponent−15</sup> × 1.significantbits<sub>2</sub>
-With subnormal numbers the formula turns into:
+With 16 bits subnormal numbers the formula turns into:
> (−1)<sup>signbit</sup> × 2<sup>−14</sup> × 0.significantbits<sub>2</sub>
-!!WIP!!
-
-Subnormal numbers don’t start with a 1, but with a 0. This means we can
-represent number with exponents lower that -14 with subnormal numbers by
-shifting the bits to the left. We’ll use the smallest 16 bits subnormal number:
-5.960464477539063e-8 as a example. Its regular floating point representation is:
+Because subnormal numbers start with a 0, we can represent number with exponents
+lower that -14 by shifting the fractional part’s bits to the left. Let’s study
+the smallest 16 bits subnormal number 5.960464477539063e-8 as a example. Its
+regular floating point binary representation is:
> 2<sup>-24</sup> × 1.0000000000<sub>2</sub>
-The fractional part is all zeros and the exponent is -24. How can we represent
-it as a 16 bits floating point number when the exponent is set to -14 and can’t
-be changed? We shift the fractional part to the left and lower the exponent by
-the same amount. Every time we shift left the fractional part by 1 bit it’s
+Its fractional part is all zeros and its exponent is -24. How can we represent
+a number with -24 exponent when 16 bits subnormal numbers have their exponent
+set to -14? By shifting the fractional part to the left.
+Every time we shift left the fractional part by 1 bit it’s
equivalent to lowering the exponent by 1.
-For our example we shift the fractional part by 10 bits, which is equivalent to
-lowering the exponent by 10 to -24:
+
+it lower the exponent by
+the same amount.
+
+For our example the fractional part is ‘shifted’ by 10 bits because it’s all
+zeros, this is equivalent to lowering the exponent by 10 to -24:
> 2<sup>-24</sup> × 1.0000000000<sub>2</sub> = 2<sup>-14</sup> ×
> 0.0000000001<sub>2</sub>
As long as we can shift the fractional part to the left without dropping any 1’s
-we can represent the number as a 16 bits subnormal number.
+we can represent the number as a 16 bits subnormal number if its exponent is
+between -14 and -24.
-In summary to encode float16 subnormal numbers we have to:
+So to encode float16 subnormal numbers we have to:
1. Check the exponent and the number of trailing zeros to ensure we can encode
the number without losing precision
-2. Add a trailing 1 at the head of the fractional, since subnormal numbers don’t
- have the leading 1 like regular numbers do
+2. Add a trailing 1 at the head of the fractional, since subnormal number’s
+ fractional part isn’t prefixed with a 1
3. Shift the fractional part to match the number’s exponent
It turns out that the smallest possible 16 bits subnormal numbers is one of the
@@ 187,6 186,8 @@ example in the CBOR spec. We add it to t
{Value: 5.960464477539063e-8, Expected: []byte{0xf9, 0x00, 0x01}},
...
+!!WIP!!
+
To check if we have a number that we can encode as a subnormal number we add a
predicate function subnumber(). This function takes two parameters: the
exponent, and the number of trailing zeros in the fractional part. It then
M episode10/cbor.go +4 -3
@@ 184,6 184,10 @@ func (e *Encoder) writeFloat(input float
return e.writeFloat16(math.Signbit(input), 0, 0)
case math.IsInf(input, 0):
return e.writeFloat16(math.Signbit(input), 1<<float16ExpBits-1, 0)
+ case math.IsNaN(input):
+ var _, frac = unpackFloat64(input)
+ var f = frac >> (float64FracBits - float16FracBits)
+ return e.writeFloat16(math.Signbit(input), 1<<float16ExpBits-1, f)
}
var (
exp, frac = unpackFloat64(input)
@@ 193,9 197,6 @@ func (e *Encoder) writeFloat(input float
trailingZeros = float64FracBits
}
switch {
- case math.IsNaN(input):
- var f = frac >> (float64FracBits - float16FracBits)
- return e.writeFloat16(math.Signbit(input), 1<<float16ExpBits-1, f)
case float16MinBias <= exp && exp <= float16MaxBias && trailingZeros >= float16MinZeros:
return e.writeFloat16(math.Signbit(input), uint16(exp+float16ExpBias), frac)
case subnumber(exp, trailingZeros):