Compare radix sort and Timsort performance on doubles between 0.0 and 1.0
`fixed copy&paste typo`
`changed version from 1.0 to 1.0`
`updated the documentation`

tip
browse log

### clone

read-only
https://hg.sr.ht/~dalke/radix_sort_benchmark
read/write
ssh://hg@hg.sr.ht/~dalke/radix_sort_benchmark
```       radix_sort_benchmark 1.1 -- 18 December 2020

Compare various sort methods when using a distribution of a relatively
small number of doubles between 0.0 and 1.0.

NOTE: this program does not support negative values!

The comparison-base sort methods are the Tim sort, heap sort,
quicksort, and merge sort implementations from sort.h , copied from
https://github.com/swenson/sort/ and distributed under the MIT
license.

The radix sort implementation, radixsort.h, is derived from NumPy's
numpy/core/npysort/radixsort.c.src , modified to work more like
sort.h.

There are three variants of radix sort, designed for the limited range
of double values between 0.0 and 1.0 from computing the small
rationals  a/b where 0 <= a <= b <= N and b != 0.  These are
the values in the Farey sequence for size N.

The implementations are:
farey65537 - Farey(65537); uses three 12-bit histogram tables.
farey23204 - Farey(23204); uses three 11-bit histogram tables.
farey1480 - Farey(1480); uses two 12-bit histogram tables.

NOTE: these "farey*" functions will likely not sort other values
correctly.

The timing driver is radix_sort_benchmark.c , which depends on sort.h,
radixsort.h, radixsort_farey65537.h, radixsort_farey23204.h and
radixsort_farey1480.h.

To compile the radix_sort_benchmark binary, do:

make

When the program runs, it initializes a histogram of double values and
their counts, either from a file or from a uniform distribution
between 0.0 and 1.0. A set of --sample doubles are randomly sampled,
weighted by count, to create an unordered array. This is then sorted
using the two selected sort methods.

(If two methods are compared then a copy is made of the original
(unsorted) data, and the execution order of the two sort functions
swapped.)

The default uses a histogram of uniformally distributed counts from
0.0 to 1.0 in steps of 1/1,000,000 (specified by --uniform
1000001). Use --uniform N to specify values 0.0, 1/(N-1), 2/(N-1),
..., (N-2)/(N-1), 1.0. Each value has a weight of 1.

Use --histogram or -f to read the histogram information from a
file. That format is described below.

The histogram is used to make a weighted sample of size --sample,
which defaults to 1,000,000. The program then times the two selected
sort functions, and reports the results.

By default the timings is done only once. Use --repeat N to have it
make a new sample and corresponding timings N times.

The option --sort1 and --sort2 tell the program which sort functions
to compare. By default these are 'tim' and 'radix'. Use 'none' to time
only one function. This is useful if you want to see if timings for a
given implementation have improved, or see if the sort execution order
or memory use order makes a difference.

Histogram file format
====================

The histogram file format is:

1. The 8 bytes of the magic string "#hist/1\n".

2. The uint32_t value "8", which is the number of bytes in a double.

3. The uint32_t value for the number of records (doubles) in the
histogram. Call it "num_records"

4. A sequence of "num_records" uint32_t values, which is the weight
for each double. (They should all be 1 for uniform sample weighting.)

5. A sequence of "num_records" double values, which are copied
into the vector of double values to sort.

The program "make_farey_histogram.py" will create a histogram
containing all the values of the Farey sequence N, with a uniform
distribution. The default value of N "1024" creates the file
"farey_1024.hist". Use "--output" to specify an alternate name.

Examples
========

1) Repeat the uniform benchmark 6 times (using 1,000,001 elements
sampled from the values 0.0, 1.0/1000000, 2.0/1000000, ... 1.0, with
equal sample weights):

% ./radix_sort_benchmark --repeat 6
0: 1000000 tim: 111151 radix: 23387 ratio: 4.753
1: 1000000 tim: 106555 radix: 17408 ratio: 6.121
2: 1000000 tim: 112764 radix: 15980 ratio: 7.057
3: 1000000 tim: 111826 radix: 18868 ratio: 5.927
4: 1000000 tim: 112968 radix: 16153 ratio: 6.994
5: 1000000 tim: 113387 radix: 16913 ratio: 6.704
total tim: 668651 / 6  average: 111441.833 microseconds
total radix: 108709 / 6  average: 18118.167 microseconds
final ratio: 6.151

2) Use 10M values instead of the default of 1M values:

% ./radix_sort_benchmark --repeat 6 --sample 10000000
0: 10000000 tim: 1195228 radix: 234226 ratio: 5.103
1: 10000000 tim: 1263916 radix: 204892 ratio: 6.169
2: 10000000 tim: 1259000 radix: 203893 ratio: 6.175
3: 10000000 tim: 1203476 radix: 190616 ratio: 6.314
4: 10000000 tim: 1226246 radix: 199124 ratio: 6.158
5: 10000000 tim: 1189984 radix: 190335 ratio: 6.252
total tim: 7337850 / 6  average: 1222975.000 microseconds
total radix: 1223086 / 6  average: 203847.667 microseconds
final ratio: 5.999

3) The first timing is consistently an outlier. Swap the execution
order to see if that changes anything:

% ./radix_sort_benchmark --repeat 6 --sample 10000000 --sort1 radix --sort2 tim
0: 10000000 radix: 226493 tim: 1219231 ratio: 0.186
1: 10000000 radix: 198439 tim: 1217108 ratio: 0.163
2: 10000000 radix: 188666 tim: 1207624 ratio: 0.156
3: 10000000 radix: 195995 tim: 1186020 ratio: 0.165
4: 10000000 radix: 189485 tim: 1194760 ratio: 0.159
5: 10000000 radix: 192722 tim: 1193631 ratio: 0.161
total radix: 1191800 / 6  average: 198633.333 microseconds
total tim: 7218374 / 6  average: 1203062.333 microseconds
final ratio: 0.165

(It doesn't.)

4) Compare radix sort to itself:

% ./radix_sort_benchmark --repeat 6 --sample 10000000 --sort1 radix --sort2 radix
0: 10000000 radix: 219384 radix: 190688 ratio: 1.150
1: 10000000 radix: 190351 radix: 190268 ratio: 1.000
2: 10000000 radix: 193739 radix: 194958 ratio: 0.994
3: 10000000 radix: 190002 radix: 189679 ratio: 1.002
4: 10000000 radix: 185730 radix: 187030 ratio: 0.993
5: 10000000 radix: 191739 radix: 191309 ratio: 1.002
total radix: 1170945 / 6  average: 195157.500 microseconds
total radix: 1143932 / 6  average: 190655.333 microseconds
final ratio: 1.024

(Looks like the first run is always significantly biased!)

5) Use values and counts from a historgram file. The benchmark
includes a histogram generated from a Tanimoto comparison of randomly
selected 2048-bit RDKit Morgan2 fingerprints generated from ChEMBL.

% ./radix_sort_benchmark --repeat 10 -f tanimoto.hist
0: 1000000 tim: 107167 radix: 21188 ratio: 5.058
1: 1000000 tim: 110632 radix: 17259 ratio: 6.410
2: 1000000 tim: 107168 radix: 17001 ratio: 6.304
3: 1000000 tim: 109849 radix: 19866 ratio: 5.529
4: 1000000 tim: 112524 radix: 17985 ratio: 6.257
5: 1000000 tim: 106386 radix: 17633 ratio: 6.033
6: 1000000 tim: 105665 radix: 17190 ratio: 6.147
7: 1000000 tim: 106074 radix: 18728 ratio: 5.664
8: 1000000 tim: 106340 radix: 17066 ratio: 6.231
9: 1000000 tim: 107126 radix: 17999 ratio: 5.952
total tim: 1078931 / 10  average: 107893.100 microseconds
total radix: 181915 / 10  average: 18191.500 microseconds
final ratio: 5.931

6) These values are in Farey(2048) so compare the general-purpose
radix sort with a specialized sort:

% ./radix_sort_benchmark --repeat 10 -f tanimoto.hist --sort1 radix --sort2 farey23204
0: 1000000 radix: 21118 farey23204: 9682 ratio: 2.181
1: 1000000 radix: 17892 farey23204: 10731 ratio: 1.667
2: 1000000 radix: 16864 farey23204: 10566 ratio: 1.596
3: 1000000 radix: 17243 farey23204: 9432 ratio: 1.828
4: 1000000 radix: 16308 farey23204: 9544 ratio: 1.709
5: 1000000 radix: 19113 farey23204: 10627 ratio: 1.799
6: 1000000 radix: 17394 farey23204: 9799 ratio: 1.775
7: 1000000 radix: 18040 farey23204: 9139 ratio: 1.974
8: 1000000 radix: 19263 farey23204: 9898 ratio: 1.946
9: 1000000 radix: 18258 farey23204: 10493 ratio: 1.740
total radix: 181493 / 10  average: 18149.300 microseconds
total farey23204: 99911 / 10  average: 9991.100 microseconds
final ratio: 1.817

7) Compare two specialized radix sort functions:

% ./radix_sort_benchmark --repeat 10 -f tanimoto.hist --sort1 farey23204 --sort2 farey65537
0: 1000000 farey23204: 13555 farey65537: 10907 ratio: 1.243
1: 1000000 farey23204: 9426 farey65537: 10231 ratio: 0.921
2: 1000000 farey23204: 9205 farey65537: 10657 ratio: 0.864
3: 1000000 farey23204: 10412 farey65537: 11365 ratio: 0.916
4: 1000000 farey23204: 11498 farey65537: 11606 ratio: 0.991
5: 1000000 farey23204: 9554 farey65537: 10037 ratio: 0.952
6: 1000000 farey23204: 9681 farey65537: 9971 ratio: 0.971
7: 1000000 farey23204: 10128 farey65537: 10272 ratio: 0.986
8: 1000000 farey23204: 9483 farey65537: 9835 ratio: 0.964
9: 1000000 farey23204: 9112 farey65537: 10139 ratio: 0.899
total farey23204: 102054 / 10  average: 10205.400 microseconds
total farey65537: 105020 / 10  average: 10502.000 microseconds
final ratio: 0.972

8) Compare the two specialized sorts on the uniformly distributed
Farey(2048) values.

% python make_farey_histogram.py 2048
Wrote 1275587 terms to 'farey_2048.hist'

% ./radix_sort_benchmark --repeat 10 -f farey_2048.hist --sort1 farey23204 --sort2 farey65537
0: 1000000 farey23204: 16659 farey65537: 11857 ratio: 1.405
1: 1000000 farey23204: 10521 farey65537: 12636 ratio: 0.833
2: 1000000 farey23204: 9854 farey65537: 12877 ratio: 0.765
3: 1000000 farey23204: 9729 farey65537: 10972 ratio: 0.887
4: 1000000 farey23204: 11620 farey65537: 16739 ratio: 0.694
5: 1000000 farey23204: 11142 farey65537: 11119 ratio: 1.002
6: 1000000 farey23204: 10931 farey65537: 11489 ratio: 0.951
7: 1000000 farey23204: 9825 farey65537: 11174 ratio: 0.879
8: 1000000 farey23204: 10648 farey65537: 11974 ratio: 0.889
9: 1000000 farey23204: 10844 farey65537: 12637 ratio: 0.858
total farey23204: 111773 / 10  average: 11177.300 microseconds
total farey65537: 123474 / 10  average: 12347.400 microseconds
final ratio: 0.905
```