[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#68244: hash-table improvements
From: |
Mattias Engdegård |
Subject: |
bug#68244: hash-table improvements |
Date: |
Mon, 8 Jan 2024 19:26:21 +0100 |
7 jan. 2024 kl. 20.10 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
> The change gives good results for small tables but less so for big ones.
> I don't have a good intuition for why that would be: none of the
> operations directly involved seem to be more costly for large tables, so
> my best guess is that it leads to more collisions somehow, tho I don't
> have a good idea about why that would be.
Yes, I wondered about that too. It could simply be bad sampling. The benchmarks
with bad results used sequential numbers (0, 1, ...) as keys so let's start
with that:
| count | size idxsiz 1st% avg max | size idxsiz 1st% avg max
|--------+------------------------------+-----------------------------
| 10000 | 12466 15343 100.0 1.00 1 | 12288 16384 100.0 1.00 1
| 20000 | 28048 34523 100.0 1.00 1 | 24576 32768 98.8 1.01 2
| 30000 | 42072 51781 100.0 1.00 1 | 49152 65536 99.1 1.01 2
| 40000 | 42072 51781 100.0 1.00 1 | 49152 65536 94.5 1.06 2
| 50000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1
| 60000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1
| 70000 | 94662 116507 100.0 1.00 1 | 98304 131072 100.0 1.00 1
| 80000 | 94662 116507 100.0 1.00 1 | 98304 131072 96.0 1.04 2
| 90000 | 94662 116507 100.0 1.00 1 | 98304 131072 89.3 1.11 2
| 100000 | 141993 174761 100.0 1.00 1 | 196608 262144 92.9 1.07 2
| 110000 | 141993 174761 100.0 1.00 1 | 196608 262144 90.9 1.09 2
| 120000 | 141993 174761 100.0 1.00 1 | 196608 262144 89.3 1.11 2
| 130000 | 141993 174761 100.0 1.00 1 | 196608 262144 87.9 1.12 2
| 140000 | 141993 174761 100.0 1.00 1 | 196608 262144 86.7 1.13 2
| 150000 | 212989 262141 100.0 1.00 1 | 196608 262144 85.7 1.14 2
| 160000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.8 1.15 2
| 170000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.0 1.16 2
| 180000 | 212989 262141 100.0 1.00 1 | 196608 262144 83.3 1.17 2
| 190000 | 212989 262141 100.0 1.00 1 | 196608 262144 82.7 1.17 2
| 200000 | 212989 262141 100.0 1.00 1 | 393216 524288 100.0 1.00 1
The left columns are for the standard hash tables with remainder-based range
reduction, the right ones with Knuth reduction.
'size' is the table size, 'idxsiz' the index vector size, '1st%' the portion of
entries accessed with a single lookup, 'avg' the average number of accesses and
'max' the maximum number of accesses required.
The old code looks perfect (no collisions!), but the new shiny code is a
disappointment.
Then again, sequential numbers are best-case for remainder-based indexing:
Emacs hashes (smallish) fixnums to themselves so we are guaranteed a minimum
number of collisions, actually zero since the index vector is always larger
than the number of entries.
But sequential keys would be a somewhat unusual use of hash tables; a vector is
a lot more efficient. Let's try with random fixnums instead, using the same
seed for each table:
| count | size idxsiz 1st% avg max | size idxsiz 1st% avg max
|--------+-----------------------------+----------------------------
| 10000 | 12466 15343 73.5 1.32 6 | 12288 16384 75.4 1.30 6
| 20000 | 28048 34523 75.7 1.29 6 | 24576 32768 75.2 1.30 6
| 30000 | 42072 51781 75.8 1.29 5 | 49152 65536 80.6 1.22 5
| 40000 | 42072 51781 69.6 1.39 7 | 49152 65536 75.0 1.30 6
| 50000 | 63108 77671 73.6 1.32 6 | 98304 131072 83.5 1.19 6
| 60000 | 63108 77671 69.6 1.39 7 | 98304 131072 80.5 1.23 6
| 70000 | 94662 116507 75.2 1.30 6 | 98304 131072 77.5 1.27 6
| 80000 | 94662 116507 72.4 1.34 7 | 98304 131072 75.0 1.30 7
| 90000 | 94662 116507 69.8 1.38 7 | 98304 131072 72.6 1.34 7
| 100000 | 141993 174761 76.2 1.29 6 | 196608 262144 83.3 1.19 6
| 110000 | 141993 174761 74.2 1.32 6 | 196608 262144 81.8 1.21 6
| 120000 | 141993 174761 72.4 1.34 7 | 196608 262144 80.3 1.23 6
| 130000 | 141993 174761 70.6 1.37 7 | 196608 262144 78.8 1.25 6
| 140000 | 141993 174761 68.8 1.40 7 | 196608 262144 77.6 1.27 6
| 150000 | 212989 262141 76.1 1.29 6 | 196608 262144 76.2 1.29 6
| 160000 | 212989 262141 74.8 1.31 6 | 196608 262144 74.9 1.30 6
| 170000 | 212989 262141 73.5 1.33 6 | 196608 262144 73.6 1.32 6
| 180000 | 212989 262141 72.3 1.35 7 | 196608 262144 72.3 1.34 6
| 190000 | 212989 262141 71.1 1.36 7 | 196608 262144 71.1 1.36 7
| 200000 | 212989 262141 69.9 1.38 7 | 393216 524288 83.1 1.19 5
Here the new code looks a bit better, but it could just be the bigger index
vectors.
Not sure what to make of this.
We could try switching to a high-quality hash function (or family thereof),
like Murmur or Jenkins. Then range reduction is just a matter of masking off
the required number of bits.
- bug#68244: hash-table improvements, (continued)
- bug#68244: hash-table improvements, Mattias Engdegård, 2024/01/05
- bug#68244: hash-table improvements, Dmitry Gutov, 2024/01/05
- bug#68244: hash-table improvements, Mattias Engdegård, 2024/01/06
- bug#68244: hash-table improvements, Eli Zaretskii, 2024/01/06
- bug#68244: hash-table improvements, Dmitry Gutov, 2024/01/06
- bug#68244: hash-table improvements, Stefan Monnier, 2024/01/07
- bug#68244: hash-table improvements, Dmitry, 2024/01/07
- bug#68244: hash-table improvements, Mattias Engdegård, 2024/01/07
- bug#68244: hash-table improvements, Stefan Monnier, 2024/01/07
- bug#68244: hash-table improvements,
Mattias Engdegård <=
- bug#68244: hash-table improvements, Stefan Monnier, 2024/01/08
- bug#68244: hash-table improvements, Mattias Engdegård, 2024/01/09
- bug#68244: hash-table improvements, Mattias Engdegård, 2024/01/13
- bug#68244: hash-table improvements, Gerd Möllmann, 2024/01/14
- bug#68244: hash-table improvements, Mattias Engdegård, 2024/01/14
- bug#68244: hash-table improvements, Stefan Kangas, 2024/01/21