bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#68244: hash-table improvements


From: Mattias Engdegård
Subject: bug#68244: hash-table improvements
Date: Mon, 8 Jan 2024 19:26:21 +0100

7 jan. 2024 kl. 20.10 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> The change gives good results for small tables but less so for big ones.
> I don't have a good intuition for why that would be: none of the
> operations directly involved seem to be more costly for large tables, so
> my best guess is that it leads to more collisions somehow, tho I don't
> have a good idea about why that would be.

Yes, I wondered about that too. It could simply be bad sampling. The benchmarks 
with bad results used sequential numbers (0, 1, ...) as keys so let's start 
with that:

|  count |   size idxsiz  1st%  avg max |   size idxsiz  1st%  avg max
|--------+------------------------------+-----------------------------
|  10000 |  12466  15343 100.0 1.00   1 |  12288  16384 100.0 1.00   1
|  20000 |  28048  34523 100.0 1.00   1 |  24576  32768  98.8 1.01   2
|  30000 |  42072  51781 100.0 1.00   1 |  49152  65536  99.1 1.01   2
|  40000 |  42072  51781 100.0 1.00   1 |  49152  65536  94.5 1.06   2
|  50000 |  63108  77671 100.0 1.00   1 |  98304 131072 100.0 1.00   1
|  60000 |  63108  77671 100.0 1.00   1 |  98304 131072 100.0 1.00   1
|  70000 |  94662 116507 100.0 1.00   1 |  98304 131072 100.0 1.00   1
|  80000 |  94662 116507 100.0 1.00   1 |  98304 131072  96.0 1.04   2
|  90000 |  94662 116507 100.0 1.00   1 |  98304 131072  89.3 1.11   2
| 100000 | 141993 174761 100.0 1.00   1 | 196608 262144  92.9 1.07   2
| 110000 | 141993 174761 100.0 1.00   1 | 196608 262144  90.9 1.09   2
| 120000 | 141993 174761 100.0 1.00   1 | 196608 262144  89.3 1.11   2
| 130000 | 141993 174761 100.0 1.00   1 | 196608 262144  87.9 1.12   2
| 140000 | 141993 174761 100.0 1.00   1 | 196608 262144  86.7 1.13   2
| 150000 | 212989 262141 100.0 1.00   1 | 196608 262144  85.7 1.14   2
| 160000 | 212989 262141 100.0 1.00   1 | 196608 262144  84.8 1.15   2
| 170000 | 212989 262141 100.0 1.00   1 | 196608 262144  84.0 1.16   2
| 180000 | 212989 262141 100.0 1.00   1 | 196608 262144  83.3 1.17   2
| 190000 | 212989 262141 100.0 1.00   1 | 196608 262144  82.7 1.17   2
| 200000 | 212989 262141 100.0 1.00   1 | 393216 524288 100.0 1.00   1

The left columns are for the standard hash tables with remainder-based range 
reduction, the right ones with Knuth reduction.
'size' is the table size, 'idxsiz' the index vector size, '1st%' the portion of 
entries accessed with a single lookup, 'avg' the average number of accesses and 
'max' the maximum number of accesses required.

The old code looks perfect (no collisions!), but the new shiny code is a 
disappointment.

Then again, sequential numbers are best-case for remainder-based indexing: 
Emacs hashes (smallish) fixnums to themselves so we are guaranteed a minimum 
number of collisions, actually zero since the index vector is always larger 
than the number of entries.

But sequential keys would be a somewhat unusual use of hash tables; a vector is 
a lot more efficient. Let's try with random fixnums instead, using the same 
seed for each table:

|  count |   size idxsiz 1st%  avg max |   size idxsiz 1st%  avg max
|--------+-----------------------------+----------------------------
|  10000 |  12466  15343 73.5 1.32   6 |  12288  16384 75.4 1.30   6
|  20000 |  28048  34523 75.7 1.29   6 |  24576  32768 75.2 1.30   6
|  30000 |  42072  51781 75.8 1.29   5 |  49152  65536 80.6 1.22   5
|  40000 |  42072  51781 69.6 1.39   7 |  49152  65536 75.0 1.30   6
|  50000 |  63108  77671 73.6 1.32   6 |  98304 131072 83.5 1.19   6
|  60000 |  63108  77671 69.6 1.39   7 |  98304 131072 80.5 1.23   6
|  70000 |  94662 116507 75.2 1.30   6 |  98304 131072 77.5 1.27   6
|  80000 |  94662 116507 72.4 1.34   7 |  98304 131072 75.0 1.30   7
|  90000 |  94662 116507 69.8 1.38   7 |  98304 131072 72.6 1.34   7
| 100000 | 141993 174761 76.2 1.29   6 | 196608 262144 83.3 1.19   6
| 110000 | 141993 174761 74.2 1.32   6 | 196608 262144 81.8 1.21   6
| 120000 | 141993 174761 72.4 1.34   7 | 196608 262144 80.3 1.23   6
| 130000 | 141993 174761 70.6 1.37   7 | 196608 262144 78.8 1.25   6
| 140000 | 141993 174761 68.8 1.40   7 | 196608 262144 77.6 1.27   6
| 150000 | 212989 262141 76.1 1.29   6 | 196608 262144 76.2 1.29   6
| 160000 | 212989 262141 74.8 1.31   6 | 196608 262144 74.9 1.30   6
| 170000 | 212989 262141 73.5 1.33   6 | 196608 262144 73.6 1.32   6
| 180000 | 212989 262141 72.3 1.35   7 | 196608 262144 72.3 1.34   6
| 190000 | 212989 262141 71.1 1.36   7 | 196608 262144 71.1 1.36   7
| 200000 | 212989 262141 69.9 1.38   7 | 393216 524288 83.1 1.19   5

Here the new code looks a bit better, but it could just be the bigger index 
vectors.
Not sure what to make of this.

We could try switching to a high-quality hash function (or family thereof), 
like Murmur or Jenkins. Then range reduction is just a matter of masking off 
the required number of bits.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]