Re: Ann: Working on mph algos

Bruno Haible <bruno@clisp.org> schrieb am Mi., 23. Feb. 2022, 11:02:

Hello Reini,

> The original code: https://github.com/rurban/nbperf (BSD3 Licensed)
> I'll bug assign@gnu for the assignment.
> Do we need permission from the original author also, or is BSD enough?

Per GNU policy, if you want code to be included upstream, a copyright
assignment to the FSF is required. In special cases, GNU packages also
incorporate code also without copyright assignment, but each such case
weakens the legal defense against companies and institutions who want
to abuse the code and who contempt the copyleft. And the number of such
parasitic companies and institutions has grown a lot in the last three
years.

I know that. I have much more dangerous enemies in my own GNU project, an extremely litigious company. The first one who actually took the FSF to court.

But this a code generator, where the generated code is not under the GPL. And multiple other similar generators exist for years which much weaker licenses. It's merely convenience not having to switch tools.

I already sent my assignment form. Question is if we need the original authors assignments also, from where I started. Did you need the assignment by the original author of gperf?

I can check with Jörg.

Another problem is that we want source code in the packages, not binary
blobs. The 5 KB of data used by mi_vector_hash do not look like source code.

This is generated code from the .c file. Ecplained in the first line. The mph's need to bring their own hash helper function with it, and I rather use xxd -i than writing my own cquoter.

Just like configure is generated from configure.ac.

Recall: "The source code for a work means the preferred form of the work
for making modifications to it."

Sure, that's why it's documented as such, and readonly.

You need to look at the current code.

> Of course it's much slower than the default gperf algo, but for large lists
> gperf is not well
> suited. Also the ordering property of CHM is nice.

I agree that gperf is not well suited for large keyword sets; this is
documented in
https://www.gnu.org/software/gperf/manual/html_node/Bugs.html

A solution to this problem should IMO come
- as an alternative algorithm,
- with maintainable source code.

Both fulfilled.

What I don't see at this point:

- Why it would be better to have 3 algorithms than one?
I mean, the user of gperf does not want to experiment a lot before
getting a satisfying result. The fewer command line options the
user has to try, the better.

Because these 3 are for huge sets. Starting with 15.000 but up to several million entries.

The best algo should be automatically selected from the input set.

- What the "ordering property of CHM" means and why it is "nice"?
I mean, a hash table is meant to implement a map
string -> some DATA or NULL
What is the abstraction that a hash function with ordering would
provide?

You don't need the word list in the hash table, you can optionally use your own array. The hash just returns the index into this array. You can also implement paging and ranges.

This is important for huge tables. Eg you can keep them off memory.

- Maintainability: The code should also be understandable by someone
who has not read a 13-pages paper by Botelho.

So, for the goal of solving gperf's problem with large key sets:

What are, in the current research, the algorithms for generating
a decent (O(strlen(s))) hash function with an O(1) lookup - not
necessarily a perfect hash function? The code in nbperf is from 2009.
Likely some other algorithms have been published in the last 13 years?
How do they compare?

No recent improvements have been made there. CHM and BPZ are still the best for big MPH's, and gperf for small. All have them have their speed - size tradeoffs, with gperf we just have the limitation for not being able to handle large sets at all.

For special cases I wrote a faster tool a couple of years ago, but this not portable. 1000x faster than memcmp. Haven't found much need for it though. Good http parsers use this trick also.

The cmph library has a few more algos, but nothing is better than those 3, and gperf/nbperf generated hashes are faster than cmph hashes.

I benchmarked all of them a few years ago in my perfect hash tool.

From:	Reini Urban
Subject:	Re: Ann: Working on mph algos
Date:	Wed, 23 Feb 2022 15:59:51 +0100