help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting words, fast!


From: Leonid Isaev (ifax)
Subject: Re: Counting words, fast!
Date: Tue, 16 Mar 2021 22:12:50 +0000

On Tue, Mar 16, 2021 at 04:31:59PM -0500, Jesse Hathaway wrote:
> Ben Hoyt wrote a fun blog post on counting words,
> https://benhoyt.com/writings/count-words, I wrote a Bash version with
> as many optimizations as I could find[1], I would love any suggestions
> on any substantial optimizations, the code is attached. The full
> repo and test file is here: https://github.com/benhoyt/countwords

Nice, but I don't understand why did you choose to implement insertion sort
in bash? Replacing it with GNU sort(1) significantly cuts down execution time:
-----8<-----
[lisaev@hyena tmp]$ cat a.sh 
#!/bin/bash

set -o errexit
set -o nounset

declare -Ai freqs
declare -a line
declare w

while read -ra line; do
        for w in "${line[@]}"; do
                freqs["${w,,}"]+=1
        done
done

for w in "${!freqs[@]}"; do
        printf "%s %s\n" "${freqs[$w]}" "$w"
done

[lisaev@hyena tmp]$ bash a.sh < kjvbible.txt | sort -rnk1 | head
64015 the
51313 and
34634 of
13567 to
12784 that
12503 in
10261 he
9838 shall
8987 unto
8810 for

[lisaev@hyena tmp]$ time { bash a.sh < kjvbible.txt | sort -rnk1 &> /dev/null; 
} 
real    0m9.721s
user    0m8.874s
sys     0m0.813s
----->8-----
(notice, this test was done in a VM, on a file in tmpfs, so there is no caching
associated with filesystem access).

If using sort(1) destroys "pure-bash-ness" of the task, you should program a
quick-/merge-sort/...-sort subroutine with $O (n \ln n)$ performance.

Sincerely,
L.

--
Leonid Isaev



reply via email to

[Prev in Thread] Current Thread [Next in Thread]