help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting words, fast!


From: Koichi Murase
Subject: Re: Counting words, fast!
Date: Sat, 20 Mar 2021 01:30:16 +0800

2021年3月17日(水) 5:32 Jesse Hathaway <jesse@mbuki-mvuki.org>:
>
> Ben Hoyt wrote a fun blog post on counting words,
> https://benhoyt.com/writings/count-words, I wrote a Bash version with
> as many optimizations as I could find[1], I would love any suggestions
> on any substantial optimizations, the code is attached. The full
> repo and test file is here: https://github.com/benhoyt/countwords
>
> Yours kindly, Jesse
>
> [1]: https://github.com/benhoyt/countwords/pull/66

Maybe, it is a bit late but some comments.

* Bash indexed arrays are implemented by a sparse linked list (of
key-value pairs), i.e., Bash indexed arrays are already a kind of
sorted dictionary of (integer-)key & value. So, you don't have to
construct the associative array and sort its keys by yourself, but you
can just construct an indexed array and print it by "${array[@]}". A
non-trivial thing is that the problem requires inputting the result in
descending order with respect to the frequency. For this one needs to
revert the array elements.

* Usually shorter variable names are faster in Bash scripts. The
speedup by this one is small for this script because this is not the
bottleneck of this script, but there is indeed a difference.

* Usually, `mapfile' is much faster than `read'. But somehow in this
case, the implementation by `read -N 65536' seems to be faster than
`mapfile' implementation. Because ${arr[*]} seems to have the
quadratic time complexity O(N^2) under some conditions, there is a
finite suitable size of a block to process at once which is something
between 32768..65536 in this problem. `mapfile' cannot control the
data size of the processed input through its options.

* «while LANG=C IFS= read ...; do ...» is slow because Bash
reinitializes the locales of the C library every time any shell
variables LANG or LC_* is changed. One can instead set LANG outside
the while statement and restore it after the while statement.

I haven't thoroughly tested it, but here's my optimization (to the
latest version using read -N 65536):

--------------------
#!/bin/bash

# load
declare -iA h
set -f
LANG=C E=
until [[ $E ]]; do
  IFS= read -N 65536 -r a || E=1
  IFS= read -r b || E=1
  for w in ${a@L}${b@L}; do
    h[$w]+=1
  done
done

# construct outputs
for w in "${!h[@]}"; do
  f=${h[$w]}
  o[f]+=$w' '$f$'\n'
done

# reverse
((i=0,j=${#o[@]}-1))
while ((j>=0)); do r[i++]=${o[j--]}; done

printf '%s' "${r[@]}"
--------------------

--
Koichi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]