help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting words, fast!


From: Jesse Hathaway
Subject: Re: Counting words, fast!
Date: Fri, 19 Mar 2021 14:28:49 -0500

On Fri, Mar 19, 2021 at 12:30 PM Koichi Murase <myoga.murase@gmail.com> wrote:
> Maybe, it is a bit late but some comments.

Never too late for some Bash fun!

> * Bash indexed arrays are implemented by a sparse linked list (of
> key-value pairs), i.e., Bash indexed arrays are already a kind of
> sorted dictionary of (integer-)key & value. So, you don't have to
> construct the associative array and sort its keys by yourself, but you
> can just construct an indexed array and print it by "${array[@]}". A
> non-trivial thing is that the problem requires inputting the result in
> descending order with respect to the frequency. For this one needs to
> revert the array elements.

This didn't seem to work for me, because ${#o[@]} is the sparse length?:

  ((i=0,j=${#o[@]}-1))
  while ((j>=0)); do r[i++]=${o[j--]}; done

Instead I needed to do this, which ended up being slower than my sorting, I
assume since it needed to iterate over so many unset values?:

  read -r _ j <<<"${o[-1]}"
  ((i=0))
  while ((j>=0)); do r[i++]=${o[j--]}; done

> * Usually shorter variable names are faster in Bash scripts. The
> speedup by this one is small for this script because this is not the
> bottleneck of this script, but there is indeed a difference.

I didn't realize that, but as you said I only saw a very small time reduction
by reducing variable length. However, not quoting the associative array
subscript saved 2s!  I didn't realize associative array subscripts don't need
to be quoted, is that specified in the man page somewhere?

> * Usually, `mapfile' is much faster than `read'. But somehow in this
> case, the implementation by `read -N 65536' seems to be faster than
> `mapfile' implementation. Because ${arr[*]} seems to have the
> quadratic time complexity O(N^2) under some conditions, there is a
> finite suitable size of a block to process at once which is something
> between 32768..65536 in this problem. `mapfile' cannot control the
> data size of the processed input through its options.

I was hoping reading into an array with read would be faster than word
splitting, but it is not for some reason, perhaps memory allocation?

> * «while LANG=C IFS= read ...; do ...» is slow because Bash
> reinitializes the locales of the C library every time any shell
> variables LANG or LC_* is changed. One can instead set LANG outside
> the while statement and restore it after the while statement.

This was a surprising slowdown, just moving LANG outside the loop save 7s.

Thanks for the reply Koichi, in my tests the script went from 27s to 16s!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]