help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting words, fast!


From: Koichi Murase
Subject: Re: Counting words, fast!
Date: Sat, 20 Mar 2021 11:18:00 +0800

2021年3月20日(土) 3:29 Jesse Hathaway <jesse@mbuki-mvuki.org>:
> I didn't realize associative array subscripts don't need
> to be quoted, is that specified in the man page somewhere?

I assume you mentions on `$w' in `h[$w]+=1'. I looked for the Bash
manuals but cannot find any related description. Something that might
be related is the description of the variable assignments (*word* in
the following quotes represents the italic shape):

https://www.gnu.org/software/bash/manual/bash.html#Shell-Parameters
> A variable may be assigned to by a statement of the form
>
>     *name*=[*value*]
>
> If *value* is not given, the variable is assigned the null string. All
> *value*s undergo tilde expansion, parameter and variable expansion,
> command substitution, arithmetic expansion, and quote removal
> (detailed below).

`h[$w]+=1' is a variant of the variable assignments, so the right-hand
side needs to be quoted when it contains some shell special characters
because it is subject to the various shell expansions. But there is no
mention in the manual about the assignment syntax of the associative
array. There is a description of the indexed arrays:

https://www.gnu.org/software/bash/manual/bash.html#Arrays
> An indexed array is created automatically if any variable is
> assigned to using the syntax>
>
>     *name*[*subscript*]=*value*
>
> The subscript is treated as an arithmetic expression that must
> evaluate to a number.

Maybe *subscript* is considered subject to the various array
expansions because it is an arithmetic expression. But it is still the
arithmetic expression so the quotes used to prevent pathname
expansions and word splitting are not needed.

For the associative array, I somehow could not find in the manual even
a description that the associative arrays support the assignment of
the form `name[subscript]=value'.

> I was hoping reading into an array with read would be faster than word
> splitting, but it is not for some reason, perhaps memory allocation?

It's related to the buffering of the data read from the file
descriptors. `read' always cares about the end of the read, so there
is some condition for unbuffered read to be enabled (but maybe the
current implementation of `read' builtin might be still optimized).
`mapfile' is supposed to read until the end of the stream, so it can
usually perform unbuffered read.


2021年3月20日(土) 6:43 Koichi Murase <myoga.murase@gmail.com>:
>
> > This didn't seem to work for me, because ${#o[@]} is the sparse length?:
>
> Oh, sorry. It seems o=("${#o[@]}") has been dropped while testing a
> different way of reverting the array.

Sorry again. I meant `o=("${o[@]}")' but not `o=("${#o[@]}")'. Here is
the fixed version:

----------

declare -iA h
set -f
LANG=C E=
until [[ $E ]]; do
  IFS= read -N 65536 -r a || E=1
  IFS= read -r b || E=1
  for w in ${a@L}${b@L}; do
    h[$w]+=1
  done
done

# construct outputs for each freq
for w in "${!h[@]}"; do
  f=${h[$w]}
  o[f]+=$w' '$f$'\n'
done

# reverse
o=("${o[@]}")
((i=0,j=${#o[@]}-1))
while ((j>=0)); do r[i++]=${o[j--]}; done

printf '%s' "${r[@]}"

--
Koichi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]