[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: split a string into an array?
From: |
Greg Wooledge |
Subject: |
Re: split a string into an array? |
Date: |
Fri, 11 Mar 2022 08:13:53 -0500 |
On Fri, Mar 11, 2022 at 02:02:40PM +0900, Koichi Murase wrote:
> I guess Peng is interested in a solution without forks.
It's sad that we have to guess, but those who haven't killfiled Peng
yet are surely used to it by now.
The traditional way that people try to split a line of input into
fields is this:
IFS=$'\t' read -ra fields <<< "$line"
This does not fork, but it creates a temporary file.
This fails if there is a trailing empty field, as noted in
<https://mywiki.wooledge.org/BashPitfalls#pf47>. A workaround exists,
for when the delimiter is something like comma. However, this workaround
DOES NOT WORK when the delimiter is tab, or any other IFS whitespace
character, because read ALWAYS strips leading and trailing IFS
whitespace characters from the input line, if they're still in IFS.
Note also that I used the key word "line" in my initial text. This
solution uses read with its default newline terminator, which means
it stops reading as soon as a newline character is encountered. If
one wishes to generalize this to handle arbitrary input *records*
which are not necessarily a single line (i.e. which may have embedded
newline characters within fields), then the delimiter must be changed.
IFS=$'\t' read -r -d '' -a fields < <(printf '%s\0' "$line")
This one does not create a temporary file, but it forks a subshell,
which is far worse for speed. It also has the same trailing empty
field problem as before, so it still cannot handle tabs correctly.
It'll work for commas, though.
Another approach would use a loop:
fields=()
tmp=$line
while [[ $tmp = *$'\t'* ]]; do
fields+=("${tmp%%$'\t'*}")
tmp=${tmp#*$'\t'}
done
fields+=("$tmp")
This doesn't create a temp file, and doesn't fork, so it should be the
best solution of them all, right? It also doesn't have any problem with
trailing delimiters, even if the delimiter is an IFS whitespace character.
People will still hate it. They will hate it with an all-consuming
fiery passion. Why? Because it doesn't look "elegant". Because it's
got a terrible "golfing" score. Because they've sworn a jihad against the
appearance of iteration (but it's totally OK if some other tool iterates
behind the curtains, so long as it can't be seen by the public eye). Or
because *gasp* it's not ONE PIPELINE. Some people will accept a
solution of any length, as long as it's a single pipeline. But put
two commands in sequence, and they start bringing out the pitchforks.
I can't fix those people.
It might also be slower than the temp-file-based solutions, which will
turn many people off, even though slowness is the price they must pay
to get a solution that actually works, in bash. If speed is an issue,
they're writing in the wrong language in the first place.
The reader may also wish to explore an iterative solution which
traverses the input string one character at a time, and either appends
the character to the current field, or starts a new field. That's
the sort of solution I'd use in C. It's wickedly efficient when you
have pointers in your toolbox.
However, the biggest issue of all is that people try to use these
technique on ACTUAL CSV FILES. These techniques cannot be used on
CSV files. They're not capable of handling fields with embedded commas
(or whatever delimiters) inside them, with various quoting mechanisms
employed to mark the embedded delimiters as literals. Or embedded
newlines (record separators). They also don't remove the field quoting
that's used to permit such embedded delimiters/separators.
So, if what Peng is ACTUALLY trying to do is parse a CSV file, then
this entire thread has been a waste of time.
If you want to parse CSV files, upgrade to a real language, with a
dedicated CSV library.