help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-bash] split


From: Seth David Schoen
Subject: Re: [Help-bash] split
Date: Sat, 26 May 2018 15:29:23 -0700
User-agent: Mutt/1.9.4 (2018-02-28)

Val Krem writes:

> Hi All,
> I wanted to split a big file based on the value of the first column of thsi 
> file
> file 1 (atest.dat).1 ah1 251 ah2 26
> 4 ah5 354 ah6 362 ah2 54
> I want to split this file into three files based on the first column 
>  file 1 will be 
>      1 ah1 25     1 ah2 26
> file 2 would be    4 ah5 35    4 ah6 36
> file three would be     2 ah2 54
> The range of the first column could vary from 1 up to 100.
> I trad  the following script 
> ################################################
> #! bin/bash
> numb=($(seq 1 1 10))
> for i in "address@hidden"
>    do
>      awk '{if($1=='"${i}"') print $0}' atest.dat   > numb${i}.txt
>    done
> #################################################

A more idiomatic use of awk here would be simply

awk '$1=="'$i'"'

because awk allows you to set a condition before the action, and its
default action is {print}.

One idea to limit the number of files created might be to start with

numb=($(cut -1 < atest.dat | sort -u))

which will literally only run the command for the values that actually
occur in the first column of atest.dat (whatever they are, whether they
are in the range 1 to 100 or not).

One inefficient thing about this is that you read the file one hundred
times (in your original version) or one more than the number of distinct
values in the first column (in my modified version).  There are various
alternatives, such as writing a script in some language that appends
to the appropriate file in each case, which could even be a loop in
bash.  For example

cat atest.dat | while read -a line; do
   echo address@hidden >> numb"${line[0]}".txt
done

This might not have faster overall I/O performance than the original
version because it will have to constantly open and close each file,
and also won't be able to do buffered writes.  However, it will only
read through the original file once.

Another option could be to write the script in another language and
hold all of the files open, with references to them in a hash table,
and generate appropriate writes on each file by looking up its file
descriptor in the hash table.

Another option could be to sort the file first.  Then an advantage is
that you know when to switch to a new output file because the first
field of the input line changes.  However, with most sorts you may lose
the original relative order of the input lines.

-- 
Seth David Schoen <address@hidden>      |  No haiku patents
     http://www.loyalty.org/~schoen/        |  means I've no incentive to
  8F08B027A5DB06ECF993B4660FD4F0CD2B11D2F9  |        -- Don Marti



reply via email to

[Prev in Thread] Current Thread [Next in Thread]