[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Import large field-delimited file with strings and numbers
From: |
Philip Nienhuis |
Subject: |
Re: Import large field-delimited file with strings and numbers |
Date: |
Thu, 11 Sep 2014 14:21:34 -0700 (PDT) |
Joao Rodrigues wrote
> On 08-09-2014 17:49, Philip Nienhuis wrote:
>>>
> <snip>
>>> Yet, csv2cell is orders of magnitude faster. I will break the big file
>>> into chunks (using fileread, strfind to determine newlines and fprintf)
>>> and then apply csv2cell chunk-wise.
>> Why do you need to break it up using csv2cell? AFAICS that reads the
>> entire
>> file and directly translates the data into "values" in the output cell
>> array, using very little temporary storage (the latter quite unlike
>> textscan/strread).
>> It does read the entire file twice, once to assess the required
>> dimensions
>> for the cell array, the second (more intensive) pass for actually reading
>> the data.
> The file I want to read has around 35 million rows, 15 columns and takes
> 200 MB of disk space: csv2cell would simply eat up all memory and the
> computer stopped responding.
>
> I tried to feed it small chunk of increasing size and found out that it
> behaved well until it received a chunk of 500 million rows (when memory
> use went through the stratosphere).
>
> So I opted for the clumsy solution of breaking the file into small
> pieces and spoon feed csv2cell.
>
> But then I found out something interesting. If I would save a cell with
> 35 million rows and only 3 columns in gzip format it would take very
> little disk space (20 MB or so) but when I tried to open it... it would
> again take forever and eat up GBs of memory.
>
> Bottom line: I think it has to do with the way Octave allocates memory
> to cells, which is not very efficient (as opposed to dense or sparse
> numerical data, which it handles very well).
>
> I managed to solve the problem I had, thanks to the help of you guys.
>
> However, I think it would probably be nice if in future versions of
> Octave there was something akin to ulimit installed by default to
> prevent a process from eating up all available memory.
>
> If someone wants to check this issue the data I am working with is public:
>
> http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip
>
> where * = 1990:2013
>
> http://www.bls.gov/cew/datatoc.htm explains the content.
I d/led the 2013 file and gave it a try with csv2cell with a 64-bit Octave.
csv2cell() didn't even need the new headerlines parameter - it is a neat
.csv cell from bottom to top.
Results:
>> tic; data = csv2cell ('2013.annual.singlefile.csv'); toc
Elapsed time is 20.2152 seconds.
>> size (data)
ans =
3565139 15
>> whos
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
ans 1x2 16 double
data 3565139x15 354645851 cell
Total is 53477087 elements using 354645867 bytes
>>
...and Octave's memory usage is ~4.6 GB (total occupied RAM on my Win7-64b
box was 5.75 GB). So you'd need at least a 64-bit Octave + 64-bit OS. For
Windows a (experimental but IMO fairly good) 64-bit Octave is available
these days.
Even after stripping away the rightmost columns, saving the result to a .mat
file, restarting Octave and reading back the .mat file, Octave still needs >
4 GB to read the file. Once in workspace the data occupies > 2 GB RAM, while
according to "whos" the cell array (3565139 x 4) occupies ~100 MB.
Puzzling numbers... as you say, Octave apparently needs a lot more RAM
behind the scenes to hold such big cell arrays.
Philip
--
View this message in context:
http://octave.1599824.n4.nabble.com/Import-large-field-delimited-file-with-strings-and-numbers-tp4666380p4666469.html
Sent from the Octave - General mailing list archive at Nabble.com.
- Re: Import large field-delimited file with strings and numbers, (continued)
- Re: Import large field-delimited file with strings and numbers, Ben Abbott, 2014/09/06
- Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/06
- Re: Import large field-delimited file with strings and numbers, João Rodrigues, 2014/09/06
- Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/08
- Re: Import large field-delimited file with strings and numbers, João Rodrigues, 2014/09/08
- Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
- Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
- Re: Import large field-delimited file with strings and numbers, Joao Rodrigues, 2014/09/08
- Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
- Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
- Re: Import large field-delimited file with strings and numbers,
Philip Nienhuis <=
- Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/10
Re: Import large field-delimited file with strings and numbers, CdeMills, 2014/09/08
Re: Import large field-delimited file with strings and numbers, Helios de Rosario, 2014/09/08