|
From: | João Rodrigues |
Subject: | Re: Import large field-delimited file with strings and numbers |
Date: | Mon, 08 Sep 2014 18:54:22 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 |
On 08-09-2014 17:49, Philip Nienhuis wrote:
The file I want to read has around 35 million rows, 15 columns and takes 200 MB of disk space: csv2cell would simply eat up all memory and the computer stopped responding.<snip> Yet, csv2cell is orders of magnitude faster. I will break the big file into chunks (using fileread, strfind to determine newlines and fprintf) and then apply csv2cell chunk-wise.Why do you need to break it up using csv2cell? AFAICS that reads the entire file and directly translates the data into "values" in the output cell array, using very little temporary storage (the latter quite unlike textscan/strread). It does read the entire file twice, once to assess the required dimensions for the cell array, the second (more intensive) pass for actually reading the data.
I tried to feed it small chunk of increasing size and found out that it behaved well until it received a chunk of 500 million rows (when memory use went through the stratosphere).
So I opted for the clumsy solution of breaking the file into small pieces and spoon feed csv2cell.
But then I found out something interesting. If I would save a cell with 35 million rows and only 3 columns in gzip format it would take very little disk space (20 MB or so) but when I tried to open it... it would again take forever and eat up GBs of memory.
Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).
I managed to solve the problem I had, thanks to the help of you guys.However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.
If someone wants to check this issue the data I am working with is public: http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip where * = 1990:2013 http://www.bls.gov/cew/datatoc.htm explains the content.
[Prev in Thread] | Current Thread | [Next in Thread] |