|
From: | Philip Nienhuis |
Subject: | Re: Textscan and csv fitness data problem |
Date: | Wed, 3 Jan 2018 23:47:50 +0100 |
User-agent: | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48 |
Ben Abbott wrote:
On Jan 3, 2018, at 12:19 AM, PhilipNienhuis <address@hidden> wrote: bpabbott wroteOn Jan 1, 2018, at 11:51 AM, PhilipNienhuis <pr.nienhuis@> wrote:NJank wroteOn Jan 1, 2018 12:13 PM, "PhilipNienhuis" <pr.nienhuis@> wrote: NJank wroteOn Jan 1, 2018 9:06 AM, "Thomas Esbensen"As to textscan, Dan did a lot of good work lately, I think the bugs you implied have been fixed in the development branch. Yeah, i noticed that. Would those make it into a 4.2.2 or not until 4.4.0? Been keeping my fingers crossed that it would suddenly "just work" and I wouldn't have to dive into his data again.Have a look in the log: http://hg.savannah.gnu.org/hgweb/octave bugs 52116 and 52479 have been fixed on stable, the last one (bug 52550) not. If you want you can ask in the latter bug report to backport it to stable. As to csv2cell's erroneous column conversion, I've fixed that stupid bug and pushed it. To use it, get csv2cell.cc from here: http://hg.code.sf.net/p/octave/io/file/31b7ff5ee040/src/csv2cell.cc and then do mkoctfile csv2cell.cc to build a fixed version. Swap it into place, using "pkg load io; which csv2cell" to find out where it should live, followed by "pkg unload io; clear -f" to clear the way for copying (otherwise csv2cell.oct is locked), and then copy csv2cell.oct into place. PhilipThe original file has lines with a varied number of columns. As a result … error: csv2cell: incorrect CSV file, line 2 too short The first row are the column labels (127 of them), and 2nd row only has 19 columns (18 commas). There are other rows deep in the file with 127 columns too.Sure but if you try csv2cell with as 2nd argument a spreadsheet-style range, it'll read .csv files with a varying nr. of data per row just fine, see my first answer in this thread. If you want to read all of the file, just supply a sufficiently large range; it'll fill empty fields beyond current line length with "" (empty string). See "help csv2cell" The only practical limit is the max line length of 4096 chars (a #DEFINEd setting; changing that is easy as csv2cell() is just an .oct file). (of course, as usual I can only vouch for csv2cell() to work fine on the 4 boxes I have access to: my 2 multiboot Linux/Win7/Win10 boxes + 2 Win7 boxes at work.) PhilipOk. I’m not familiar with the history behind the default behavior. I was expecting the default behavior to load the full csv file. Given the current design, I don’t know how to determine the actual size of the file. Meaning when a range of rows/cols is specified, there is now way to be sure all the information is included.
Sure, I sympathize with your (and probably anyone else's) expectation. But csv2cell isn't so flexible yet.
I usually inspect csv files with e.g., notepad++ before feeding them to Octave. For csv2cell one can always specify a range that is sufficiently "wide" to contain all possible columns (max 4096, see [*] below). Afterwards one can invoke parsecell.m in the io package to separate text and numerical info; the resulting arrays are stripped from enveloping empty columns/rows; or strip the empty outer columns by hand (e.g., by re-using the code in parsecell.m).
csv files can also be read into Octave by LibreOffice (or Excel) using xlsread or odsread.
If the behavior were changed such that "error: csv2cell: incorrect CSV file, line 2 too short” was replaced by “warning: csv2cell: line 2 has fewer columns than the prior lines” and the entire file were to be read would there be an adverse impact on compatibility?
Compatibility? you mean with the competition? Matlab doesn't have csv2cell or my other "easy" function to read mixed-type delimited files.
When implementing csv2cell's "range" option some io package releases ago I've changed the actual reading part so that variable numbers of fields per line are now easily coped with. The crux is efficiently finding the required number of columns to be able to also efficiently preallocate the output array. For a small file an initial 4096 columns and resizing afterwards could be fine, but I've read up to GB size files with csv2cell and cutting down on the initial output array size gets vital then.
(10^6 lines (= not extreme) times 4096 columns needs 64-bit indexing.)I'm open to suggestions, but implementation will be in a future io-2.4.10 release as I think this needs careful thinking over (FYI, yesterday I've opened a ticket for an io-2.4.9 release)
Philip[*] The line buffer currently is 4096 characters. A line can contain just consecutive separators separating 4096 empty fields (or is it 4097 empty fields?). So there's the default max nr. of columns to take into account.
[Prev in Thread] | Current Thread | [Next in Thread] |