Re: Textscan and csv fitness data problem

help-octave

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Textscan and csv fitness data problem

From:	Philip Nienhuis
Subject:	Re: Textscan and csv fitness data problem
Date:	Wed, 3 Jan 2018 23:47:50 +0100
User-agent:	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48

Ben Abbott wrote:

On Jan 3, 2018, at 12:19 AM, PhilipNienhuis <address@hidden> wrote:

bpabbott wrote

On Jan 1, 2018, at 11:51 AM, PhilipNienhuis &lt;

pr.nienhuis@

&gt; wrote:


NJank wrote

On Jan 1, 2018 12:13 PM, "PhilipNienhuis" &lt;

pr.nienhuis@

&gt; wrote:

NJank wrote

On Jan 1, 2018 9:06 AM, "Thomas Esbensen"



As to textscan, Dan did a lot of good work lately, I think the bugs you
implied have been fixed in the development branch.


Yeah, i noticed that. Would those make it into a 4.2.2 or not until
4.4.0?
Been keeping my fingers crossed that it would suddenly "just work" and I
wouldn't have to dive into his data again.


Have a look in the log: http://hg.savannah.gnu.org/hgweb/octave
bugs 52116 and 52479 have been fixed on stable, the last one (bug 52550)
not. If you want you can ask in the latter bug report to backport it to
stable.

As to csv2cell's erroneous column conversion, I've fixed that stupid bug
and
pushed it. To use it, get csv2cell.cc from here:

http://hg.code.sf.net/p/octave/io/file/31b7ff5ee040/src/csv2cell.cc

and then do

mkoctfile csv2cell.cc

to build a fixed version. Swap it into place, using
"pkg load io; which csv2cell"
to find out where it should live, followed by
"pkg unload io; clear -f"
to clear the way for copying (otherwise csv2cell.oct is locked), and then
copy csv2cell.oct into place.

Philip


The original file has lines with a varied number of columns.  As a result
…

error: csv2cell: incorrect CSV file, line 2 too short

The first row are the column labels (127 of them), and 2nd row only has 19
columns (18 commas). There are other rows deep in the file with 127
columns too.


Sure but if you try csv2cell with as 2nd argument a spreadsheet-style range,
it'll read .csv files with a varying nr. of data per row just fine, see my
first answer in this thread.
If you want to read all of the file, just supply a sufficiently large range;
it'll fill empty fields beyond current line length with "" (empty string).
See "help csv2cell"

The only practical limit is the max line length of 4096 chars (a #DEFINEd
setting; changing that is easy as csv2cell() is just an .oct file).

(of course, as usual I can only vouch for csv2cell() to work fine on the 4
boxes I have access to: my 2 multiboot Linux/Win7/Win10 boxes + 2 Win7 boxes
at work.)

Philip


Ok. I’m not familiar with the history behind the default behavior. I was 
expecting the default behavior to load the full csv file. Given the current 
design, I don’t know how to determine the actual size of the file. Meaning when 
a range of rows/cols is specified, there is now way to be sure all the 
information is included.

Sure, I sympathize with your (and probably anyone else's) expectation.But csv2cell isn't so flexible yet.

I usually inspect csv files with e.g., notepad++ before feeding them toOctave.For csv2cell one can always specify a range that is sufficiently "wide"to contain all possible columns (max 4096, see [*] below). Afterwardsone can invoke parsecell.m in the io package to separate text andnumerical info; the resulting arrays are stripped from enveloping emptycolumns/rows; or strip the empty outer columns by hand (e.g., byre-using the code in parsecell.m).

csv files can also be read into Octave by LibreOffice (or Excel) usingxlsread or odsread.

If the behavior were changed such that "error: csv2cell: incorrect CSV file, 
line 2 too short” was replaced by “warning: csv2cell: line 2 has fewer columns than 
the prior lines” and the entire file were to be read would there be an adverse 
impact on compatibility?

Compatibility? you mean with the competition? Matlab doesn't havecsv2cell or my other "easy" function to read mixed-type delimited files.

When implementing csv2cell's "range" option some io package releases agoI've changed the actual reading part so that variable numbers of fieldsper line are now easily coped with. The crux is efficiently finding therequired number of columns to be able to also efficiently preallocatethe output array.For a small file an initial 4096 columns and resizing afterwards couldbe fine, but I've read up to GB size files with csv2cell and cuttingdown on the initial output array size gets vital then.

(10^6 lines (= not extreme) times 4096 columns needs 64-bit indexing.)

I'm open to suggestions, but implementation will be in a futureio-2.4.10 release as I think this needs careful thinking over (FYI,yesterday I've opened a ticket for an io-2.4.9 release)


Philip

[*] The line buffer currently is 4096 characters. A line can containjust consecutive separators separating 4096 empty fields (or is it 4097empty fields?). So there's the default max nr. of columns to take intoaccount.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Textscan and csv fitness data problem, (continued)
- Re: Textscan and csv fitness data problem, Nicholas Jankowski, 2018/01/01
  - Re: Textscan and csv fitness data problem, PhilipNienhuis, 2018/01/01
    - Re: Textscan and csv fitness data problem, Nicholas Jankowski, 2018/01/01
    - Re: Textscan and csv fitness data problem, PhilipNienhuis, 2018/01/01
    - Re: Textscan and csv fitness data problem, Ben Abbott, 2018/01/01
    - Re: Textscan and csv fitness data problem, Philip Nienhuis, 2018/01/02
    - Re: Textscan and csv fitness data problem, Philip Nienhuis, 2018/01/02
    - Re: Textscan and csv fitness data problem, Ben Abbott, 2018/01/02
    - Re: Textscan and csv fitness data problem, PhilipNienhuis, 2018/01/03
    - Re: Textscan and csv fitness data problem, Ben Abbott, 2018/01/03
    - Re: Textscan and csv fitness data problem, Philip Nienhuis <=
    - Re: Textscan and csv fitness data problem, Ben Abbott, 2018/01/03
    - Re: Textscan and csv fitness data problem, Philip Nienhuis, 2018/01/04
- Re: Textscan and csv fitness data problem, Stuart Edwards, 2018/01/01
  - Re: Textscan and csv fitness data problem, PhilipNienhuis, 2018/01/01

Prev by Date: Re: C++ API bool is* functions
Next by Date: Re: Textscan and csv fitness data problem
Previous by thread: Re: Textscan and csv fitness data problem
Next by thread: Re: Textscan and csv fitness data problem
Index(es):
- Date
- Thread