[Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV pro

octave-patch-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV pro

From:	Dan Sebald
Subject:	[Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV processing using dlmread() as core
Date:	Wed, 31 Jul 2013 04:36:43 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 SeaMonkey/2.15

URL:
  <http://savannah.gnu.org/patch/?8140>

                 Summary: Speed up importdata() ASCII CSV processing using
dlmread() as core
                 Project: GNU Octave
            Submitted by: sebald
            Submitted on: Wed 31 Jul 2013 04:36:42 AM GMT
                Category: None
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

I reworked the importdata script to use dlmread() as the core.  Last night I
was working with a relatively small CSV file and thought the loading times
were much greater than they should be.

Here are some CPU times for various parts of the importdata routine (the size
of the data is 7383 x 5):

ans =  0.0099990
ans =  0.089986
ans = 0
ans = 0
ans = 0
ans =  0.097985
ans =  0.49592
ans =  3.6494

The main thing to note from this is basically that the first stages involving
the regexp routine are rather efficient and the last stages which involve
double looping are quite the opposite.

which I think has enough flexibility with its arguments to handle the
importdata CSV ascii case.  It is so efficient that I think a better approach
is to

1) Just fscanf the first header lines of the file (as opposed to reading in
the whole data file)

2) Use dlmread() to do all the work, which places NaN for the cases where the
conversion failed

3) Look at the data matrix for any NaN and then retroactively read in the data
file and then compute where the associated lines are.  I think I've done it
efficiently so that every entry of the file need not be extracted, just the
lines where the NaN occurred.

The last step slows things down, but it is still pretty efficient.  Here is
the CPU consumption for stages of the revamped importdata:


octave:460> aa = importdata_new ('foo.csv');
ans =    1.0000e-03
ans =  0.029996
ans = 0


Here are the results when I place a couple text strings amongst the data
columns:


octave:461> aa = importdata_new ('foo_b.csv');
ans = 0
ans =  0.033995
ans =  0.18297


Having to pull the data back in and apply regexp adds some, but still compared
to the current importdata.m it is rather minuscule.

There are three tests that fail after applying the patch.  We can discuss
those.  Basically, I don't agree with some of the results:



%!test
%! # Header
%! A.data = [3.1 -7.2 0; 0.012 6.5 128];
%! A.textdata = {"This is a header row."; \
%!               "this row does not contain any data, but the next one
does."};
%


I think that treating text with spaces rules out using space characters as
delimiter and automatically recognizing column names.  For example, if the
first lines of my data file were


TIME VOLTAGE DISPLACEMENT
0 3.3 0.137
0.25 3.4 0.148
0.5 3.6 0.150


how can we tell that the first line should be data column titles or just some
textdata?


%!test
%! # Missing values
%! A = [3.1 NaN 0; 0.012 6.5 128];


The above test produces the correct data output.data, but while this
expectation is just the data, the new routine is creating output.textdata for
that NaN result which happens to be an empty string.  Isn't that the proper
result?



%!test
%! # CR for line breaks
%! A = [3.1 -7.2 0; 0.012 6.5 128];
%! fn  = tmpnam ();
%! fid = fopen (fn, "w");
%! fputs (fid, "3.1\t-7.2\t0\r0.012\t6.5\t128");


The new version of importdata fails on the above test, and it would be easy to
correct as a first step by searching and replacing any \r with \n.  However, I
wonder if the proper fix for this would be a simple addition to dlmread().  So
let's hold off on this test until we are certain where it should be fixed.



    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Wed 31 Jul 2013 04:36:42 AM GMT  Name:
octave-importdata_rework-2013jul30.patch  Size: 9kB   By: sebald

<http://savannah.gnu.org/patch/download.php?file_id=28717>

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/patch/?8140>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV processing using dlmread() as core, Dan Sebald <=

Prev by Date: [Octave-patch-tracker] [patch #8139] cd: Add Bash-like "cd -" shortcut to change to last dir from input prompt.
Next by Date: [Octave-patch-tracker] [patch #8141] singal package: levinson.m deal with matrix input (matlab compatibility)
Previous by thread: [Octave-patch-tracker] [patch #8139] cd: Add Bash-like "cd -" shortcut to change to last dir from input prompt.
Next by thread: [Octave-patch-tracker] [patch #8141] singal package: levinson.m deal with matrix input (matlab compatibility)
Index(es):
- Date
- Thread