[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV pro
From: |
Dan Sebald |
Subject: |
[Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV processing using dlmread() as core |
Date: |
Wed, 31 Jul 2013 04:36:43 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 SeaMonkey/2.15 |
URL:
<http://savannah.gnu.org/patch/?8140>
Summary: Speed up importdata() ASCII CSV processing using
dlmread() as core
Project: GNU Octave
Submitted by: sebald
Submitted on: Wed 31 Jul 2013 04:36:42 AM GMT
Category: None
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Email:
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Details:
I reworked the importdata script to use dlmread() as the core. Last night I
was working with a relatively small CSV file and thought the loading times
were much greater than they should be.
Here are some CPU times for various parts of the importdata routine (the size
of the data is 7383 x 5):
ans = 0.0099990
ans = 0.089986
ans = 0
ans = 0
ans = 0
ans = 0.097985
ans = 0.49592
ans = 3.6494
The main thing to note from this is basically that the first stages involving
the regexp routine are rather efficient and the last stages which involve
double looping are quite the opposite.
which I think has enough flexibility with its arguments to handle the
importdata CSV ascii case. It is so efficient that I think a better approach
is to
1) Just fscanf the first header lines of the file (as opposed to reading in
the whole data file)
2) Use dlmread() to do all the work, which places NaN for the cases where the
conversion failed
3) Look at the data matrix for any NaN and then retroactively read in the data
file and then compute where the associated lines are. I think I've done it
efficiently so that every entry of the file need not be extracted, just the
lines where the NaN occurred.
The last step slows things down, but it is still pretty efficient. Here is
the CPU consumption for stages of the revamped importdata:
octave:460> aa = importdata_new ('foo.csv');
ans = 1.0000e-03
ans = 0.029996
ans = 0
Here are the results when I place a couple text strings amongst the data
columns:
octave:461> aa = importdata_new ('foo_b.csv');
ans = 0
ans = 0.033995
ans = 0.18297
Having to pull the data back in and apply regexp adds some, but still compared
to the current importdata.m it is rather minuscule.
There are three tests that fail after applying the patch. We can discuss
those. Basically, I don't agree with some of the results:
%!test
%! # Header
%! A.data = [3.1 -7.2 0; 0.012 6.5 128];
%! A.textdata = {"This is a header row."; \
%! "this row does not contain any data, but the next one
does."};
%
I think that treating text with spaces rules out using space characters as
delimiter and automatically recognizing column names. For example, if the
first lines of my data file were
TIME VOLTAGE DISPLACEMENT
0 3.3 0.137
0.25 3.4 0.148
0.5 3.6 0.150
how can we tell that the first line should be data column titles or just some
textdata?
%!test
%! # Missing values
%! A = [3.1 NaN 0; 0.012 6.5 128];
The above test produces the correct data output.data, but while this
expectation is just the data, the new routine is creating output.textdata for
that NaN result which happens to be an empty string. Isn't that the proper
result?
%!test
%! # CR for line breaks
%! A = [3.1 -7.2 0; 0.012 6.5 128];
%! fn = tmpnam ();
%! fid = fopen (fn, "w");
%! fputs (fid, "3.1\t-7.2\t0\r0.012\t6.5\t128");
The new version of importdata fails on the above test, and it would be easy to
correct as a first step by searching and replacing any \r with \n. However, I
wonder if the proper fix for this would be a simple addition to dlmread(). So
let's hold off on this test until we are certain where it should be fixed.
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Wed 31 Jul 2013 04:36:42 AM GMT Name:
octave-importdata_rework-2013jul30.patch Size: 9kB By: sebald
<http://savannah.gnu.org/patch/download.php?file_id=28717>
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/patch/?8140>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Octave-patch-tracker] [patch #8140] Speed up importdata() ASCII CSV processing using dlmread() as core,
Dan Sebald <=