Re: xlsread in Octave 3.6.4

On Mon, Sep 2, 2013 at 12:10 AM, Markus Bergholz <address@hidden> wrote:

On Sun, Sep 1, 2013 at 11:42 PM, PhilipNienhuis <address@hidden> wrote:

Markus Bergholz wrote

> now it's faster than matlab!!
> matlab takes ~100 seconds
> xlsxread in octave ~80 seconds
> http://p.osuv.de/index.php/ZuBLam/ (autodelete after 5 days)
> i will push my modifications later.
>
>

> On Sun, Jun 2, 2013 at 10:25 PM, Markus Bergholz <

> markuman@

> > wrote:
>
>>
>>
>>
>> On Sun, May 12, 2013 at 9:26 PM, Philip Nienhuis <

> pr.nienhuis@

> >wrote:
>>
>>> Markus Bergholz wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 8, 2013 at 10:06 AM, PhilipNienhuis <

> pr.nienhuis@

> >>> <mailto:

> pr.nienhuis@

> >**> wrote:
>>>>
>>>> E4
>>>> Markus Bergholz wrote
>>>> > I haven't follow this thread and it's issue, but i've wrote a
>>>> xlsxread
>>>> > function whitch don't need java.
>>>> > but it's very very rudimentary, works just with linux and is a
>>>> quick&dirty
>>>> > write-down.
>>>> > furthermore, you have to remove the string-analyse part, if your
>>>> sheet
>>>> > don't contain strings.
>>>> > but maybe it helps someone else or someone want to improve it or
>>>> someone
>>>> > rewrite it in c/c++ as oct file, to get it even faster than
>>>> matlab (for me
>>>> > it's still faster than the java stuff atm).
>>>> >
>>>> >

>>>> http://git.osuv.de/Octave/**tree/functions/xlsxread.m<http://git.osuv.de/Octave/tree/functions/xlsxread.m>

>>>>
>>>> The Java based options are relatively slow as they offer maximum
>>>> flexibility
>>>> as regards data types.
>>>>
>>>> Before venturing in COM/ActiveX and Java based solutions for the io
>>>> pkg 4
>>>> years ago I've looked at a few other solutions, similar to yours.
>>>> IIRC the
>>>> most promising one was posted in an OpenWatcom news group. All of
>>>> them (i.e.
>>>> the "free solutions") suffered from the same limitations: lack of
>>>> flexibility, lack of documentation, dependency on some very
>>>> specific
>>>> development framework, and/or bound to specific .xls formats
>>>> (BIFF5,
>>>> BIFF8,
>>>> OOXML, what not).
>>>>
>>>> If you want I can look if your code can somehow be absorbed in the
>>>> io pkg as
>>>> a sort of fall-back option.
>>>>
>>>>
>>>> i don't think that this is a good idea :D as i said, it just works with
>>>> linux (i'm using sed and unzip through 'system' command. furthermore, i
>>>> made quick&dirty my own tmp-dir (mktemp -d would be better). aaaaaand
>>>> so
>>>> on :)
>>>>
>>>> To that end it needs a suitable license
>>>>
>>>>
>>>> i don't care about the licence as long as it's a free licence.
>>>>
>>>> and
>>>> someone should support/maintain it (my C/C++ skills are
>>>> rudimentary).
>>>>
>>>> Philip
>>>>
>>>>
>>>> my c/c++ skills are rudimentary too :)
>>>> if you like, we could code together on github on a xlsxread function
>>>> e.g..
>>>> it is not so difficult but it is extremely time-consuming to parse the
>>>> shitty ms xml format!! (i don't read any specs yet, just do some lousy
>>>> reverse engineering).
>>>>
>>>
>>> Weighing the amount of work needed to build a good, robust and
>>> fool-proof
>>> C+/C-based xlsread backend versus already having available a well-tested
>>> choice of working (albeit relatively slow [1]) solutions, I just fail to
>>> see the benefits of reinventing the wheel.
>>>
>>> Just for the record & to emphasize an important aspect, I myself don't
>>> use xlsread (or xlswrite), I usually invoke the much more flexible

>>> xlsopen-xls2oct-[parsecell-]**oct2xls-xlsclose sequences. So we'd be

>>> talking about another interface in xlsopen/xls2oct/xlsclose rather than
>>> xlsread.
>>>
>>> Philip
>>>
>>> [1] OpenOffice / LibreOffice are really fast for large spreadsheets, I
>>> doubt a 2-person amateur team can beat the OOo/LO devs as regards speed
>>> tuning; the only problem is start-up time of OOo/LO.
>>> Oh and there's a currently unsolvable Java-UNO issue outlined when you
>>> use it for the first time.
>>> BTW a while ago I had a try with Starbasic (& ActiveX) invoking
>>> LibreOffice for spreadsheet I/O. I already had some success, but I had
>>> to
>>> put it away due to lack of time. Maybe next summer I can look at it
>>> again.
>>> Maybe that can be made cross-platform too.
>>>
>>
>>
>> I've do a rewrite of my xlsxread function and push it to github
>> https://github.com/markuman/xlsxread/
>> it is ~10% faster now, (still faster than the java version, but still
>> slow!)
>> Theoretical this could work in windows now too, but the unzip command in
>> octave don't accept the .xlsx extension:
>> warning: unrecognized file type, .xlsx
>> So i have to use a system command again (see line 47-51
>> https://github.com/markuman/xlsxread/blob/master/xlsxread.m )
>> strings are not recognized too atm. so it's still limited.
>> if someone has an idea how to improve it, i'd like so see some forks :D
>>
>>
>>
>>
>>
>
>
> --
> icq: 167498924
> XMPP|Jabber:

> address@hidden

>
> _______________________________________________
> Help-octave mailing list

> Help-octave@

> https://mailman.cae.wisc.edu/listinfo/help-octave

Hi Markus,

Tonight I had a brief glance of your code and tried a few command lines from
your .m files. Nice stuff.
I encountered a few hurdles (e.g., no unzip binaries in the MXE builds f
Windows) but OK that was easily solved.

yes, this is already fixed. see: http://savannah.gnu.org/bugs/index.php?39148

A first try, concerning a simple xlsx file from my test suite with one text
string inside a square, otherwise numerical cell range, breaks in the
reshape stage because your regexp line doesn't recognize and thus skips
<f></f> tags that AFAICS seem to be used for booleans (rather than <v></v>
tags).
Note that the enclosing <c...> (column) tags indicate the cell type, so in
principle text strings can be extracted as well.

yes :) it's all not supported atm.

I'd expect a next hurdle to be "merged" cells. But maybe that is easy.

It is probably not so hard to properly parse the xml worksheet files so that
text strings and booleans + probably formulas are read. But I am sure it
will induce a speed penalty.

yes, it will :)
my very first quick and dirty version did one sed command for parse line by line.
http://git.osuv.de/Octave/tree/functions/xlsxread.m

this is the easiest but slowest (but still faster than java!) way to parse it.
i made the last changes ~3 month ago https://github.com/markuman/xlsxread/

but i've never pushed my last commit with a 10% working range-read regexp part (that's another braking part).
So xlsxread is always on my mind, but i did roughly nothing in my semester break ;)

In ~2-3 weeks i'll be more active again.

All in all I think the blazing speed you claim (a claim I believe as-is)
comes at the cost of robustness and some flexibility. To be able to be
included in the io package I think some of the speed has to be sacrificed to
get some more robust code that won't provoke too many bug reports.
BTW I saw str2num being used to convert text to doubles. Any reason for
that? I ask because str2double is known to be much faster.

indeed: https://github.com/markuman/xlsxread/search?q=str2num&ref=cmdform

I don't know when I can have another look. Your code is promising though;
I'd like to amend and include it in the near future in the io package.
But to that end I hope you can make up your mind about the license. Would
you agree with GPL 3? I don't know if the current "do what the f**k you want
to" license is compatible with GPL 3 and thus compatible with the rest of
the io package.

GPL3 is fine too for me.
feel free to fork it on github and commit it with a new licence and the str2double replacement :P

i've made a few quick and dirty changes, change to gpl licence and commit the broken range part too.

https://github.com/markuman/xlsxread

it's now plattform indepentend and - once again - faster than before (~58 seconds). now it's nearly twice as fast as matlab (~110 seconds).

enough time to waste it for ranges, strings etc in future.

Philip

--
View this message in context: http://octave.1599824.n4.nabble.com/xlsread-in-Octave-3-6-4-tp4652046p4656979.html

Sent from the Octave - General mailing list archive at Nabble.com.
_______________________________________________
Help-octave mailing list
address@hidden
https://mailman.cae.wisc.edu/listinfo/help-octave

--
icq: 167498924
XMPP|Jabber: address@hidden

--
icq: 167498924
XMPP|Jabber: address@hidden

From:	Markus Bergholz
Subject:	Re: xlsread in Octave 3.6.4
Date:	Mon, 2 Sep 2013 11:38:50 +0200