|
From: | Markus Bergholz |
Subject: | Re: Test files [WAS: Re: xlsread in Octave 3.6.4] |
Date: | Tue, 17 Sep 2013 22:02:29 +0200 |
Markus,
Before answering, just a quick question:
What mailer do you use?
- Its indentation seems tabified and (IMO) screws up readibility;
- Could you please strip my email address from your replies? I know it is easily reconstructed, but to just shout it all over the place....
Thanks
Read on...
Markus Bergholz wrote:
On Sun, Sep 15, 2013 at 9:56 PM, Philip Nienhuis wrote<snip>
:
<snip>
:
This weekend I'll sent you a few test Excel files with empty
row/columns, merged cells, dates, times, booleans, formulas,
erroneous
formulas and -consequently- error values in the cached value
attributes,
etc.
Attached. I've combined a few test spreadsheets I had lying around.
I used most of these for exploring ranges of occupied cells (which
in OOXML turns out to be extremely easy, the range is in one of the
first nodes of the worksheets) and cell value types. You'll find a
few spreadsheet formulas which -intentionally- contain errors.
Good luck with it,
Sorry, Matlab is actually right.
>
> * Mergedranges.xlsx
> looks nearly the same as with matlab R2010 at work. One difference is
> Matlab start read in column J. My xlsxread start at A as defined in the
> sheet. So i guess matlab ignore the real defined size.
> the values are the same. (as far as i quickly see).
I deliberately merged the ten "topleftmost" rows and columns into one *empty* cell. So your xlsxread.m fails at that hurdle. Oopsie...
Here is what xlsfinfo.m (in Octave's OF-io package, using LibreOffice ("UNO") interface) says:
=============================================================
>> xlsfinfo ("Mergedranges.xlsx", "uno")
1: Sheet1 (Used range ~ J1:AMJ15)
ans = Microsoft Excel Spreadsheet
============================================================
Now, in the expanded xml (expanded using 7-zip) I see:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
- <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="A1:AMJ15" />
Hey.....?? ^^^^^^^^^^^^^^^^^^^ ??
Octave (io package) does:
>
> * testOOXML.xlsx
> same for this file.
> the values are the same.
=============================================================
>> xlsfinfo ("testOOXML.xlsx", "uno")
1: FirstSheet (Used range ~ B2:AD9)
2: ThirdSheet (Used range ~ A4:AMJ39)
3: Sheet#5 (Used range ~ F2:Q8)
4: Sheet5 (Used range ~ B2:F14)
5: Sixth (Used range ~ A1:B3)
6: 7th (Used range ~ B2:K9)
7: Sheet_nr_3 (Used range ~ A2:M12)
ans = Microsoft Excel Spreadsheet
>>
=============================================================
...and that exactly matches the worksheets in question.
In the xml itself (again, expanded using 7-zip) I see:
testOOXML.xlsx, for example 4th worksheet ("Sheet5"):
---------------
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
- <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="B1:F65536" />
==> Wrong again. Should be B2:F14
I think this points to a more general issue/problem/bug with MS-Excel: it actually hands you the range of formatted cells (incl. empty ones!), rather than the range of cells with actual data contents.
So detecting the occupied cell range isn't quite as easy as I hoped it would be; it appears one cannot avoid scanning the entire worksheet :-(
Oh and this is the very reason that Matlab often gives you rows and columns of NaNs surrounding the actual data matrix. IMO Matlab behaves extremely dumb here.
In the xls2oct.m, ods2oct.m and parsecell.m functions in the OF io package I've added (default) options to strip away such empty rows and columns (and to keep track of which spreadsheet cell ranges the stripped data originate from).
Another issue I think needs some attention is that reading data values from raw spreadsheets as we do here, carries the risk of actually reading cached formula values rather than recalculated ones as MS-Excel and LibreOffice would do
(if invoked through ActiveX or Java). IOW, one might read outdated values (depending on recalculation settings in the spreadsheet program used to create the file).
Note that the same applies to JExcelAPI (JXL) and OpenXLS (OXS) for .xls/.xlsx, and ODF Toolkit (OTK) and jOpenDocument (JOD) for .ods.
Of the "pure" Java interfaces, only Apache POI (POI; .xls/.xlsx) has a built-in formula evaluator (that I do invoke in the relevant io package functions).
As you can guess, I don't think "we" are done yet :-)
>
> the question is now if (get rid of|change) line 102-109+122 to get the
> same wrong result as matlab or ignore this and we are done?
Philip
[Prev in Thread] | Current Thread | [Next in Thread] |