[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Loading a large and unusually formatted dataset into an Octave matri
From: |
Przemek Klosowski |
Subject: |
Re: Loading a large and unusually formatted dataset into an Octave matrix |
Date: |
Wed, 19 Jun 2013 11:52:35 -0400 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6 |
On 06/18/2013 09:23 PM, Ben Abbott wrote:
On Jun 18, 2013, at 02:01 PM, Przemek Klosowski wrote:
command="perl -F'\"' -lane 'print \"$F[5] $F[9]\"' /tmp/bitcoin";
a=reshape(fscanf(popen(command,'r'),"%f"),2,[]);
From a bash prompt, you perl command works as expected.
Am I to infer that those two Octave commands fail to work for you? They
read the file for me (Octave 3.6.4 on 64-bit Fedora 19)
If you modify this code (sscanf() for fputs()) to load a lot of data
this the array(s) will be resized on each sscanf(). That will be
inefficient.
I didn't check but I think that there would be no Octave-level re-sizing
in the above a=reshape(...) command, unless fscanf does it internally.
Actually, I found another way of reading that avoids external perl,
using textread()'s "delimiter" option. Sorry for droning on about it,
but I hope it would be useful to others---I often need to read odd files
into octave, and I found it often awkward; this looks like a pretty
general approach:
b=textread("/tmp/bitcoin","%f","delimiter",'"');
It's the same trick of breaking each line on double-quotes to extract
the content of quoted strings. The resulting array contains a lot of
junk, including one extra character from the last line, so I reshape()
it and extract a subset of rows containing valid numbers:
reshape(b(1:end-1),[],3)([6,10,14,18,22],:)
This is rather unreadable code but it's actually quite easy to arrive at
by trial and error on the Octave command line. In order to find out the
required reshaping and to find out what row indexes to use, I found it
useful to print pieces of the array using
format + +-.
which shows a compact representation of the jumble that makes it easy to
read various required counts off the screen output.