[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: data access
From: |
John Darrington |
Subject: |
Re: data access |
Date: |
Sun, 2 Jul 2006 08:23:36 +0800 |
User-agent: |
Mutt/1.5.4i |
I think that part of the problem is that the casefiles are designed to
be a) Fast; and b) Potentially very large. One of the costs of these
design criteria is that they're not quite so flexible.
As Ben has explained to me before, accessing a casefile in random
order is much less effecient than doing so in sequential order. I'm
not sure that a 10^8 x 300 gsl-matrix would be very efficient.
I'm not a statistician, but I cannot envisage any situation where a
matrix operation (eg pre-multiply, inverse etc) would need to be
performed on a casefile as a whole; it wouldn't make sense in the
general case, because of non-numeric data.
Having said that, I'm working on abstracting the interface for the
casefiles right now. It might be possible to devise a casefile type
that is more convenient for math routines, but probably not one that
would be quite as flexible as gsl_matrix.
Can you give me an example of a particular problem you've encountered,
and I'll see if I can come up with any suggestions.
J'
On Sat, Jul 01, 2006 at 02:01:45PM -0400, Jason Stover wrote:
I'm wrestling with reading data via casefiles again. We've all said
it would be nice to make reading the data easier, and Ben has
complained about every procedure's need to pass the entire data
set.
I thought of what might be a simple approach: Each time a procedure
reads the data via casefiles, it stores them in a gsl_matrix, along
with some other information about variable names, etc. Then the next
time a procedure needs the data, it uses that gsl_matrix, if it's
available and contains the necessary information. If not, it reads the
data via casefiles.
Filling up and using a gsl_matrix is easy. I don't know how easy it
would be to store the meta-data the procedures would need.
Pardon me if this is an old idea. But the difficulty of using
casefiles prevents other people from contributing mathematical code,
whereas gsl_matrices are easy to handle.
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
pgpwBRVdIUSE5.pgp
Description: PGP signature
- data access, Jason Stover, 2006/07/01
- Re: data access,
John Darrington <=