[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnumed-devel] forms handling / NetEpi
From: |
Tim Churches |
Subject: |
Re: [Gnumed-devel] forms handling / NetEpi |
Date: |
Wed, 27 Jun 2007 07:02:27 +1000 |
User-agent: |
Thunderbird 1.5.0.12 (Windows/20070509) |
Karsten Hilbert wrote:
> Tim,
>
> thanks for your invaluable insights.
>
> Let me try to get a few things cleared up in my mind:
>
>>> Tim, what were your reasons to store the form definition as
>>> an XML-based blob rather than in tables (roughly) like so:
> ...
>> Easier extensibility of the form definitions/metadata,
> I can see that. Put IOW: it seems safer to update data in
> deployed databases rather than requiring the databases
> themselves to be modified, eg. when one needs a new "field"
> in the form metadata definition.
>
> It *can* be done via ALTER TABLE but I agree it sounds safer
> to rather do UPDATE ... WHERE ... ;
What if the update to a form which has already been used to collect data
involves the removal of a column? If you use ALTER TABLE, all the data
collected in that form field/column is now gone... where's the undo button?
>
>> and
>> the form definitions need to be serialised anyway in order
>> to share them around.
> True. I was thinking of using
>
> COPY ... (SELECT ...)
>
> to create singled out dumps for sharing forms (COPY these
> days is able to dump the result of a SELECT statement :-)
That would work. XML is not always best, but if you tell people that
your form definitions are shareable as CSV files, they will look at you
very strangely, whereas if you say "shareable as XML" then they won't be
surprised at all.
>>> One disadvantage for this is that the table might become
>>> large and thus slower. Which could be alleviated by using
>>> constraint-based partitioning (eg by age) and putting the
>>> partitions onto more spindles.
>
> Let me rephrase this:
>
> You store form data in a-table-per-form because EAV for
> all-form-data-in-one-table was tested to be too slow ?
In 2003 on a typical $1,000 machine used as a server, yes it was too
slow with large volumes of data - especially analytical queries and
bulk export of the data into a table form.
> If testing did NOT prove the approach to be too slow at the
> time would you have chosen it over the approach you are
> taking now ?
Probably.
> Did you test the explicit-table-for-form-definitions, too ?
> If so, I assume that wasn't a bottleneck ? (apart from that
> "parsing XML into Python objects was faster then parsing a
> SELECT result into Python objects")
No, we didn't test that.
>> We tested your approach, which is a form of the well-known
>> EAV (entity-attribute-value) approach,
> Correct.
>
>> and found that performnce was lacking at around the
>> 500,000 record mark with about a ten forms of 50 fields each
>> i.e. about 250 million rows in the values tables.
> Let's see:
>
> We see about 150 patients per day. Let's say 200. Each
> receives, on average, one form printout to go. Let's say 2.
>
> That's 400 form instances per day.
>
> Let's assume each form has 50 fields (rather high for the
> average form I deal with). That's 20000 form field values
> per day.
>
> That'd make 800 000 form field values per year assuming 400
> days in a year.
>
> Or 8 000 000 form field values over 10 years. Let's double
> that up for increase of bureaucracy: 16 million. Double
> again just for good measure: 32 million.
>
> Nothing which PostgreSQL cannot handle these days judging
> from what people shove around on the PG mailing list. Also
> nowhere near the 250 million rows you were benchmarking for
> back in 2003 with a then-current PostgreSQL and that's over
> a 10 year period which sees increases in hardware and
> software speed.
Sure - at a clinic or individual institution level, the EAV approach is
fine performance-wise. But not at a jurisdictional level, which is what
we were benchmarking as a worst-case scenario.
> Still really good to be able to learn from your numbers.
>
> BTW, one reason we don't reach your numbers is that GNUmed
> isn't an epidemiologist's tool which NetEpi decidedly is.
Yes, and bulk analytical queries and bulk data export to stats packages
is vital for the latter role, and EAV is slow for these things.
>> If we were starting stoday, we would use openEHR
>> archetypes or a variation thereof, but we'd need to spend
>> 6-12 months building the engine ourselves.
> If I had the manpower I'd go that route, too, yes.
>
>>>> Data from the table for the previous versions of the form
>>>> are rolled forward into the table for the new version of the form when
>>>> it is deployed.
> ...
>> There is a need for custom transformations in some rare
>> roll-forward instances, which have to be written in Python
>> outside the Web environment.
> Sure, that's to be expected. It likely could be done via
> dedicated stored procedures, perhaps written in pl/Python ?
Written in Python, but running in one of the middle layers of the
application stack - we haven't used triggers or other database-specific
processing, preferring to put all logic into the middle layers and
keeping as much database independence as possible. Having said that, we
have PG dependencies, but they are minor and could be easily removed,
although we have no reason to do so at the moment.
Tim C