gnumed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] experiments with gnumed - multiusers vnc, importing


From: Tim Churches
Subject: Re: [Gnumed-devel] experiments with gnumed - multiusers vnc, importing
Date: Wed, 26 Apr 2006 17:53:23 +1000
User-agent: Thunderbird 1.5.0.2 (Windows/20060308)

Syan Tan wrote:
> it would be better if the records were synthetic, based on some statistics 
> about
> the EHR . e.g. age, sex, health issues, episodes per health issues, encounter 
> frequency,
> medication names prescribed, blood pressure, test names ordered and 
> frequency, 
> clusters of frequencies of appointments and health issues dealt with ,
> specialty names mentioned in the narrative text, symptom names mentioned,
> sign words like "chest  clear,  basal , wheeze, nil added, pulse, 
> regular,irregular,  abdo , lax, mass, no masses, sclera, pallour,
> conjunctiva, well, unwell, fwt , nad, nitrites, wcc , rcc"'  ,;
> 
> synthetic records can be fairly
> sure of being "deidentified".  Maybe some configuring statistics and terms, 
> and 
> a program would be better.
> 
> Not sure if they loose their value , because maybe someone wants to use real 
> statistical record patterns for research. Probably good enough for load 
> testing 
> though ?

I agree that synthetic records are teh way to go - if using real records
you need to be very,very, very careful about de-identifying or
pertutbing them - and it can be more trouble that synthesising records
in the first place.

Clearly for authentic-seeming records you need multivariate
distributions (i.e. joint or conditional probabilities). If the source
data is limited in size, you need to be careful how far you go in
creating those distributions, as you can easily create a system which
synthesises records which are almost identical to the source records, in
which case you have gained nothing. I would suggest using conditional
probabilities based on just age group and sex, although medications
should be diagnosis specific too, I suppose. Maybe just stick tot he
more common diagnoses and medications.

Tim C

> *On Tue Apr 25 13:23 , Tim Churches sent:
> 
> *
> 
>     Syan Tan wrote:
>      > i've processed 360,000 rows of clin.clin_narrative and parsed out all 
> the
>     words
>      >
>      > containing letters. I was thinking of using a stoplist method where 
> any word
>      > appearing
>      >
>      > on the stoplist will be replaced by 'xxxx' . The stoplist would also
>     include all
>      > the names
>      >
>      > listed out from dem.names.lastnames and dem.names.firstnames.
>      >
>      > BTW - what about a secondary structure for clin.clin_narrative, where 
> the
>     narrative
>      >
>      > consists of a list of indexes pointing into a table of words. this is 
> the
>      > simplest step before
>      >
>      > having some sort of semantic linking at the word level ( but not at the
>     phrase
>      > level).
>      >
>      > whilst trying to recreate the gnumed database using a pg_dump,
>      >
>      > the dump reload seems to stall ; I tried to turn off logging, table
>      > constraints, removing
>      >
>      > internal log table data , and fsync , which all finally worked , but 
> I'm not
>      > sure what causes the stall.
>      >
>      >
>      >
>      >
>      > *On Mon Apr 24 18:53 , Karsten Hilbert sent:
>      >
>      > *
>      >
>      > On Thu, Apr 20, 2006 at 09:47:54AM +0800, Syan Tan wrote:
>      >
>      > > thinking about it, the only correct thing to do seems to be to 
> preserve the
>      > > structure of the instance data and the health issue + episode 
> headings,
>      > but to
>      > > scramble the text with word substitution, as well as name 
> substitution,
>     date
>      > > fudging, and address random relinking . would that be de-identified
>     enough ?
>      > Well, I tend to think that "de-identified enough" is a range
>      > from "acceptably so" to "beyond use" rather than a cutoff.
>      > The exact value used within that range depends on what sort
>      > of protection you need.
>      >
>      > Yes, if you want to hide a patient's data securely from your
>      > fellow doctor next door you will have to scamble the medical
>      > content, too, as she might be able to match "real patient"
>      > to "problems/operations listed" by her own medical skills
>      > and thereby gain knowledge via the now re-identified EMR.
>      >
>      > But if you want to protect a patient's privacy from, say,
>      > me, it's enough to falsify the identities. I do not have
>      > access to your patients. I also have no idea how to find out
>      > who your patients actually are in order to start matching
>      > EMRs to patients. Hence proper protection is ensure, I dare
>      > say. It is akin to not storing patient names with any
>      > medical data and hold the EMR ID <-> patient identity
>      > mapping elsewhere in a secure space (say, the patient's
>      > brain).
>      >
>      > In a recent discussion on the openhealth list this topic was
>      > chanced upon and the OpenEHR guys thought the latter
>      > approach would be the most secure that's practically useful
>      > - and they were talking real live patient data in actual
>      > care.
> 
>     I didn't mention it on the openEHR list (maybe I should) but merely
>     removing the direct identifiers (names, DOB etc) does not de-identify or
>     anonymise that data. For example, if the record reveals "32 yr old male,
>     with medical visits on 23/4/04, 12/6/05 and 14/01/06" then that record
>     has a very high probability of being unique to an individual in even a
>     large population. Hence if I know your age and sex (easily discovered or
>     ascertained) and I know that you had medical appointments on those dates
>     (eg if I had access to your work leave records, as staff in the
>     personnel department of your employer may have), then I can fairly
>     easily which record belongs to you. Disclosure control in microdata
>     almost always involves some degree of obfuscation, perturbation or
>     allocation to broad categories - in other words, a lot of detail needs
>     to be removed to make real data truly anonymous (in that it cannot be
>     re-identified). Also, anonymity of data is a continuum - it is not
>     dichotomous, and often it comes down to a risk judgement and some
>     assumptions about what additional information an 'attacker' who might
>     try to re-identify records might possess. If the data are to be made
>     publicly available, you can't make any assumptions about what an
>     attacker might or might not already know about a person, so you need to
>     be very conservative.
> 
>     Tim C
> 
> 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]