Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract

gnumed-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract

From:	Karsten Hilbert
Subject:	Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
Date:	Mon, 25 Jan 2010 23:41:03 +0100

> Some discussion of PDF indexing and scraping of PDFs makes me ask about
> GNUmed's ability to search for text across a patient record:
> 
> 1) when a PDF was generated from source text (such as a word processor and
> "print to pdf") the text within the PDF remains recognizable to software,
> albeit not in human readable form.

AFAIK, that entirely depends on the mode in which it was generated.
It well behooves PDF generators to choose a mode that somehow preserves
text but AFAIK there's other modes where there's no text anymore.

> Is GNUmed presently only able to query
> information stored-as-human-readable text?

Even worse, it cannot query over *any* information in any
of the documents in the archive regardless of format.

> 2) there exists apparently a form of PDF called "searchable" in which a
> PDF can be created (or appended) to contain both an image layer (such as a
> scanned paper document) but to *also* hold, in a separate layer within the
> same document (file), ASCII or perhaps UTF-8 text, as may have been generated
> through OCR or perhaps when the PDF did already contain identifiable text
> (only non-human-readable within the PDF format), into a layer of
> human-readable text.

That sounds mighty useful to me.

> For GNUmed to be able to access such a layer in within-patient searches,
> would it be necessary for such PDFs to have been imported twice, and/or to
> use some additional tool to "split" the document into two parts (one an
> image part, and one the text part)?

It would be possible to implement the access to the text part inside
GNUmed. Actually using that in a search would, however, presently
require exporting each and every document and trying to search it.

That could, indeed, only be mitigated by splitting the text part
into a separate for-search table upon import.

Except that GNUmed already has that table: blobs.doc_desc, of which
there can by any number per document. In fact, we should probably
extend the per-patient and across-patients search to look at those !

Which would then enable practices to implement just what you wanted -
they'd have to import the text version themselves, but it'd be usable
for finding stuff.

:-)

Karsten

-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

[Prev in Thread]

Current Thread

[Next in Thread]

[Gnumed-devel] Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/05
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/15
  - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/15
  - [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
    - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert <=
    - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
    - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
    - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
    - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/26
    - Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26

Prev by Date: [Gnumed-devel] Re: GNUmed user interface toolkit considerations
Next by Date: Re: [Gnumed-devel] Re: GNUmed user interface toolkit considerations
Previous by thread: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
Next by thread: Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
Index(es):
- Date
- Thread