[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
From: |
Jim Busser |
Subject: |
[Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR |
Date: |
Mon, 25 Jan 2010 13:57:01 -0800 |
Some discussion of PDF indexing and scraping of PDFs makes me ask about
GNUmed's ability to search for text across a patient record:
1) when a PDF was generated from source text (such as a word processor and
"print to pdf") the text within the PDF remains recognizable to software,
albeit not in human readable form. Is GNUmed presently only able to query
information stored-as-human-readable text?
2) there exists apparently a form of PDF called "searchable" in which a PDF can
be created (or appended) to contain both an image layer (such as a scanned
paper document) but to *also* hold, in a separate layer within the same
document (file), ASCII or perhaps UTF-8 text, as may have been generated
through OCR or perhaps when the PDF did already contain identifiable text (only
non-human-readable within the PDF format), into a layer of human-readable text.
For GNUmed to be able to access such a layer in within-patient searches, would
it be necessary for such PDFs to have been imported twice, and/or to use some
additional tool to "split" the document into two parts (one an image part, and
one the text part)?
PS the maintainer of Xpdf has a link to PdfSearch, a Python-based utility for
searching PDF files
http://sourceforge.net/projects/pdfsearch/
Also "Some useful stuff for MacOS X"
http://users.phg-online.de/tk/MOSXS/
includes
UseXPDFforPrinting.dmg
xpdf-tools-3.dmg
although it isn't obvious that these solve the use case of passing a PDF to a
print control dialog and to present that to a user on Mac OS X.
Here were the two Oscar posts (which I merged) that reminded me to ask a
question I've had on my mind:
> From: Rob James
> Date: January 25, 2010 10:04:05 AM PST
> To: address@hidden
> Subjects: [Oscarmcmaster-bc-users] PDF indexing / PDF scraping
>
> On the topic of PDF indexing and textual retrieval. ...If
> the file is graphical in its origins, as you would expect if someone
> prints it, then faxs/scans it, then you are obligated to fall back to
> OCR as no true textual data is in the file.
> If, however, the PDF is generated directly from via
> PDFdriver, the text is actually in the file, surrounded by PDF stuff.
> The tool pdf2text ... strip PDFs for ASCII content
> (http://sourceforge.net/projects/pdf2textpilot/)
>
> For example, because most academic articles are now distributed with PDF
> file generated in the later format, Zotero - the remarkable
> Firefox-based citation/bibliography manager - is able to fully index PDF
> articles as it acquires them. That trick would almost certainly have
> been based on extant open-source tools. I assume that what Zotero
> does is strip the PDF of all the non-text components... then indexes.
> It turns out that Zotero (www.zotero.org) uses the resources of Xpdf to
> scrape academic PDFs for textual content, and then to make the text
> available for subsequent indexing and retrieval. Given that several
> projects are using Xpdf in this way, it is probably a place to start.
>
> There are [info and] downloads for Linux and Windows available at:
> [ http://www.foolabs.com/xpdf/about.html ]
> http://www.foolabs.com/xpdf/download.html
- [Gnumed-devel] Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/05
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/15
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/15
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR,
Jim Busser <=
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26