[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract
From: |
Jelle Licht |
Subject: |
[bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files. |
Date: |
Tue, 28 Feb 2023 01:31:40 +0100 |
Hi Simon,
Simon South <simon@simonsouth.net> writes:
> Jelle,
>
> Respectfully, and speaking only as an interested observer, I think this
> may not be the right fix.
Cunningham's law strikes again :) [1].
>
> Guix's Tesseract is indeed missing its config files, causing (among
> other things) the examples in the online documentation[0] to not work,
> e.g.:
>
> ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png -
> -l eng hocr
> read_params_file: Can't open hocr
> The (quick) [brown] {fox} jumps!
> Over the $43,456.78 <lazy> #90 dog
> (...)
>
> But the root issue appears to be a misconfiguration of the
> TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
> Tesseract's own config files to be installed in a folder other than the
> one it's configured to search.
>
> Fixing this places Tesseract's config files and the trained-data files
> together beneath /usr/share/tessdata, allowing Tesseract to work as
> expected:
>
> ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png -
> -l eng hocr
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> (...)
I will believe you without any doubt, but there's this spooky comment
left in the tesseract-ocr 'adjust-TESSDATA_PREFIX-macro phase:
--8<---------------cut here---------------start------------->8---
;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
;; specific search-path than '/share' can be specified. The
;; build system uses CPPFLAGS for itself, so we can't simply set
;; a make flag.
--8<---------------cut here---------------end--------------->8---
This makes me believe the current situation was a deliberate choice, but
I personally don't understand what the original problem was/is.
> This approach has the advantage of keeping the
> tesseract-ocr-tessdata-fast package "pure" and focused only on
> trained-data files, which will be important for the patch I'm working on
> that will split it into multiple packages, one for each language and
> script, to allow greater flexibility.
>
> I'll respond to this email with a draft (!) patch to tesseract-ocr that
> should achieve the same result as yours, making the config files
> available for use. Does this also fix the problem for you? If so,
> would you consider submitting this change instead?
It seems to work for my stuff! I'm bringing Maxim to weigh in on this, as
they are the (un?)lucky expert according to my git-foo.
Thanks for paying attention!
- Jelle
[1] https://meta.wikimedia.org/wiki/Cunningham%27s_Law