[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: PDF outline not capturing Cyrillic text
From: |
Deri |
Subject: |
Re: PDF outline not capturing Cyrillic text |
Date: |
Tue, 06 Feb 2024 13:39:51 +0000 |
On Sunday, 4 February 2024 03:57:22 GMT Robin Haberkorn wrote:
> Regarding cyrillic characters in PDF outlines, I think I got a few
> insights today.
>
> It turns out that the pdfmarks in the postscript code are "text strings"
> according to the PDF specs, that is either a PDFDocEncoding or
> UTF-16BE with a leading byte-order marker (cf. PDF Reference 1.7).
> A PDFDocEncoding is basically latin1 it seems.
> This explains why the current code in MOM works with western European
> languages.
> Now, in order to include cyrillic, you will have to reencode whatever
> encoding Groff uses and passes to the postprocessor - which will
> subsequently end up in the postscript code - to UTF-16BE.
> Everything needs to be hex-encoded and enclosed in sharp
> brackets (<FEFF....>).
>
> In the most hacky case, this could be done by a script on the
> postscript code generated by `pdfroff --emit-ps`. As a proof of concept
> Here's an incomplete, but somewhat working version in SciTECO:
>
> sciteco -e "16,0ED @EB/document.ps/ <@S|/Title (|; -D @I|<FEFF| .(@S|)
> /OUT|6R).@EC{iconv -f KOI8-R -t UTF-16BE | hexdump -e '1/1 \"%02X\"'} @I/>/
> D> @EW//"
>
> This assumes that the Groff encoding is KOI8-R, which I chose as an
> intermediate format in order to enable Russian hyphenation
> (but that does not work unfortunately).
> It should be rewritten into a Python or Perl script using some
> iconv wrapper or ideally pdfroff itself could do it.
> The script could even interpret Groff Unicode escapes generated by preconv
> and convert them back to plain Unicode before writing out everything in
> UTF16.
>
> I will probably just use such a hack for my purposes.
>
> What's the status of pdfroff anyway? I read that it is more or less
> deprecated and we should all use `groff -Tpdf` instead.
> Actually, pdfmom should work with ms as well, actually uses
> gropdf and should perform the necessary multipass processing
> for pdfhref forward-references to work.
> Will try this next!
>
> Best regards,
> Robin
Hi Robin,
The current gropdf (in the master branch) does support UTF-16BE for pdf
outlines (see attached pdf), but Branden has not released the other parts to
make it work! If you can compile and install the current git the applying the
attached patch should give you what you want.
To apply the patch, cd into the git groff directory and "patch -p1 < path-to-
patch-file", and then run make and install as usual.
I would be very interested in how you get on, and whether it gives you what
you need. Note that I am assuming you are feeding groff a file in UTF-8 and
the -k flag. I can see some hyphenation happening, but I don't know if it is
correct.
Cheers
Deri
master.patch
Description: Text Data
Rus2.pdf
Description: Adobe PDF document
Rus2.trf
Description: Text document
- Re: PDF outline not capturing Cyrillic text, Robin Haberkorn, 2024/02/03
- Re: PDF outline not capturing Cyrillic text,
Deri <=
- gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), G. Branden Robinson, 2024/02/06
- Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), Deri, 2024/02/06
- Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), G. Branden Robinson, 2024/02/06
- Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), G. Branden Robinson, 2024/02/07
- Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)), Deri, 2024/02/07
- Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)), Dave Kemper, 2024/02/07
- Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)), Peter Schaffter, 2024/02/07
- Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status, Oliver Corff, 2024/02/07
Re: Re: PDF outline not capturing Cyrillic text, Robin Haberkorn, 2024/02/06