lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Issues with -dont_wrap_pre and -nomargins


From: Thomas Dickey
Subject: Re: [Lynx-dev] Issues with -dont_wrap_pre and -nomargins
Date: Wed, 16 Sep 2009 04:59:17 -0400 (EDT)

On Wed, 16 Sep 2009, Claus Strommer wrote:

Hello all. I am using lynx to convert an archive of html files into plaintext for information retrieval. The command that I use is

lynx -nounderline -notitle -nocolor -nomargins -nolist -nobold -nonumbers -force_html -dump -dont_wrap_pre <file>

If works almost perfectly, except for one minor issue; I am not sure if it is a bug or something I am doing wrong. When I parse the attached a.html file, some of the words are printed without a whitespace separator:

I can reproduce this

"...However, in order to fully develop our vision of the next version of Twingle, we needed more control over the fine nuances of searching through email. And, asthe next phase of the Twingle development is to include a downloadable versionof the software, we needed it to make it easier for people to install - when the lead developer gave up after 6 hours of trying to get it all working on his own machine at home we knew we had a problem!..."

'asthe' should be 'as the', 'versionof' -> 'version of', and so on. AFAIK, this is not an input error - the words are separated when skip either of the -dont_wrap_pre or -nomargins options. As these errors occur near the n*80th characters in a paragraph, I can only assume that some part of the parsing is going awry there. The errors occur in the 1.8.6-rel5 (macports), 1.8.6-rel4 (ubuntu) and and latest 1.8.8 builds.

s/1.8/2.8/

That sounds like a bug, for instance in how lynx is storing some hidden
characters for &nbsp;, etc.


So my question is: Is there anything I can do to work around this?  I would

...other than fixing the bug - perhaps not. (I'm working on xterm and mawk at the moment, intending to go back to lynx next...).

Just reading the code: It looks as if -nomargins goes to the no_margins variable, and _that_ is used in only a few places:

DefaultStyle.c:466:         if (no_margins) {
DefaultStyle.c:482:         if (no_margins) {
LYGlobalDefs.h:394:    extern BOOLEAN no_margins;
LYMain.c:393:BOOLEAN no_margins = FALSE;
LYMain.c:3622:      "nomargins",        4|SET_ARG,              no_margins,
LYOptions.c:35:#define MARGIN_STR (no_margins ? "" : "&nbsp;&nbsp;")
LYOptions.c:36:#define MARGIN_LEN (no_margins ?  0 : 2)
LYrcFile.h:159:#define RC_NO_MARGINS                   "no_margins"
LYReadCFG.c:1494:     PARSE_SET(RC_NO_MARGINS,           no_margins),

The uses in DefaultStyle.c and LYOptions.c are simple to change and see if the bug's behavior changes predictably. For instance, making -no_margins do a single character rather than none might make it usable for your script.

The derived variables are used in GridText.c's split_line() function, which is (complicated) where the boundary check is most likely off.
It's complicated, since there are long expressions such as

        spare = WRAP_COLS(text)
            - (int) style->rightIndent
            - indent
            + ctrl_chars_on_previous_line
            - LYstrExtent2(previous->data, previous->size);

But that's the area where the fix would probably be made - split_line.

very much like to keep using these two options, as it is important to me to be able to distinguish between lines and paragraphs. I am even willing to use other tools, if you can suggest any - but as far as I've seen, lynx is the only one which gives the desired options. Also, I'd like to stay away from the -width option (it does not allow me to specify infinite width, AND it breaks with tables - the attached b.html, for example).



--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net




reply via email to

[Prev in Thread] Current Thread [Next in Thread]