lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV error recovery for form parsing


From: Klaus Weide
Subject: Re: LYNX-DEV error recovery for form parsing
Date: Sat, 5 Apr 1997 22:40:59 -0600 (CST)

On Sat, 5 Apr 1997, Foteos Macrides wrote:

> Hynek Med <address@hidden> wrote:
> >
> >Laura,
> >
> >this is funny. A while ago I sent to lynx-dev a similar patch (though it
> >in fact didn't work as I intended it to do, as Klaus has noted :-).. Our
> >idea is the same, just not to assume </FORM> and rather ignore the
> >offending ending tag. 
> >
> >I wonder what do others think about the idea behind our patches.. It
> >certainly helps for most of the pages with bad markup and it doesn't have
> >any side effects on pages with good HTML..
> 
>       None of the currently active developers has addressed this,

An indication of skepsis combined with don't-understand-enough-of-this
(speaking just for myself of course).  Well I ma trying to understand
better.

> except for a worrisome nonsequatur that HTML element handling might
> be done homologously to the optional SGML comment parsing, and Laura
> is still hacking solely in SGML.c without understanding the consequences
> for HTML.c, GridText.c, and LYfoo.c modules, so (against my better
> judgment 8-) I'll address it from my "vacation spot".
> 
>       When I was an active developer, this was an FAQ which I
> frequently answered, and Subir has an explanation in the "Why
> does Lynx do this" pages at "lynx links".  Perhaps yet another
> explanation, but geared explicitly toward "code modifiers" rather
> than toward "general readers", would be helpful.

It was helpful.
 
>       The current Lynx API uses ***TWO** stack-based parsers, one in
> SGML.c, and another in HTML.c.  The one in HTML.c stacks "container"
> HTML elements (ones not declared SGML_EMPTY in HTMLDTD.c), and depends
> on the SGML.c parser to enforce valid (*strictly* embedded and *never*
> interdigitated) nesting of them.  That is why the SGML.c functions
> substitute the "expected" end tags for "container" HTML elements before
> invoking HTML.c functions.  If you break that, as in your patch, in
> Laura's original patch, and in her more recent "BETTER SOLUTION"
> patch, you throw the HTML.c stack out of whack.  

Although I think this was not the case with Hynek's patch - if it had
worked the way he intended.

An example he gave was
   <B><A HREF="something"></B>something</A>

Regular Lynx SGML.c processing would treat that as (== pass it down to HTML.c
as if it were)
   <B><A HREF="something"></A>something</B>
giving a link that cannot be selected.

With Hynek's patch instead:
   <B><A HREF="something">something</A>
The </B> is ignored (the SGML.c parser's stack is not changed when
</B> is encountered), and when the </A> is detected B is still on the
stack (possibly until the end of the document).  But at least this 
doesn't create out-of-order calls to HTML_start_element/HTML_end_element.

With Laura's "BETTER SOLUTION" patch (the first one was specific to FORM,
but I think the principle was the same):
   <B><A HREF="something"></B>something</A>
I.e. generating calls to HTML_start_element/HTML_end_element in invalid
order.  (Changing the order of stack elements, by using anything else than
push on pop operations, of course makes the whole idea of having a
stack structure pointless.)

From Laura's description:
"Strategy of fix:  If and end tag </xxx> is found that doesn't match the top
 element of the stack, search down the stack until you find a match.  If
 there's no match, ignore the end tag;"...

Isn't this *first* part reasonable?  (just ignoring end tags that
cannot possibly be right.)  It doesn't mess up the stacks (or so it seems
to me).

> In the course of
> the past three years, I've added lots of "hacks" to get around the
> constraints of stack-based parsing and try to cope with much of the
> bad HTML which the "anything that basically works and sells is fine"
> vendor(s) has(have) made so commonplace on the Web, so if you break
> the enforced valid nesting in SGML.c of HTML elements declared as
> "containers" in HTMLDTD.c and just test the result "empirically" with
> this or that URL that returns bad HTML, rather than understanding and
> considering the consequences for the "downstream" functions, you might
> think you've improved the situation.  But believe me, please, that's
> NOT a good thing to do.
[...]
>       When Rob started developing the configurable color/styles
> enhancements, and the potential for using external style sheets
> (very important, IMHO, for the long-term viability of Lynx) he
> also ran into the problem of stack-based parsing being heavily
> dependent on valid HTML, plus conflicts with my hacks to get around
> the constraints.  He then turned to a hash table design, with the
> prospects of eliminating stack-based parsing in Lynx altogether.
> That, rather than further "workaround" hacks to the present
> stack-based parsing, is a better long-term objective for Lynx
> development (sez I from my "vacation spot" 8-).

I'd like to see it...
 
>       Be that as it may, appended is a patch set for v2.7.1 which
> achieves what you and Laura are attempting, and without throwing
> the HTML.c stack out of whack.  It is also available (as a
> formhack.patch text file and in a formhack.zip) in:
> 
>         http://www.slcc.edu/lynx/fote/patches/
> or:      ftp://www.slcc.edu/pub/lynx/fote/patches
> 

Another step in making Lynx's parsing more like that of the abovementioned 
vendor's(s') products, unfortunately.

   Klaus

;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]