lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV problem when page has only one input area


From: Benjamin C. W. Sittler
Subject: Re: LYNX-DEV problem when page has only one input area
Date: Thu, 14 Nov 1996 10:49:44 -0700 (MST)

On Wed, 13 Nov 1996, Foteos Macrides wrote:

> Filip M Gieszczykiewicz <address@hidden> wrote:
> >Greetings. I'm not sure if this issue has been beaten to death here,
> >so let me know if it has.
> >
> >It seems that people get the notion that Netscape is more than a browser
> >and also a validator - "if my pages work in Netscrape, they must be ok".
> >Too many of these same people don't know and don't care that this is
> >absolute BS.
> >
> >As a result, a lot of broken HTML gets out there and really messes with
> >stricter browsers like, say, lynx. I just heard that Fote fixed the
> >unclosed <form> - YES! I run across these all the time... another
> >infamous is <a name="blah"> (no closing </a> or tag text), and various
> >combinations of markup in links. Example:
> >
> ><ul>
> ><li><a href="Howdie1"><b>Howdie1</a>
> ><li><a href="Howdie2"><i>Howdie2<i></a>
> ><li><a href="Howdie3">Howdie3</b></a>
> ><li><a href="Howdie4"><b>Howdie4</b></a>
> ><li><a href="Howdie5">Howdie5</a>
> ></ul>
> >
> >The first doesn't show up as the link, neither does the second one
> >(and the bullet is gone from now on, as well), the third one is OK,
> >as is the fourth. The fifth one hoses up but DOES select, and shows
> >some silly "* Howd" on the sixth line...
> 
>       Lynx has no realistic prospect of handling HTML that bad
> as intended.  When it's that bad, the objective is simply not to
> crash.  Lynx can't ignore any interdigitated container (SGML_MIXED)
> tags that it recognizes, which is functionally what you have there.
> It must substitute the end tag it's expecting.  It should unwind to
> what it's expecting, but I changed it not to do that a year or so
> ago, and that helped.  I also changed the worst offenders to
> SGML_EMPTY, and look for their end tags explicitly, so those can
> be interdigitated, but you can't do that for everything and still
> have reasonable performance.

That may be so, but what if we used a "tag stack" model and only
unwound as far as absolutely necessary when encountering illegal markup?
(Sorta equivalent to making all end-tags omissible.) So for the abocve
doc you might have the following virtual tag sequence: (not strictly HTML)

<UL>
<LI><A HREF="Howdie1"><B><#PCDATA "Howdie1"></#PCDATA><!--
  Parse Error: non-ommissible /B omitted.
--></B></A>
</LI><LI><A HREF="Howdie2"><I><#PCDATA "Howdie2"></#PCDATA><I><!--
  Parse Error: non-omissible /I omitted.
--></I><--
  Parse Error: non-omissible /I omitted.
--></I></A>
</LI><LI><A HREF="Howdie3"><#PCDATA "Howdie3"></#PCDATA><!--
  Parse Error: /B does not close any open element.
--></A>
</LI><LI><A HREF="Howdie4"><B><#PCDATA "Howdie4"></#PCDATA></B></A>
</LI><LI><A HREF="Howdie5"><#PCDATA "Howdie5"></#PCDATA></A>
</LI></UL>

(All inferred tags are shown at the point of inference, possibly
preceded by an error message in a comment)

The reason for #PCDATA is that text implies the close of some HTML
tags, such as HR and IMG, and even DIV if HTML.Recommended is used,
although this would produce a parse error. One way to accomplish this
would be to have a two-dimensional data structure (perhaps an array)
indexed in both dimensions by tag name (including #PCDATA). Let's call
it OpenCloses[i][j]. OpenCloses[i][j] could take on one of two values,
depending on i and j:

    0. false
       <i> does no imply </j>
                (the case when i==j==I, and when i==B and j==A)


    1. true
       <i> implies </j>
                (the case when i==j==LI, and when i==#PCDATA and j==DIV)

Another data structure, call it OmitClose[i], would contain true (1)
when </i> is omissible (i.e. HTML, BODY), and false (0)
elsewhere. Thie would be used to generate the error messages above.

This, however, doesn't work for elements which may contain only a
certain sequence of contents (HTML3 UL, for example may have an LH
only at the beginning of the list.) To handle this we need to build a
one-dimensional array of regular expressions (or an equivalent)
indexed by tag. For example, the HTML3 entry for UL and OL might look
like this:

LH?, LI+ (excerpt from HTML3 DTD... optional LH followed by one or more
          LIs.)

If this were implemented, Lynx could actually understand stuff like
</> and <> (SHORTTAG) that only full-featured SGML systems
understand. In fact, Lynx could be a real SGML system. According to
the HTML DTD, SHORTTAG *is* allowed in HTML docs.

>       It may behave somewhat differently with that bad markup
> now that it's unwinding on EOF, but it won't make it "right". 

EOF should cause a complete unwinding of the tag stack, producing
error messages for all non-omissible tags.

Perhaps Lynx 3 should be an SGML system? If not, I'll probably use Lynx
as the fetching engine for a web browser built around a valid parser,
but it will take awhile to develop.

Just a thought.

--
Ben


;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;


reply via email to

[Prev in Thread] Current Thread [Next in Thread]