lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LYNX-DEV Parser stuff (was: Another fotemods.zip update)


From: Klaus Weide
Subject: LYNX-DEV Parser stuff (was: Another fotemods.zip update)
Date: Wed, 16 Apr 1997 15:03:57 -0500 (CDT)

On Wed, 16 Apr 1997, Foteos Macrides wrote:

> Klaus Weide <address@hidden> wrote:
> >On Tue, 15 Apr 1997, Foteos Macrides wrote:
> >> 
> >> 1997-04-15
> >> * Miscellanous additional tweaks in HTML.c for more robust error recovery
> >>   from bad HTML involving emphasis or style elements (B, BLINK, CITE, EM,
> >>   FONT, I, STRONG, and U), or HREF-less NAME-ed Anchors without matching
> >>   end tags. - FM
> >> * Modified the declarations in HTMLDTD.c and code in SGML.C, HTML.c, and
> >>   GridText.c to handle A, B, BLINK, CITE, EM, FONT, I, STRONG, and U
> >>   container elements homologously to the modified handling of FORM (see
> >>   1997-04-05 mods) so that if they are invalidly interdigitated or have
> >>   spurious end tags in the markup, substitutions of the "expected" end
> >>   tags by the SGML.c stack-based parser will not be made, and without
> >>   messing up the HTML.c stack-based parser.  Appears to work reliably
> >>   for all of the elements, and to be reasonably crash safe (hopefully
> >>   as safe as the vanilla v2.7.1), but there are no guarantees. - FM
> >
> >If you go on at that pace, where won't be anything left to do for
> >the stack-based parsing !?
> 
>       I did it with the current API because it should still be
> compatible with what can and should be done when Rob's color/style
> stuff is worked into the next formal release (as you apparently
> intend to do).  However, v2.7.1+FOTEMODS still is treating all the
> "emphasis" elements as if they were synonyms, for underlining.  The
> color/style stuff should allow them to be treated individually, but
> with a hash table design still be able to cope equivalently with
> bad HTML (that's purely "theoretical" at this point, though 8-).

I hear this "hash table design" again, but am still clueless what
the function of a "hash table" would be.  I.e. what in the current
design would it replace/extend.  Would it be used to encode *more*
of the structure information of HTML elements?

  If you have a clearer idea about this than I (which must be the case :) )
I would appreciate some summary explanation.  Or can this be found in
Rob's preliminary code (from Jan. or so)?  I am trying to get some
understanding without having to analyze that code...
(but a simple yes or no to the question: can the answer be found there?
would also help.)
  
> >                               I wonder whether there is anything which
> >can not be subjected to the same treatment for principal reasons...
> 
>       You should only do this kind of thing with the current API
> for elements which do not have styles registered in DefaultStyle.c,
> nor have ALIGN attributes.  They inherit the registered style of any
> element which contains them (and it's, or a P's, CENTER's, or DIV's,
> current alignment setting), or the Normal style if they aren't nested.
> Thus, there is no need to put them in the HTML.c stack, because their
> entries simply will have the styles info for the preceding element in
> the stack, reiterated.  So you can just look at the "containing"
> element's (or Normal) style in the stack, and furthermore save the
> memory allocations associated with reiterating that for A, FORM, and
> the emphasis/style container elements and loading those as well onto
> the HTML.c stack.  

Ok, thanks for the explanation.  I think I understand the basic difference
now, between tags that can be subjected to "FORM-hack-like" treatment,
and those that cannot.  This means that if the number of tags which have
some "Style"-relevant (as handled be DefaultStyle.c) property increases,
those tags are then not amenable th that treatment any more.  Or, viewing
it the other way, treating tags FORM-hacks-like will prevent introducing
"Style" parameters for them.

The way I conceptualize this is that, instead of havin one (or two, in
different modules) global stack, you have lot's of them which don't
know about each other... (most of them with a max depth of 1).
 
Hmm, the property of a tag to cause style-stack pushing does not 
necessarily have to be coupled to the decision "SGML_MIXED or not".
There could be an additional per-tag flag in the HTMLDTD, which
could tell HTML_{start,end}_element whether to push/pop a style
even for an SGML_MIXED element.  That way one could still avoid
the Style-related overhead for some elements.

> I'm not certain yet that the mods are fully immune
> from inherited alignment glitches.  The alignment handling is too
> complicated to be sure one has thought it all through correctly, but
> empirically, so far, that seems to be OK too.

I haven't even tried to understand the alignment mechanism works...

I am tempted to try going the opposite direction from what you are
doing: encode *more* information in the so-called HTMLDTD, not less,
and let more of the structure be handled by SGML.c (with appropriate
heuristics).  Essentially make the HTMLDTD resemble more an "offical"
DTD, not making it more dissimilar by pretenting all the problematic
elemets are SGML_EMPTY...  I am thinking of some stack wind-down, and
automatic closing of some open elements, but based on per-element
information and a hierarchy of "strongness" - for example a </FORM>
would close an open <EM> (maybe only if a <FORM> is open) because
FORM is "stronger" (more "outer") than EM, but an </EM> would not
close an open <FORM>.  An <A> would auto close an open </A> since
the "HTMLDTD" knows that A cannot be nested in itself.
I am really curious whether this would improve handling of some
invalid stuff.  It might be worth an experiment.  Comments welcome...


> >I would appreciate reports to lynx-devfrom people who have tried Fote's
> >latest code, does it improve then handling of invalid pages?  Does it make
> >more sense of the typical invalid HTML than standard Lynx 2.7.1 (or  
> >devel)?
> 
>       I tested it with http://www.businesswire.com/headlines.shtml
> and it handles that perfectly now, despite all the horrible non-HTML
> in it.

That really is a significant achievement :)

Below is a patch for a simple fix of some inconsistencies in the
stack bounds checking.  With that applied, Lynx (without your
changes) at least doesn't get a memory violation.  (It still
complains a lot as it should about the number 800 being exceeded,
and gives an appropriately ugly display...)
  
>  That page has a number of link names which are 3 or 4 lines
> long, and Lynx highlights only their first two lines when making
> them the current link.  I looked again at what would be involved in
> modifying the code to highlight more of the current link lines, or
> ideally all of them no matter how many, but again said "Ugh!" and
> moved that to the bottom of my TODO list. :) :)

It would also make it more likely that the continuation lines would
fall on the beginning of a screen page with the anchor start on the
previous page, leading to confusing appearance.  Unless there will
be even more changes that would allow "selecting" the "tail" of an
anchor area.

;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]