lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV Parser stuff (was: Another fotemods.zip update)


From: Rob Partington
Subject: Re: LYNX-DEV Parser stuff (was: Another fotemods.zip update)
Date: Thu, 17 Apr 1997 22:05:03 +0100 (BST)

Klaus Weide wrote:
> 
> On Wed, 16 Apr 1997, Foteos Macrides wrote:
> 
> > Klaus Weide <address@hidden> wrote:
> > >On Tue, 15 Apr 1997, Foteos Macrides wrote:
> > >> 
> > >> 1997-04-15
> > >> * Miscellanous additional tweaks in HTML.c for more robust error recovery
> > >>   from bad HTML involving emphasis or style elements (B, BLINK, CITE, EM,
> > >>   FONT, I, STRONG, and U), or HREF-less NAME-ed Anchors without matching
> > >>   end tags. - FM
> > >> * Modified the declarations in HTMLDTD.c and code in SGML.C, HTML.c, and
> > >>   GridText.c to handle A, B, BLINK, CITE, EM, FONT, I, STRONG, and U
> > >>   container elements homologously to the modified handling of FORM (see
> > >>   1997-04-05 mods) so that if they are invalidly interdigitated or have
> > >>   spurious end tags in the markup, substitutions of the "expected" end
> > >>   tags by the SGML.c stack-based parser will not be made, and without
> > >>   messing up the HTML.c stack-based parser.  Appears to work reliably
> > >>   for all of the elements, and to be reasonably crash safe (hopefully
> > >>   as safe as the vanilla v2.7.1), but there are no guarantees. - FM
> > >
> > >If you go on at that pace, where won't be anything left to do for
> > >the stack-based parsing !?
> > 
> >     I did it with the current API because it should still be
> > compatible with what can and should be done when Rob's color/style
> > stuff is worked into the next formal release (as you apparently

a small note about this: i'm having difficulty even compiling 2.7 at
the moment because i've installed the latest linux kernel (2.1.34).
the build fails with __inet_addr undefined in HTTCP.o, and i've yet
to figure out where that comes from.  lynx 2.7 builds fine on 2.0.27.

i've worked my style mods into the base 2.7 code, but there are a few
problems (like the styles don't actually get created so you end up
with no highlighting) which i'm working on right now.

> I hear this "hash table design" again, but am still clueless what
> the function of a "hash table" would be.  I.e. what in the current
> design would it replace/extend.  Would it be used to encode *more*
> of the structure information of HTML elements?

the hash table is used to store the style information for an element,
based on the element name and it's class.  so the style information for
<em class=red> is stored in styles[hash("em.red")].  i'm about to add
support for start/end strings for the text representation so that you
can say things like "style:em:red:*:*" (i'm going to trim the config
format because you don't need three fields to specify the attributes)
and "style:strong:brightred:<!:!>" which would render "<strong>a</strong>"
as "<!a!>".  in the extreme, you could do "style:strong::<strong>:</strong>"
which would display the page as source (if you did it for all elements).

ObHack: has anyone fixed the "view source too wide for my screen"
problem?  if not, then how about this for a quick solution: get
html_start_element to add "<%s>", element_name (and similarly for
html_end_element) to the gridtext structure when a certain flag was set?
you'd need to work out a way to print the attributes, but you'd get nice
word-wrapped source (probably have to hack split_line to understand about
not splitting in certain places in tags as well).

>   If you have a clearer idea about this than I (which must be the case :) )
> I would appreciate some summary explanation.  Or can this be found in
> Rob's preliminary code (from Jan. or so)?  I am trying to get some
> understanding without having to analyze that code...

it's not _that_ bad, is it?

> Hmm, the property of a tag to cause style-stack pushing does not 
> necessarily have to be coupled to the decision "SGML_MIXED or not".
> There could be an additional per-tag flag in the HTMLDTD, which
> could tell HTML_{start,end}_element whether to push/pop a style
> even for an SGML_MIXED element.  That way one could still avoid
> the Style-related overhead for some elements.

except you then run into problems if there's no flag in HTMLDTD, and 
someone does <TT style="something">.  if people want "real" style
handling, then you can't have whether an element has a style or not
hard-coded - it has to be generated as you parse the page (which is
a pain.)

> elemets are SGML_EMPTY...  I am thinking of some stack wind-down, and
> automatic closing of some open elements, but based on per-element
> information and a hierarchy of "strongness" - for example a </FORM>
> would close an open <EM> (maybe only if a <FORM> is open) because
> FORM is "stronger" (more "outer") than EM, but an </EM> would not
> close an open <FORM>.  An <A> would auto close an open </A> since
> the "HTMLDTD" knows that A cannot be nested in itself.
> I am really curious whether this would improve handling of some
> invalid stuff.  It might be worth an experiment.  Comments welcome...

i've got a perl program using a table of element pairs with values like
{CAN_BACKTRACK, CAN_ENCLOSE, DISALLOWED} which does a fairly good job
of impersonating a real SGML parser.  i'll dig it out tonight and post
it to the list tomorrow (if i remember.  if i don't, someone prod me)

ObSGMLHack: does anyone have any particular thoughts about having lynx
construct a tree-based representation of the document before beginning
to display it?  it's essential for "real" style-sheets, but it needs a
good sgml parser (possibly better than the one lynx has now), and it
needs more memory, but if you can deal with those, it's generally a win.
comments?  flames?
-- 
rob partington / address@hidden / address@hidden
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]