lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV error recovery for form parsing


From: Foteos Macrides
Subject: Re: LYNX-DEV error recovery for form parsing
Date: Sat, 05 Apr 1997 17:35:50 -0500 (EST)

Hynek Med <address@hidden> wrote:
>On Thu, 3 Apr 1997, Laura Eaves wrote:
>
>> Anyway, below is a relatively small change to 
>> WWW/Library/Implementation/SGML.c
>> from 
>>      Lynx Version 2.7ac-0.38 (1997)
>> It seems to fix the problem.
>
>Laura,
>
>this is funny. A while ago I sent to lynx-dev a similar patch (though it
>in fact didn't work as I intended it to do, as Klaus has noted :-).. Our
>idea is the same, just not to assume </FORM> and rather ignore the
>offending ending tag. 
>
>I wonder what do others think about the idea behind our patches.. It
>certainly helps for most of the pages with bad markup and it doesn't have
>any side effects on pages with good HTML..

        None of the currently active developers has addressed this,
except for a worrisome nonsequatur that HTML element handling might
be done homologously to the optional SGML comment parsing, and Laura
is still hacking solely in SGML.c without understanding the consequences
for HTML.c, GridText.c, and LYfoo.c modules, so (against my better
judgment 8-) I'll address it from my "vacation spot".

        When I was an active developer, this was an FAQ which I
frequently answered, and Subir has an explanation in the "Why
does Lynx do this" pages at "lynx links".  Perhaps yet another
explanation, but geared explicitly toward "code modifiers" rather
than toward "general readers", would be helpful.

        The current Lynx API uses ***TWO** stack-based parsers, one in
SGML.c, and another in HTML.c.  The one in HTML.c stacks "container"
HTML elements (ones not declared SGML_EMPTY in HTMLDTD.c), and depends
on the SGML.c parser to enforce valid (*strictly* embedded and *never*
interdigitated) nesting of them.  That is why the SGML.c functions
substitute the "expected" end tags for "container" HTML elements before
invoking HTML.c functions.  If you break that, as in your patch, in
Laura's original patch, and in her more recent "BETTER SOLUTION"
patch, you throw the HTML.c stack out of whack.  In the course of
the past three years, I've added lots of "hacks" to get around the
constraints of stack-based parsing and try to cope with much of the
bad HTML which the "anything that basically works and sells is fine"
vendor(s) has(have) made so commonplace on the Web, so if you break
the enforced valid nesting in SGML.c of HTML elements declared as
"containers" in HTMLDTD.c and just test the result "empirically" with
this or that URL that returns bad HTML, rather than understanding and
considering the consequences for the "downstream" functions, you might
think you've improved the situation.  But believe me, please, that's
NOT a good thing to do.

        In the case of SGML comments and declarations (DOCTYPE,
ENTITY, ELEMENT, ATTLIST, etc.), those are handled entirely by
the SGML.c parser, are not associated with declarations in HTMLDTD.c,
and thus can be handled on the basis of configuration options and
run-time toggles without any concern about throwing the HTML.c stack
out of whack.

        When Rob started developing the configurable color/styles
enhancements, and the potential for using external style sheets
(very important, IMHO, for the long-term viability of Lynx) he
also ran into the problem of stack-based parsing being heavily
dependent on valid HTML, plus conflicts with my hacks to get around
the constraints.  He then turned to a hash table design, with the
prospects of eliminating stack-based parsing in Lynx altogether.
That, rather than further "workaround" hacks to the present
stack-based parsing, is a better long-term objective for Lynx
development (sez I from my "vacation spot" 8-).

        Be that as it may, appended is a patch set for v2.7.1 which
achieves what you and Laura are attempting, and without throwing
the HTML.c stack out of whack.  It is also available (as a
formhack.patch text file and in a formhack.zip) in:

        http://www.slcc.edu/lynx/fote/patches/
or:      ftp://www.slcc.edu/pub/lynx/fote/patches

(Heather please note and remember that the terminal slash in http
URLs for directory listings or implied index files in actual paths,
as opposed to the http server's root, is NOT optional.  If you use:
        http://www.slcc.edu/lynx/fote
or:     http://www.slcc.edu/lynx/fote/patches
the http server will return redirection so that the browser must
make another request with the required terminal slash included.
The terminal slash is not required for the homologous directory
listings via ftp servers, but IS required for http servers.)

                                Fote

=========================================================================
 Foteos Macrides            Worcester Foundation for Biomedical Research
 address@hidden         222 Maple Avenue, Shrewsbury, MA 01545
=========================================================================

1997-04-05
* Patch for Lynx v2.7.1 to handle invalidly interdigitated container
  elements or spurious container end tags without substitutions of
  "expected" FORM end tags by the SGML.c stack-based parser, and
  without messing up the HTML.c stack-based parser.  Reliably succeeds
  in not closing FORMs before all of the FORM elements, including
  submit buttons, have been processed.  Should be reasonably crash
  safe (hopefully as safe as the vanilla v2.7.1), but there are no
  guarantees. - FM

diff -c lynx2-7-1/src/HTML.c_ori lynx2-7-1/src/HTML.c
*** lynx2-7-1/src/HTML.c_ori    Thu Apr  3 06:42:45 1997
--- lynx2-7-1/src/HTML.c        Sat Apr  5 10:02:08 1997
***************
*** 3402,3412 ****
            HTChildAnchor * source;
            HTAnchor *link_dest;
  
            /*
!            *  Set to know we are in a form.
             */
            me->inFORM = TRUE;
-           UPDATE_STYLE;
  
            if (present && present[HTML_FORM_ACTION] &&
                value[HTML_FORM_ACTION])  {
--- 3402,3428 ----
            HTChildAnchor * source;
            HTAnchor *link_dest;
  
+           UPDATE_STYLE;
            /*
!            *  FORM was declared SGML_EMPTY in HTMLDTD.c, and
!            *  SGML_character() in SGML.c checks for a FORM end
!            *  tag to call HTML_end_element() directly (with a
!            *  check in that to bypass decrementing of the HTML
!            *  parser's stack), so if we have an open FORM, close
!            *  that one now. - FM
             */
+           if (me->inFORM) {
+               if (TRACE) {
+                   fprintf(stderr,
+                           "HTML: Missing FORM end tag. Faking it!\n");
+               }
+               HTML_end_element(me, HTML_FORM, (char **)&include);
+           }
+ 
+           /*
+            *  Set to know we are in a new form.
+            */
            me->inFORM = TRUE;
  
            if (present && present[HTML_FORM_ACTION] &&
                value[HTML_FORM_ACTION])  {
***************
*** 3562,3568 ****
            /* Check for unclosed TEXTAREA */
            if (me->inTEXTAREA) {
                if (TRACE) {
!                   fprintf(stderr, "HTML: Missing TEXTAREA end tag\n");
                } else if (!me->inBadHTML) {
                    _statusline(BAD_HTML_USE_TRACE);
                    me->inBadHTML = TRUE;
--- 3578,3584 ----
            /* Check for unclosed TEXTAREA */
            if (me->inTEXTAREA) {
                if (TRACE) {
!                   fprintf(stderr, "HTML: Missing TEXTAREA end tag.\n");
                } else if (!me->inBadHTML) {
                    _statusline(BAD_HTML_USE_TRACE);
                    me->inBadHTML = TRUE;
***************
*** 4290,4307 ****
      }
  
      /*
!      *  Pop state off stack.
       */
!     if (me->sp < me->stack + MAX_NESTING+1) {
!         (me->sp)++;
!         if (TRACE)
!           fprintf(stderr,
!                   "HTML:end_element: Popped style off stack - %s\n",
!                   me->sp->style->name);
!     } else {
!       if (TRACE)
!           fprintf(stderr,
    "Stack underflow error!  Tried to pop off more styles than exist in 
stack\n");
      }
      
      /*
--- 4306,4325 ----
      }
  
      /*
!      *  Pop state off stack if it's not a FORM end tag. - FM
       */
!     if (element_number != HTML_FORM) {
!         if (me->sp < me->stack + MAX_NESTING+1) {
!           (me->sp)++;
!           if (TRACE)
!               fprintf(stderr,
!                       "HTML:end_element: Popped style off stack - %s\n",
!                       me->sp->style->name);
!       } else {
!           if (TRACE)
!               fprintf(stderr,
    "Stack underflow error!  Tried to pop off more styles than exist in 
stack\n");
+       }
      }
      
      /*
***************
*** 5058,5064 ****
        break;
  
      case HTML_FORM:
!       /* Make sure we had a form start tag. */
        if (!me->inFORM) {
            if (TRACE) {
                fprintf(stderr, "HTML: Unmatched FORM end tag\n");
--- 5076,5087 ----
        break;
  
      case HTML_FORM:
!       /*
!        *  Check if we had a FORM start tag, and issue a
!        *  message if not, but fall through to ensure that
!        *  the FORM-related globals in GridText.c are
!        *  initialized. - FM
!        */
        if (!me->inFORM) {
            if (TRACE) {
                fprintf(stderr, "HTML: Unmatched FORM end tag\n");
***************
*** 5067,5078 ****
                me->inBadHTML = TRUE;
                sleep(MessageSecs);
            }
-           /*
-            *  We probably did start a form, for which bad HTML
-            *  caused a substitution, so we'll try to end.
-            *
-           break;
-            */
        }
  
        /*
--- 5090,5095 ----
***************
*** 5366,5379 ****
--- 5383,5420 ----
  */
  PUBLIC void HTML_free ARGS1(HTStructured *, me)
  {
+     char *include = NULL;
+ 
      UPDATE_STYLE;             /* Creates empty document here! */
      if (me->comment_end)
        HTML_put_string(me, me->comment_end);
      if (me->text) {
+         /*
+        *  Emphasis containers should have been closed via
+        *  the SGML_free() wind-down, but let's play it
+        *  safe. - FM
+        */
        if (me->inUnderline) {
            HText_appendCharacter(me->text, LY_UNDERLINE_END_CHAR);
            me->inUnderline = FALSE;
        }
+ 
+       /*
+        *  FORM was declared SGML_EMPTY in HTMLDTD.c, and
+        *  SGML_character() in SGML.c checks for a FORM end
+        *  tag to call HTML_end_element() directly (with a
+        *  check in that to bypass decrementing of the HTML
+        *  parser's stack), so if we still have an open FORM,
+        *  close it now. - FM
+        */
+       if (me->inFORM) {
+           HTML_end_element(me, HTML_FORM, (char **)&include);
+           me->inFORM = FALSE;
+       }
+ 
+       /*
+        *  Now call the cleanup function. - FM
+        */
        HText_endAppend(me->text);
      }
  
***************
*** 5401,5409 ****
  
  PRIVATE void HTML_abort ARGS2(HTStructured *, me, HTError, e)
  {
      if (me->text) {
!       if (me->inUnderline)
            HText_appendCharacter(me->text, LY_UNDERLINE_END_CHAR);
        HText_endAppend(me->text);
      }
  
--- 5442,5469 ----
  
  PRIVATE void HTML_abort ARGS2(HTStructured *, me, HTError, e)
  {
+     char *include = NULL;
+ 
      if (me->text) {
!         /*
!        *  If we have an open emphasis container, close it now. - FM
!        */
!       if (me->inUnderline) {
            HText_appendCharacter(me->text, LY_UNDERLINE_END_CHAR);
+           me->inUnderline = FALSE;
+       }
+ 
+         /*
+        *  If we have an open FORM container, close it now. - FM
+        */
+       if (me->inFORM) {
+           HTML_end_element(me, HTML_FORM, (char **)&include);
+           me->inFORM = FALSE;
+       }
+ 
+       /*
+        *  Now call the cleanup function. - FM
+        */
        HText_endAppend(me->text);
      }
  
diff -c lynx2-7-1/WWW/Library/Implementation/HTMLDTD.c_ori 
lynx2-7-1/WWW/Library/Implementation/HTMLDTD.c
*** lynx2-7-1/WWW/Library/Implementation/HTMLDTD.c_ori  Mon Jan 27 08:18:22 1997
--- lynx2-7-1/WWW/Library/Implementation/HTMLDTD.c      Sat Apr  5 09:59:22 1997
***************
*** 989,995 ****
      { "FIG"   , fig_attr,     HTML_FIG_ATTRIBUTES,    SGML_MIXED },
      { "FN"    , fn_attr,      HTML_FN_ATTRIBUTES,     SGML_MIXED },
      { "FONT"  , font_attr,    HTML_FONT_ATTRIBUTES,   SGML_EMPTY },
!     { "FORM"  , form_attr,    HTML_FORM_ATTRIBUTES,   SGML_MIXED },
      { "FRAME" , frame_attr,   HTML_FRAME_ATTRIBUTES,  SGML_EMPTY },
      { "FRAMESET", frameset_attr,HTML_FRAMESET_ATTRIBUTES, SGML_MIXED },
      { "H1"    , h_attr,       HTML_H_ATTRIBUTES,      SGML_MIXED },
--- 989,995 ----
      { "FIG"   , fig_attr,     HTML_FIG_ATTRIBUTES,    SGML_MIXED },
      { "FN"    , fn_attr,      HTML_FN_ATTRIBUTES,     SGML_MIXED },
      { "FONT"  , font_attr,    HTML_FONT_ATTRIBUTES,   SGML_EMPTY },
!     { "FORM"  , form_attr,    HTML_FORM_ATTRIBUTES,   SGML_EMPTY },
      { "FRAME" , frame_attr,   HTML_FRAME_ATTRIBUTES,  SGML_EMPTY },
      { "FRAMESET", frameset_attr,HTML_FRAMESET_ATTRIBUTES, SGML_MIXED },
      { "H1"    , h_attr,       HTML_H_ATTRIBUTES,      SGML_MIXED },
diff -c lynx2-7-1/WWW/Library/Implementation/SGML.c_ori 
lynx2-7-1/WWW/Library/Implementation/SGML.c
*** lynx2-7-1/WWW/Library/Implementation/SGML.c_ori     Fri Mar 14 14:16:03 1997
--- lynx2-7-1/WWW/Library/Implementation/SGML.c Sat Apr  5 09:59:39 1997
***************
*** 1554,1559 ****
--- 1554,1582 ----
                        context->state = S_text;
                    }
                    break;
+               } else if (tag_OK &&
+                          !strcasecomp(string->data, "FORM")) {
+                   /*
+                   **  Handle a FORM end tag.  We declared FORM
+                   **  as SGML_EMPTY to prevent "expected tag
+                   **  substitution" and avoid throwing the
+                   **  HTML.c stack out of whack (Wow, what
+                   **  a hack! 8-). - FM
+                   */
+                   if (TRACE)
+                       fprintf(stderr, "SGML: End </%s>\n", string->data);
+                   (*context->actions->end_element)
+                       (context->target,
+                        (context->current_tag - context->dtd->tags),
+                        (char **)&context->include);
+                   string->size = 0;
+                   context->current_attribute_number = INVALID;
+                   if (c != '>') {
+                       context->state = S_junk_tag;
+                   } else {
+                       context->state = S_text;
+                   }
+                   break;
                } else {
                    /*
                    **  Handle all other end tags normally. - FM
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]