chicken-janitors
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

#1805: `html->sxml` with escaped quotes breaks text into multiple nodes


From: Chicken Trac
Subject: #1805: `html->sxml` with escaped quotes breaks text into multiple nodes
Date: Fri, 10 Jun 2022 18:53:06 -0000

#1805: `html->sxml` with escaped quotes breaks text into multiple nodes
----------------------------+-----------------------------------
 Reporter:  Jeremy Steward  |                 Owner:  Alex Shinn
     Type:  defect          |                Status:  assigned
 Priority:  minor           |             Milestone:  someday
Component:  extensions      |               Version:  5.3.0
 Keywords:                  |  Estimated difficulty:
----------------------------+-----------------------------------
 There's some weirdness with escaping quotes in text when using
 `html->sxml`. Perhaps a short example would be sufficient to explain the
 problem I'm encountering:

 {{{
 (html->sxml "<p>foo&apos;bar&quot;baz</p>") ;=> (*TOP* (p "foo" "'" "bar"
 "\"" "baz"))
 }}}

 As a counter-example, I'll use the [https://wiki.call-
 cc.org/eggref/5/ssax: ssax egg]:

 {{{
 (call-with-input-string "<p>foo&apos;bar&quot;baz</p>") ;=> (*TOP* (p
 "foo'bar\"baz"))
 }}}

 I guess fundamentally it's a question of whether there should be one text
 node or not. I would argue that in this particular case, it should be a
 single node. I have been using html-parser to try and scrape some web
 pages, and this is extremely unexpected! Especially so if one uses
 `txpath` / `sxpath` on the final result, as `//p/text()` queries will not
 necessarily behave as expected. You would have to `(apply string-append
 ((txpath "//p/text()") sxml))` to the result to get the full contents of
 the text.

 Is there a rationale for this, or is that some kind of limitation of the
 parser? I know that tags may also contain sub-tags in HTML, but I'm not
 sure a new node should be made if a tag's contents are not HTML tags
 themselves.

-- 
Ticket URL: <https://bugs.call-cc.org/ticket/1805>
CHICKEN Scheme <https://www.call-cc.org/>
CHICKEN Scheme is a compiler for the Scheme programming language.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]