[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: survey: string external representation
From: |
Thien-Thi Nguyen |
Subject: |
Re: survey: string external representation |
Date: |
Fri, 27 Jan 2012 11:27:30 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux) |
Thanks to everyone who responded. Based on the collected
information, i've cobbled together a runtime check for
‘sql-quote’. It and some tests are in the attached program.
To play:
guile -s normalize.scm
guile -s normalize.scm stupid
The code assumes Guile 2 DTRT, but if you have doubts, you can
sed -i 's/guile-2/&-not-really/' normalize.scm
to disable that assumption. In any case, the program should exit
successfully, indicating smooth ‘write’ / ‘read’ round-tripping.
This is so (both w/ and w/o "stupid") for Guile 1.4.1.124 and 1.8.7.
___________________________________________
;; -*- mode: scheme; coding: utf-8 -*-
(define EXIT-VALUE #t) ; optimism
(define STUPID? (false-if-exception (string=? "stupid" (cadr (command-line)))))
;; PostgreSQL groks ‘\xXX’ as an octet w/ hex value XX.
;; It also groks raw octets. This is all fine and good.
;; The problem arises when there is a mix of contiguous
;; raw and \x representations, intended to represent a
;; UTF-8 (say) encoded character.
;;
;; It seems Guile
;; - 1.4 DTRT by doing nothing;
;; - 1.6 ???;
;; - 1.8 fails by \x-escaping inconsistently;
;; - 2.0 doesn't have this problem.
(cond-expand
(guile-2
(define normalize identity))
(else
(use-modules
(srfi srfi-13)
(srfi srfi-14))
(define normalize
(or (let* ((ego (char-set
;; These are not strictly necessary for
;; PostgreSQL, but we include them for
;; (Scheme-only) round-trip testing.
;; Doubtlessly, what doubtful ego!
#\" #\\))
(ugh (ucs-range->char-set #o177 #o400 #t ego)))
(and (not (char-set-every
(lambda (ch)
;; Does the octet xrep unmolested?
(char=? ch (string-ref (object->string (string ch)) 1)))
(char-set-difference ugh ego)))
(or (not STUPID?)
(begin (set! ugh ego)
#t))
;; Lame.
(lambda (s)
(define backslash-x
(let ((v (make-vector 256)))
(char-set-for-each
(lambda (ch)
(let ((i (char->integer ch)))
(vector-set!
v i (string-append
"\\x" (number->string i 16)))))
ugh)
;; backslash-x
(lambda (ch)
(vector-ref v (char->integer ch)))))
(let loop ((start 0) (acc '()))
(cond ((string-index s ugh start)
=> (lambda (idx)
(loop (1+ idx)
(cons* (backslash-x (string-ref s idx))
(substring/shared s start idx)
acc))))
((zero? start)
s)
(else
(string-concatenate-reverse
acc (substring/shared s start))))))))
;; Cool.
identity))))
(define (try s)
(simple-format
#t "ORIG:\t~S~%NORM:\t~S~%=>\t~A~%~%"
s (normalize s)
(let ((round (with-input-from-string
(with-output-to-string
(lambda ()
(if (eq? identity normalize)
(write s)
(begin (display #\")
(display (normalize s))
(display #\")))))
read)))
(cond ((equal? s round) 'SAME)
(else
(set! EXIT-VALUE #f) ;-O
(string-append
"DIFF: [" (number->string (string-length round))
"]|" round "|"))))))
(simple-format #t "Guile ~A~% LANG: ~S~% normalize: ~S~A~%~%"
(version) (getenv "LANG") (procedure-name normalize)
(if (and STUPID? (not (eq? normalize identity)))
" (but we stupidly revert to degeneracy)"
""))
(try "")
(try (list->string (map integer->char (iota 256))))
(try "U+2002: | | (utf-8: E2 80 82)")
(try "U+232C: |⌬| (utf-8: E2 80 82)")
(try "U+1D7FF: |𝟿| (utf-8: F0 9D 9F BF)")
(try "U+2F9B2: |䕫| (utf-8: F0 AF A6 B2)")
(try "U+2F9BC: |蜨| (utf-8: F0 AF A6 BC)")
(exit EXIT-VALUE)