Core hackers,
The attached patch has been shown to speed up regex matching considerably
in a typical case. It should also do so across the board.
Background: I observed a 40KB POST request to Spiffy
(x-www-form-urlencoded) to require 1.8 seconds on my machine for a
response. Investigation showed the entire time was spent in
http:canonicalize-string. There was a suboptimal algorithm there
which generated a lot of garbage; I fixed that in the http egg.
However, a core patch will complete the fix. The problem is in
string-search-positions and ultimately in re-match. Any time
a scheme string is passed to pcre_exec(), it is copied.
For large strings, or a typical use case for string-search-positions
-- e.g. repeatedly calling it to get the next match -- the overhead
is very high.
The patch just changes the type from c-string to scheme-pointer,
as pcre_exec() takes a length argument and doesn't need a null
terminator.
With both changes made Spiffy response time drops from 1.8 s to 0.1 s.
For overkill purposes, here is some supporting data.
Prelude:
(use url http-utils)
(define (encode str)
(string-translate (url-encode str (list #\_ #\- #\* #\. #\@ #\space))
" " "+"))
(define raw ...) ; raw text, length 30029
(define cooked (encode raw)) ; length 36889
Timing:
,t (begin (http:canonicalize-string cooked) (void))
original w/regex patch
1.957 s elapsed 0.334 s elapsed
1.831 s in GC 0.23 s in GC
2465 major GCs 311 major GCs
http fixes w/regex patch
0.26 s elapsed 0.022 s elapsed
0.195 s in GC 0 s in GC
175 major GCs 0 major GCs