help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Help-smalltalk] [PATCH] Fix regexes that can match the empty string


From: Paolo Bonzini
Subject: [Help-smalltalk] [PATCH] Fix regexes that can match the empty string
Date: Thu, 24 Jan 2008 10:05:50 +0100
User-agent: Thunderbird 2.0.0.9 (Macintosh/20071031)

I read about this on a blog (http://t-a-w.blogspot.com/2008/01/really-strange-quirk-of-ruby-and-perl.html if you care) and remembered that I fixed this once in sed. Now, the same for gst. The behavior I implemented for tokenize is consistent with ruby (didn't check perl and python), the behavior I implemented for gsub is consistent with sed and python but not with ruby and perl.

Paolo
2008-01-24  Paolo Bonzini  <address@hidden>

        * kernel/Regex.st: Fix global substitution and tokenization for
        regexes that can match the empty string.

 
diff --git a/kernel/Regex.st b/kernel/Regex.st
index dec5e6e..b074361 100644
--- a/kernel/Regex.st
+++ b/kernel/Regex.st
@@ -881,10 +881,11 @@ String extend [
         of the match (as in #%)."
 
        <category: 'regex'>
-       | res idx regex beg end regs |
+       | res idx regex beg end regs emptyOk |
        regex := pattern asRegex.
        res := WriteStream on: (String new: to - from + 1).
        idx := from.
+       emptyOk := true.
        
        [regs := self 
                    searchRegexInternal: regex
@@ -894,17 +895,20 @@ String extend [
                whileFalse: 
                    [beg := regs from.
                    end := regs to.
-                   res 
-                       next: beg - idx
-                       putAll: self
-                       startingAt: idx.
-                   res nextPutAll: str % regs.
-                   idx := end + 1.
-                   beg > end 
-                       ifTrue: 
-                           [res nextPut: (self at: idx).
-                           idx := idx + 1].
-                   idx > self size ifTrue: [^res contents]].
+                   (beg <= end or: [ beg > idx or: [ emptyOk ]])
+                       ifTrue: [
+                           emptyOk := false.
+                           res 
+                               next: beg - idx
+                               putAll: self
+                               startingAt: idx.
+                           res nextPutAll: str % regs.
+                           idx := end + 1]
+                       ifFalse: [
+                           beg <= to ifFalse: [^res contents].
+                           emptyOk := true.
+                           res nextPut: (self at: beg).
+                           idx := beg + 1]].
        res 
            next: to - idx + 1
            putAll: self
@@ -963,11 +967,11 @@ String extend [
         are separated and stored into an Array of Strings that is returned."
 
        <category: 'regex'>
-       | res idx regex regs tokStart |
+       | res idx tokStart regex regs beg end emptyOk |
        regex := pattern asRegex.
        res := WriteStream on: (Array new: 10).
-       idx := from.
-       tokStart := 1.
+       idx := tokStart := from.
+       emptyOk := false.
        
        [regs := self 
                    searchRegexInternal: regex
@@ -975,10 +979,27 @@ String extend [
                    to: to.
        regs notNil] 
                whileTrue: 
-                   [res nextPut: (self copyFrom: tokStart to: regs from - 1).
-                   tokStart := regs to + 1.
-                   idx := regs to + 1 max: regs from + 1].
-       res nextPut: (self copyFrom: tokStart to: to).
+                   [beg := regs from.
+                   end := regs to.
+                   (beg <= end or: [ beg > idx or: [ emptyOk ]])
+                       ifTrue: [
+                           emptyOk := false.
+                           res nextPut: (self copyFrom: tokStart to: beg - 1).
+                           idx := tokStart := end + 1 ]
+                       ifFalse: [
+                           "If we reach the end of the string exit
+                            without adding the token.  tokStart must have been
+                            set above to TO + 1 (it was set above just before
+                            setting emptyOk to false), so we'd add an empty
+                            token we don't want."
+                           beg <= to ifFalse: [^res contents].
+                           emptyOk := true.
+
+                           "By not updating tokStart we put the character in 
the
+                            next token."
+                           idx := beg + 1]].
+       (tokStart <= to or: [ emptyOk ])
+           ifTrue: [ res nextPut: (self copyFrom: tokStart to: to) ].
        ^res contents
     ]
 
diff --git a/tests/strings.ok b/tests/strings.ok
index f083526..2706df5 100644
--- a/tests/strings.ok
+++ b/tests/strings.ok
@@ -66,3 +66,48 @@ returned value is '   -   -    '
 
 Execution begins...
 returned value is ''
+
+Execution begins...
+returned value is 'xaxbxcx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbxcx'
+
+Execution begins...
+returned value is 'xbxcx'
+
+Execution begins...
+returned value is '('abc' 'def' )'
+
+Execution begins...
+returned value is '('' 'abc' 'def' )'
+
+Execution begins...
+returned value is '('a' 'b' 'c' )'
+
+Execution begins...
+returned value is '('a' )'
+
+Execution begins...
+returned value is '('a' )'
+
+Execution begins...
+returned value is '('a' )'
diff --git a/tests/strings.st b/tests/strings.st
index be74137..ee2a16e 100644
--- a/tests/strings.st
+++ b/tests/strings.st
@@ -95,3 +95,20 @@ Eval [ '388350028456431097' formatAs: 'Card Number #### 
###### #### Expires ##/#
 Eval [ '543' formatAs: '###-###-####' ]
 Eval [ '' formatAs: '###-###-####' ]
 Eval [ '1234' formatAs: '' ]
+
+"Have fun with regexes that can match the empty string."
+Eval [ 'abc' copyReplacingAllRegex: 'x*' with: 'x' ]           "xaxbxcx"
+Eval [ 'f' copyReplacingAllRegex: 'o*$' with: 'x' ]            "fx"
+Eval [ 'fo' copyReplacingAllRegex: 'o*$' with: 'x' ]           "fx"
+Eval [ 'foo' copyReplacingAllRegex: 'o*$' with: 'x' ]          "fx"
+Eval [ 'ba' copyReplacingAllRegex: 'a*' with: 'x' ]            "xbx"
+Eval [ 'baa' copyReplacingAllRegex: 'a*' with: 'x' ]           "xbx"
+Eval [ 'baaa' copyReplacingAllRegex: 'a*' with: 'x' ]          "xbx"
+Eval [ 'bc' copyReplacingAllRegex: 'a*' with: 'x' ]            "xbxcx"
+Eval [ 'bac' copyReplacingAllRegex: 'a*' with: 'x' ]           "xbxcx"
+Eval [ ('abc def ' tokenize: ' ') printString ]                        "(abc 
def)"
+Eval [ (' abc def ' tokenize: ' ') printString ]               "('' abc def)"
+Eval [ ('abc' tokenize: 'x*') printString ]                    "(a b c)"
+Eval [ ('axxx' tokenize: 'x*') printString ]                   "(a)"
+Eval [ ('ax' tokenize: 'x*') printString ]                     "(a)"
+Eval [ ('a' tokenize: 'x*') printString ]                      "(a)"

reply via email to

[Prev in Thread] Current Thread [Next in Thread]