[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Help-smalltalk] [PATCH] Fix regexes that can match the empty string
From: |
Paolo Bonzini |
Subject: |
[Help-smalltalk] [PATCH] Fix regexes that can match the empty string |
Date: |
Thu, 24 Jan 2008 10:05:50 +0100 |
User-agent: |
Thunderbird 2.0.0.9 (Macintosh/20071031) |
I read about this on a blog
(http://t-a-w.blogspot.com/2008/01/really-strange-quirk-of-ruby-and-perl.html
if you care) and remembered that I fixed this once in sed. Now, the
same for gst. The behavior I implemented for tokenize is consistent
with ruby (didn't check perl and python), the behavior I implemented for
gsub is consistent with sed and python but not with ruby and perl.
Paolo
2008-01-24 Paolo Bonzini <address@hidden>
* kernel/Regex.st: Fix global substitution and tokenization for
regexes that can match the empty string.
diff --git a/kernel/Regex.st b/kernel/Regex.st
index dec5e6e..b074361 100644
--- a/kernel/Regex.st
+++ b/kernel/Regex.st
@@ -881,10 +881,11 @@ String extend [
of the match (as in #%)."
<category: 'regex'>
- | res idx regex beg end regs |
+ | res idx regex beg end regs emptyOk |
regex := pattern asRegex.
res := WriteStream on: (String new: to - from + 1).
idx := from.
+ emptyOk := true.
[regs := self
searchRegexInternal: regex
@@ -894,17 +895,20 @@ String extend [
whileFalse:
[beg := regs from.
end := regs to.
- res
- next: beg - idx
- putAll: self
- startingAt: idx.
- res nextPutAll: str % regs.
- idx := end + 1.
- beg > end
- ifTrue:
- [res nextPut: (self at: idx).
- idx := idx + 1].
- idx > self size ifTrue: [^res contents]].
+ (beg <= end or: [ beg > idx or: [ emptyOk ]])
+ ifTrue: [
+ emptyOk := false.
+ res
+ next: beg - idx
+ putAll: self
+ startingAt: idx.
+ res nextPutAll: str % regs.
+ idx := end + 1]
+ ifFalse: [
+ beg <= to ifFalse: [^res contents].
+ emptyOk := true.
+ res nextPut: (self at: beg).
+ idx := beg + 1]].
res
next: to - idx + 1
putAll: self
@@ -963,11 +967,11 @@ String extend [
are separated and stored into an Array of Strings that is returned."
<category: 'regex'>
- | res idx regex regs tokStart |
+ | res idx tokStart regex regs beg end emptyOk |
regex := pattern asRegex.
res := WriteStream on: (Array new: 10).
- idx := from.
- tokStart := 1.
+ idx := tokStart := from.
+ emptyOk := false.
[regs := self
searchRegexInternal: regex
@@ -975,10 +979,27 @@ String extend [
to: to.
regs notNil]
whileTrue:
- [res nextPut: (self copyFrom: tokStart to: regs from - 1).
- tokStart := regs to + 1.
- idx := regs to + 1 max: regs from + 1].
- res nextPut: (self copyFrom: tokStart to: to).
+ [beg := regs from.
+ end := regs to.
+ (beg <= end or: [ beg > idx or: [ emptyOk ]])
+ ifTrue: [
+ emptyOk := false.
+ res nextPut: (self copyFrom: tokStart to: beg - 1).
+ idx := tokStart := end + 1 ]
+ ifFalse: [
+ "If we reach the end of the string exit
+ without adding the token. tokStart must have been
+ set above to TO + 1 (it was set above just before
+ setting emptyOk to false), so we'd add an empty
+ token we don't want."
+ beg <= to ifFalse: [^res contents].
+ emptyOk := true.
+
+ "By not updating tokStart we put the character in
the
+ next token."
+ idx := beg + 1]].
+ (tokStart <= to or: [ emptyOk ])
+ ifTrue: [ res nextPut: (self copyFrom: tokStart to: to) ].
^res contents
]
diff --git a/tests/strings.ok b/tests/strings.ok
index f083526..2706df5 100644
--- a/tests/strings.ok
+++ b/tests/strings.ok
@@ -66,3 +66,48 @@ returned value is ' - - '
Execution begins...
returned value is ''
+
+Execution begins...
+returned value is 'xaxbxcx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbxcx'
+
+Execution begins...
+returned value is 'xbxcx'
+
+Execution begins...
+returned value is '('abc' 'def' )'
+
+Execution begins...
+returned value is '('' 'abc' 'def' )'
+
+Execution begins...
+returned value is '('a' 'b' 'c' )'
+
+Execution begins...
+returned value is '('a' )'
+
+Execution begins...
+returned value is '('a' )'
+
+Execution begins...
+returned value is '('a' )'
diff --git a/tests/strings.st b/tests/strings.st
index be74137..ee2a16e 100644
--- a/tests/strings.st
+++ b/tests/strings.st
@@ -95,3 +95,20 @@ Eval [ '388350028456431097' formatAs: 'Card Number ####
###### #### Expires ##/#
Eval [ '543' formatAs: '###-###-####' ]
Eval [ '' formatAs: '###-###-####' ]
Eval [ '1234' formatAs: '' ]
+
+"Have fun with regexes that can match the empty string."
+Eval [ 'abc' copyReplacingAllRegex: 'x*' with: 'x' ] "xaxbxcx"
+Eval [ 'f' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx"
+Eval [ 'fo' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx"
+Eval [ 'foo' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx"
+Eval [ 'ba' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx"
+Eval [ 'baa' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx"
+Eval [ 'baaa' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx"
+Eval [ 'bc' copyReplacingAllRegex: 'a*' with: 'x' ] "xbxcx"
+Eval [ 'bac' copyReplacingAllRegex: 'a*' with: 'x' ] "xbxcx"
+Eval [ ('abc def ' tokenize: ' ') printString ] "(abc
def)"
+Eval [ (' abc def ' tokenize: ' ') printString ] "('' abc def)"
+Eval [ ('abc' tokenize: 'x*') printString ] "(a b c)"
+Eval [ ('axxx' tokenize: 'x*') printString ] "(a)"
+Eval [ ('ax' tokenize: 'x*') printString ] "(a)"
+Eval [ ('a' tokenize: 'x*') printString ] "(a)"
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Help-smalltalk] [PATCH] Fix regexes that can match the empty string,
Paolo Bonzini <=