octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness


From: Daniel J Sebald
Subject: Re: regexp strangeness
Date: Sat, 8 Feb 2020 04:12:03 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

On 2/8/20 3:32 AM, Kay Nick wrote:
Hey all,

the documentation to regexp says:

'\w'
           Match any word character

what exactly is a word character (maybe even more important what isn't)?
Am I right in assuming its
[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]? What about non
english characters like öäßłńŚ?

https://en.wikipedia.org/wiki/Regular_expression#Character_classes

lists \w as the equivalent to [A-Za-z0-9_]

Probably non-english won't handle this, but maybe you could try [ä-Ś] or whatever makes sense for the alphabet of interest.


And here some other strange (to me) behavior:

regexp("#w#","#\w#")
ans =  1                        <- seems to work in general...

As you point out two examples later, there is need of an escape. That has nothing to do with the regexp() programming, but generally in Octave double quotes are like the C printf syntax, i.e., escapes. Matlab doesn't interpret double quotes. On the other hand, Octave treats single quotes just the way that Matlab does.

So, in the above \w is an escape sequence, but probably one that isn't defined so that \w ends up the same as w. So what you've done is regexp("#w#","#w#"), which matches.


regexp("#d#","#\w#")
ans = [](1x0)                    <- why?

Because by the same logic as above, you've done regexp("#d#","#w#"), which doesn't match.


regexp("#d#","#\\w#")       <- so we need to double escape these
special characters... no mention of that in the help... :-(
ans =  1
regexp("#j#","#\\w#")
ans =  1                        <- ok
regexp("#E#","#\\w#")
ans =  1                        <- ok
regexp("#E#","#\\w*#")
ans =  1                        <- ok
regexp("##","#\\w*#")
ans =  1                        <- ok
regexp("#.#","#\\w*#")
ans = [](1x0)                    <- why?

Because . is not in [A-Za-z0-9_]

Dan


Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
like a bug to me. Or am I getting something wrong here?

Thanks


Kay



reply via email to

[Prev in Thread] Current Thread [Next in Thread]