Re: regexp strangeness

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness

From:	Daniel J Sebald
Subject:	Re: regexp strangeness
Date:	Sat, 8 Feb 2020 04:12:03 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

On 2/8/20 3:32 AM, Kay Nick wrote:

Hey all,

the documentation to regexp says:

'\w'
           Match any word character

what exactly is a word character (maybe even more important what isn't)?
Am I right in assuming its
[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]? What about non
english characters like öäßłńŚ?


https://en.wikipedia.org/wiki/Regular_expression#Character_classes

lists \w as the equivalent to [A-Za-z0-9_]

Probably non-english won't handle this, but maybe you could try [ä-Ś] orwhatever makes sense for the alphabet of interest.


And here some other strange (to me) behavior:

regexp("#w#","#\w#")

ans =  1                        <- seems to work in general...

As you point out two examples later, there is need of an escape. Thathas nothing to do with the regexp() programming, but generally in Octavedouble quotes are like the C printf syntax, i.e., escapes. Matlabdoesn't interpret double quotes. On the other hand, Octave treatssingle quotes just the way that Matlab does.

So, in the above \w is an escape sequence, but probably one that isn'tdefined so that \w ends up the same as w. So what you've done isregexp("#w#","#w#"), which matches.

regexp("#d#","#\w#")

ans = [](1x0)                    <- why?

Because by the same logic as above, you've done regexp("#d#","#w#"),which doesn't match.

regexp("#d#","#\\w#")       <- so we need to double escape these

special characters... no mention of that in the help... :-(
ans =  1

regexp("#j#","#\\w#")

ans =  1                        <- ok

regexp("#E#","#\\w#")

ans =  1                        <- ok

regexp("#E#","#\\w*#")

ans =  1                        <- ok

regexp("##","#\\w*#")

ans =  1                        <- ok

regexp("#.#","#\\w*#")

ans = [](1x0)                    <- why?


Because . is not in [A-Za-z0-9_]

Dan

Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
like a bug to me. Or am I getting something wrong here?

Thanks


Kay

[Prev in Thread]

Current Thread

[Next in Thread]

regexp strangeness, Kay Nick, 2020/02/08
- Re: regexp strangeness, Daniel J Sebald <=
  - Re: regexp strangeness, Andrew Janke, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/08
- regexp strangeness, Kay Nick, 2020/02/08
  - Re: regexp strangeness, Andreas Weber, 2020/02/08
    - Re: regexp strangeness, Kay Nick, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/11

Prev by Date: regexp strangeness
Next by Date: Documentation on sources
Previous by thread: regexp strangeness
Next by thread: Re: regexp strangeness
Index(es):
- Date
- Thread