Re: regexp strangeness

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness

From:	Andreas Weber
Subject:	Re: regexp strangeness
Date:	Sat, 8 Feb 2020 15:01:46 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1

Am 08.02.20 um 12:47 schrieb Kay Nick:
> the documentation to regexp says:
> 
> '\w'
>           Match any word character
> 
> what exactly is a word character (maybe even more important what isn't)?

It's always worth to have a look at the underlying library, PCRE in this
case: https://www.pcre.org/original/doc/html/pcrepattern.html

...A "word" character is an underscore or any character that is a letter
or digit. By default, the definition of letters and digits is controlled
by PCRE's low-valued character tables, and may vary if locale-specific
matching is taking place (see "Locale support" in the pcreapi page). For
example, in a French locale such as "fr_FR" in Unix-like systems, or
"french" in Windows, some character codes greater than 127 are used for
accented letters, and these are then matched by \w. The use of locales
with Unicode is discouraged. ....

>>> regexp("#d#","#\w#")
> ans = [](1x0)                     <- why does this happen? I've provided
> a word character (letter)
>>> regexp("#d#","#\\w#")
> ans =  1                             <- Ahhh, so we need to double
> escape these special characters... no mention of that in the help...

The handling of escape sequences apply to all sings, not just in regexp,
see https://octave.org/doc/v4.0.1/Escape-Sequences-in-String-Constants.html

I don't think it makes sense to document this especially or additionally
tin the help text for regexp.

>>> regexp("#.#","#\\w*#")
> ans = [](1x0)                    <- why? Asterisk (*) is supposed to
> match zero or more times. Here there is zero times a letter, so it
> should match...

No, it would match "##" but no "#.#".
You can play around here: https://regex101.com/r/sYXfWy/1

> Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
> like a bug to me. Or am I getting something wrong here?

Yes, see above.

-- Andy

[Prev in Thread]

Current Thread

[Next in Thread]

regexp strangeness, Kay Nick, 2020/02/08
- Re: regexp strangeness, Daniel J Sebald, 2020/02/08
  - Re: regexp strangeness, Andrew Janke, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/08
- regexp strangeness, Kay Nick, 2020/02/08
  - Re: regexp strangeness, Andreas Weber <=
    - Re: regexp strangeness, Kay Nick, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/11

Prev by Date: Re: Octave 5.2.0 release
Next by Date: Re: regexp strangeness
Previous by thread: regexp strangeness
Next by thread: Re: regexp strangeness
Index(es):
- Date
- Thread