Re: regexp strangeness

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness

From:	Kay Nick
Subject:	Re: regexp strangeness
Date:	Sat, 8 Feb 2020 16:57:25 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2

> I don't think it makes sense to document this especially or additionally
> tin the help text for regexp.
I do agree that we should avoid redundant use of explanatory paragraphs
in the documentation. But I think that clear hints to places where your
explanation comes from aka. those links you have provided (thanks for
that btw.) would be very helpful in the help text for regexp. That way
we would improve documentation without of creating additional burden to
keep it up to date.

Kay

On 08.02.20 15:01, Andreas Weber wrote:
> Am 08.02.20 um 12:47 schrieb Kay Nick:
>> the documentation to regexp says:
>>
>> '\w'
>>           Match any word character
>>
>> what exactly is a word character (maybe even more important what isn't)?
> It's always worth to have a look at the underlying library, PCRE in this
> case: https://www.pcre.org/original/doc/html/pcrepattern.html
>
> ...A "word" character is an underscore or any character that is a letter
> or digit. By default, the definition of letters and digits is controlled
> by PCRE's low-valued character tables, and may vary if locale-specific
> matching is taking place (see "Locale support" in the pcreapi page). For
> example, in a French locale such as "fr_FR" in Unix-like systems, or
> "french" in Windows, some character codes greater than 127 are used for
> accented letters, and these are then matched by \w. The use of locales
> with Unicode is discouraged. ....
>
>>>> regexp("#d#","#\w#")
>> ans = [](1x0)                     <- why does this happen? I've provided
>> a word character (letter)
>>>> regexp("#d#","#\\w#")
>> ans =  1                             <- Ahhh, so we need to double
>> escape these special characters... no mention of that in the help...
> The handling of escape sequences apply to all sings, not just in regexp,
> see https://octave.org/doc/v4.0.1/Escape-Sequences-in-String-Constants.html
>
> I don't think it makes sense to document this especially or additionally
> tin the help text for regexp.
>
>>>> regexp("#.#","#\\w*#")
>> ans = [](1x0)                    <- why? Asterisk (*) is supposed to
>> match zero or more times. Here there is zero times a letter, so it
>> should match...
> No, it would match "##" but no "#.#".
> You can play around here: https://regex101.com/r/sYXfWy/1
>
>> Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
>> like a bug to me. Or am I getting something wrong here?
> Yes, see above.
>
> -- Andy

[Prev in Thread]

Current Thread

[Next in Thread]

regexp strangeness, Kay Nick, 2020/02/08
- Re: regexp strangeness, Daniel J Sebald, 2020/02/08
  - Re: regexp strangeness, Andrew Janke, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/08
- regexp strangeness, Kay Nick, 2020/02/08
  - Re: regexp strangeness, Andreas Weber, 2020/02/08
    - Re: regexp strangeness, Kay Nick <=
    - Re: regexp strangeness, Andrew Janke, 2020/02/11

Prev by Date: Re: regexp strangeness
Next by Date: Re: regexp strangeness
Previous by thread: Re: regexp strangeness
Next by thread: Re: regexp strangeness
Index(es):
- Date
- Thread