poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [WIP][PATCH 2/2] pkl,pvm: add support for regular expression


From: Jose E. Marchesi
Subject: Re: [WIP][PATCH 2/2] pkl,pvm: add support for regular expression
Date: Sun, 23 Apr 2023 11:49:46 +0200
User-agent: Gnus/5.13 (Gnus v5.13)

> Hello Jose.
>
> Thanks for poke 3.1 :)
>
>
> On Mon, Feb 20, 2023 at 02:43:47PM +0100, Jose E. Marchesi wrote:
>> 
>> > On Fri, Feb 17, 2023 at 12:19:51PM +0100, Jose E. Marchesi wrote:
>> >> 
>> >> >
>> >> > What about having a new compile-time type for matched entities.
>> >> > Both useful in regular expression matching for strings and array of
>> >> > characters.
>> >> >
>> >> > Something like this:
>> >> >
>> >> > ```poke
>> >> > var m1 = "Hello pokers!" ~ /[hH]ello/,
>> >> >     m2 = [0x00UB, 0x11UB, 0x22UB] ~ /\x11\x22/;
>> >> >
>> >> > if (m)
>> >> >   {
>> >> >     printf "matched at index %v and offset %v\n", m.index_begin, 
>> >> > m.offset_begin;
>> >> >     assert ("Hello pokers!"[m.index_begin:m.index_end] == "Hello");
>> >> >   }
>> >> > else
>> >> >   {
>> >> >     assert (m.index_begin ?! E_elem);
>> >> >     assert (m.offset_begin ?! E_elem);
>> >> >   }
>> >> > ```
>> >> >
>> >> > We can use other fields for the giving the access to sub-groups.
>> >> >
>> >> > We can take an approach similar to `Exception` struct.  But for 
>> >> > `Matched`.
>> >> > Compiler can cast it to boolean when necessary.
>> >> 
>> >> The idea is interesting.  But I don't like the part of changing the
>> >> semantics of `if' like this: it is not orthogonal.
>> >> 
>> >> Note that the syntactic construction that uses Exception only works with
>> >> exceptions:
>> >> 
>> >>   try STMT; catch if EXCEPTION { ... }
>> >> 
>> >> If we could come with a syntactic construction for regular expression
>> >> matching, then it would be better IMO.
>> >> 
>> >> 
>> >
>> >
>> > What about this syntax:
>> >
>> > ```poke
>> > var matched_p = "Hello pokers!" ~? /[hH]ello/,
>> >     matchinfo = "Hello pokers!" ~ /[hH]ello/;
>> >
>> > assert (matched_p isa int<32>);
>> > assert (matchinfo isa Matched);
>> >
>> > if (matchinfo.matched_p) { ... }
>> > ```
>> 
>> Hmm... that has the disadvantage of having to match twice.
>> 
>> It seems to me, we could make use of the exceptions by having ~ return a
>> Match struct and raising an E_nomatch exception when there is no match.
>> 
>> Then we can use the normal operators ?! and try-until and try-catch to
>> check for when there is no match.
>> 
>
>
> As we discussed some time ago, to keep the matching functionality consistent, 
> we
> can use closures (the `_pkl_regexp_matcher' function in the following patch).
> The `_pkl_regexp_matcher' will return a closure which will return true/false
> for an input string. It also takes two more optional parameters (one to 
> specify
> the start index and the other to report the sub-matches to the user).
> I re-implemented the `pk_regexp_match' and `pk_regexp_gmatch' using 
> `_pkl_regexp_matcher'.
>
> If you do like the approach, I can add the ~ operator to the language which
> expects a string on the LHS and a function with the `_Pkl_Matcher' signature 
> on
> the RHS.
> I also can add `/.../' construct to evaluate to a `_Pkl_Matcher' for the
> specified regexp. 
>
>
> ```poke
> var matched_p = "Hi" ~ /[hH]/;
>
> assert (matched_p);
> assert (/[hH]/ ("Hi"));
> assert (/[Bb]/ ("HiBye", 2));
> ```
>
>
> WDYT?
>
> - Maybe ~ is not necessary and /.../ is enough?

Both are nice to have, because ~ will be used to match other things,
including user-defined things.

> - I'm not sure reporting sub-match using `_Pkl_Regexp_Match' type is the right
>   way. What about accepting `int<32>[2]` as the sub-index? Then we'll call the
>   `clbk` several times, each time with a pair of indices).

Just use the most convenient way ;)

>
> Regards,
> Mohammad-Reza
>
> ```
> diff --git a/libpoke/pkl-rt.pk b/libpoke/pkl-rt.pk
> index 896aeb44..13d6460b 100644
> --- a/libpoke/pkl-rt.pk
> +++ b/libpoke/pkl-rt.pk
> @@ -1819,6 +1819,63 @@ fun _pkl_re_gmatch = (string regex, string str,
>    return result;
>  }
>  
> +type _Pkl_MatcherCallback = (any)void;
> +type _Pkl_Matcher = (string, int<32>?, _Pkl_MatcherCallback?)int<32>;
> +
> +fun _pkl_regexp_matcher = (string regex) _Pkl_Matcher:
> +{
> +  return lambda (string str,
> +                 int<32> start = 0,
> +                 _Pkl_MatcherCallback clbk = lambda (any v) void: {}) 
> int<32>:
> +    {
> +      var result = _Pkl_Regexp_Match {};
> +
> +      /* HACK This is equivalent to `push null'.
> +         Until we get a more powerful assembler, we have to use this
> +         trick.  */
> +      var opq = asm any: ("push 7");
> +
> +      /* Unfortunately we have to compile the regexp in every invocation.
> +         The reason is to not leak the opaque value (we have to invoke 
> `refree'
> +         instruction to free the resources explicitly).  */
> +      {
> +        var err = asm any: ("push 7");
> +
> +        asm ("recomp" : opq, err : regex);
> +        if (asm int<32>: ("nn; nip" : err))
> +          raise Exception {code = EC_inval,
> +                           name = "invalid regular expression: " + err as 
> string,
> +                           exit_status = 1};
> +      }
> +
> +      asm ("remtch; nip" : result.count : opq, str, start);
> +      {
> +        var subnum = 0UL;
> +
> +        asm ("resubnum; nip" : subnum : opq);
> +        result.submatches = int<32>[2][subnum] ();
> +        for (var i = 0UL; i != subnum; ++i)
> +          {
> +            asm ("resubref; rot; drop"
> +                 : result.submatches[i][0], result.submatches[i][1]
> +                 : opq, i);
> +          }
> +      }
> +      asm ("refree" :: opq);
> +
> +      if (result.count == -2)
> +        raise Exception {code = EC_inval,
> +                         name = "regular expression match function internal 
> error",
> +                         exit_status = 1};
> +
> +      var found_p = result.count != -1;
> +
> +      if (found_p)
> +        clbk (result);
> +      return found_p;
> +    };
> +}
> +
>  /**** Set the default load path ****/
>  
>  immutable var load_path = "";
> diff --git a/libpoke/std.pk b/libpoke/std.pk
> index bcc0d1cc..01c30073 100644
> --- a/libpoke/std.pk
> +++ b/libpoke/std.pk
> @@ -866,7 +866,7 @@ fun pk_vercmp = (any _a, any _b) int<32>:
>  
>  fun pk_regexp_match = (string regex, string str, int<32> start = 0) int<32>:
>  {
> -  return _pkl_re_match (regex, str, start);
> +  return _pkl_regexp_matcher (regex) (str, start);
>  }
>  
>  type Pk_Regexp_Match =
> @@ -879,7 +879,15 @@ type Pk_Regexp_Match =
>  fun pk_regexp_gmatch = (string regex, string str,
>                          int<32> start = 0) Pk_Regexp_Match:
>  {
> -  var result = _pkl_re_gmatch (regex, str, start);
> +  var result = Pk_Regexp_Match {};
>  
> -  return Pk_Regexp_Match {count=result.count, submatches=result.submatches};
> +  _pkl_regexp_matcher (regex) (str, start, lambda (any v) void:
> +    {
> +      var m = v as _Pkl_Regexp_Match;
> +
> +      result.count = m.count;
> +      result.submatches = m.submatches;
> +    });
> +
> +  return result;
>  }
> ```



reply via email to

[Prev in Thread] Current Thread [Next in Thread]