[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [WIP][PATCH 2/2] pkl,pvm: add support for regular expression
From: |
Jose E. Marchesi |
Subject: |
Re: [WIP][PATCH 2/2] pkl,pvm: add support for regular expression |
Date: |
Sun, 23 Apr 2023 11:49:46 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) |
> Hello Jose.
>
> Thanks for poke 3.1 :)
>
>
> On Mon, Feb 20, 2023 at 02:43:47PM +0100, Jose E. Marchesi wrote:
>>
>> > On Fri, Feb 17, 2023 at 12:19:51PM +0100, Jose E. Marchesi wrote:
>> >>
>> >> >
>> >> > What about having a new compile-time type for matched entities.
>> >> > Both useful in regular expression matching for strings and array of
>> >> > characters.
>> >> >
>> >> > Something like this:
>> >> >
>> >> > ```poke
>> >> > var m1 = "Hello pokers!" ~ /[hH]ello/,
>> >> > m2 = [0x00UB, 0x11UB, 0x22UB] ~ /\x11\x22/;
>> >> >
>> >> > if (m)
>> >> > {
>> >> > printf "matched at index %v and offset %v\n", m.index_begin,
>> >> > m.offset_begin;
>> >> > assert ("Hello pokers!"[m.index_begin:m.index_end] == "Hello");
>> >> > }
>> >> > else
>> >> > {
>> >> > assert (m.index_begin ?! E_elem);
>> >> > assert (m.offset_begin ?! E_elem);
>> >> > }
>> >> > ```
>> >> >
>> >> > We can use other fields for the giving the access to sub-groups.
>> >> >
>> >> > We can take an approach similar to `Exception` struct. But for
>> >> > `Matched`.
>> >> > Compiler can cast it to boolean when necessary.
>> >>
>> >> The idea is interesting. But I don't like the part of changing the
>> >> semantics of `if' like this: it is not orthogonal.
>> >>
>> >> Note that the syntactic construction that uses Exception only works with
>> >> exceptions:
>> >>
>> >> try STMT; catch if EXCEPTION { ... }
>> >>
>> >> If we could come with a syntactic construction for regular expression
>> >> matching, then it would be better IMO.
>> >>
>> >>
>> >
>> >
>> > What about this syntax:
>> >
>> > ```poke
>> > var matched_p = "Hello pokers!" ~? /[hH]ello/,
>> > matchinfo = "Hello pokers!" ~ /[hH]ello/;
>> >
>> > assert (matched_p isa int<32>);
>> > assert (matchinfo isa Matched);
>> >
>> > if (matchinfo.matched_p) { ... }
>> > ```
>>
>> Hmm... that has the disadvantage of having to match twice.
>>
>> It seems to me, we could make use of the exceptions by having ~ return a
>> Match struct and raising an E_nomatch exception when there is no match.
>>
>> Then we can use the normal operators ?! and try-until and try-catch to
>> check for when there is no match.
>>
>
>
> As we discussed some time ago, to keep the matching functionality consistent,
> we
> can use closures (the `_pkl_regexp_matcher' function in the following patch).
> The `_pkl_regexp_matcher' will return a closure which will return true/false
> for an input string. It also takes two more optional parameters (one to
> specify
> the start index and the other to report the sub-matches to the user).
> I re-implemented the `pk_regexp_match' and `pk_regexp_gmatch' using
> `_pkl_regexp_matcher'.
>
> If you do like the approach, I can add the ~ operator to the language which
> expects a string on the LHS and a function with the `_Pkl_Matcher' signature
> on
> the RHS.
> I also can add `/.../' construct to evaluate to a `_Pkl_Matcher' for the
> specified regexp.
>
>
> ```poke
> var matched_p = "Hi" ~ /[hH]/;
>
> assert (matched_p);
> assert (/[hH]/ ("Hi"));
> assert (/[Bb]/ ("HiBye", 2));
> ```
>
>
> WDYT?
>
> - Maybe ~ is not necessary and /.../ is enough?
Both are nice to have, because ~ will be used to match other things,
including user-defined things.
> - I'm not sure reporting sub-match using `_Pkl_Regexp_Match' type is the right
> way. What about accepting `int<32>[2]` as the sub-index? Then we'll call the
> `clbk` several times, each time with a pair of indices).
Just use the most convenient way ;)
>
> Regards,
> Mohammad-Reza
>
> ```
> diff --git a/libpoke/pkl-rt.pk b/libpoke/pkl-rt.pk
> index 896aeb44..13d6460b 100644
> --- a/libpoke/pkl-rt.pk
> +++ b/libpoke/pkl-rt.pk
> @@ -1819,6 +1819,63 @@ fun _pkl_re_gmatch = (string regex, string str,
> return result;
> }
>
> +type _Pkl_MatcherCallback = (any)void;
> +type _Pkl_Matcher = (string, int<32>?, _Pkl_MatcherCallback?)int<32>;
> +
> +fun _pkl_regexp_matcher = (string regex) _Pkl_Matcher:
> +{
> + return lambda (string str,
> + int<32> start = 0,
> + _Pkl_MatcherCallback clbk = lambda (any v) void: {})
> int<32>:
> + {
> + var result = _Pkl_Regexp_Match {};
> +
> + /* HACK This is equivalent to `push null'.
> + Until we get a more powerful assembler, we have to use this
> + trick. */
> + var opq = asm any: ("push 7");
> +
> + /* Unfortunately we have to compile the regexp in every invocation.
> + The reason is to not leak the opaque value (we have to invoke
> `refree'
> + instruction to free the resources explicitly). */
> + {
> + var err = asm any: ("push 7");
> +
> + asm ("recomp" : opq, err : regex);
> + if (asm int<32>: ("nn; nip" : err))
> + raise Exception {code = EC_inval,
> + name = "invalid regular expression: " + err as
> string,
> + exit_status = 1};
> + }
> +
> + asm ("remtch; nip" : result.count : opq, str, start);
> + {
> + var subnum = 0UL;
> +
> + asm ("resubnum; nip" : subnum : opq);
> + result.submatches = int<32>[2][subnum] ();
> + for (var i = 0UL; i != subnum; ++i)
> + {
> + asm ("resubref; rot; drop"
> + : result.submatches[i][0], result.submatches[i][1]
> + : opq, i);
> + }
> + }
> + asm ("refree" :: opq);
> +
> + if (result.count == -2)
> + raise Exception {code = EC_inval,
> + name = "regular expression match function internal
> error",
> + exit_status = 1};
> +
> + var found_p = result.count != -1;
> +
> + if (found_p)
> + clbk (result);
> + return found_p;
> + };
> +}
> +
> /**** Set the default load path ****/
>
> immutable var load_path = "";
> diff --git a/libpoke/std.pk b/libpoke/std.pk
> index bcc0d1cc..01c30073 100644
> --- a/libpoke/std.pk
> +++ b/libpoke/std.pk
> @@ -866,7 +866,7 @@ fun pk_vercmp = (any _a, any _b) int<32>:
>
> fun pk_regexp_match = (string regex, string str, int<32> start = 0) int<32>:
> {
> - return _pkl_re_match (regex, str, start);
> + return _pkl_regexp_matcher (regex) (str, start);
> }
>
> type Pk_Regexp_Match =
> @@ -879,7 +879,15 @@ type Pk_Regexp_Match =
> fun pk_regexp_gmatch = (string regex, string str,
> int<32> start = 0) Pk_Regexp_Match:
> {
> - var result = _pkl_re_gmatch (regex, str, start);
> + var result = Pk_Regexp_Match {};
>
> - return Pk_Regexp_Match {count=result.count, submatches=result.submatches};
> + _pkl_regexp_matcher (regex) (str, start, lambda (any v) void:
> + {
> + var m = v as _Pkl_Regexp_Match;
> +
> + result.count = m.count;
> + result.submatches = m.submatches;
> + });
> +
> + return result;
> }
> ```