grammatica-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Grammatica-users] Fuzzy tokenizer.


From: Per Cederberg
Subject: Re: [Grammatica-users] Fuzzy tokenizer.
Date: Wed, 29 Jun 2005 22:56:22 +0200

Grammatica currently doesn't support productions that match
a null input. Hence it cannot handle grammars that match an
empty input. Normally this isn't such a big deal, but it
might of course be annoying. The reason for this was to make
the production look-ahead handling manageable.

Good point regarding the ordering of the productions. I
think it is written down somewhere, but obviously not in
the right place as I couldn't find it in the reference
manual just now... :-(

It would be great if you create a new version with the
features you mention. It is a much more ambitious goal than
the current Grammatica though, as you are then entering
into the domains of context sensitive grammars. I created
Grammatica as a pretty standard context *in*sensitive
LL(k) parser generator as I know the theory behind that.
Going context sensitive results in a whole new complexity
with regard to the ambiguity handling, as this example
illustrates:

%tokens%
TOKENX = "x"
LETTER = <<a-z>>

%productions%
A = B
  | C ;

B = "x" LETTER ;

C = LETTER "x" ;

Is this an ambiguity? For the input "xx", how do one
choose between B or C? I opted out of this and created a
normal LL(k) parser instead.

That said, extending the Tokenizer class to return all the
token matches from a specified position would of course be a
nice generalization and not too tricky to pull off. The
current class only returns the longest one (or the first one
if two have the same length). From that point one can then
start to experiment with error recovery or context
sensitivity in the parser without having to support full
context sensitivity everywhere.

Good luck with the project! And do let me know when you
make any progress. I'd love to merge back good ideas and
solid implementations into Grammatica.

Cheers,

/Per

On wed, 2005-06-29 at 22:13 +0300, Matti Katila wrote:
> Hello,
> 
> First let me introduce you a simple grammar:
> 
> %tokens%
> letter = <<[a-z]>>
> digit   = <<[0-9]>>
> b = "b"
> 
> %productions%
> S = letter ( letter | digit )*
> 
> 
> First question: How to define a production for empty file? It seems that
> there needs to be at least one character or error is thrown since
> non-empty production is illegal. Also, was it documented that first
> pattern is used as a startPattern?
> 
> Back to example grammar. With input "acd" all works fine, but with input
> "abc" an error is thrown since "b" is not expected token. You may wonder
> of course why to create such a "b" there but if we add another production
> like:
> 
> bXb = "b" letter "b",
> 
> we win in readability which I have said to be my high consern already.
> 
> To make grammatica even slower I will branch current version into darcs
> repository and will contribute a patch witch fixes this tokenizer design
> "flaw". Current design doesn't care what tokens the current production is
> looking for but tries it best to give a sequense of tokens what tokenizer
> thinks as best. Darcs supports distributed developing and it is easy for
> others to pick up my patches for grammatica if you are interested in those.
> 
> 
>    -Matti





reply via email to

[Prev in Thread] Current Thread [Next in Thread]