grammatica-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Grammatica-users] novice question: when more than one tokenmatch?


From: malcolm macaulay
Subject: RE: [Grammatica-users] novice question: when more than one tokenmatch?
Date: Tue, 13 Apr 2004 00:03:51 +0100

Hi Per,

Thanks for your very fast reply. Apologies for asking a common dumb
question. 

Sounds like I need to do some more reading on parsers. I understand your
point about context-sensitive/-free parsers and I suspect for most of
the file types I need to parse, the grammar is (or can be) context-free,
as you demonstrate.

In fact, if I really understood the concepts, I think I could say for
sure that all my 'configuration' files are context-free. I think...

Thanks again for your help and for Grammatica. It may be some while
before I can send you a patch :-)

cheers

Malcolm

-----Original Message-----
From: address@hidden
[mailto:address@hidden
] On Behalf Of Per Cederberg
Sent: Monday, 12 April 2004 11:21 p.m.
To: address@hidden
Subject: Re: [Grammatica-users] novice question: when more than one
tokenmatch?

Note to self: Find someone to write a nice tutorial or FAQ
for using Grammatica based on this question (and others).

..hint, hint...

On Mon, 2004-04-12 at 22:04, malcolm macaulay wrote:
> I have a grammar where one more than one token can match (i.e. one or
> more regex tokens are a subset of another regex token). If I am
reading
> the C# code correctly, Grammatica will return the *longest* token
which
> can be matched and if this does not match the production it will
return
> an error. This does not make sense to me as there may be a shorter
token
> match which is the correct one according to the grammar.

What you are after here (without knowing it) is a context-
sensitive parser. I.e. a parser that adjusts the flow of
tokens (for instance) according to the current production
being matched.

Grammatica, unfortunately, can only handle context-free
grammars, due to the complexity of the more advanced kind.
So in Grammatica the string of tokens parsed is totally
independent of the productions in the grammar (which by
the way can be tested with the --tokenize command-line
option).

> The document I want to parse looks like this:
> 
> Document to parse (the first part of a configuration file for a
digital
> power protection relay):
> 
> [DEVICE INFORMATION]
> DEVICE NAME=750
> COMMENT=some words
> VERSION=500
> 
> My grammar:
> 
> %tokens%
> 
> WHITESPACE =                                  <<[\s\n\r]+>> %ignore%
> DEVICE_INFORMATION_HEADING =  <<\[DEVICE INFORMATION\]>>
> DEVICE_NAME =                                         <<DEVICE NAME>>

If you use string tokens instead of regexps here, your parser
will be more efficient. I suspect, however, that you're using
regexps to allow more flexible whitespace, like this:

DEVICE_NAME =                                   <<DEVICE[ \t]+NAME>>

> EQUALS =                                              "="
> NUMBER =                                              <<[0-9]+>>
> DOT_STAR =                                    <<.*>>

The simple solution in your case would be to replace DOT_STAR
with the following (as the last token):

DISCARD = <<.>> %ignore%

This will match any character not included in the other tokens,
but it won't match more than a single character so any problems
regarding the match length is avoided. Also, by using the
%ignore% directive we hide any occurencies of this token so that
you don't have to include it in the grammar (whenever it is
encountered).

> %productions%
> 
> Expression = DEVICE_INFORMATION_HEADING DEVICE_NAME EQUALS NUMBER
> DOT_STAR;
> 
> When I test this grammar against the document I get:
> 
> Expression(2001)
> DEVICE_INFORMATION_HEADING(1002): "[DEVICE INFORMATION]". Line: 1,
col:
> 1
> Error: in test.txt line 2:
> Unexpected token "DEVICE NAME=750" <DOT_STAR>, expected <DEVICE_NAME>
> 
> I have read the C# code and I can see that it is matching both
> <DEVICE_NAME> and <DOT_STAR>, but returning the <DOT_STAR> match as
this
> has the longer string of the two matches. This does not seem right;
> surely it should return all matches and then refer to the productions
to
> determine which to use?

Well, in a context-sensitive world, yes. Tools such as Grammatica
(or flex, ANTLR...) can only handle context-free grammars however.
So we return the "best" match, i.e. the longest one. Which means
that by pushing complexity onto our users, the parsers generated
can be much faster and easier to implement. :-)

Ok, so what to do? Well, the simple solution in your case is to 
start using the --tokenize option. That way you can test your
token definitions separately without worrying about the
productions. Try it with various inputs and check that the list
of tokens is what is expected. Avoid using * + and similar in
the regexps unless you are *absolutely sure* they will only match
things you want to match. Once that works, try the --parse option
and start playing with the productions.

Or if you're really annoyed by me just being this increadibly
lazy guy that didn't implement this the *right way*, well...
I do accept patches, you know... ;-)

> Per, thanks for making this parser generator available.

Welcome to the wonderful world of free software! ;-)

Cheers,

/Per
-- 
Per Cederberg, Software Consultant
http://www.percederberg.net/software



_______________________________________________
Grammatica-users mailing list
address@hidden
http://mail.nongnu.org/mailman/listinfo/grammatica-users






reply via email to

[Prev in Thread] Current Thread [Next in Thread]