Re: [Grammatica-users] novice question: when more than one token match?

grammatica-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Grammatica-users] novice question: when more than one token match?

From:	Per Cederberg
Subject:	Re: [Grammatica-users] novice question: when more than one token match?
Date:	Tue, 13 Apr 2004 00:21:20 +0200

Note to self: Find someone to write a nice tutorial or FAQ
for using Grammatica based on this question (and others).

...hint, hint...

On Mon, 2004-04-12 at 22:04, malcolm macaulay wrote:
> I have a grammar where one more than one token can match (i.e. one or
> more regex tokens are a subset of another regex token). If I am reading
> the C# code correctly, Grammatica will return the *longest* token which
> can be matched and if this does not match the production it will return
> an error. This does not make sense to me as there may be a shorter token
> match which is the correct one according to the grammar.

What you are after here (without knowing it) is a context-
sensitive parser. I.e. a parser that adjusts the flow of
tokens (for instance) according to the current production
being matched.

Grammatica, unfortunately, can only handle context-free
grammars, due to the complexity of the more advanced kind.
So in Grammatica the string of tokens parsed is totally
independent of the productions in the grammar (which by
the way can be tested with the --tokenize command-line
option).

> The document I want to parse looks like this:
> 
> Document to parse (the first part of a configuration file for a digital
> power protection relay):
> 
> [DEVICE INFORMATION]
> DEVICE NAME=750
> COMMENT=some words
> VERSION=500
> 
> My grammar:
> 
> %tokens%
> 
> WHITESPACE =                                  <<[\s\n\r]+>> %ignore%
> DEVICE_INFORMATION_HEADING =  <<\[DEVICE INFORMATION\]>>
> DEVICE_NAME =                                         <<DEVICE NAME>>

If you use string tokens instead of regexps here, your parser
will be more efficient. I suspect, however, that you're using
regexps to allow more flexible whitespace, like this:

DEVICE_NAME =                                   <<DEVICE[ \t]+NAME>>

> EQUALS =                                              "="
> NUMBER =                                              <<[0-9]+>>
> DOT_STAR =                                    <<.*>>

The simple solution in your case would be to replace DOT_STAR
with the following (as the last token):

DISCARD = <<.>> %ignore%

This will match any character not included in the other tokens,
but it won't match more than a single character so any problems
regarding the match length is avoided. Also, by using the
%ignore% directive we hide any occurencies of this token so that
you don't have to include it in the grammar (whenever it is
encountered).

> %productions%
> 
> Expression = DEVICE_INFORMATION_HEADING DEVICE_NAME EQUALS NUMBER
> DOT_STAR;
> 
> When I test this grammar against the document I get:
> 
> Expression(2001)
> DEVICE_INFORMATION_HEADING(1002): "[DEVICE INFORMATION]". Line: 1, col:
> 1
> Error: in test.txt line 2:
> Unexpected token "DEVICE NAME=750" <DOT_STAR>, expected <DEVICE_NAME>
> 
> I have read the C# code and I can see that it is matching both
> <DEVICE_NAME> and <DOT_STAR>, but returning the <DOT_STAR> match as this
> has the longer string of the two matches. This does not seem right;
> surely it should return all matches and then refer to the productions to
> determine which to use?

Well, in a context-sensitive world, yes. Tools such as Grammatica
(or flex, ANTLR...) can only handle context-free grammars however.
So we return the "best" match, i.e. the longest one. Which means
that by pushing complexity onto our users, the parsers generated
can be much faster and easier to implement. :-)

Ok, so what to do? Well, the simple solution in your case is to 
start using the --tokenize option. That way you can test your
token definitions separately without worrying about the
productions. Try it with various inputs and check that the list
of tokens is what is expected. Avoid using * + and similar in
the regexps unless you are *absolutely sure* they will only match
things you want to match. Once that works, try the --parse option
and start playing with the productions.

Or if you're really annoyed by me just being this increadibly
lazy guy that didn't implement this the *right way*, well...
I do accept patches, you know... ;-)

> Per, thanks for making this parser generator available.

Welcome to the wonderful world of free software! ;-)

Cheers,

/Per
-- 
Per Cederberg, Software Consultant
http://www.percederberg.net/software

[Prev in Thread]

Current Thread

[Next in Thread]

[Grammatica-users] novice question: when more than one token match?, malcolm macaulay, 2004/04/12
- Re: [Grammatica-users] novice question: when more than one token match?, Per Cederberg <=
  - RE: [Grammatica-users] novice question: when more than one tokenmatch?, malcolm macaulay, 2004/04/12

Prev by Date: Re: [Grammatica-users] Grammatica TimeOut
Next by Date: RE: [Grammatica-users] novice question: when more than one tokenmatch?
Previous by thread: [Grammatica-users] novice question: when more than one token match?
Next by thread: RE: [Grammatica-users] novice question: when more than one tokenmatch?
Index(es):
- Date
- Thread