Re: [Grammatica-users] Parsing data out of an html file
From:
Oliver Gramberg
Subject:
Re: [Grammatica-users] Parsing data out of an html file
Date:
Mon, 7 Feb 2011 14:27:03 +0100
Hi, Andrew,
your token definition,
SKIP_EVERYTHING
= <<.*>> %ignore%
does exactly what its name says: it
skips everything. The reason is that the tokenizer works "greedily,"
i.e., it eats as many characters as it can, once a valid match is found.
This is called the "longest match principle." The reason this
principle is employed, in turn, is efficiency: The tokenizer doesn't have
to backtrack, and therefore effectively reads each character only once.
Let's assume you are actually interested
in the "England" bit of the line you show to be your target.
Grammatica's %ignore% is all-or-nothing, therefore, it is not of much help
here: The line is identified by the markup at the beginning of the line,
so you cannot just throw away *all* markup; also, you want to throw away
most of the content, but not *all*.
Fortunately, there's another way:
To ignore something can also mean *not to do anything with it*, or, in
Grammatica's terms: to do nothing in the method that is called when such
a token is found.
So, the easiest solution to your
problem might involve
(1) declaring a token that exactly
matches HTML markup before the location where you want to extract data;
(2) declaring a token that matches
all HTML markup, i.e., starts with "<";
(3) declaring a token that matches
all HTML non-markup, i.e., starts with "[^<]";
- Token (1) must come first in your
grammar, this way Grammatica choses it over (2) when your identifying markup
appears in the input.
- When (1) is found, you set (in
the appropriate callback method) a flag that indicates that the next non-markup
is the data you want to extract.
- Only when the flag is set, (3)
is used as output.
- Don't forget to reset the flag.
On the other hand, with such a small
number of tokens, it might be even easier to handle this with a small script:
perl -n extract.pl output.html >
extract.txt
with this line as the contents of
extract.pl:
print $1 if m|<div class="endOfDayLeft"><a
href="">