Developed primarily by Noah Smith and Jason Eisner, with input from Markus Dreyer and David Smith
You do not have to write any code for this exercise!
Your system, which is in competition with those of the other teams, will consist of two sub-grammars:
One way to see what your grammar does is to generate
random sentences from it. For our purposes, generation is just
repeated symbol expansion. To expand a symbol such as NP
, our
sentence generator will randomly choose one of your grammar's
NP -> ...
rules, with probability proportional to the rule's
weight.
Your task is to write the grammar rules and also choose the weight for each rule. The weights allow you to express your knowledge of which English phenomena are common and which ones are rare. By giving a low weight to a rule (or to S2 itself), you express a belief that it doesn't happen very often in English.
Another way to see what your grammar does is to parse sentences with it. You can use our parser to look at a sentence and figure out whether your grammar could have generated it -- and how, and with what probability. We say that your grammar predicts a sentence well if (according to the parser) your grammar has a comparatively high probability of generating exactly that sentence.
You will have to decide how much effort to expend on designing S1 and S2, and how much you will trust S1 relative to S2. Your goal is to describe English accurately. This means that your grammar (especially the S1 subgrammar) should be able to generate lots of syntactic constructions that show up in English. Moreover, it should have a higher probability of generating common sentences, and a low or zero probability of generating ungrammatical sentences.
The performance of your system will be measured in two ways:
A weighted context free grammar consists of:
START
), and the terminal symbols as the words.
A derivation rule tells one way to rewrite a non-terminal symbol
into a sequence of non-terminal symbols and terminal symbols.
For example, S -> NP VP
says that an S
(perhaps indicating a declarative sentence) can be rewritten as
an NP
(noun phrase) followed by a VP
(verb phrase).
A word about weights: Today you heard a lot about probability models. In a probabilistic CFG, each rule would have some probability associated with it, and the probability of a derivation for a sentence would be the product of all the rules that went into the derivation. We don't want you to worry about making probabilities sum up to one, so you can use any positive number as a weight.
In the file Vocab.gr
, we are giving you the set of terminal
symbols (words), embedded in rules
of the form Tag -> word
.
Note that the vocabulary is closed. There will
be no unknown words in the test sentences, and you are not allowed
to add any
words (terminal symbols) to the grammar.
We have given equal weights to all the Tag ->
word
rules, but you are free to change the weights if
you like. You are also free to add, remove, or change these
rules, as long as you don't add any new words.
Tag -> word
rules, but
you might find that the tags aren't useful when trying to extend
S1 into a bigger English grammar. So you are free to create new
tags for word classes that you find convenient or use the words
directly in the rules, if you find it advantageous. We tried to
keep the vocabulary relatively small to make it easier for you
to do this.
You will almost certainly want to change the tags in rules
of the form Misc ->
word
. But be careful: you don't want to
change Misc -> goes
to VerbT ->
goes
, since goes
doesn't behave like
other VerbT
's. In particular, you want your S1
to generate Guinevere has the chalice .
but not Guinevere goes the chalice .
,
which is ungrammatical. This is why you may want to invent
some new tags.
Our method of enforcing this requirement is to use a grammar that
is effectively a bigram (or finite-state) model. Suppose we only have
two tag types, A
and B
. The set of rules
we would allow in S2 would be:
S2 -> _A
This grammar can generate any sequence of
S2 -> _B
S2 ->
_A -> A
_A -> A _A
_A -> A _B
_B -> B
_B -> B _A
_B -> B _B
A
s and B
s,
and there is no ambiguity in parsing with it: there is always a
single, right-branching
parse.
START -> S1
and
START -> S2
. The relative weight of these determines how
likely it is that S1 (with start symbol S1
) or S2 (with
start symbol S2
) would be selected in generating a sentence,
and how costly it is to choose one or the other when parsing a sentence.
Choosing the relative weight of these two rules is a gamble. If you are over-confident in your "real" English grammar (S1), and you weight it too highly, then you risk assigning very low probability to sentences generated by a team whose grammar is more extensive (since the parser will have to resort to your S2 to get a parse).
On the other hand, if you weight S2 too highly, then you will probably
do a poor job of predicting the other teams' sentences, since
S2 will not make any sentences
very likely (it accepts everything, so probability mass is spread
very thin across the space of word strings). Of course, you can
invest some effort in trying to make S2 a better n-gram model, but
that's a tedious task and a risky investment.
The second score evaluates your full grammar (S1+S2) and is much
more important. Here we will take the other teams' test sets
(removing the sentences that human judges couldn't parse) and try to
parse them with your grammar (the combination of S1 and
S2). You will be able to parse all of the sentences (because S2
will accept anything, albeit with low probability). The score
will be the cross-entropy of your model with respect to the test
set. A low cross-entropy is good; it means that your model
is unsurprised by the test sentences generated by the
other teams. We will report your model's cross-entropy on each
of the other teams' test sets.
If the test set of sentences is {s1, s2, s3, ...}, then your
model's cross-entropy is defined as
Note: p(s) denotes the probability that a randomly
generated sentence is exactly s. Often your grammar will have
many different ways to generate s, corresponding to many
different trees. In that case, p(s) is the total probability of
all those trees. However, for convenience, we approximate this
with the probability of the single most likely tree, because that
is what our parser happens to find. So we are really only
measuring approximate cross-entropy.
We want to draw attention to a blunder made by some previous participants.
Remember the part about weighting the
There are two file formats to know about. The first is for sentences
generated by a grammar, or to be parsed by a grammar. We name files in this
format with the suffix
The file format for the grammar is also quite simple. We
name these files with the suffix
Be careful not to name any of your non-terminals the same as
any terminals, since the parser will assume you intended to conflate the symbols.
Blank lines will be ignored. If you accidentally include
the same rule more than once, its weight will be the sum of the weights
in the grammar file(s).
From here on It is expected that your grammar will be in multiple files. If you follow
our convention and name
all of these ending in If a sentence fails to parse, its output will be
You will get an error if your grammar includes additional vocabulary, and a warning will print
if your sentences contain words not in the grammar.
If you want to make the parses a bit more readable, pipe the
output of If you want
to see how you're doing in terms of cross-entropy:
Note that this check is called automatically whenever you call
In addition, we
have generated examples of sentences you might want to try to design for.
They
are listed in the file called Evaluation
To evaluate the precision of your S1 grammar, we will first
generate a test set of sentences from it using
randsent
.
We will then make a grammaticality judgement about each sentence,
and your score will be the fraction of sentences judged to be
grammatical. To judge the grammaticality of the sentences,
we will be using the best tool available: human speakers (yourselves!).
Toward the end of the lab, we'll elicit grammaticality judgments from each of you
on a set of sentences
(which might have been generated by your grammar or someone else's -- so
be honest!).
(-log(p(s1)) - log(p(s2)) - log(p(s3)), ... ) / (|s1| + |s2| + |s3| + ...)
where |si| is the number of words in the ith sentence.
Note that
-log(1)=0, -log(1/2)=1, -log(1/4)=2, -log(1/8)=3, ... -log(0)=infinity
So a high cross-entropy corresponds to a low probability and
vice-versa. You will be severely penalized for probabilities
close to zero.
START -> S1
and
START -> S2
rules? Few teams change
those weights from their defaults; doing so helps their cross-entropy scores a lot.
Tools at Your Disposal
We have given you tools to parse sentences, to generate sentences
randomly according to your grammar, and to compute the cross-entropy
of your grammar with respect to a set of sentences. We've also given
you a tool to check for terminal symbols you may have accidentally
added to the grammar.
.sen
. Simply put one sentence per
line, with spaces between tokens. Make sure you put spaces before and after
punctuation symbols, since, for example Arthur,
will look like
one token to the parser, while Arthur ,
is what you want
(two tokens).
.gr
. On a
given line, anything following #
is an ignored
comment. Each rule goes on one line, in the format:
weight X Y1
Y2 ... Yn
signifying
the rule X -> Y1
Y2 ... Yn
(where
X
is a non-terminal and the Y
s are
any mix of terminals and non-terminals), and it gets weight weight
.
.gr
files refer to grammar files,
and .sen
files refer to sentence files.
Random sentence generator:
This program will randomly generate
sentences according to the distribution defined by your weighted grammar.
cat grammar-file(s) | randsent [ -t ] [ -n num-sentences ]
[ -s start-symbol ]
So to generate 17 trees from your entire merged grammar (S1 and S2 combined),
and to make the trees "pretty," you might run: -t
: produce tree output instead of flat sentences.
-n num-sentences
: e.g., -n 10
will generate
ten sentences instead of just one.
-s start-symbol
: The generator's start symbol is START
by default.
You may want to specify another so you can test just part of
the grammar (e.g., S1
).
-g single-grammar-file
: Instead of reading the grammar files from standard input, read them from a single file.
grammar-files
: Typically *.gr
. In general, a list of .gr
files containing the parts of the grammar. Different parts
may be written by different people.
cat *.gr | randsent -t -n 17 -s START | prettyprint
If your grammar is in a single file fullgrammar.gr
, you may equivalently run:
randsent -t -n 17 -s START -g fullgrammar.gr | prettyprint
You can also list more than one grammar file:
randsent -t -n 17 -s START -g fullgrammar.gr extrarules.gr | prettyprint
Earley Parser (and pre/post-processing scripts):
The parser was written in Dyna and C++.
To use the parser:
parse -g grammar-file(s) -i sentence-file(s)
You can also pipe grammar files or sentence files in through standard input (not both; if you don't specify -g
or -i
, the program won't know what's what):
cat grammar-file(s) | parse -i sentence-file(s)
cat sentence-file(s) | parse -g grammar-file(s)
Many other options are available:
sentence-file(s)
: Names of files with one input sentence per line.
-g grammar-file
: Name of the file containing the grammar.
-s start-symbol
:
The parser's default start symbol is START.
You may want to specify another.
-h
: Prints help, including details on many useful options.
.gr
you can parse with something like:
cat *.gr | parse -i sentence-file(s)
which will take sentences from sentence-file(s) and use all your grammar
files.
failure
on a single line. If the parse is successful, you
will see the single best-scoring (highest-weight) parse on one line. Additional
options let you see the probability of the sentence, the probability
of the best parse, the number of parses the sentence has under the grammar,
and the cross-entropy (in bits per word) of your sentence file.
parse
to prettyprint
.
parse -i sentence-file -g grammar-file | prettyprint
parse -i sentence-file -g grammar-file -nC
(The -n
suppresses the parses, and the -C
option indicates cross-entropy.)
Checking for new vocabulary:
New vocabulary is strictly forbidden! Sometimes, though, you might
add a non-terminal on the right-hand side of a rule and forget to give
its expansion rules. Or you might forget exactly which words are in the
vocabulary. To check for new terminals:
cat grammar-files | check-for-new-terms
This will list any un-expandable symbols (terminals) in your grammar that
are not in the vocabulary we've provided.
parse
or randsent
, so you will get constant
warnings until you fix the illegal terminals.
Keeping Up: Training Data
Throughout the exercise, we
will make more and more "training data" available to all the
teams. This data will come from S1-sentences generated by each
team's grammar.
You will be able to find this data in your team's subdirectory generated_data
.
Check back often to keep on top of what
kinds of sentences the others are generating!
examples.sen
in the start
directory.
These are just
suggestions; they aren't part of the evaluation. If you want to get more ideas
about how to stump the other teams, we suggest looking on the Web. It's not
hard to find complicated sentences.
Questions for Discussion