Developed primarily by Noah Smith and Jason Eisner, with input from Markus Dreyer and David Smith
You do not have to write any code for this exercise!
Your system, which is in competition with those of the other teams, will consist of two sub-grammars:
One way to see what your grammar does is to generate
random sentences from it. For our purposes, generation is just
repeated symbol expansion. To expand a symbol such as NP
, our
sentence generator will randomly choose one of your grammar's
NP -> ...
rules, with probability proportional to the rule's
weight.
Your task is to write the grammar rules and also choose the weight for each rule. The weights allow you to express your knowledge of which English phenomena are common and which ones are rare. By giving a low weight to a rule (or to S2 itself), you express a belief that it doesn't happen very often in English.
Another way to see what your grammar does is to parse sentences with it. You can use our parser to look at a sentence and figure out whether your grammar could have generated it -- and how, and with what probability. We say that your grammar predicts a sentence well if (according to the parser) your grammar has a comparatively high probability of generating exactly that sentence.
You will have to decide how much effort to expend on designing S1 and S2, and how much you will trust S1 relative to S2. Your goal is to describe English accurately. This means that your grammar (especially the S1 subgrammar) should be able to generate lots of syntactic constructions that show up in English. Moreover, it should have a higher probability of generating common sentences, and a low or zero probability of generating ungrammatical sentences.
The performance of your system will be measured in two ways:
A weighted context free grammar consists of:
START
), and the terminal symbols as the words.
A derivation rule tells one way to rewrite a non-terminal symbol
into a sequence of non-terminal symbols and terminal symbols.
For example, S -> NP VP
says that an S
(perhaps indicating a declarative sentence) can be rewritten as
an NP
(noun phrase) followed by a VP
(verb phrase).
A word about weights: Today you heard a lot about probability models. In a probabilistic CFG, each rule would have some probability associated with it, and the probability of a derivation for a sentence would be the product of all the rules that went into the derivation. We don't want you to worry about making probabilities sum up to one, so you can use any positive number as a weight.
In the file Vocab.gr
, we are giving you the set of terminal
symbols (words), embedded in rules
of the form Tag -> word
.
Note that the vocabulary is closed. There will
be no unknown words in the test sentences, and you are not allowed
to add any
words (terminal symbols) to the grammar.
We have given equal weights to all the Tag ->
word
rules, but you are free to change the weights if
you like. You are also free to add, remove, or change these
rules, as long as you don't add any new words.
Tag -> word
rules, but
you might find that the tags aren't useful when trying to extend
S1 into a bigger English grammar. So you are free to create new
tags for word classes that you find convenient or use the words
directly in the rules, if you find it advantageous. We tried to
keep the vocabulary relatively small to make it easier for you
to do this.
You will almost certainly want to change the tags in rules
of the form Misc ->
word
. But be careful: you don't want to
change Misc -> goes
to VerbT ->
goes
, since goes
doesn't behave like
other VerbT
's. In particular, you want your S1
to generate Guinevere has the chalice .
but not Guinevere goes the chalice .
,
which is ungrammatical. This is why you may want to invent
some new tags.
Our method of enforcing this requirement is to use a grammar that
is effectively a bigram (or finite-state) model. Suppose we only have
two tag types, A
and B
. The set of rules
we would allow in S2 would be:
S2 -> _A
This grammar can generate any sequence of
S2 -> _B
S2 ->
_A -> A
_A -> A _A
_A -> A _B
_B -> B
_B -> B _A
_B -> B _B
A
s and B
s,
and there is no ambiguity in parsing with it: there is always a
single, right-branching
parse.
START -> S1
and
START -> S2
. The relative weight of these determines how
likely it is that S1 (with start symbol S1
) or S2 (with
start symbol S2
) would be selected in generating a sentence,
and how costly it is to choose one or the other when parsing a sentence.
Choosing the relative weight of these two rules is a gamble. If you are over-confident in your "real" English grammar (S1), and you weight it too highly, then you risk assigning very low probability to sentences generated by a team whose grammar is more extensive (since the parser will have to resort to your S2 to get a parse).
On the other hand, if you weight S2 too highly, then you will probably do a poor job of predicting the other teams' sentences, since S2 will not make any sentences very likely (it accepts everything, so probability mass is spread very thin across the space of word strings). Of course, you can invest some effort in trying to make S2 a better n-gram model, but that's a tedious task and a risky investment.
randsent
.
We will then make a grammaticality judgement about each sentence,
and your score will be the fraction of sentences judged to be
grammatical. To judge the grammaticality of the sentences,
we will be using the best tool available: human speakers (yourselves!).
Toward the end of the lab, we'll elicit grammaticality judgments from each of you
on a set of sentences
(which might have been generated by your grammar or someone else's -- so
be honest!).
The second score evaluates your full grammar (S1+S2) and is much more important. Here we will take the other teams' test sets (removing the sentences that human judges couldn't parse) and try to parse them with your grammar (the combination of S1 and S2). You will be able to parse all of the sentences (because S2 will accept anything, albeit with low probability). The score will be the cross-entropy of your model with respect to the test set. A low cross-entropy is good; it means that your model is unsurprised by the test sentences generated by the other teams. We will report your model's cross-entropy on each of the other teams' test sets.
If the test set of sentences is {s1, s2, s3, ...}, then your
model's cross-entropy is defined as
(-log(p(s1)) - log(p(s2)) - log(p(s3)), ... ) / (|s1| + |s2| + |s3| + ...)
where |si| is the number of words in the ith sentence.
Note that
-log(1)=0, -log(1/2)=1, -log(1/4)=2, -log(1/8)=3, ... -log(0)=infinity
So a high cross-entropy corresponds to a low probability and
vice-versa. You will be severely penalized for probabilities
close to zero.
Note: p(s) denotes the probability that a randomly generated sentence is exactly s. Often your grammar will have many different ways to generate s, corresponding to many different trees. In that case, p(s) is the total probability of all those trees. However, for convenience, we approximate this with the probability of the single most likely tree, because that is what our parser happens to find. So we are really only measuring approximate cross-entropy.
We want to draw attention to a blunder made by some previous participants.
Remember the part about weighting the
START -> S1
and
START -> S2
rules? Few teams change
those weights from their defaults; doing so helps their cross-entropy scores a lot.
There are two file formats to know about. The first is for sentences
generated by a grammar, or to be parsed by a grammar. We name files in this
format with the suffix .sen
. Simply put one sentence per
line, with spaces between tokens. Make sure you put spaces before and after
punctuation symbols, since, for example Arthur,
will look like
one token to the parser, while Arthur ,
is what you want
(two tokens).
The file format for the grammar is also quite simple. We
name these files with the suffix .gr
. On a
given line, anything following #
is an ignored
comment. Each rule goes on one line, in the format:
weight X Y1
Y2 ... Yn
signifying
the rule X -> Y1
Y2 ... Yn
(where
X
is a non-terminal and the Y
s are
any mix of terminals and non-terminals), and it gets weight weight
.
Be careful not to name any of your non-terminals the same as any terminals, since the parser will assume you intended to conflate the symbols. Blank lines will be ignored. If you accidentally include the same rule more than once, its weight will be the sum of the weights in the grammar file(s).
From here on .gr
files refer to grammar files,
and .sen
files refer to sentence files.
cat grammar-file(s) | randsent [ -t ] [ -n num-sentences ]
[ -s start-symbol ]
-
-t
: produce tree output instead of flat sentences.
-
-n num-sentences
: e.g., -n 10
will generate
ten sentences instead of just one.
-
-s start-symbol
: The generator's start symbol is START
by default.
You may want to specify another so you can test just part of
the grammar (e.g., S1
).
-
-g single-grammar-file
: Instead of reading the grammar files from standard input, read them from a single file.
-
grammar-files
: Typically *.gr
. In general, a list of .gr
files containing the parts of the grammar. Different parts
may be written by different people.
So to generate 17 trees from your entire merged grammar (S1 and S2 combined),
and to make the trees "pretty," you might run:
cat *.gr | randsent -t -n 17 -s START | prettyprint
If your grammar is in a single file fullgrammar.gr
, you may equivalently run:
randsent -t -n 17 -s START -g fullgrammar.gr | prettyprint
You can also list more than one grammar file:
randsent -t -n 17 -s START -g fullgrammar.gr extrarules.gr | prettyprint
Earley Parser (and pre/post-processing scripts):
The parser was written in
Dyna and C++.
To use the parser:
parse -g grammar-file(s) -i sentence-file(s)
You can also pipe grammar files or sentence files in through standard input (not both; if you don't specify -g
or -i
, the program won't know what's what):
cat grammar-file(s) | parse -i sentence-file(s)
cat sentence-file(s) | parse -g grammar-file(s)
Many other options are available:
-
sentence-file(s)
: Names of files with one input sentence per line.
-
-g grammar-file
: Name of the file containing the grammar.
-
-s start-symbol
:
The parser's default start symbol is START.
You may want to specify another.
-
-h
: Prints help, including details on many useful options.
It is expected that your grammar will be in multiple files. If you follow
our convention and name
all of these ending in .gr
you can parse with something like:
cat *.gr | parse -i sentence-file(s)
which will take sentences from sentence-file(s) and use all your grammar
files.
If a sentence fails to parse, its output will be
failure
on a single line. If the parse is successful, you
will see the single best-scoring (highest-weight) parse on one line. Additional
options let you see the probability of the sentence, the probability
of the best parse, the number of parses the sentence has under the grammar,
and the cross-entropy (in bits per word) of your sentence file.
You will get an error if your grammar includes additional vocabulary, and a warning will print
if your sentences contain words not in the grammar.
If you want to make the parses a bit more readable, pipe the
output of parse
to prettyprint
.
parse -i sentence-file -g grammar-file | prettyprint
If you want
to see how you're doing in terms of cross-entropy:
parse -i sentence-file -g grammar-file -nC
(The -n
suppresses the parses, and the -C
option indicates cross-entropy.)
Checking for new vocabulary:
New vocabulary is strictly forbidden! Sometimes, though, you might
add a non-terminal on the right-hand side of a rule and forget to give
its expansion rules. Or you might forget exactly which words are in the
vocabulary. To check for new terminals:
cat grammar-files | check-for-new-terms
This will list any un-expandable symbols (terminals) in your grammar that
are not in the vocabulary we've provided.
Note that this check is called automatically whenever you call
parse
or randsent
, so you will get constant
warnings until you fix the illegal terminals.
Keeping Up: Training Data
Throughout the exercise, we
will make more and more "training data" available to all the
teams. This data will come from S1-sentences generated by each
team's grammar.
You will be able to find this data in your team's subdirectory generated_data
.
Check back often to keep on top of what
kinds of sentences the others are generating!
In addition, we
have generated examples of sentences you might want to try to design for.
They
are listed in the file called examples.sen
in the start
directory.
These are just
suggestions; they aren't part of the evaluation. If you want to get more ideas
about how to stump the other teams, we suggest looking on the Web. It's not
hard to find complicated sentences.
Questions for Discussion
- What were some of the linguistic phenomena you handled,
and how?
- If s is a sentence, and p1(s) and p2(s) denote the
respective probabilities that S1 and S2 generate s,
then what is the probability that the whole grammar
generates s?
- How could you improve the probabilities assigned by S2,
without losing the guarantee that S2 can generate any
string of words?
- What would happen if you didn't have S2?
- Is there any other way to combine S1 and S2 besides the
one we recommended?
- Suppose you strategically decided to make your grammar devote
most of its probability to a really unusual sentence. Who would
that hurt more - your team or the other teams?
- Suppose you were leading a 6-week (or 6-year) team effort to
build a grammar for English. How would you
organize the effort?
How would you track your progress? What kinds of tools
or other resources would you want to develop?
- How does your answer to the previous question change if the
language is not English? Does it matter what family of language
it is? What if you and your teammates don't speak it?
- Your grammar scores well when
it finds a parse of high probability. But S1 might produce several
different parses of a sentence, dividing the probability among them.
Is the scoring method fair?
Overview |
Model |
Tools |
Evaluation |
Tools |
Keeping Up |
Questions