Competitive Grammar Writing:
An Introduction to Natural Language Processing
Reference Guide for Participants

Developed primarily by Noah Smith and Jason Eisner, with input from Markus Dreyer and David Smith

Overview

Grammars are of interest to both linguists and engineers. They are a formal way of writing down how a language works.

For linguists, the goal is to write down the true grammar of the language -- the one that is subconsciously used by the language's speakers.
For engineers, the goal is to apply the grammar to some task. This requires algorithms that make use of grammars: for example, generation and parsing of sentences.

In today's lab, you will work in a team to design a grammar for as much of English as possible. Nowadays it is fashionable to construct grammars somewhat automatically. But by building one the old-fashioned manual way, you'll have to grapple directly with linguistic phenomena, grammars, and probabilities.

You do not have to write any code for this exercise!

Your system, which is in competition with those of the other teams, will consist of two sub-grammars:

A weighted context-free grammar, S1, that is supposed to generate all and only English sentences
A weighted context-free grammar, S2, that generates all word strings

If you could design S1 perfectly, then you wouldn't need S2. But English is amazingly complicated, and you only have a few hours. So S2 will serve as your fallback position. It will be able to handle any English sentences that S1 can't.

One way to see what your grammar does is to generate random sentences from it. For our purposes, generation is just repeated symbol expansion. To expand a symbol such as NP, our sentence generator will randomly choose one of your grammar's NP -> ... rules, with probability proportional to the rule's weight.

Your task is to write the grammar rules and also choose the weight for each rule. The weights allow you to express your knowledge of which English phenomena are common and which ones are rare. By giving a low weight to a rule (or to S2 itself), you express a belief that it doesn't happen very often in English.

Another way to see what your grammar does is to parse sentences with it. You can use our parser to look at a sentence and figure out whether your grammar could have generated it -- and how, and with what probability. We say that your grammar predicts a sentence well if (according to the parser) your grammar has a comparatively high probability of generating exactly that sentence.

You will have to decide how much effort to expend on designing S1 and S2, and how much you will trust S1 relative to S2. Your goal is to describe English accurately. This means that your grammar (especially the S1 subgrammar) should be able to generate lots of syntactic constructions that show up in English. Moreover, it should have a higher probability of generating common sentences, and a low or zero probability of generating ungrammatical sentences.

The performance of your system will be measured in two ways:

As a simple test, we will randomly generate sentences according to your S1 model, and see how many of them are grammatical. To do well on this score, your S1 should be restrictive enough to generate "real" English sentences only.
For the real test, we will see how well your grammar predicts the output of the other teams' grammars (i.e., their randomly generated S1 sentences, excluding any ungrammatical ones). Really we should see how well your grammar predicts some naturally occurring English, like tomorrow's newspaper, but it's more fun to pit the teams against one another.

To do well on the second score, you need to guess what the other teams are doing and keep up with them. Your grammar must be able to generate their sentences, and indeed do so with high probability. In practice you want your S1 to generate their sentences; although your fallback grammar S2 can generate all possible strings, it consequently has low probability of generating any particular string. So you will try to make S1 extensive enough to handle anything that the other teams can throw at you -- and extensive enough that you are throwing tricky sentences back at the other teams!

The Model

There are two components to the model, both of which are expressible as weighted context-free grammars.

A weighted context free grammar consists of:

A set of non-terminal symbols
A set of terminal symbols
A set of rewrite or derivation rules, each with an associated weight
A start symbol

For natural language CFGs, we think of the start symbol as indicating "sentence" (in this case it will be START), and the terminal symbols as the words.

A derivation rule tells one way to rewrite a non-terminal symbol into a sequence of non-terminal symbols and terminal symbols. For example, S -> NP VP says that an S (perhaps indicating a declarative sentence) can be rewritten as an NP (noun phrase) followed by a VP (verb phrase).

A word about weights: Today you heard a lot about probability models. In a probabilistic CFG, each rule would have some probability associated with it, and the probability of a derivation for a sentence would be the product of all the rules that went into the derivation. We don't want you to worry about making probabilities sum up to one, so you can use any positive number as a weight.

About Vocab

In the file Vocab.gr, we are giving you the set of terminal symbols (words), embedded in rules of the form Tag -> word. Note that the vocabulary is closed. There will be no unknown words in the test sentences, and you are not allowed to add any words (terminal symbols) to the grammar.

We have given equal weights to all the Tag -> word rules, but you are free to change the weights if you like. You are also free to add, remove, or change these rules, as long as you don't add any new words.

About S1

We are giving you a simple little S1 to start with. It generates a subset of real English. As noted, we've also given you a set of Tag -> word rules, but you might find that the tags aren't useful when trying to extend S1 into a bigger English grammar. So you are free to create new tags for word classes that you find convenient or use the words directly in the rules, if you find it advantageous. We tried to keep the vocabulary relatively small to make it easier for you to do this.

You will almost certainly want to change the tags in rules of the form Misc -> word. But be careful: you don't want to change Misc -> goes to VerbT -> goes, since goes doesn't behave like other VerbT's. In particular, you want your S1 to generate Guinevere has the chalice . but not Guinevere goes the chalice ., which is ungrammatical. This is why you may want to invent some new tags.

About S2

The goal of S2 is to enforce the intuition that every string of words should have some (possibly miniscule) probability. You can view it as a type of smoothing of the probability model. There are a number of ways to enforce the requirement that no string have zero probability under S2; we give you one to start with, and you are free to change it. Just note that your score will become infinitely bad if you ever give zero probability to a sentence (even if it's a crummy one generated by another team)!

Our method of enforcing this requirement is to use a grammar that is effectively a bigram (or finite-state) model. Suppose we only have two tag types, A and B. The set of rules we would allow in S2 would be:
S2 -> _A S2 -> _B S2 -> _A -> A _A -> A _A _A -> A _B _B -> B _B -> B _A _B -> B _B This grammar can generate any sequence of As and Bs, and there is no ambiguity in parsing with it: there is always a single, right-branching parse.

Placing your bets

Two rules your grammar must include are START -> S1 and START -> S2. The relative weight of these determines how likely it is that S1 (with start symbol S1) or S2 (with start symbol S2) would be selected in generating a sentence, and how costly it is to choose one or the other when parsing a sentence.

Choosing the relative weight of these two rules is a gamble. If you are over-confident in your "real" English grammar (S1), and you weight it too highly, then you risk assigning very low probability to sentences generated by a team whose grammar is more extensive (since the parser will have to resort to your S2 to get a parse).

On the other hand, if you weight S2 too highly, then you will probably do a poor job of predicting the other teams' sentences, since S2 will not make any sentences very likely (it accepts everything, so probability mass is spread very thin across the space of word strings). Of course, you can invest some effort in trying to make S2 a better n-gram model, but that's a tedious task and a risky investment.

Evaluation

To evaluate the precision of your S1 grammar, we will first generate a test set of sentences from it using randsent. We will then make a grammaticality judgement about each sentence, and your score will be the fraction of sentences judged to be grammatical. To judge the grammaticality of the sentences, we will be using the best tool available: human speakers (yourselves!). Toward the end of the lab, we'll elicit grammaticality judgments from each of you on a set of sentences (which might have been generated by your grammar or someone else's -- so be honest!).

The second score evaluates your full grammar (S1+S2) and is much more important. Here we will take the other teams' test sets (removing the sentences that human judges couldn't parse) and try to parse them with your grammar (the combination of S1 and S2). You will be able to parse all of the sentences (because S2 will accept anything, albeit with low probability). The score will be the cross-entropy of your model with respect to the test set. A low cross-entropy is good; it means that your model is unsurprised by the test sentences generated by the other teams. We will report your model's cross-entropy on each of the other teams' test sets.

If the test set of sentences is {s1, s2, s3, ...}, then your model's cross-entropy is defined as
(-log(p(s1)) - log(p(s2)) - log(p(s3)), ... ) / (|s1| + |s2| + |s3| + ...)
where |si| is the number of words in the ith sentence. Note that
-log(1)=0, -log(1/2)=1, -log(1/4)=2, -log(1/8)=3, ... -log(0)=infinity
So a high cross-entropy corresponds to a low probability and vice-versa. You will be severely penalized for probabilities close to zero.

Note: p(s) denotes the probability that a randomly generated sentence is exactly s. Often your grammar will have many different ways to generate s, corresponding to many different trees. In that case, p(s) is the total probability of all those trees. However, for convenience, we approximate this with the probability of the single most likely tree, because that is what our parser happens to find. So we are really only measuring approximate cross-entropy.

We want to draw attention to a blunder made by some previous participants. Remember the part about weighting the START -> S1 and START -> S2 rules? Few teams change those weights from their defaults; doing so helps their cross-entropy scores a lot.

Tools at Your Disposal

We have given you tools to parse sentences, to generate sentences randomly according to your grammar, and to compute the cross-entropy of your grammar with respect to a set of sentences. We've also given you a tool to check for terminal symbols you may have accidentally added to the grammar.

There are two file formats to know about. The first is for sentences generated by a grammar, or to be parsed by a grammar. We name files in this format with the suffix .sen . Simply put one sentence per line, with spaces between tokens. Make sure you put spaces before and after punctuation symbols, since, for example Arthur, will look like one token to the parser, while Arthur , is what you want (two tokens).

The file format for the grammar is also quite simple. We name these files with the suffix .gr . On a given line, anything following # is an ignored comment. Each rule goes on one line, in the format:
weight X Y₁ Y₂ ... Y_n
signifying the rule X -> Y₁ Y₂ ... Y_n (where X is a non-terminal and the Ys are any mix of terminals and non-terminals), and it gets weight weight.

Be careful not to name any of your non-terminals the same as any terminals, since the parser will assume you intended to conflate the symbols. Blank lines will be ignored. If you accidentally include the same rule more than once, its weight will be the sum of the weights in the grammar file(s).

From here on .gr files refer to grammar files, and .sen files refer to sentence files.

Random sentence generator:

This program will randomly generate sentences according to the distribution defined by your weighted grammar.

 cat grammar-file(s) | randsent [ -t ] [ -n num-sentences ] 
	[ -s start-symbol ]

 


	
	      -t:  produce tree output instead of flat sentences.
	  
 -n num-sentences:  e.g., -n 10 will generate
	    ten sentences instead of just one.
	  
  -s start-symbol:   The generator's start symbol is START by default.
     You may want to specify another so you can test just part of 
     the grammar (e.g., S1).
     
  -g single-grammar-file:  Instead of reading the grammar files from standard input, read them from a single file.
	  
 grammar-files: Typically *.gr.  In general, a list of .gr
     files containing the parts of the grammar.  Different parts
     may be written by different people.
	

      So to generate 17 trees from your entire merged grammar (S1 and S2 combined),
      and to make the trees "pretty," you might run: 

       cat *.gr | randsent -t -n 17 -s START | prettyprint  


      If your grammar is in a single file fullgrammar.gr, you may equivalently run: 

       randsent -t -n 17 -s START -g fullgrammar.gr | prettyprint 

      You can also list more than one grammar file: 

       randsent -t -n 17 -s START -g fullgrammar.gr extrarules.gr | prettyprint 

      Earley Parser (and pre/post-processing scripts): 
	The parser was written in

Dyna and C++.
      To use the parser:



	  parse -g grammar-file(s) -i sentence-file(s)  

You can also pipe grammar files or sentence files in through standard input (not both; if you don't specify -g or -i, the program won't know what's what): 

	   cat grammar-file(s) | parse -i sentence-file(s)  

	   cat sentence-file(s) | parse -g grammar-file(s)  


Many other options are available:
	
	   sentence-file(s): Names of files with one input sentence per line.
	  
 -g grammar-file:  Name of the file containing the grammar.
	  
 -s start-symbol: 
	    The parser's default start symbol is START.
	    You may want to specify another.
	  
 -h:  Prints help, including details on many useful options.
	

	 It is expected that your grammar will be in multiple files.  If you follow
	our convention and name
	all of these ending in .gr you can parse with something like:
	

	 cat *.gr | parse -i sentence-file(s)  

	  which will take sentences from sentence-file(s) and use all your grammar
	  files.
    
	
 If a sentence fails to parse, its output will be
	  failure on a single line.  If the parse is successful, you
	  will see the single best-scoring (highest-weight) parse on one line.  Additional
	  options let you see the probability of the sentence, the probability
	  of the best parse, the number of parses the sentence has under the grammar,
	  and the cross-entropy (in bits per word) of your sentence file.
	

	You will get an error if your grammar includes additional vocabulary, and a warning will print
	if your sentences contain words not in the grammar.
      

	If you want to make the parses a bit more readable, pipe the
	output of parse to prettyprint.  
	
	
 parse -i sentence-file -g grammar-file | prettyprint  

      
 If you want
	to see how you're doing in terms of cross-entropy:
	
	
 parse -i sentence-file -g grammar-file -nC    

(The -n suppresses the parses, and the -C option indicates cross-entropy.)


      
 Checking for new vocabulary: 
      New vocabulary is strictly forbidden!  Sometimes, though, you might
      add a non-terminal on the right-hand side of a rule and forget to give
      its expansion rules.  Or you might forget exactly which words are in the
      vocabulary.  To check for new terminals: 

       cat grammar-files | check-for-new-terms 

      This will list any un-expandable symbols (terminals) in your grammar that
      are not in the vocabulary we've provided.
      
	Note that this check is called automatically whenever you call
	parse or randsent, so you will get constant
	warnings until you fix the illegal terminals.


      
      
      
 Keeping Up: Training Data 
 Throughout the exercise, we
      will make more and more "training data" available to all the
      teams.  This data will come from S1-sentences generated by each
      team's grammar.
      You will be able to find this data in your team's subdirectory generated_data.
      Check back often to keep on top of what
      kinds of sentences the others are generating!
      
      In addition, we
      have generated examples of sentences you might want to try to design for.  
      They
      are listed in the file called examples.sen in the start
	directory.  
      These are just
      suggestions; they aren't part of the evaluation.  If you want to get more ideas
      about how to stump the other teams, we suggest looking on the Web.  It's not
      hard to find complicated sentences.

      
      
      
 Questions for Discussion 

      
         What were some of the linguistic phenomena you handled,
             and how?
        
 If s is a sentence, and p1(s) and p2(s) denote the
             respective probabilities that S1 and S2 generate s,
	     then what is the probability that the whole grammar
	     generates s?
	
 How could you improve the probabilities assigned by S2,
             without losing the guarantee that S2 can generate any
             string of words?
	
 What would happen if you didn't have S2?
	
 Is there any other way to combine S1 and S2 besides the
             one we recommended?
	
 Suppose you strategically decided to make your grammar devote 
	     most of its probability to a really unusual sentence.  Who would 
	     that hurt more - your team or the other teams?


       
 Suppose you were leading a 6-week (or 6-year) team effort to
              build a grammar for English.  How would you
              organize the effort?
              How would you track your progress?  What kinds of tools
            or other resources would you want to develop?  
         
 How does your answer to the previous question change if the
            language is not English?  Does it matter what family of language
            it is?  What if you and your teammates don't speak it?

	
Your grammar scores well when
	  it finds a parse of high probability.  But S1 might produce several  
	  different parses of a sentence, dividing the probability among them. 
	  Is the scoring method fair?

      
      
      
      Overview |     
	Model |
	Tools |
	Evaluation |
	Tools |
	Keeping Up |
	Questions

Competitive Grammar Writing: An Introduction to Natural Language Processing Reference Guide for Participants