Further Resources on Log-Linear Models

by Jason Eisner and Frank Ferraro (2013)

This page points to some resources on log-linear modeling. They accompany the interactive visualization described in Ferraro & Eisner (2013), A virtual manipulative for learning log-linear models. Suggested additions are welcome.

Log-Linear Software

Here's some recommended open-source software you can use to build log-linear models for your own use.

Pencil-and-Paper Exercises

[We will place some practice problems here from Jason's NLP class. We would also be happy to link to exercises from other NLP classes.]

Homework Projects

[We will link here to an assignment from Jason's NLP class. We would also be happy to link to projects from other NLP classes.]

Further Reading

One good introduction is the handout that goes along with our visualization.

Noah Smith's tutorial offers a more mathematical description of log-linear models, including how maximizing conditional log-likelihood (as in our visualization) arises as the dual problem of maximizing entropy. His book, Linguistic Structure Prediction, discusses log-linear models for structure prediction (see especially sections 3.4 and 3.5).

Charles Elkan has very readable notes on log-linear models and related concepts, with a bibliography. His CIKM 2008 video tutorial comes with notes. Computational and optimization aspects are covered, and grounded in logistic regression examples and conditional random field (CRF) tagging. Hanna Wallach also offers an introduction to CRFs and efficient computation for linear chain CRFs.

Jason Eisner has teaching slides (pdf) on using conditional log-linear models for structured prediction problems like sequence tagging and parsing, where the number of output categories y is very large. These slides also introduce the structured perceptron, a related technique. They assume familiarity with the simpler cases covered in our visualization, as well as with dynamic programming algorithms for tagging and parsing.

For links into the research literature, we quote from section 8 of our paper (Ferraro & Eisner, 2013):

At the time of writing, 3266 papers in the ACL Anthology mention log-linear models, with 137 using “log-linear,” “maximum entropy” or “maxent” in the paper title. These cover a wide range of applications that can be considered in lectures or homework projects.

Early papers may cover the most fundamental applications and the clearest motivation. Conditional log-linear models were first popularized in computational linguistics by a group of researchers associated with the IBM speech and language group, who called them “maximum entropy models,” after a principle that can be used to motivate their form (Jaynes, 1957). They applied the method to various binary or multiclass classification problems in NLP, such as prepositional phrase attachment (Ratnaparkhi et al., 1994), text categorization (Nigam et al., 1999), and boundary prediction (Beeferman et al., 1999).

Log-linear models can be also used for structured prediction problems in NLP such as tagging, parsing, chunking, segmentation, and language modeling. A simple strategy is to reduce structured prediction to a sequence of multiclass predictions, which can be individually made with a conditional log-linear model (Ratnaparkhi, 1998). A more fully probabilistic approach---used in the original “maximum entropy” papers---is to use (1) to define the conditional probabilities of the steps in a generative process that gradually produces the structure (Rosenfeld, 1994; Berger et al., 1996.). (Even predicting the single next word in a sentence can be broken down into a sequence of binary decisions in this way. This avoids normalizing over the large vocabulary (Mnih & Hinton, 2008).) This idea remains popular today and can be used to embed rich distributions into a variety of generative models (Berg-Kirkpatrick et al. 2010). For example, a PCFG that uses richly annotated nonterminals involves a large number of context-free rules. Rather than estimating their probabilities separately, or with traditional backoff smoothing, a better approach is to use (1) to model the probability of all rules given their left-hand sides, based on features that consider attributes of the nonterminals. (E.g., case, number, gender, tense, aspect, mood, lexical head. In the case of a terminal rule, the spelling or morphology of the terminal symbol can be considered.)

The most direct approach to structured prediction is to simply predict the structured output all at once, so that y is a large structured object with many features. This is conceptually natural but means that the normalizer Z(x) involves summing over a large space 𝒴(x). One can restrict 𝒴(x) before training (Johnson et al., 1999). More common is to sum efficiently by dynamic programming or sampling, as is typical in linear-chain conditional random fields (Lafferty et al., 2001), whole-sentence language modeling (Rosenfeld et al., 2001), and CRF CFGs (Finkel et al, 2008).

This page online: http://cs.jhu.edu/~jason/tutorials/loglin/further
Jason Eisner - jason@cs.jhu.edu (suggestions welcome)