Publications listed here in reverse chronological order.
Also available:
4 accepted papers at NAACL-HLT 2012, and one accepted to the PAMI journal. Also, a few papers are currently in review (and of course there's lots of work in progress!).
Invited talk at ILP+MLG+SRL, July 2009.
In this talk, I'll propose to hide many common implementation details behind a new level of abstraction that we are developing. Dyna is a declarative programming language that combines logic programming with functional programming. It also supports modularity. It may be regarded as a kind of deductive database, theorem prover, truth maintenance system, or equation solver.
I will illustrate how Dyna makes it easy to specify the combinatorial structure of typical computations needed in natural language processing, machine learning, and elsewhere in AI. Then I will sketch implementation strategies and program transformations that can help to make these computations fast and memory-efficient. Finally, I will suggest that machine learning should be used to search for the right strategies for a program on a particular workload.
Keynote talk at the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing, June 2009. To all of you who asked in person for the slides, please email me (jason at cs dot jhu dot edu). Note that the talk referred to this work, this work, this work, and this work, among others.
In monolingual dependency parser adaptation, we achieve high accuracy in translating among multiple annotation styles for the same sentence. On the more difficult problem of cross-lingual parser projection, we learn a dependency parser for a target language by using bilingual text, an English parser, and automatic word alignments. Our experiments show that unsupervised QG projection improves on parses trained using only high-precision projected annotations and far outperforms, by more than 35% absolute dependency accuracy, learning an unsupervised parser from raw target-language text alone. When a few target-language parse trees are available, projection gives a boost equivalent to doubling the number of target-language trees.
Our models are trained on automatically aligned bitext. Their form is simple but novel. They assess, based on features of the input sentence, how strongly each pair of input word tokens wi;wj would like to reverse their relative order. Combining all these pairwise preferences to find the best global reordering is NP-hard. However, we present a non-trivial O(n3) algorithm, based on chart parsing, that at least finds the best reordering within a certain exponentially large neighborhood. We show how to iterate this reordering process within a local search algorithm, which we use in training.
Dynamic programming algorithms in statistical natural language processing can be easily described as weighted logic programs. We give a notation and semantics for such programs. We then describe several source-to-source transformations that affect a program's efficiency, primarily by rearranging computations for better reuse or by changing the search strategy. We present practical examples of using these transformations, mainly to optimize context-free parsing algorithms, and we formalize them for use with new weighted logic programs.
Specifically, we define weighted versions of the folding and unfolding transformations, whose unweighted versions are used in the logic programming and deductive database communities. We then present a novel transformation called speculation -- a powerful generalization of folding that is motivated by gap-passing in categorial grammar. Finally, we give a simpler and more powerful formulation of the magic templates transformation.
This supersedes the pre-proceedings version (BibTeX): Eisner, Jason and John Blatz (2006). Program transformations for optimization of parsing algorithms and other weighted logic programs. Pre-proceedings of the 11th Conference on Formal Grammar (FG-2006), pp. 39-59, Malaga, July.
NP without supervision into
NP[0] and NP[1], which behave differently.
We first propose to learn a PCFG that adds
such features to nonterminals in such a
way that they respect patterns of linguistic
feature passing: each node's nonterminal
features are either identical to, or independent
of, those of its parent. This linguistic
constraint reduces runtime and the
number of parameters to be learned. However,
it did not yield improvements when
training on the Penn Treebank. An orthogonal
strategy was more successful: to improve
the performance of the EM learner
by treebank preprocessing and by annealing
methods that split nonterminals selectively.
Using these methods, we can maintain
high parsing accuracy while dramatically
reducing the model size.
To model a collection of documents, suppose that each document was generated by a different hidden Markov model or probabilistic finite-state automaton (PFSA). Further suppose that all these PFSAs are similar because they are drawn from a single (but unknown) prior distribution over PFSAs. We wish to infer the prior, obtain smoothed estimates of the individual PFSAs, and reconstruct the hidden paths by which the unknown PFSAs generated their documents.
As an initial application, particularly hard for our model because of its sparse data, we derive an FSA topology from WordNet. For each verb, we construct the “document” of all nouns that have appeared as its object. Our method then learns a better estimate of p(object | verb), as well as which paths in WordNet, and hence which senses of ambiguous objects, tend to be favored. Our method improves 14.6% over Witten-Bell smoothing on the conditional perplexity of objects given the verb, and 27.5% over random on detecting the most common senses of nouns in the SemCor corpus.
A finite-state machine with n tapes describes a rational (or regular) relation on n strings. It is more expressive than a relational database table with n columns, which can only describe a finite relation.
We describe some basic operations on n-ary rational relations and propose notation for them. (For generality we give the semiring-weighted case in which each tuple has a weight.) Unfortunately, the join operation is problematic: if two rational relations are joined on more than one tape, it can lead to non-rational relations with undecidable properties. We recast join in terms of "auto-intersection" and illustrate some cases in which difficulties arise. We close with the hope that partial or restricted algorithms may be found that are still powerful enough to have practical use.
Radiology reports have especially low entropy, which allows prediction of multi-word phrases. Our system therefore chooses an optimal phrase length for each prediction, using Bellman-style dynamic programming to minimize the expected cost of typing the rest of the document. This computation considers what the user is likely to type in the future, and how many keystrokes it will take, considering the future effect of phrase completion as well.
Previous work on minimizing weighted finite-state automata (including transducers) is limited to particular types of weights. We present efficient new minimization algorithms that apply much more generally, while being simpler and about as fast.
We also point out theoretical limits on minimization algorithms. We characterize the kind of ``well-behaved'' weight semirings where our methods work. Outside these semirings, minimization is not well-defined (in the sense of producing a unique minimal automaton), and even finding the minimum number of states is in general NP-complete and inapproximable.
Probabilistic parsing requires a lexicon that specifies each word's syntactic preferences in terms of probabilities. To estimate these probabilities for words that were poorly observed during training, this thesis assumes the existence of arbitrarily powerful transformations (also known to linguists as lexical redundancy rules or metarules) that can add, delete, retype or reorder the argument and adjunct positions specified by a lexical entry.
In a given language, some transformations apply frequently and others rarely. We describe how to estimate the rates of the transformations from a sample of lexical entries. More deeply, we learn which properties of a transformation increase or decrease its rate in the language. As a result, we can smooth the probabilities of lexical entries. Given enough direct evidence about a lexical entry's probability, our Bayesian approach trusts the evidence; but when less evidence or no evidence is available, it relies more on the transformations' rates to guess how often the entry will be derived from related entries.
Abstractly, the proposed ``transformation models'' are probability distributions that arise from graph random walks with a log-linear parameterization. A domain expert constructs the parameterized graph, and a vertex is likely according to whether random walks tend to halt at it. Transformation models are suited to any domain where ``related'' events (as defined by the graph) may have positively covarying probabilities. Such models admit a natural prior that favors simple regular relationships over stipulative exceptions. The model parameters can be locally optimized by gradient-based methods or by Expectation-Maximization. Exact algorithms (matrix inversion) and approximate ones (relaxation) are provided, with optimizations. Variations on the idea are also discussed.
We compare the new technique empirically to previous techniques from the probabilistic parsing literature, using comparable features, and obtain a 20% perplexity reduction (similar to doubling the amount of training data). Some of this reduction is shown to stem from the transformation model's ability to match observed probabilities, and some from its ability to generalize. Model averaging yields a final 24% perplexity reduction.
We speed up Tesar and Smolensky's RCD algorithm to be linear on the number of constraints. This finds a ranking so each attested form xi beats or ties a particular competitor yi.
We also generalize RCD so each xi beats or ties all possible competitors. Alas, neither this more realistic version of learning, nor even generation, has any polynomial algorithm unless P=NP! That is, one cannot improve qualitatively upon brute force:
Merely checking that a single (given) ranking is consistent with given forms is coNP-complete if the surface forms are fully observed and Delta2p-complete if not. Indeed, OT generation is OptP-complete. As for ranking, determining whether any consistent ranking exists is coNP-hard (but in Delta2p) if the forms are fully observed, and Sigma2p-complete if not.
Finally, we show that generation and ranking are easier in derivational theories: in P, and NP-complete.
The obvious parsing algorithm for bilexical grammars (used by most authors) takes time O(n5). A more efficient O(n3) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.
This is an improved and extended version of an earlier paper (BibTeX): Eisner, Jason (1997). Bilexical grammars and a cubic-time probabilistic parser. Proceedings of the International Workshop on Parsing Technologies, pages 54-65, MIT, September.
In this talk I'll motivate a restrictive formalization of OT that allows just two types of simple, local constraint. Gen freely proposes gestures and prosodic constituents; the constraints try to force these to coincide or not coincide temporally. An efficient algorithm exists to find the optimal candidate.
I will argue that despite its simplicity, primitive OT is expressive enough to describe and unify nearly all current work in OT phonology. However, it is provably more constrained: because it is unable to mimic deeply non-local mechanisms like Generalized Alignment, it forces a new and arguably better account of metrical stress typology. I will even discuss a proposal for constraining it further.
http://cs.jhu.edu/~jason/papers/#ucla99. Extended
version of talk at the 1997 LSA.
Some results on generation are presented. Unlike less restricted theories using Generalized Alignment, OTP grammars can derive optimal surface forms with finite-state methods adapted from Ellison (1994). Unfortunately these methods take time exponential on the size of the grammar. Indeed the generation problem is shown NP-complete in this sense. However, techniques are discussed for making Ellison's approach fast and practical in the typical case, including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size. One avenue for future improvements is a new finite-state notion, ``factored automata,'' where regular languages are represented compactly via formal intersections of FSAs.
Specifically, the paper attempts to shed light on the classical algorithms of Kruskal, Prim, and Boruvka; the improved approach of Gabow, Galil, and Spencer, which takes time only O(m log (lg* n - lg* m/n)); and the randomized O(m) algorithm of Karger, Klein, and Tarjan, which relies on an O(m) MST verification algorithm by King. It also considers Frederickson's method for maintaining an MST in time O(sqrt(m)) per change to the graph. An appendix explains Fibonacci heaps.
This paper presents a rather different and in some ways more explanatory typology of stress, couched in the framework of primitive Optimality Theory (OTP), which allows only primitive, radically local constraints. For example, Generalized Alignment is not allowed. The paper presents a single, coherent system of rerankable constraints that yields the basic facts about iambic and trochaic foot form, iambic lengthening, quantity sensitivity, unbounded feet, simple word-initial and word-final stress, directionality of footing, syllable (and foot) extrametricality, degenerate feet, and word-level stress.
The metrical part of the account rests on the following intuitions:
This talk proposes an approach to constraining OT, called "primitive Optimality Theory" (OTP). Most constraints given in the literature can be reformulated (not always obviously) as coming from one of two simple, local families of ``primitive constraints'':
We will formalize these families and the representations that they constrain. As in Optimal Domains Theory, neither the constraints nor the representations use association lines. The constraints control only the relative timing of articulatory gestures, and other phonological or morphological constituents, along a continuous timeline.
A list of hundreds of constraints drawn from the literature is presented, showing how every degree of freedom of OTP is exploited in each of several areas: features, prosody, feature-prosody interaction, input-output relationships, and morphophonology. To show that the primitive constraints are not merely necessary, but also close to sufficient, we also discuss how to handle a few apparently difficult cases of non-local phenomena.
http://ruccs.rutgers.edu/roa.html) and at
http://cs.jhu.edu/~jason/papers/#lsa97.
With some additions and corrections:
Eisner, Jason (1997).
Constraining OT: Primitive Optimality Theory. Talk handout,
MIT, September. http://www.cs.jhu.edu/~jason/papers/#mit97.
The present paper describes some details of the experiments and repeats them with a larger training set of 25,000 sentences. As reported at the talk, the more extensive training yields greatly improved performance. Nearly half the sentences are parsed with no misattachments; two-thirds are parsed with at most one misattachment.
Of the models described in the original written paper, the best score is still obtained with the generative (top-down) ``model C.'' However, slightly better models are also explored, in particular, two variants on the comprehension (bottom-up) ``model B.'' The better of these has an attachment accuracy of 90%, and (unlike model C) tags words more accurately than the comparable trigram tagger. Differences are statistically significant.
If tags are roughly known in advance, search error is all but eliminated and the new model attains an attachment accuracy of 93%. We find that the parser of Collins (1996), when combined with a highly-trained tagger, also achieves 93% when trained and tested on the same sentences. Similarities and differences are discussed.
Many modern students of democracy favor proportional representation through the Single Transferable Vote (STV). In countries with high illiteracy, however, this system may be unworkable.
This paper proposes a practical modification of STV. In the modified system, each citizen votes for only one candidate. Voters need not specify their second, third, and fourth choices. Instead, each candidate specifies his or her second, third, and fourth choices. The modified system is no more difficult for voters than current proposals -- and it provides virtually all the benefits of STV, together with some new ones.
- Abstract:
- An adaptive compression technique which is an improvement to Lempel-Ziv (LZ) compression techniques, both as applied for purposes of reducing required storage space and for reducing the transmission time associated with transferring data from point to point. Pre-filled compression dictionaries are utilized to address the problem with prior Lempel-Ziv techniques in which the compression software starts with an empty compression dictionary, whereby little compression is achieved until the dictionary has been filled with sequences common in the data being compressed. In accordance with the invention, the compression dictionary is pre-filled, prior to the beginning of the data compression, with letter sequences, words and/or phrases frequent in the domain from which the data being compressed is drawn. The letter sequences, words, and/or phrases used in the pre-filled compression dictionary may be determined by statistically sampling text data from the same genre of text. Multiple pre-filled dictionaries may be utilized by the compression software at the beginning of the compression process, where the most appropriate dictionary for maximum compression is identified and used to compress the current data. These modifications are made to any of the known Lempel-Ziv compression techniques based on the variants detailed in 1977 and 1978 articles by Ziv and Lempel.
- Citation (BibTeX):
- Jeffrey C. Reynar, Fred Herz, Jason Eisner, and Lyle Ungar. A Lempel-Ziv data compression technique utilizing a dictionary pre-filled with frequent letter combinations, words and/or phrases. U.S. Patent #5,951,623 issued 9/14/1999, filed 1996.
- On-line document:
- #5,951,623 (at Patent Storm)
- Relationship to other papers:
- jump to this paper in my research summary
- Citation (BibTeX):
- Frederick Herz, Lyle Ungar, and Jason M. Eisner. System for the automatic determination of customized prices and promotions. Patent pending, filed 1996.
- Relationship to other papers:
- jump to this paper in my research summary
- Abstract:
- This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a "target profile" for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user, which target profile interest summary describes the user's interest level in various types of target objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media. Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.
- Citations (BibTeX):
- Frederick S. M. Herz, Jason M. Eisner, and Lyle H. Ungar. System for generation of object profiles for a system for customized electronic identification of desirable objects. U.S. Patent #5,835,087 issued 11/10/1998, filed 1995.
- Frederick S. M. Herz, Jason M. Eisner, Lyle H. Ungar, and Mitchell P. Marcus. System for generation of user profiles for a system for customized electronic identification of desirable objects. U.S. Patent #5,754,939 issued 5/19/1998, filed 1995.
- Frederick S. M. Herz, Jason Eisner, and Marcos Salganicoff. Pseudonymous server for system for customized electronic identification of desirable objects. U.S. Patent #5,754,938 issued 5/19/1998, filed 1995.
- On-line documents:
- #5,835,087; #5,754,939; #5,754,983 (at Patent Storm)
- Relationship to other papers:
- jump to this paper in my research summary
http://cs.jhu.edu/~jason/papers
| Jason Eisner - jason@cs.jhu.edu (tech correspondence welcome) | Last Mod $Date: 2012/04/02 01:34:55 $ |