Finite-State Methods In Natural Language Processing
Offered as a pair of short courses: 600.405 (Fall 2000) and its
sequel 600.406 (Spring 2001).
Catalog description: Finite-state transducers, a
generalization of finite-state automata, can efficiently compute many
useful functions and weighted (probabilistic) relations on
strings. We'll cover theory and practice, including algebraic
operations for building transducers, software tools, and a range of
applications to natural language.
Announcements
Assignments 4 and 5 due by May 12, please. I'm not on campus so
you'll need to use email/fax/snailmail.
Where to find things for spring semester
Policies
- Format: Lecture, discussion, weekly homework exercises (due
at the following class meeting), readings.
- Grades: Based on homework and participation. Monte Carlo grading may be used.
- Collaboration: You may work in pairs on the homework,
provided that each of you makes a real effort on each problem;
that you indicate who
you worked with; and that you write up your work separately.
- Communication: Assignments, notes, supplements, etc., will
be posted on this web page. I'll send you email from time to
time. Email is also the best way to reach me.
- Prerequisites: 600.271 (Automata and Computation Theory), 600.405 (Part I of this course).
Syllabus
Finite-state methods are enjoying a lot of new attention these
days. The NLP community has been inventing new algorithms and
applications, as well as unearthing old results. A sampling of topics
is below. We will not be able to cover
more than a fraction of them, so rather than setting a fixed schedule
in advance, I'll try to take your interests and background into account as the course
progresses.
Really this class is about constructing useful functions.
We'll focus on what can be done theoretically; how to do it elegantly
and efficiently; and how to do it in practice with existing software toolkits.
Resources
Overview readings that will reinforce this material:
- Background: Review your notes from 600.271 (Kosaraju). Or
look at your old copy of Hopcroft & Ullman (1979), Introduction to
Automata Theory, Languages, and Computation [note: an all-new
version is to appear by the end of 2000].
- Unweighted transducers: Kenneth Beesley & Lauri Karttunen
(forthcoming), Finite-State
Morphology: Xerox Tools and Techniques, Chapters 1-2. [This
draft copy, which you may not redistribute, is only accessible if you
are running your browser on the CS dept. research or undergrad network.]
- Weighted transducers: Mehryar Mohri (1997), Finite-state
transducers in language and speech processing. Computational
Linguistics 23:2.
- Semirings & power series: J. Berstel & C. Reutenauer
(1988), Rational Series and Their Languages, Chapter 1.
[Handed out in class.] A bit less dense is Arto Salomaa & Matti
Soittola, Automata-Theoretic Aspects of Formal Power Series.
The Aho, Hopcroft & Ullman algorithms textbook describes
algorithms on semiring-weighted graphs.
Some possible topics for future
lectures in 600.405 and 600.406
Send me mail if you see
something you're particularly interested in, or if you'd like to
propose another topic.
- Formal stuff:
- Definitions of n-tape machines
- Undecidability of equivalence for general transducers
- Sequentiality:
- Deterministic, (p-)(sub)sequential, and unambiguous machines
- Decidability of equivalence for sequential transducers
- Factoring rational functions into sequential functions
- Characterizations of sequentiality
- Testing sequentiality
- Extensions
- 2-dimensional transducers
- Extensions to tree transducers
- Cute theorems
- Operations:
- Closure under sum, product, Kleene star
- Closure under inversion, projection, composition (for transducers)
- Closure under homomorphism, Hadamard product (for commutative semirings)
- Trimming (removing non-accessible and non-co-accessible states)
- Best-paths
- Epsilon removal; epsilon normalization
- Determinization and minimization
- Approximate minimization
- Constructing new operations
- Cross product
- Directed replace and its variants
- Longest match, shortest match
- Lookahead assertions
- Ignore
- Lenient composition
- Directed best paths
- Efficient implementation:
- Storage choices
- Lazy ("on-the-fly") operations
- Local determinization
- Choice of epsilon removal methods
- Pruning
- Ordering operations
- Composition with CFGs
- Applications:
- Speech recognition
- Speech synthesis
- Dictionary compression and lookup (perfect hash)
- Text indexing
- Text markup for information extraction, text-to-speech, etc.
- Weighted edit distance (with applications to speech,
translation, lexicon induction, ...)
- Back-transliteration
- Statistical machine translation
- Spelling correction, accent restoration, word segmentation, etc.
- Text normalization (e.g., prior to parsing)
- Part-of-speech tagging (CG, TBL)
- Chunking and light parsing
- Phonology: Derivational, declarative, Optimality Theory
- Morphology: Two-level, templatic, ...
- Cryptography (!)
- Compilation of other devices into finite-state machines:
- Transformational rules
- Transformation-based learning
- Decision trees/lists
- Context-free grammars (exactly or approximately)
- Reduplicative grammars (approximately)
- Learning transducers
- Learning transducer weights via EM
- Weight parameterization
- Adaptive mixtures of transducers
- Learning transducer topology
- Local greedy search (state merging/splitting)
- Stochastic model search (e.g., MCMC)
- Transformation-based learning
- Abbadingo
- The case of reversible languages
- The case of acyclic automata
- Learning from random walks
- Query learning
Jason Eisner - jason@cs.jhu.edu
- $Date: 2002/12/22 18:32:40 $