Finite-State Software at JHU CS: Getting Started (HOWTO)

This tutorial page should help you get started with the 3 different finite-state packages we have installed:

FSA Utilities
XFST: Xerox Finite-State Tool
FSM: AT&T finite-state library

If you have trouble getting the programs to run, please contact Radu Florian (rflorian@cs.jhu.edu). If you have corrections or suggestions for this page, email Jason Eisner (jason@cs.jhu.edu).

Note on other finite-state packages

Note that you can find other finite-state packages on the web (e.g., this Mathematica library and Java applet), but most of them only deal with finite-state acceptors, not with weights or transducers.

As you probably know, some useful unambiguous transductions are straightforward to carry out in Perl or Python or with regular expression libraries (the GNU regex library for C, C++, Java; other Java packages; and the extended Regex++ library for C/C++). The idea is to simply match a generalized regular expression against different pieces of a text, recording the locations of subexpression matches so that they can then be manipulated by arbitrary code. This hybrid approach is useful if one really needs more power than a finite-state approach - thanks to the possibility of backreferences (which require backtracking) and arbitrary replacement code (most often used to swap substrings of unbounded length). But it sacrifices efficiency, economy, elegance, intermediate nondeterminism, and specification via best weighted path.

Changing your PATH and MANPATH variables

This section assumes you use csh or a csh derivative as your shell. If not, you may have to modify these instructions slightly.

If you are on the CS research network: check that /usr/local/xerox/bin is on your PATH. (It is by default.) Also make the following additions to PATH and MANPATH variables in your .cshrc file:

setenv PATH /users/rtfm/rflorian/software/bin:$PATH
setenv MANPATH /users/rtfm/rflorian/software/man:/software/local/uman:$MANPATH
setenv LEFTYPATH /users/rtfm/rflorian/software/lib/graphviz/lefty

If you are on the CS undergrad network: Note that the software is installed on barley and other machine, but not on hops. Check that /usr/local/xerox/bin is on your PATH. (It is by default.) Also make the following additions to PATH and MANPATH variables in your .cshrc file:
```
setenv PATH /home/0/rflorian/software/bin:$PATH
setenv MANPATH /home/0/rflorian/software/man:/software/local/uman:$MANPATH
setenv LEFTYPATH /home/0/rflorian/software/lib/graphviz/lefty
```

Comparison Chart

	FSA Utilities	Xerox FST	AT&T FSM + lextools
interactive startup	fsa -tk	xfst	N/A
manual	Holland, local (local is accessible from CS research network only)	book draft (only accessible from CS research and undergrad networks; don't distribute)	man fsmintro man lextools
complete regexp guide	Holland, local	sects 2.3-2.4 (p. 42ff), XRCE	man -s5 lextools
automaton format guide	Holland, local		man -s5 fsm man -s3 fsmclass man -s5 lextools
interface guide	chaps 11, 12, 13 Prolog manual Prolog-Emacs interface	overview + ref chap 3 (p. 73ff) help, apropos	N/A
parentheses	(E)	[E]	(E)
comments	% comment	# comment	# comment
atomic expressions	a, a:b, a::3, a:b:3	a, a:b	a, a:b, <3> (<3> = epsilon with weight 3)
literal symbol	'' (or escape())	"" (or %)	\*
symbol escape codes	as in Prolog	as in C
long symbol names	foo bar	foobar (greedy)	[foo][bar]
complex symbol	predicates		[noun num=pl gender=fem]
symbol class	a..z (or predicates)		superclass defn in .sym file (`lexmakelab` compiles to `.scl` file for `-S` option)
any symbol	? (surround with spaces)	?	[<sigma>] (if defined as superclass)
symbol complement (? - E)	`E	\E
edge of string		.#.	[<bos>], [<eos>]
concat	[E1,E2,E3]	E1 E2	E1 E2 (fsmconcat)
character concat		{abcd}	abcd
union	{E1,E2,E3}	E1 \| E2	E1 \| E2 (fsmunion)
empty string (epsilon)	[]	0 []	0 [<epsilon>] (if defined)
empty language	{}	\?
optionality	E^	(E)	E?
Kleene closure	E*	E*	E* (fsmclosure)
Kleene plus	E+	E+	E+
repetition		E^n, E^<n, E^>n, E^{n,m}	E^n
contains	$E	$E
reverse	reverse(E)	E.r	fsmreverse
intersect	E1 & E2	E1 & E2	& (fsmintersect)
difference	E1 - E2	E1 - E2	E1 - E2 (fsmdifference)
complement (?* - E)	~E	~E	!E (fsmcomplement)
cross-product	E1 x E2 E1:E2 (high-precedence)	E1 .x. E2, also {abcd}:{aceg}	E1:E2
same-length cross-product	E1 xx E2
project	domain(E), range(E)	E.u, E.l	fsmproject
epsilon-remove	efree(E), et al.		fsmrmepsilon
determinize	E! {t,w,wt}_determinize(E)		fsmdeterminize
minimize	E# {t,w,wt}_minimize(E)		fsmminimize
compose	E1 o E2	E1 .o. E2	E1 @ E2 (fsmcompose)
invert	invert(E)	E.i	fsminvert
ignore (insert freely)		A / B A ./. B (blocked at edges)
restriction		[A => L1 _ R1, L2 _ R2]
replacement	Relevant macros in ~rflorian/software/lib/fsa	A -> B (exhaustive nondeterm.) A (->) B (optional) A @-> B (LR longest) A @> B (LR shortest) A ->@ B (RL longest) A >@ B (RL shortest) Reverse arrow swaps lower, upper A may have form [. E .] (blocks repeat matches of epsilon in same place) B may have form L ... R (inserts L, R around A) Contextual restrictions: \|\| L _ R (upper upper) // L _ R (lower upper) \\ L _ R (upper lower) \/ L _ R (lower lower) Parallel replacements joined by , or by ,, if restricted

Walkthrough: A Basic Example

This tutorial section tells you what to type in order to get the same example working in all 3 programs. It is written so that it will be easiest to follow if you learn the 3 programs in the order given. The example involves roughly the following steps:

A regular expression over an alphabet of part-of-speech tags. The regexp is basically (art | quant)? adj* noun+ although the notation varies from package to package. It is intended to match simple noun phrases: an optional article or quantifier, followed by 0 or more adjectives, followed by 1 or more nouns.
A transducer that matches exactly the same input as the previous regular expression, and outputs a transformed version where non-final noun tags are replaced by nmod ("nominal modifier") tags. For example, it would map the input adj noun noun noun deterministically to adj nmod nmod noun. It would map the input adj to no outputs at all, since that input is not a noun phrase and therefore does not allow even one accepting path.
A transducer that reads an arbitrary input string and nondeterministically brackets and transforms, as above, some arbitrary set of non-overlapping noun phrases.
A transducer that reads an arbitrary input string and outputs a single version where all the maximal noun phrases (chosen greedily from left to right) have been bracketed and transformed as above.
The output of this last transducer on the input string verb art adj noun noun noun prep art adj noun and on the class of input strings quant noun+ verb.

FSA Utilities

To start fsa with the graphical interface, type fsa -tk. This opens an interactive Prolog window (where any errors and outputs will appear) and an interactive Tk/Tcl window.
In the "Regex:" field, type [{art,quant}^, adj*, noun+]. Look at the above to see what these symbols mean. (Note: Emacs editing keys work in the "Regex:" field.)
Notice that the automaton that appears has been automatically determinized and minimized (this often happens but not always). Convince yourself that it's correct. The initial state is green; the final state is red.
To get a better view of the automaton, drag the states around and point to the arcs. Clicking on an arc will print its description in the Prolog window.
Request a slower but excellent view made by the dot program for automatic directed graph layout. Select dot_ghostview or dot_xv from the Visualization menu. (This calls dot to lay out a postscript or gif file, and then invokes the appropriate viewer on that file.)
In your regexp, art, noun, etc. are elements of the formal alphabet. Try changing noun to Noun and hit return; you will get an error message in the Prolog window. The problem is that capitalized symbols have a special meaning to the Prolog back-end. So try changing Noun to the quoted 'Noun': this should work. Then change it back.

(You also have to quote symbols like '*' to block their special meaning. If you want numbers in your formal alphabet, be warned that the digit character '5' (which you can enter into the "String:" field) is distinct in Prolog from the number 5, but both can be in the alphabet, as can 23 and '23'.)
Type art adj adj noun into the "String:" field. The Prolog window will tell you "no" (i.e., it doesn't match). The problem is that the string was interpreted as a sequence of characters 'a','r','t',' ', ... To fix this, on the Settings menu, change symbol_separator to 32, which is the ASCII code for space. (You can also arrange this at startup by including symbol_separator=32 on the command line.) Now when you submit the string, the Prolog window should say "yes".
Another way to test a string is to intersect it with the automaton. If you get the empty language (easy to see after minimization), it didn't match. Thus, type ([art,adj,adj,noun] & [{art,quant}^, adj*, noun+])# into the Regexp box: what happens? What if you delete the first noun from the string? (Notice that we are not really using the string itself, but rather the regular language that consists of only that string.)
Now we'll compose our automaton with a transducer that replaces each noun that immediately precedes another noun with nmod. Such a transducer can be constructed by one of the directed replacement operators of Karttunen (1997). We need to load a macro file that defines this operator. On the File menu, choose LoadAux... and load the file ~rflorian/software/lib/fsa/Karttunen97/Karttunen97_2.pl. (You can also load a file at startup time by adding -aux file to the command line; you may use the -aux flag multiple times on the same line to load multiple files.)
Having loaded the file, we can now write replace(Old,New,Left,Right), to get a directed replacement transducer that scans its input from left to right. At every position i, it replaces the longest substring (if any) that starts at i, matches the regexp Old, and falls directly between substrings matching Left and Right. The match for Old is replaced nondeterministically by every string matching New. The transducer then resumes scanning immediately after the end of the match text.

(This is actually an abbreviation for Old => New lu Left -- Right. =>chooses left-to-right scanning, and lu says that the left context is matched against the "lower" (output) version of the string -- i.e., it can see the previous changes -- whereas the right context is matched against the "upper" (input) version that has not been changed yet. <=, ul, uu, and ll are also available.)
Specifically, type replace(noun,nmod,[],noun) into the "Regex:" field. (The resulting transducer will mention a few harmless odd characters that were introduced during its construction.) Try typing foo bar noun baz noun noun bing noun noun noun noun into the "String:" field. You should get the expected output, but you get 16 identical copies of it.
The multiple copies arise from multiple equivalent paths in the transducer. Change the regexp to t_determinize(replace(noun,nmod,[],noun)). (This does transducer (t_) determinization, which is not automatically requested by the graphical interface because it may not terminate.) Now try again; you should get only one copy.
Let's compose our original regexp with the transducer, using the composition operator o. If R and S are relations (i.e., sets of ordered pairs), then R o S = {(x,z): for some y, (x,y) in R and (y,z) in S}. This is a generalization of function composition.

The original regexp matches a particular language L. It will automatically be coerced to an "identity" transduction (namely [{art:art,quant:quant}^, adj:adj*, noun:noun+]) that maps every string in L to itself, and rejects all other strings. In other words, it describes the relation {(x,x): x in L}.

So try the composition: [{art,quant}^, adj*, noun+] o replace(noun,nmod,[],noun). Determinize it using t_determinize (again introducing some odd characters that you can ignore). How does this machine work? How does it manage to be deterministic, given that it must look ahead to the next symbol in order to decide whether to change noun to nmod?
Modify the transducer so that it also introduces angle brackets around the noun phrase in the output: [ []:'<', [{art,quant}^, adj*, noun+] o replace(noun,nmod,[],noun), []:'>' ]. Try it on the inputs art adj noun noun noun and art adj. Again, determinize it and try again.
The symbol ? matches any character, so ? * matches any string. (Warning: the space after ? is crucial!) If ? * is used as a transduction, the usual rules mean it will be coerced to the transduction that maps any string to itself - i.e., leaves the input unchanged in the output.

Therefore, given a transducer T, we can write [? *,[T, ? *]*] for a transducer that applies T to some arbitrary set of disjoint substrings of the input that it matches, and leaves the rest of the input unchanged.

Take T to be the deterministic transducer we have just constructed, and apply this transformation to it. Try it on the strings verb art adj noun noun noun and art adj. How would you characterize the output?
Use the LoadAux command as before to load ~rflorian/software/lib/fsa/GerdemannVannoord99/replace.pl. This defines 1-argument and 3-argument versions of the replace macro; in these versions the first argument is a transducer. If T is our noun phrase transducer as above, then replace(T) is a transducer that scans the input from left to right, applying T to the maximal substring If T is deterministic, then so is replace(T).

Try applying replace(T) to strings like verb art adj noun noun noun prep art adj noun. Fun, eh? (Note: replace(T,Left,Right) will only replace matches that fall directly between substrings matching Left and right.)
Try doing the above application by using composition. That is, build the automaton [verb, art, adj, ... ] o replace(T). This yields the automaton that maps a particular string to its output, and nothing else. Try applying the function range to this automaton, to get only the output.
Finally, try applying the transducer to a class of strings in parallel: in the "range" expression above, replace the specific string by the regexp [quant, noun+, verb]. The resulting automaton describes the set of outputs for these inputs. Add # to the end of the expression to determinize and minimize it. Have a look.
Quit FSA Utilities, either by typing halt. in the Prolog window, or by existing through the File menu.

Some other nice features of FSA Utilities that we didn't explore:

Weighted automata over more-or-less general semirings.
Minimal paths ("pruning").
A generalization of character ranges like a..z: arc labels may be predicates that pick out subsets of a potentially infinite alphabet.
Load/save with import/export to various file formats, including FSM.
Generate code that implements your specific deterministic acceptor or transducer in C, Java, etc.
Many more useful and interesting macros in the Examples directory, ~rflorian/software/lib/fsa, including many constructions from the recent literature.
Numerous flags that affect the behavior of the system.
Write your own powerful macros using Prolog. In fact, Prolog allows definition of new infix operators, so you can extend the regexp language quite nicely. Here's a brief introduction to Sicstus Prolog.
Open source. (It can even be compiled GUIlessly with YAP, an open-source Prolog which unfortunately won't link yet to tk/tcl.)

XFST: Xerox Finite-State Tool

Note: The tutorial material in chapters 2-3 of the xfst book will be very helpful if you want to do real work with xfst.

Note: A somewhat nicer and more standalone version of the following instructions is found in problem 1 of this assignment (postscript version).

To start xfst, type xfst. This gives you a command line. Useful commands are help, help command, and apropos topic.

Because xfst doesn't have good command-line editing and recall facilities, you may want to start a shell in emacs and run xfst in that shell. (M-x shell starts the shell and C-h m tells you how it works.)
Define a regular expression over an alphabet of part-of-speech tags: define nounphrase (Art|Quant) Adj* Noun+ ;

Note: When we type in strings to match against this regexp, xfst will manage to interpret ArtAdjAdjNoun as a length-4 string. It tokenizes by greedy left-to-right longest match, but it is safest not to rely on that. Instead, the convention of starting all multicharacter symbols with a distinguished character or character class prevents any accidental ambiguities. Here we use capital letters for this. In practice, the authors of xfst recommend using + (typed in a regexp as %+ or "+" since + has a special meaning in regexps).
Get information about the nounphrase machine: print words nounphrase and print net nounphrase.

The former command appears to list all words that can be accepted along acyclic paths (a finite number).

The latter command lists the transitions from each state. States are named sn or fsn depending on whether they are final; s0 or fs0 is the start state. Notice that as in FSA Utilities, the machine has been automatically determinized and minimized for us, since it is an acceptor rather than a transducer.
Let's see whether Art Adj Adj Noun is a noun phrase. We'll test that by intersection:
- push nounphrase puts a copy of our automaton on the stack.
- Then read regex Art Adj Adj Noun ; creates an unnamed automaton reading just the test string, and pushes it on the stack too. (Note that the xsft prompt shows the current stack depth; use print stack to see the contents.)
- intersect net replaces the two machines on top of the stack with their intersection.
- test non-null confirms that the intersection is not null, so Art Adj Adj Noun did indeed match the automaton.
- Type pop stack to pop the intersection automaton off the stack and discard it.
There is an easier way to do the test. Just as in FSA Utilities, an automaton is also regarded as as an identity transducer on the language it accepts. xfst makes it easy to apply transducers to raw strings, as follows:

push nounphrase pushes the nounphrase machine onto the stack, ready to apply; and then apply down enters a mode where you can type in test strings and discover what nounphrase transduces them to, in the usual ("down") direction of transduction. If the input string is copied to the output, then it's part of the language; if there are no output strings, then the input was not part of the language. Try typing in ArtAdjAdjNoun and ArtAdjAdj.

Press Ctrl-D to exit this mode, and do clear stack.
Type read regex Noun -> Nmod || _ Noun ; , which pushes an unnamed transducer that replaces Noun with Nmod before any Noun. Test it out: apply down FooBarNounBazNounNounBingNounNounNounNoun.

Note: The "down" direction of application (and in the stack) is identified with "right" in regular expressions (and in strings). Applying down transforms along the direction of the rightward arrow in Noun -> Nmod, and from a to b in a:b or a .x. b.
We now want to compose nounphrase with this transducer. Type push nounphrase and then compose net.

Did this compose the transducers in the right order? We want the lower language (the usual output) of nounphrase to feed into the upper language (input) of the new transducer. Since nounphrase is sitting on top of the stack, in the usual metaphor, it is above the new transducer, so its lower language was "touching" the new transducer's upper language, as desired.

Now that we have composed them, we can apply down, which is as if we send a string downward through nounphrase first and then through the noun-to-nmod transducer. Try some inputs like ArtAdjNounNounNoun and VerbAdjNounNounNoun.
We want to modify the composed machine so it inserts angle brackets around the noun phrase. First let's pop it into a named variable: define noun2nmod. (If the stack isn't clear now, type clear.) Now let's push the three things we have to concatenate:
- read regex []:%< ; (replaces the empty string with < - recall that % says to treat the next character literally)
- push noun2nmod
- read regex []:%> ;
However, the stack is now in reverse order: since down in the stack corresponds to right in the string, we want the topmost thing on the stack to be leftmost in the string. So let's reverse the order of stack elements (without exchanging up and down for any element): turn stack. (Type print stack to see the stack.) Now we can finally concatenate together all three transducers on the stack: concatenate net. Try using apply down to test it on the same strings as before.
Give a name to our new transducer: define npbracket. We can use this name directly in a regexp: read regex [?* [npbracket ?*]*];

As we did with the FSA utilities, try this on the strings VerbArtAdjNounNounNoun and ArtAdj.
For our last major trick, we again want to "correct" the previous transducer so that it behaves like replace(T) in the FSA Utilities. This means applying npbracket to maximal noun phrases from left to right. This will be a little tricky, since xfst has no operator that uses a transducer to do directed replacement. However, the solution is completely general.
We will do it as a composition of two transducers:
- The first one puts special markers around the strings that npbracket ought to transform: define r1 [npbracket.u @-> "{{{" ... "}}}"] ; (Here npbracket.u is the upper language of npbracket, i.e., the strings npbracket it will accept on the input side. @-> calls for left-to-right maximal replacement of this language, and the ... notation stands for an output copy of whatever string was actually matched on the input side.)
- The second one applies applies the npbracket transducer within curly braces, deleting the braces and leaving everything else alone: define r2 [\"{{{"* [ "{{{":[] npbracket "}}}":[] \"{{{"* ]*] ;
- So the full machine is obtained on the stack by read regex r1 .o. r2 ;
Check that the machine we just constructed does the right thing: apply down VerbArtAdjNounNounNounPrepArtAdjNoun. And as before, try applying it to a class of strings:
- read regex Quant Noun+ Verb ;
- compose
- print net
Questions for you to try yourself:
- Given an acceptor like nounphrase, how would you construct a transducer that maps accepted strings to yes and rejected strings to no? (Hint: use the cross product .x. operator.)
- Given a transducer like npbracket, how would you construct a version that is willing to take input in the form Art Adj Adj Noun (instead of ArtAdjAdjNoun) and also produces output in this form?
Type print defined and print storage to see what you've done. Then quit xfst by typing exit or quit.

Some other nice features of xfst that we didn't explore:

Most commands may be abbreviated to one word or a part of a word.
Load/save.
Scripts and aliases.
I/O redirection for individual commands.
Stack commands that correspond to most operators in the extended regexp language.
Heuristic optimize net command that tries to reduce number of arcs.
Convenient tests for various automaton properties, as well as equivalence, etc.
Some very ad hoc context-free approximation (see recursive-define variable).
Numerous flags that affect the behavior of the system.
Additional tools built on top of xfst. We have some of them installed here (and they are described in the book): twolc for compiling morphological grammars in Koskenniemi's "Two-Level Rule" formalism, and the accompanying tool lexc for compiling morphological lexicons.

Warning: In the case of transducers, xfst appears to use only the automaton algorithms for determinization and minimization. That is, it temporarily treats an arc label like a:b as a single symbol. So the result is not necessarily deterministic with respect to the input (there may be a:b and a:c arcs from the same state); nor is it necessarily the minimal transducer implementing this relation, since xfst will not change the alignment between input and output (will not split up the two halves of a:b and put them on different arcs).

FSM and lextools: AT&T finite-state library

This software comes in the form of C libraries (type man -s3 fsm) and command-line tools (type man -s1 fsm and man lextools). There is no interactive interface other than the shell, makefiles, etc. So you will need to create some files representing the machines. Make a directory to hold the files for this exercise.
The FSM package denotes states and symbols internally by numbers. Let's start by giving them symbolic labels for our convenience. Create a file called partofspeech.lab with these contents:
```
eps     0
Noun	1
Verb	2
Quant	3
Art	4
Adj	5
Nmod	6
Prep	7
Adj	8
         
```
0 is reserved for the empty string. (It should be assigned a name like epsilon or #.) Otherwise the numbers here are arbitrary; they need not be consecutive or even small.
Now let's create a textual description of a FSA (finite-state acceptor) for the language (Art | Quant)? Adj* Noun+. Put this in a file nounphrase.txt:
```
0	1	Art
0	1	Quant
1	1	Adj
1	2	Noun
2	2	Noun
2
         
```
The first line means, for example, that there is an arc from state 0 to state 1 labeled with the symbol Art. The last line means that state 2 is a final state.

The first state mentioned in the file (state 0 in this case) is the start state. The state numbering is arbitrary (in fact, the FSM programs may renumber the states, so the -s option for assigning names to states usually won't do what you want when printing or drawing a machine, although it can be helpful when compiling one from input you create).

The order of lines in the file is irrelevant, and any amount of whitespace may be used to separate fields on a line. Any line can take an additional field at the end giving the cost of the arc or the final state.
Now compile this textual representation into a binary representation:
fsmcompile -ipartofspeech.lab nounphrase.txt > nounphrase.fsa
The -i flag gives our file that turns input symbol labels into numbers. (Use -o for output symbols and -s for state labels.) Compilation or other operations may renumber the states.
Let's look at our compiled FSA:
- fsmprint nounphrase.fsa
- fsmprint -ipartofspeech.lab nounphrase.fsa
- fsmdraw -ipartofspeech.lab nounphrase.fsa | dot -Tps | ghostview -landscape -
- lexfsmstrings -l partofspeech.lab nounphrase.fsa (refuses to list strings because infinitely many)
- fsminfo nounphrase.fsa
- fsminfo -n nounphrase.fsa
- fsminfo -v nounphrase.fsa (verbose; other options get subsets of this info)
You may be interested to know that fsmdraw produces a directed graph specification in the dot language (documentation: man dot, user's guide). While there are several options to fsmdraw, it is sometimes convenient to edit this specification it produces, or filter it with a Perl script - for example, to get color. You can also use dotty nounphrase.dot to make minor changes to the layout, e.g., drag states around with the mouse or set colors and styles.

dot -Tps then lays out the graph and produces postscript output, which can be saved in a file or piped directly to ghostview - as above.
There are some easier ways to produce these machines than by hand-coding them! The lextools package is designed for that. In particular, lexcompre compiles a regular expression specified with -s:
```
lexcompre -l partofspeech.lab -s '([Art]|[Quant])?[Adj]*[Noun]+' > nounphrase2.fsa
```
View the output and make sure it's right!
Other options:
- Create the .txt textual description automatically with a program.
- Directly create an internal representation of the machine with a C program. (See man pages for fsmclass, fsmobject, fsmaccess, fsmcost).
- Build up the machine from simpler machines by using the FSM tools. For example, regexp * is implemented by the fsmclosure utility or the FSMClosure() function. Note the possibility of defining higher-level operators (like directed replacement) by using scripts or C functions that call the FSM tools.
- Write a regexp in FSA Utilities, and save it in FSM format! This allows access to the nice features of the FSA Utilities' regexp language.
Just for fun, try reversing the automaton:
```
fsmreverse nounphrase2.fsa | fsmdraw -ipartofspeech.lab | dot -Tps | ghostview -landscape -
```
The result contains an epsilon arc and other instances of nondeterminism. Thus, try
```
fsmreverse nounphrase2.fsa | fsmrmepsilon | fsmdeterminize | fsmminimize | fsmdraw -ipartofspeech.lab | dot -Tps | ghostview -landscape -
```
Notice that we needed a sequence of 3 commands: we must remove epsilons before determinization (which would otherwise treat them as ordinary symbols), and determinization is required for minimization to run.

Note also that some of the FSM commands are being used as pipes here. As for many Unix commands, the one-argument commands will conveniently read from standard input if their filename argument is omitted, and all the commands take the special filename - to refer to standard input.

Now let's create our transducer that maps Noun to Nmod before another noun. First by hand: create a file noun2nmod.txt with the following contents (but delete the comments).

 
### state 0: neither inside or immediately after a noun sequence
0   # final      
0	1	Noun	Nmod   # nondeterministically start multi-word noun sequence:
0	2	Noun	Noun   # nondeterministically start single-word noun sequence:
0	0	Verb	Verb
0	0	Quant	Quant
0	0	Art	Art
0	0	Adj	Adj
0	0	Nmod	Nmod
0	0	Prep	Prep
0	0	Adv	Adv
### state 1: strictly inside a maximal input noun sequence
###          must continue it
1	1	Noun	Nmod
1	2	Noun	Noun
### state 2: right at the end of a maximal input noun sequence
###          no way to leave here on a Noun input
2    # final
2	0	Verb	Verb
2	0	Quant	Quant
2	0	Art	Art
2	0	Adj	Adj
2	0	Nmod	Nmod
2	0	Prep	Prep
2	0	Adv	Adv

Compile this via

fsmcompile -t -ipartofspeech.lab
      -opartofspeech.lab noun2nmod.txt > noun2nmod.fst

. (The -t option indicates that the input describes a transducer.) View the result with fsmdraw.

(coming soon! But you do know enough now to use these tools in assignment 1. You could check out this tutorial if you want more info on FSM.)

CS 405 home - Jason Eisner - jason@cs.jhu.edu - $Date: 2001/11/19 16:08:31 $