Bits of Code You Could Ask Me For
Code implementations are available for most of
my papers. Please ask me or the first author
for the code rather than wasting time on a reimplementation. My
students and I have started doing our work on github, bitbucket, etc.
My big software projects are the Dyna programming
language, the Dopp programming language parser, the Dynasty
hypergraph browser, and the ERMA
toolkit for robust training of graphical models under approximations.
Also, in a more community-service vein, the aclpub
package for producing conference proceedings volumes, and the
gradrev system
for managing the department's annual review letters to graduate
students.
Over the years I've also written many small-to-medium bits of code
that someone might find useful. Some of them are even packaged or
documented nicely. I've meant to clean some of them up and publish
them to the relevant archives, but have never gotten around to it.
Since I hate duplication of effort, and like to give back to the
open source code pool, just drop me
a note if you want to download one of these. I'll try to make it
freely available to you in some form (especially if it is small enough
that I can easily look it over first). If I have already put it on
the web for someone else, I may or may not have linked to it from this
page.
The list below was already incomplete at the time I made it, and I
haven't kept it up to date well. I have written other possibly useful
software not listed here.
Computational Linguistics
Some of these materials are support software for problem sets I use in
teaching. The problem sets could serve as additional documentation.
- Semantics:
- simplify: Evaluate lambda terms typed at the command
line. The work is done by a Perl(!) module,
LambdaTerm.pm, which correctly
implements the untyped lambda-calculus (Turing complete). To
make it natural to encode natural-language semantics, I allow the
terms to contain all the sugar you might want - infix operators,
multi-argument functions, quantifiers that bind variables, and spaces.
This sugar is preserved when possible during reduction, so that
the results remain readable.
- buildfeats: Assign semantics to a parse tree, given
a grammar with semantic attachments. Only a weak ("bottom-up") form
of feature unification is currently implemented. But lambda-term
reductions are fully implemented via LambdaTerm.pm (see above).
In Perl.
- Syntax:
- head-treebank: Suite of Perl scripts for automatically manipulating the Penn
Treebank.
Difficult scripts:
- guess lexical heads using rules and exception tables (plus
pshead-mode annotation software for developing those)
- convert traces to slashed categories
- fix common annotation errors
- add additional subconstituent structure
More straightforward scripts:
- flatten lexicalized trees into dependency format
- binarize trees (convert to Chomsky Normal Form)
- list rules or lexicalized rules; discard singleton rules; summarize rules
- get the fringe of a parse
- distinguish adjuncts from arguments following Collins 1997
- simplify case, punctuation, nonterminals
- select Treebank sections
- convert between one-parse-per-line and prettyprinted formats
- randsent: Script to generate
text or trees from a probabilistic context-free grammar (PCFG). You
should probably also download prettyprint.
- earley: A weighted Earley's algorithm parser with a
left-corner speedup. In Lisp.
- probabilistic dependency parsing: As in this paper
and the slides showing experiments for
this paper.
In Lisp.
- smoothing and backoff: Very general, rather
slow code for counting events and getting smoothed conditional
probabilities. You can write arbitrary backoff functions and
have them called at runtime. In Lisp.
- more lexicalized probabilistic parsing: As in this early paper.
In Lisp.
- prettyprint: A Perl filter
that pretty-prints Lisp-like terms, such as parses. The code is
carefully written to pass input through immediately, rather than
waiting for the end of the expression or file.
- dependency-mode: Annotation tool for viewing and
editing text representations of dependency trees. Built in Emacs.
- Finite-state:
- finite state code: A full implementation of this paper on
primitive OT. Allows FSAs whose arcs are labeled with predicates
over the astronomically large alphabet: supports weighted intersection
and best-paths, unweighted determinization and minimization,
compilation of OTP constraints, factored FSAs. In C.
- vtag: A simple smoothed bigram
HMM tagger, with EM training (= forward-backward = Baum-Welch) and
Viterbi decoding. Also a baseline tagger btag. In
Perl.
- HMM spreadsheet: A spreadsheet that implements the
forward-backward algorithm, so you can play what-if and see the effect
on graphs. In Excel.
- Clustering spreadsheet: A spreadsheet that implements the EM
clustering algorithm, and animates graphs showing how the cluster
centers move. In Excel.
- counts: Library that computes the counts and
count-counts needed for smoothing methods such as Katz backoff. In C,
but not very efficient.
- shade: Generate an Optimality
Theory tableau in LaTeX. Automatically fill in the shading, ! marks,
and pointy hand. In Perl.
- stamp: In Perl, print some header comments at the
start of the output. These comments are useful in research: they will
later help you determine exactly how the output file was created.
They give the time the file is being created, the command line used,
the working directory, the program's directory, and the program's
modtime. They also quote the header comments for any files used as
arguments.
C Data Structures
If for some reason you prefer writing in C to using a language with
real libraries, here's old but efficient code to support some common
data structures.
- stack: Self-resizing array (vector).
- hash: Hash table that
maps strings (e.g., hash keys or English words) to consecutive
integers. These integers are smaller than the strings, faster to
compare, and can be used to index an array of values associated with
the respective strings.
- pqueue: Priority queue (heap).
- allocate: Fast memory allocator for fixed-size
blocks (as small as 1 byte). Can allocate single blocks or arrays.
Can iterate over all allocated blocks. Can free all blocks controlled
by a given allocator, but not individual blocks. This is similar to
the Vmlast method of vmalloc, which
you should probably use instead.
- prime: Miller-Rabin primality
testing. Useful to pick a prime hash table size, for example.
Perl Data Structures
- summset: Generalization of a heap. Maintains a
associative summary statistic on a set of elements: e.g., min, argmin,
sum-of-squares, sorted list. Insertion and deletion both take O(log
n) time, if the summary statistic on two disjoint sets can be combined
in O(1) time into the summary statistic of their union.
- jigsaw: The union-find algorithm.
- LambdaTerm.pm: See above.
- OptimizeParams.pm: Minimize a function of 1 variable
with or without derivative information (Brent's method); minimize a
function of n variables without derivative information (Powell's
method). Adapted from Numerical Recipes in C. Obviously, it would
be smarter to use a Perl-to-C interface.
- aliasdag: Easily build up a big directed acyclic
graph from smaller ones, using operations like rooted-union and
cross-product. Even makes a diagram of the result. I use this to
build a multiple inheritance hierarchy of email aliases.
R or S-PLUS code
- samplerepl: MUCH faster replacement for the built-in
sample function (sample with replacement).
- repl: Multiple evaluations of a function, e.g., a
function with a random output.
- warnifnot: Weaker version of "stopifnot".
- map: Map a function along a list.
- zipf: Processes to generate Zipf-like distributions.
Emacs Lisp
This section especially needs to be finished (it is missing some items
and some descriptions). I've spent a lot of time hacking my Emacs
environment; these are some of the bits that might be most useful to
others.
- previous-buffer, next-buffer: These page through
your recently visited buffers just like the back and forward buttons
in a browser. Much handier than the standard alternatives. I
assign them to my laptop's special back/forward keys. Holding down
shift skips over buffers that aren't in the current major mode.
- scroll-right-slightly, scroll-left-slightly:
Horizontally scroll the window to look at long lines. To turn
wrapping on, scroll leftward past the left margin.
- defun-pop-to-specific-file: Macro that constructs
a function to jump to a specific file, like ~/.emacs or ~/calendar.
This function can be bound to a key. Related macros include
defun-switch-to-specific-buffer and
defun-find-major-mode-buffer. Fallback behavior
can be specified for when the function fails.
- iconify-temp-frames: Clean up temporary frames every
so often.
- bibtex-jump-key: Jump to the right place for a new
key in a sorted bibtex file.
- shell-command-at-point(-on-region): Consider line
under point to be a shell command, and execute it.
- vc-create-snapshot-synchronized: Checks in a bunch
of files with the same comment and version number, whether they've
changed or not.
- mail-current-buffer: Start a mail message containing
the current buffer or region.
- nextfont: Cycle this frame through a list of fonts.
- buffer-menu-flag-buffers-regexp: In
buffer-menu-mode, mark for deletion all buffers matching a
user-provided regexp. (Bind this to the D key.)
- pp-sexp-after-point, pp-last-value: Prettyprint.
- flexdate, cal-jump:
- sorted-fancy-diary-display: Improved version of
fancy-diary-display that sorts each day's entries by start time.
- dired-nonsubdir: Modify dired to allow insertion of
arbitrary directories into the dired buffer, even if they're not
subdirectories of the main dired directory.
- feedmail, feedmail-show:
- grep-specific:
- hippie-exp-pathname:
- hscroll-patch:
- launch:
- load-safely:
- mail-fix-headers:
- mail-guess-fcc:
- mail-hist-improvements:
- mark-motion:
- misc: (break this into several entries)
- my-undigest:
- query-replace-rotating:
- rmail-dup:
- rmail-later:
- rmail-reply-self:
- rsi:
- websearch:
- Split a big RMAIL file into pieces.
- cat-rmail: Schell script to concatenate RMAIL files.
- diary: Shell script to run M-x diary and email you
the result. I run this every night as a cron job.
LaTeX
- acl-squish.sty: Versions of the ACL style files that
let you squish a little more into your paper.
- myshading.sty: Hack to allow shaded cells in an OT
tableau. See entry for "shade" above.
- numpara.sty: Allow inline-numbered sections (with
optional marginal notes) - the mathematical style of this paper
- prog.sty: Environment for displaying pseudocode.
- sectnums.sty: Use in conjunction with apa.cls to
allow section numbers (e.g., for the journal Cognitive
Science).
Shell Scripts and Aliases
Things you might want to type at a Unix command line:
- vote: Given some ballots that rank candidates,
obtain a consensus ranking under Single Transferable Vote. (Or an
infinite variety of related procedures: you can specify a command-line
formula that gives some initial weight to later-ranked choices.)
Reports (and even graphs) how the candidates' status changes over
successive rounds of the algorithm. Performs sensitivity analysis of
the results via bootstrap resampling (i.e., would the winners have
changed if slightly different voters had been in the room?).
- bibgrep: Search your bibtex files for entries that
match a regexp. Print each unique entry together with a list of all
bibfiles that contain it and all texfiles that cite it. Especially
useful for finding a bib entry that you've used before.
- texgrep, texgrepcmd: Search your tex files for a
regexp or to find the definition of a specific macro.
- pdfcat: Concatenate one or more PDF files
(or selected pages of them) into a new PDF file. Optionally, you
can force each new file to start at an odd-numbered page,
which is sometimes desirable for printed copies.
- pdfify: Converts files into PDF, usually by asking
OpenOffice to print them. Will accept .doc, .txt, .xls, possibly
.ppt files.
- mailmerge: A powerful and handy command-line mailmerge program.
It is based on Perl's
Text::Merge
module, so it can handle field insertions and conditional text.
Unlike Text::Merge, it can also handle nested conditions,
create MIME messages with optional attachments, preview the
messages, and -- if you are satisfied -- call sendmail
to send the messages.
- webify: This is a good alternative to emailing
a large file to yourself or someone else. It publishes the document
or directory at an automatically generated secret URL. Handles
permissions and file updates.
- pdfify: Convert various kinds of files to PDF,
if necessary by asking Open Office to print them.
- bigfiles: Find your biggest files.
- comix: Email a date-specific URL to someone, with a
randomly chosen cheery greeting. I have a daily cron job that emails
my family a URL to that day's Edge City syndicated comic (drawn
& written by my cousins Terry and Patty LaBan).
- bground: Fork to run a program with arguments in the background, first
making copies of any files in case the caller deletes them. This is
useful to put into your .mailcap if your mailer insists on viewing
messages in the foreground.
- xlaunch: Start a command in a new, appropriately titled window.
- keepalive: Keep a remote login connection from going
down during idle time, by writing ASCII 0 to the terminal every 10
minutes. This one isn't by me, but it's so useful that I'll list it
anyway.
- datedir: Change the dates of directories to the
dates of their most recent files. Recursive if desired.
- lls: Version of ls that lists absolute filenames.
That is, each filename is preceded by the full name of the directory
it's in. Allows the same options as ls does. Useful as an ingredient
in other scripts.
- getmail: Safely move a mail spool file from one machine to another, e.g.,
from your mail server to your personal machine.
- mailwho: Scan mail folders to produce a list of who you've been
corresponding with, how often, and on what subjects/dates. Can
deal with compressed files and rmail folders. Uses heuristics
to guess when multiple addresses refer to the same person.
- makealiases: Automatically produce a file of mail aliases by scanning your
mail folders.
- texclean: Cleans intermediate files created by tex and its relatives.
- ps2eps: Turn postscript (.ps) into encapsulated
postscript (.eps); may not always work.
- qcd: Change to directory by typing just a fragment of its pathname.
- rcsfiles, rcschanged: Find RCS-controlled files in the current directory. Or, find
just the ones that need to be checked in.
- Wrappers around standard programs:
- General-purpose wrapper that allows other programs to take their
input from stdin. (Saves stdin to a file if necessary, then runs the
program on that file.)
- subclass: Find the next program on the search $PATH that has the same name as
a given program. Very useful in wrapper scripts that have the same
name as the program that they wrap, and need to call it.
- phone: Version of grep that extracts and nicely
formats lines from a fixed file. For example, I type "phone laura" to
get all the lines in my Rolodex file that mention anyone named Laura.
- Versions of grep that search particular kinds of files, such as
txtgrep to find short text files under a given directory
and grep them, or texgrep to grep my tex files.
- Version of which that takes regular expressions - looks in $PATH
for anything matching the regexp.
- Version of enscript that respects the $PRINTER variable.
- Version of lpr that defaults to no header page, and handles .dvi files
correctly.
- Version of finger that can translate your mail aliases into
user names. So "finger joe" works like finger "joseph_shmoe@nebish.com."
- xgraph-int: A wrapper around "xgraph" that makes it
easier to display multiple line graphs with the same x axis.
Optionally transforms the y axis of each dataset so that each
dataset's curve has decent dynamic range. Generates a decent (and
sorted) legend to explain this. For the caller's convenience, it
allows the input stream to intersperse points from different datasets.
- Command-line filters:
- wrap: Wrap long lines *without* removing existing
linebreaks (as those are often significant). I'm always calling this from
emacs to clean up other people's emails, or
cut-and-pasted text, or code. The script preserves indentation
prefixes, and can use blanks to separate wrapped lines from
surrounding text.
- html2tabs: Convert browser table to tab-delimited flat file. (When you cut and paste a
table from your browser, it's pretty ugly.)
- flattabs: Convert flat file (tab-delimited) into a human-readable version.
The tabs are replaced with spaces to make the columns line up.
- sum: Sums all the numbers in the input text.
- fixexp: Replace numbers in scientific notation with numbers in fixpoint
notation.
- Perl snippets: ...
- Shell aliases: ...
Fun math code
- primes: An improved
linear-time version of the sieve of Eratosthenes (a rediscovery
of Gries & Misra 1978); randomized primality testing; Legendre's and Meissel's formulas
for pi(n), the number of primes less than n; approximations to
pi(n).
- partition: given a sequence of points in n-space,
split the sequence into two clusters so as to minimize total
intracluster variance.
- graph, chrompoly, colouring, hamilton: Some code for
querying and manipulating graphs. Edge contraction, connected
components, coloring. Chromatic polynomial or chromatic number of a
graph. Hamilton cycles, exactly or approximately. In Lisp.
- fractions: Find continued fractions; solve Pell's
equation. In Lisp.
- huffman: Huffman coding/decoding. In Lisp.
- bignum: Bignums and rationals. In ML.
- scheme: An implementation of Scheme in ML.
- church-turing: Shows how to use the pure untyped
lambda-calculus to implement everything you need to build a Turing
machine. For example, computes Ackermann's function on Church
numerals. In Scheme.
- svc: A fast C implementation of the membership
function for the Smith-Volterra-Cantor
set (also known as the "fat Cantor set"), a nowhere-dense
fractal subset of [0,1] that has Lebesgue measure 1/2. Of course,
we can't really work with real numbers, but this code provides a
discrete equivalent, e.g., a subset of the same shape that contains
half of the integers in [0,2n). (I wrote this
because I had a possible construction that involved SVC membership tests
on random real numbers, and I wanted to do a sanity check on its
desired properties before trying to prove anything.)
Jason Eisner - jason@cs.jhu.edu - Last Modification $Date: 2017/03/31 13:07:04 $ (GMT)