Joshua
open source statistical hierarchical phrase-based machine translation system
 All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Friends
joshua.decoder.io.DeNormalize Class Reference

List of all members.

Static Public Member Functions

static String processSingleLine (String normalized)
static String capitalizeLineFirstLetter (String line)
static String joinPunctuationMarks (String line)
static String joinHyphen (String line)
static String joinContractions (String line)
static String capitalizeNameTitleAbbrvs (String line)
static String capitalizeI (String line)
static String replaceBracketTokens (String line)

Detailed Description

Denormalize a(n English) string in a collection of ways listed below.

  • Capitalize the first character in the string
  • Detokenize
    • Delete whitespace in front of periods and commas
    • Join contractions
    • Capitalize name titles (Mr Ms Miss Dr etc.)
    • TODO: Handle surrounding characters ([{<"''">}])
    • TODO: Join multi-period abbreviations (e.g. M.Phil. i.e.)
    • TODO: Handle ambiguities like "st.", which can be an abbreviation for both "Saint" and "street"
    • TODO: Capitalize both the title and the name of a person, e.g. Mr. Morton (named entities should be demarcated).

<bold>N.B.</bold> These methods all assume that every translation result that will be denormalized has the following format:

  • There is only one space between every pair of tokens
  • There is no whitespace before the first token
  • There is no whitespace after the final token
  • Standard spaces are the only type of whitespace

Member Function Documentation

static String joshua.decoder.io.DeNormalize.capitalizeI ( String  line) [static]
static String joshua.decoder.io.DeNormalize.capitalizeLineFirstLetter ( String  line) [static]

Capitalize the first letter of a line. This should be the last denormalization step applied to a line.

Parameters:
lineThe single-line input string
Returns:
The input string modified as described above

Here is the caller graph for this function:

static String joshua.decoder.io.DeNormalize.capitalizeNameTitleAbbrvs ( String  line) [static]

Capitalize the first character of the titles of names: Mr Mrs Ms Miss Dr Prof

Parameters:
lineThe single-line input string
Returns:
The input string modified as described above

Here is the caller graph for this function:

static String joshua.decoder.io.DeNormalize.joinContractions ( String  line) [static]

Scanning the line from left-to-right, a contraction suffix preceded by a space will become just the contraction suffix.

I.e., the preceding space will be deleting, joining the prefix to the suffix.

E.g.

wo n't

becomes

won't
Parameters:
lineThe single-line input string
Returns:
The input string modified as described above

Here is the caller graph for this function:

static String joshua.decoder.io.DeNormalize.joinHyphen ( String  line) [static]

Scanning from left-to-right, a hyphen surrounded by a space before and after it will become just the hyphen.

Parameters:
lineThe single-line input string
Returns:
The input string modified as described above

Here is the caller graph for this function:

static String joshua.decoder.io.DeNormalize.joinPunctuationMarks ( String  line) [static]

Scanning from left-to-right, a comma or period preceded by a space will become just the comma/period.

Parameters:
lineThe single-line input string
Returns:
The input string modified as described above

Here is the caller graph for this function:

static String joshua.decoder.io.DeNormalize.processSingleLine ( String  normalized) [static]

Apply all the denormalization methods to the normalized input line.

Parameters:
normalized
Returns:

Here is the call graph for this function:

Here is the caller graph for this function:

static String joshua.decoder.io.DeNormalize.replaceBracketTokens ( String  line) [static]

Case-insensitively replace all of the character sequences that represent a bracket character.

Keys are token representations of abbreviations of titles for names that capitalize more than just the first letter.
Bracket token sequences: -lrb- -rrb- -lsb- -rsb- -lcb- -rcb-

See http://www.cis.upenn.edu/~treebank/tokenization.html

Parameters:
lineThe single-line input string
Returns:
The input string modified as described above

Here is the caller graph for this function: