Joshua
open source statistical hierarchical phrase-based machine translation system
All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Friends
joshua.corpus.Corpus Interface Reference

List of all members.

Public Member Functions

int getWordID (int position)
int getSentenceIndex (int position)
int[] getSentenceIndices (int[] positions)
int getSentencePosition (int sentenceID)
int getSentenceEndPosition (int sentenceID)
Phrase getSentence (int sentenceIndex)
int size ()
int getNumSentences ()
int comparePhrase (int corpusStart, Phrase phrase, int phraseStart, int phraseEnd)
int comparePhrase (int corpusStart, Phrase phrase)
int compareSuffixes (int position1, int position2, int maxComparisonLength)
ContiguousPhrase getPhrase (int startPosition, int endPosition)
Iterable< Integer > corpusPositions ()

Detailed Description

Corpus is an interface that contains methods for accessing the information within a monolingual corpus.

Author:
Chris Callison-Burch
Since:
7 February 2005
Version:
LastChangedDate:
008-07-30 17:15:52 -0400 (Wed, 30 Jul 2008)

Member Function Documentation

int joshua.corpus.Corpus.comparePhrase ( int  corpusStart,
Phrase  phrase,
int  phraseStart,
int  phraseEnd 
)

Compares the phrase that starts at position start with the subphrase indicated by the start and end points of the phrase.

Parameters:
corpusStartthe point in the corpus where the comparison begins
phrasethe superphrase that the comparsion phrase is drawn from
phraseStartthe point in the phrase where the comparison begins (inclusive)
phraseEndthe point in the phrase where the comparison ends (exclusive)
Returns:
an int that follows the conventions of java.util.Comparator.compareTo()
int joshua.corpus.Corpus.comparePhrase ( int  corpusStart,
Phrase  phrase 
)

Compares the phrase that starts at position start with the phrase passed in. Compares the entire phrase.

Parameters:
corpusStart
phrase
Returns:
int joshua.corpus.Corpus.compareSuffixes ( int  position1,
int  position2,
int  maxComparisonLength 
)

Compares the suffixes starting a positions index1 and index2.

Parameters:
position1the position in the corpus where the first suffix begins
position2the position in the corpus where the second suffix begins
maxComparisonLengtha cutoff point to stop the comparison
Returns:
an int that follows the conventions of java.util.Comparator.compareTo()
Iterable<Integer> joshua.corpus.Corpus.corpusPositions ( )

Gets an object capable of iterating over all positions in the corpus, in order.

Returns:
An object capable of iterating over all positions in the corpus, in order.

Gets the number of sentences in the corpus.

Returns:
the number of sentences in the corpus.
ContiguousPhrase joshua.corpus.Corpus.getPhrase ( int  startPosition,
int  endPosition 
)
Parameters:
startPosition
endPosition
Returns:
Phrase joshua.corpus.Corpus.getSentence ( int  sentenceIndex)

Gets the specified sentence as a phrase.

Parameters:
sentenceIndexZero-based sentence index
Returns:
the sentence, or null if the specified sentence number doesn't exist

Gets the exclusive end position of a sentence in the corpus.

Returns:
the position in the corpus one past the last word of the specified sentence. If the sentenceID is outside of the bounds of the sentences, then it returns one past the last position in the corpus.

Gets the sentence index associated with the specified position in the corpus.

Parameters:
positionIndex into the corpus
Returns:
the sentence index associated with the specified position in the corpus.
int [] joshua.corpus.Corpus.getSentenceIndices ( int[]  positions)

Gets the sentence index of each specified position.

Parameters:
positionIndex into the corpus
Returns:
array of the sentence indices associated with the specified positions in the corpus.

Gets the position in the corpus of the first word of the specified sentence. If the sentenceID is outside of the bounds of the sentences, then it returns the last position in the corpus + 1.

Returns:
the position in the corpus of the first word of the specified sentence. If the sentenceID is outside of the bounds of the sentences, then it returns the last position in the corpus + 1.
int joshua.corpus.Corpus.getWordID ( int  position)
Returns:
the integer representation of the Word at the specified position in the corpus.

Here is the caller graph for this function:

Gets the number of words in the corpus.

Returns:
the number of words in the corpus.