Joshua
open source statistical hierarchical phrase-based machine translation system
|
Classes | |
class | SentenceFilteredTrie |
Public Member Functions | |
Trie | getTrieRoot () |
boolean | hasRuleForSpan (int startIndex, int endIndex, int pathLength) |
int | getNumRules () |
int | getNumRules (Trie node) |
Rule | constructManualRule (int lhs, int[] sourceWords, int[] targetWords, float[] scores, int aritity) |
boolean | isRegexpGrammar () |
Package Functions | |
SentenceFilteredGrammar (AbstractGrammar baseGrammar, Sentence sentence) | |
Private Member Functions | |
SentenceFilteredTrie | filter (Trie unfilteredTrieRoot) |
void | filter (int i, SentenceFilteredTrie trieNode, boolean lastWasNT) |
SentenceFilteredTrie | filter_regexp (Trie unfilteredTrie) |
boolean | matchesSentence (Trie childTrie) |
Private Attributes | |
AbstractGrammar | baseGrammar |
SentenceFilteredTrie | filteredTrie |
int[] | tokens |
Sentence | sentence |
This class implements dynamic sentence-level filtering. This is accomplished with a parallel trie, a subset of the original trie, that only contains trie paths that are reachable from traversals of the current sentence.
joshua.decoder.ff.tm.SentenceFilteredGrammar.SentenceFilteredGrammar | ( | AbstractGrammar | baseGrammar, |
Sentence | sentence | ||
) | [package] |
Construct a new sentence-filtered grammar. The main work is done in the enclosed trie (obtained from the base grammar, which contains the complete grammar).
baseGrammar | |
sentence |
Rule joshua.decoder.ff.tm.SentenceFilteredGrammar.constructManualRule | ( | int | lhs, |
int[] | sourceWords, | ||
int[] | targetWords, | ||
float[] | scores, | ||
int | arity | ||
) |
This is used to construct a manual rule supported from outside the grammar, but the owner should be the same as the grammar. Rule ID will the same as OOVRuleId, and no lattice cost
Reimplemented from joshua.decoder.ff.tm.hash_based.MemoryBasedBatchGrammar.
SentenceFilteredTrie joshua.decoder.ff.tm.SentenceFilteredGrammar.filter | ( | Trie | unfilteredTrieRoot | ) | [private] |
What is the algorithm?
Take the first word of the sentence, and start at the root of the trie. There are two things to consider: (a) word matches and (b) nonterminal matches.
For a word match, simply follow that arc along the trie. We create a parallel arc in our filtered grammar to represent it. Each arc in the filtered trie knows about its corresponding/underlying node in the unfiltered grammar trie.
A nonterminal is always permitted to match. The question then is how much of the input sentence we imagine it consumed. The answer is that it could have been any amount. So the recursive call has to be a set of calls, one each to the next trie node with different lengths of the sentence remaining.
A problem occurs when we have multiple sequential nonterminals. For scope-3 grammars, there can be four sequential nonterminals (in the case when they are grounded by terminals on both ends of the nonterminal chain). We'd like to avoid looking at all possible ways to split up the subsequence, because with respect to filtering rules, they are all the same.
We accomplish this with the following restriction: for purposes of grammar filtering, only the first in a sequence of nonterminal traversals can consume more than one word. Each of the subsequent ones would have to consume just one word. We then just have to record in the recursive call whether the last traversal was a nonterminal or not.
void joshua.decoder.ff.tm.SentenceFilteredGrammar.filter | ( | int | i, |
SentenceFilteredTrie | trieNode, | ||
boolean | lastWasNT | ||
) | [private] |
Matches rules against the sentence. Intelligently handles chains of sequential nonterminals. Marks arcs that are traversable for this sentence.
i | the position in the sentence to start matching |
trie | the trie node to match against |
lastWasNT | true if the match that brought us here was against a nonterminal |
SentenceFilteredTrie joshua.decoder.ff.tm.SentenceFilteredGrammar.filter_regexp | ( | Trie | unfilteredTrie | ) | [private] |
Alternate filter that uses regular expressions, walking the grammar trie and matching the source side of each rule collection against the input sentence. Failed matches are discarded, and trie nodes extending from that position need not be explored.
Gets the number of rules stored in the grammar.
Reimplemented from joshua.decoder.ff.tm.hash_based.MemoryBasedBatchGrammar.
A convenience function that counts the number of rules in a grammar's trie.
node |
Gets the root of the Trie
backing this grammar.
Note: This method should run as a small constant-time function.
Trie
backing this grammar Reimplemented from joshua.decoder.ff.tm.hash_based.MemoryBasedBatchGrammar.
boolean joshua.decoder.ff.tm.SentenceFilteredGrammar.hasRuleForSpan | ( | int | startIndex, |
int | endIndex, | ||
int | pathLength | ||
) |
This function is poorly named: it doesn't mean whether a rule exists in the grammar for the current span, but whether the grammar is permitted to apply rules to the current span (a grammar-level parameter). As such we can just chain to the underlying grammar.
Reimplemented from joshua.decoder.ff.tm.hash_based.MemoryBasedBatchGrammar.
This returns true if the grammar contains rules that are regular expressions, possibly matching many different inputs.
Reimplemented from joshua.decoder.ff.tm.hash_based.MemoryBasedBatchGrammar.
boolean joshua.decoder.ff.tm.SentenceFilteredGrammar.matchesSentence | ( | Trie | childTrie | ) | [private] |
int [] joshua.decoder.ff.tm.SentenceFilteredGrammar.tokens [private] |