Training Tree Transducers

Graehl, Knight, Mayの論文がComputational Linguisticsと言う雑誌に発表されました。

4. Extended LHS Tree Transducers
A weighted extended lhs top-down tree transducer M is a quintuple (sigma, delta, Q, Qi, R). Sigma is the input alphabet, delta the output alphabet, Q the set of states, Qi the initial state, and R the set of weighted transformation rules. One component of R is the set of finite tree patterns(xTPATsigma). This is a function from an input tree to a boolean (0/1). This pattern is often abbreviated as lambda in the text. For the rule:
qA(x0:B,C)-w->q'x0, lambda is (labelandrankt*1=B (rank is excluded, as B is assigned to be a variable?) and labelandrankt*2=(C,0).
The binary derivation relation on a transducer M relating two trees a and b is:
=>M is: *3 which is the subtree at path i' relative to a sub i. The replaced subtree will have the label q', which will be the input state for subsequent rule applications. Note that h is a variable which holds a list of the derivation history. The text uses the expression lambda(a sub i.(1)), which seems erroneous to me, as this would be referring to the first child of the ith subtree, when in fact we want to apply the rule to the ith subtree itself.
Sources(r) is another term defined in this section. Sources are the input paths in the rhs of a rule r. This is the set of all paths in the right hand side of a rule labelled with a state q (one would assume all of these paths are leaves, but this may not be the case A(x0(B(x1))) seems possible. If these relative paths do not exist in the input subtree, then the rule does not apply. This term is used to define a linear transducer, which means that for all rules the sources are a frontier and are used at most once. A concrete transducer is one in whcih the finite tree patterns (xTPATsigma) specify the labels and ranks of all nodes in an input subtree.

5. Parsing an xT Tree Relation
After defining a tree-to-tree transducer M=(sigma, delta, Q, Qi, R), the text proceeds to outline how to produce the derivation trees for an input/output tree pair. It does this via an algorithm which creates a weighted regular tree grammar (wRTG) to produce the derivation trees.
If the algorithm below returns false (no derivation trees exist), the transducer cannot translate the given input to the given output. By adjusting the transducer's rules, a user might determine how translation can occur.
A derivation tree has rules for nodes, so the tree indicates the order of rule application which generates the input from the output. The grammar consists of productions in the form:
state.input node -----> rule(q.node1startpos.node1endpos, q.node2startpos, q.node2endpos, ...)
nodestartpos is the position of the node in the input tree (eg first child of first child of root=11) and nodeendpos is its position in the output tree. q is the name of a state (in the example, all states are named "q"). Separate trees will branch from a single tree when there is a single state.input pair that:
1. is the output of a rule already applied via the production's right hand side
2. is on the left side of multiple productions with an output which is the child of the current output node or that same node
Rules mentioned above with regard to the right sides of productions are the rules of the transducer M. The nodes accounted for by the right-hand side of the production(in the tree beneath the rule) are non-terminals of the right-hand side of a transducer's rule. A derivation tree corresponds to M transductions
(I, ())=>*M(O,h), where h is the set of rule applications.
The algorithm for the method Deriv invokes the method Produce. If Produce returns true, then some derivations of output tree O from input tree I via transducer M exist. In this case, the set N holds all non-terminals in the productions, ie elements of the form (Q x pathsI x pathsO), If Produce returns false, then Deriv returns false. The arguments Deriv passes to Produce are (S, (), ()), for start state and roots input, output trees.
Produce takes arguments in the triple (q, i, o) for state, input tree node, output tree node. The method attempts to find all rules of the transducer which map i to o and then calls itself recursively for any nonterminals that result in the rules output. The first instruction checks whether the derivation trees have already been found (or found not to exist). If this argument triple has been seen before, the method returns the cached result (true or false). If nothing is found, the function adds an item (alpha, true) to its memo cache, where alpha represents (q,i,o) of the input. This is in case the productions result in a cycle which returns to this input via a recursive call from the current stack frame. It then sets the variable anyrule to be false. This value will be returned in the event that no cycles occur.
The for loop which ensues iterates over every rule of the transducer. For each rule whose left-hand side matches the input node i (written as a path of the input tree I) and whose right-hand side contains only nodes that have relative positions in accord with their absolute positions in the output tree O beneath the node o:
1. stores all nonterminals in form qx0 in variables o1, o2, ... , on
2. creates a new tree derivrhs to store the results of applying this rule (e.g. moving node at 21 (1st child of second child of root) in input to 12 in output). The label and rank of the root will be (r, n) where r is the rule and n the number of output nonterminals.
Then, the algorithm iterates all output nonterminals for that transducer rule (all nodes which are labelled with a state q). The next (q,i,o) triple is set to be (q', i.i', o.o') where q' is the state of the output nonterminal, i.i is the absolute path in the input tree I of the left-hand side's path to the nonterminal, and o.o is the absolute path in the output tree of the right-hand side's path to the nonterminal.
Produce is called recursively on this triple. If the recursive invocation returns false, then the rule is not used in the derivation (as it outputs nonterminals which cannot be expanded). Otherwise, the jth child of rule i is set to (q', i.i', o.o') in derivrhs. If the for loop over all o completes for this rule r, "anyrule" is set to "true", and a new production is added, with (q,i,o) on the left, derivrhs on the right, and weight w (the weight of the rule in the transducer).
After all rules have been checked, the memo is updated with (q,i,o) and "anyrule". Ideally, the temporary value saved at the beginning of the function's code should be overwritten. Produce returns "anyrule".
Notes:
1. The text says that each derivation tree (tree of rules) corresponds 1-1 (isomorphic) to a leftmost derivation.
2. The diagrams of both derivation trees do not match the grammar in the left-right order of rules. For instance, rule3 precedes rule 2 in the first tree (as it would in a leftmost derivation), but the state/input to which rule 3 applies follows rule2's state input in the grammar.
3. Time and space complexity is |Q|*|I|*|O|*|R|, where Q is the number of states defined by the the transducer, I the number of nodes in the input tree, O the number of nodes in the output tree, and R the number of rules. An equivalent expression is O(Gn^2), where n is the total size of the input and output trees and G is the constant accounting for states and rules (including size). It seems that there might need to be an additional |O| factor accounting for the for loop over all output nonterminals which are produced by a rule. However, this is apparently already accounted for by |Q||I||O|. Any output of a rule will represent a path in the input tree and a path in the output tree, if the rule was a match. Any of the Q possible states can be appended to these paths in the output of one production and the input of another production. Produce can thus be invoked at most QIO times and each time all R rules are traversed. As Deriv does nothing other than call Produce once and then cull the yields from all productions (O(|R||Q||I||O|)), the total complexity is O(|R||Q||I||O|).

After the grammar is generated by Deriv, it can be post-processed by RTGPrune, which removes all productions which contain nonterminals on the left which are never derived from the start state or which contain nonterminals on the right which do not expand to trees composed entirely of terminals.
The complexity of this algorithm is O(|N| + Sigma(1+pathsti(N))).
1 to m
The algorithm generates two arrays:
1. A[n] indicates that node n can be reached from S.
2. B[n] indicates that node n can derive a tree of no nonterminals
The initialization of B[n] (line 1) and A[n] (second line before end of RTGPrune each run in O(N).
After initialization of B[n] and Adj[n] (this indicates the rules that are invoked on the nonterminals in the production and is initialized to null), the algorithm loops over all productions (p1, p2, ... , pm) in the form pi = (qi, ti, wi). qi is the state onthe left, ti is the tree on the right, and wi is the weight. The labels of all nonterminals on the right side are stored in the set Y. Each nonterminal in Y then adds production i to its adjacency list, which is a list of all rules deriving a given nonterminal. If there are no nonterminals on the right-hand side of the production, then its left-hand nonterminal, qi, should be added to the list M(upon which we call Reach later, as we know that B[qi] can be set to true). The cardinality of the set Y is stored in r[i] for that production. This outer for loop executes once for each production and at least once (even if there are no nonterminals in ti). At most, the code in the inner for loop executes sigma(ti) times, thus the sigma(1+ti) figure.
1 to m
Reach is invoked on each left nonterminal which generated a right-hand tree with no nonterminals. Thus, it can be invoked at most sigma (|pathsti|) times (if every node in the right-side tree derives no nonterminals).
1 to m
Reach is called to:
1. set B[n] to true for the nonterminal argument n, ie that the nonterminal derives a tree entirely of terminals
2. for each production i that has this nonterminal on its right-hand side, decrease r[i] (list of NTs in this rule that do not expand to terminal tree) by 1. If r[i] is 0, call reach on this rule's left-hand side nonterminal.
Reach cannot be called more than one time for a non-terminal output of a production pi, as otherwise r[i] would reach 0 multiple times.
After Reach calls complete, A[n] is initialized to false. This requires O(|N|) time. An A[n] value of true means that a tree t' containing only one nonterminal n can be derived from S, and t' derives an entirely terminal tree t.
Use is then invoked on the start node S. Use sets A[n] to true for its argument n, in the first case S. Although B[S] has not been demonstrated to be true, if there are any productions with S on the LHS for which all nonterminals on the right-hand side derive trees of terminals, then B[S] is true. If this is not the case, then there will be no productions with both B[n] true of all nonterminals on its RHS and A[n] true of the nonterminal on its LHS, and all productions can be dropped. Then for each nonterminal n' in the right-hand side tree of the production which the argument n is the lhs of (may be multiple productions and multiple nonterminals per production), call use recursively on n' if A[n'] is false and B[n'] is true, i.e. on those nonterminals which are derived from n (which we know to be reachable from S since it is an argument to a recursive call in a stack which began with S) and which are expandable to entirely terminal trees. Like Reach, use can be called at most sigma(|pathsti|+1) times, as each right0hand side nonterminal will not have A[n']==false after Use is invoked on it.
Productions with n on the left-hand side for which A[n] is false or with n on the right-hand side for which B[n] is false may be pruned.

7. EM Training (訓練)
EM stands for expectation management.
1. The expectation seems to be the product of the probabilities of derivation for a given set of output trees using a tree transducer M. (Whether the input trees are fixed or not is unclear).
2. This value is maximized by adjusting the probabilities used to weight the rules in the transducer. For example, if rule 1 is A->BC, the probability with which this rule is executed (when the state and input match) is set based on the number of times the rule occurs in the derivations of the relevant trees, with the average count per tree provided in the case of multiple derivations for a tree. For example, if rule 1 is used 1 time in derivation 1 of tree1 and one time in derivation 2 of tree1, and each derivation has probability 1/2, then the count of rule one in tree1 is 1. Note that the count is multiplied by the probability of the tree's derivation, in the event that there is only one derivation. In the algorithm, the count is found by adding, over all (in, out) derivations, the "gamma" divided by beta(S) times the weight of the example (weight of training pair). Here gamma is the product of the inside and outside weights for the rule and beta(S) is the weight of the derivation tree.
3. This per-tree value is summed over all trees to obtain ctau, the expectation for decision (rule) tau. This value is then divided by a normalization constant to obtain the probability to be assigned to the rule tau. The normalization constant seems to be the sum of counts of occurrences of all rules with the same left side as tau in derivations of all trees in the sample set, with each count multiplied by the probability of this derivation(ie the same sum used to compute ctau, but for all rules with the same left hand side).
4. Acording to the article, this procedure continues over multiple iterations. However, I do not understand how this occurs, i.e. how ctau would change upon recalculation. I suppose that the Pparameters(d) value, i.e. the probability of a derivation, would change along with the value of tau. Thus, the recalculation of ctau would change.
5. I suppose that the iteration could continue until the total weights fo the derivations converge to values within a certain precision. This is done by computing delta, which is the change in L from the previous time as a fraction of its magnitude. Here L is the sum over all derivations of the log of the weight of the derivation weighted with the example weight.
6. Note that, since the probability of the tree's derivation is multiplied by the count, trees which are less likely to be derived have less of an influence on the rule's probability than those which can be derived with greater frequency.
7. Calculation of EM takes O(NGn^2), where G is the grammar constant (based on number of rules and states). N is the number of examples in the training corpus. A previous algorithm Deriv requires O(Gn^2) to find the derivation tree for a given input/output pair. Thus, running this algorithm over all elements in the training set (assuming these are input/output pairs) should be sufficient to find the weights of all derivations needed to compute one iteration of EM.
8. Joint probability distributions over input/output tree pairs are mentioned, but I am not sure what the variables are which are considered jointly. I assume that the two variables are (input tree, output tree), as, given trained weights in the productions of the grammar, a probability for the training corpus's derivation can be computed for a transducer using these productions on a given input tree ti and an output tree to.
9. Next, the article discusses conditional probability. In particular, it mentions conditional normalization, which I am not certain of the meaning of. Conditional probability differs from joint probability, in that it finds (X|Y) instead of (X,Y). So instead of finding W(I,O), perhaps the algorithm finds W(O|I). I am not certain of:
a. How the original joint probability is found without using conditionals (perhaps counting only the derivations which originate with the input and conclude with the output?
b. How the conditional probability is determined (perhaps by ruling out productions for input trees not matching the input). The authors indicate that the patterns for all the rules of a given state should be partitioned into sets so that only patterns for at most one set match a given input. I am not sure what this passage intends. I assume that "patterns" means the input which the rule transforms from a certain state. If patterns for at most one set match a given input, then the input must decide the set which is to be used. It is unclear what this partitioning does to change calculations of ctau or the probabilities of the tree derivations. Perhaps only rules in the same set as tau are used to normalize ctau in computing tau's weight. This makes some sense, as rules which take a different input would not be under consideration to replace tau, as they would not match the input tree. However, the passage does not indicate what would happen if a the rules were all placed in a single set. In this case, any input matches rules from only one set, but the set includes many rules it does not match(I assume that the authors imply that this should not occur). The example given states that input-epsilon rules would be problematic to conditional probability calculation. These rules match any input. Thus, they could be included in every set of a partition. If rule1 were such a rule and we kept count of only the number of times it occurs in output tree derivations (irrespective of the input it matched), we would be inflating its probability in the case where we were using it to compute W(O|I). We could divide the count by the counts of all rules in all partitions, but this does not reflect the probability that this rule was used in the case of a single input, and its count would still dilute the probabilities of the other rules in its set.
c. The article then asserts that exact conditional probabilities can never be discovered by rule weights. Something called remarginalization is discussed (cf definition online). Remarginalization would seem to be the process of deducing the unconditional probability (marginal) of a single variable. The authors claim that the counts can be divided by the inside weight of the root derivation forest. It seems that this inside weight represents the probability that a derivation from the given input tree occurs, or P(I). Then, the joint P(I,O) (actually W(I,O)) can be divided by the marginal to obtain P(O|I). This may be a conceptually simpler approach to finding the conditional than the set-based approach above.
d. The next point in this section is that each q lhs could be placed in its own normalization group. This would cause the probability for each rule involved to be one, it seems, as each count would be normalized against itself. Each tree represents the substitution of a single rule at some point in a previous tree in the forest. If the partitioning of one rule per group is done for only a single set of rules with the same state/left-hand side, then the ratio of the sum of the weights to one would be equal to the number of rules in this set. The easiest way to discern this is to use a derivation forest of single nodes(rules). The probability of each is set to one, so the sum of the weights in the forest is the number of trees (nodes). To obtain the correct sum of 1, one must divide the probability of each rule by the number of trees. Each derivation tree has a weight equal to the product of all rules used therein. Thus, changing one rule by a factor of x changes the entire tree's weight by x. To correct this, it seems the probabilities of those in the set need to be divided by the size of the set. However, the article indicates that dividing each rule's probability by the sum of the weights of the derivation trees is adequate. This seems invalid, as dividing every single rule's probability would decrease the weight of each derivation tree by s^n, where n is the number of rules used in that tree. Dividing only the rules in the affected set by s also seems inadequate, as there may be trees where rules of that set are not used. The weights of these trees would cause s to be less than the size of the set of "q lhs" rules, thus dividing the weights of these rules by s would make the probability of the forest be >1 (although less so than originally). Dividing the weight of each tree in the forest by s would make the weights of the derivations in the forest sum to one,as (2x+2y)=2 => x+y=1.
e. The algorithm for calculating EM assigns to each rule a count specified by "prior". It then increments the count based on the quotient gamma(p)/Beta(S) for a derivation forest D. Here gamma is the weight of the derivation tree using a given rule and beta is the weight of the tree beginning with the start node. It seems that prior could be set to zero, and that this is used to adjust probabilities for any deviation from the expected outcomes which might be caused by insufficiency of the training data.

10. Parsing an xTs Tree-to-String Relation
This is analogous to the xT transducer, except that the outputs are strings with nonterminals interleaved between characters, instead of the trees used before. Just as a new transducer could be derived by the Deriv algorithm to output derivation trees given an input tree and output tree, so a derivation tree can be output by SDeriv given and input tree and an output string. The output of SDeriv is a grammar consisting of rules R, nonterminal set N', start state S, and productions P. The set N', interestingly, is (Q x pathsin x spansout) U (pathsin x spansout x (Q x paths)*). This is due to the fact that the first element in the union is the input to the Produce function, and the second element is set by VarsToSpan, it seems. The initial state S is set to be (Q, (), (1,n)) but it seems that the span should be (1, n+1), as the second character of a span is not included in the span, but in this case the nth character should be included (all characters in output form the initial span).
There are four functions which comprise the SDeriv algorithm (besides the SDeriv function). These are Produce, VarsToSpan, Canonical, and Feasible. The complexity of the algorithm is O(Gnm^3). The SDeriv argument invokes Produce for input tree I, output string O with argument of S mentioned above. In produce, the outermost loop iterates over all rules of the transducer with a left-hand side which will accept the input tree and a right-hand side for which Feasible returns true. Feasible is true if every terminal in the RHS is in the span specified by the output string. The next loop iterates over all sets of terminals (p1, ... , pk) which align with the k terminals on the right-hand side of the rule (r1...rk). if ri+1==r(i+1), the two terminals RHS(ri) and RHS(r(i+1)) are adjacent. If they are not adjacent, there are nonterminals between if the span (ri+1, r(i+1)). Note that the text here contains an error, looping over indices pk instead of pi, as pk represents only the last term in the sequence. If k==0, then the right-hand side of the rule consists of non-terminals only. The variables in the (ri+1, r(i+1)) span must generate O (span) (p+1, p(i+1)). For each set (p1...pk) mentioned above, a new tree derivrhs is created. Note that multiple derivrhs trees might be created for the same set of terminals for the same rule. For example, a rule might be:
A=>dubious
and the output string: "dubious hobbits and Gandalf are dubious".
Here the right-hand side "dubious" matches both the first and last words of O. In this case, VarsToSpan would return valse, since (p7+1, p8) span would be " hobbits and Gandalf are dubious", and there is no nonterminal to cover this sequence.
Why O(Gnm^3)? Every possible substring in O is represented by m start points and m end points, so the total number of substrings is O(m^2). There are m positions in each substring where variables might "cut" the terminals. n is the number of nonterminals in the input tree, so if nm^3 means that there are at most n possible nonterminals to be checked for each cut of each possible output string for each rule, it seems. Each of these nonterminals would be expanded to a different substring but this does not occur until the next (recursive) VarsToSpan call.
VarsToSpan loops from b to a (where b is the rightmost terminal of the output span to be checked against the nonterminals of RHS). For each subspan (a, s), it checks whether the first input nonterminal can be expanded to this span by calling Produce with input of the nonterminal and output of the above span. If Produce returns true, a new tree is created with label "b" (for binarization).
A datastructure used in this algorithm is a tree-variant known as a trie, or prefix tree. Tries store associative arrays (ie arrays indexed by characters or some other non-integer data type). The key of a node is provided by the position of the node in the trie, and all descendants of a node have a common prefix.
n1
t/ \i
e/ \l
a/ \k
The leaf nodes above have keys "tea" and "ilk". For a trie of terminals of the output string, the lookup time for any node is O(m). A common application of tries is dictionaries, where words are the keys and definitions the values at the leaf nodes. In this case, the trie is used to store nonterminals. An artificial node "b" divides the initial nonterminal's path from the path used for the remaining nonterminals in the interval (ri+1, r(i+1)). The value at each node is the triple of the state q', the path i', and the span of terminals (a,s). Note that there will be a separate trie for each (a, s) span found.
Another theoretical construct discussed in connection with SDeriv is a finite state automaton (FSA). The FSA apparently uses subsequences of the output string for states, skipping to substrings in constant time with an index on the outgoing transitions. I am uncertain whether the FSA embodies all possible (p1.. pk) or whether there is a separate FSA for each (p1...pk) which look sat all possible cuts in a specific substring. It seems that the FSA is used to determine each valid subsequence of the output string (p1...pk), as this is not explicitly written in code. Once a subsequence is found, the cuts can be found by a for loop which checks for (ri+1!=(r(i+1))). However, this determination of nonterminals' location could also be found while deriving the substring in the FSA, so that the calls to VarsToSpan could be made in the process of finding (p1...pk) (perhaps this is what the authors meant by interleaving, although their reference seems to be to finding rules which are appropriate matches for the current input/output. This step could also be part of the above FSA, perhaps.).
Sample FSA:
Note that there is another set of states preceeding this to determine a valid substring p1...pk (ie testing all possible start/end points in output string).
state 1: select substring1 ---->state 2. get r(i+1) -----> state 3. get ri+1----> state 4. increment i and go back to 2
(if r(i+1)==ri+1)
|
|
V
state 5. set up spangen --->state 6. call VarsToSpan----> set label and rank of spanlist, set P, and goto 2
| true
| false
V
next p
The fourth (after Produce, VarsToSpan, and Feasible) method used by SDeriv are Canonical. Canonical is used for the purpose of memoization. It keeps the indices of the leftmost instance of a given substring. Thus, only one copy of each string that might be derived from a given set of nonterminals is stored. This cache is used by VarsToSpan to eliminate redundant searching for derivations.
Is it possible that the span passed to the Produce call in VarsToSpan might be checked more than once for a given rule? eg. for A-->caBt
Produce will check the output substring between ca and t in O. It seems like this would be the case (for the same nonterminal) only if there were a recursive rule like A--->caAt.
Another point of the authors is that exhaustive backtracking is used to determine expansion sites against an input tree. I assume this means that every possible rule based expansion is attempted until some right hand side terminals do not match the output or some right-hand side nonterminals cannot be expanded.
Yet another assertion is that a leftmost trie is used to index tree patterns against the input tree. Apparently, given such a data structure, the feasible method is not needed. I am not sure exactly what is being indexed in this case. Perhaps given an input tree and an output substring all rules which apply are stored.
A sample of the trie used to store the derivations of variables on the right-hand side of the rule is:
"b"
/ \
A "b"
| \ \
cat B C
Note that the weight set for a production in VarsToSpan is 1, as there is no rule used to create the binarization. In the derivation tree, a rule is applied to an in/out pair in a production, and then another production (child node) is used to indicate how the nonterminals align to the output string.
This binarized trie is used to allow sharing of the ways in which a set of nonterminals can derive a span of the output string.
"b" "b"
/ \ / \
A "b" A "b"
| / \ | / \
cat B C ca B C
Above are two possible tries, one in which A aligns with "cat" and one in which A aligns with "ca". In the derivation tree, the node below the rule when VarsToSpan is called will simply be "wholespan", (input tree, output string span, right hand side nonterminals). However, in another production, there will be a "b" binarized rule indicating how the nonterminals were expanded. It seems that this production would be a node beneath the rule's child in the derivation tree. The "Train" algorithm does not consider such nodes in counting the number of times a rule appears when calculating the inside weight (only "rule" labelled nodes are considered), so there is no effect on the weighting obtained. This effect is obtained by VarsToSpan returning "wholespan" (instead of false, if there is no derivation) and adding a new production to P.
The text makes a point about cyclical transitions of the form qx0->q'x0. It indicates that an abundance of these will cause the running time to approach the worst case. One point to note here is that, when VarsToSpan is called with the nonterminal on the right equivalent to that on the left, I assume that the path i concat i' will be equivalent to i, so i' seems like it must be the path of the nonterminal relative to the input of the left-hand side. So if there are many states such that q1 x0 ->q2 x0, q2 x0 ->q3x0.. . q(n-1) x0->qn x0, then all of these states will need to be visited for the single alignment x0. Thus, many rules are checked for what essentially has no effect on the derivation of the output string.
A second commentary the authors make involves binarized rules. The reader is uncertain as to what these denote. i.e. I am not sure whether the "b" will be on the left or right of the rules.
b->A(cat) BC
ABC-> b(A(cat), BC)
In any case, wherever "b" is placed in the derivation tree (below a rule) we could substitute a new rule id and add this rule to the list of transducer rules R. However, then calculating the counts and probabilities for only the rules that were originally included in R becomes non-trivial.

11. Translation Modeling Experiment
A replication of work done by Yamada and Knight is undertaken.
The authors create a tree transducer with rules designed to translate English sentences into Japanese. Firstly, English sentence trees are flattened such that permutations of nodes at different subtree levels need not occurred (any reordered nodes have the same parent). The translation process is a four-step one:
1. reordering 2. insertion of Japanese function words(wa, ga, no, etc) 3. translation 翻訳 4. Deletion of internal words. Note that there are four states in the simple version of this transducer: q=start, r=reorder, i=insert, t=translate.
The algorithm finds the Viterbi derivation: the sequence of rules deriving the output string from the input tree. The observation, in terms of the Viterbi algorithm parlance, is the output Japanese string. The Viterbi alignment is the correlation of English and Japanese words. Metrics used to evaluate the performance of the algorithm are link match accuracy (assume this means word correlation) and sentence match accuracy (assume this means that all correlations in a sentence match). A more advanced version of the transducer was created with additional i-rules, making insertions conditional upon the parent nodes, and additional q-states, specifying both the parent node and the current node (which are passed when q is on the right side of a reorder rule). However, even in the more exact transducer, the probability distributions in the rules differ from those of Yamada and Knight. This is because Yamada and Knight barred foreign word insertion prior to a null translation of an English word. The example tree shows that the rule for parsing the root of the tree, VB, derives three copies of the rule "r2", which generates a reorder followed by an insertion. The authors remedy this discrepancy via "rule tying" (which is also used in reordering, because the paper's model is more specific than Yamada/Knight's). Thus, the insert and translate steps are tied to an "s" transition rule and an "s" translation rule. In the case of a NULL-translated English string, no "s" rule will be present, so no insertion will occur (I assume this means that the first tied rules which led to the null derivation (e.g. a.VB x0:NN->r x0, ix0@24 VB a0:NN->s x0, i x0@24) are discounted.) Interestingly, the transition rules r NN(x0:"car")->t x0 and s NN(x0:"car")->t x0 are not tied. Another question is why the original rules simply couldn't be replaced. Perhaps the authors wanted to emphasize that this modification to the transducer was made to emphasize a special case of the model they were attempting to mimic.
The transducer functionality described in the Deriv, SDeriv, and Train algorithms appears to have been implemented in Java via the Tiburon toolkit.

12. Training Probabilistic Context Free Grammars on Strings
In this section, the xTs transducer training algorithm is used on a grammar (the transducer rules) and a set of output sentences (transducer output). The objective is to assign probabilities to the grammatical rules such that the probability that the three given output sentences are output is maximized. Examples of the grammar rules are:
S->NP VP
NP->DT N
DT->the
DT->sees
Note that every nonterminal which derives a terminal string derives every possible string. There is no input tree in this case, so the transducer training algorithm assumes that the input subtree is "null" in each case. Thus, each production in the output WCFG grammar will be of the form:
qv null->(0.11) of
where the parenthetical quantity is the probability that the production is used for that state. The tree/string pairs used to input into the SDeriv algorithm are of the form:
null->the ambassador observed the window
The derivation forests output by SDeriv allow one to determine Viterbi parses for the output strings. This parse would show the order in which grammar rules were applied to produce the string, it seems.

13. Related and Future Work
An xLNT transducer is mentioned as a related model. This is an extended linear nondeleting top-down tree transducer. I am uncertain about what is meant by "linear". These are similar to TSGs (tree substitution grammars). In a TSG, states and tree labels are conflated, so a relabeling step in a TSG is required to make it equivalent to an xLNT. Note that a TSG is also related to a TAG (tree adjoining grammar).
Regular lookahead for deleted inputs is mentioned. This seems to mean that transducer rules:
x0->B
B->C x0
would be permitted (note that x0 is deleted in the first rule and recovered in the second).
Note that the Expectation Maximization algorithm in this paper is a forward-backward training algorithm (cf feed-forward neural networks or back-propagation). I am not certain as to what exactly is meant by "forward" and "backward". "Forward" derivations seem to be correlated to inside weight calculation, i.e. to finding the probability of a derivation moving forward from a given state. "Backward" derivations seem to be correlated to outside weight calculation, i.e. to finding the probability of a derivation that leads to a given state from the start state. The algorithm in this paper is a generalization of forward-backward EM training for finite-state transducers, which is a generalization of a forward-backward model for Hidden Markov Models. A related piece involves string to string training based on an FST, finite state transducer. Another related work involves tree-to-tree mapping using cloning (copying an arbitrary subtree elsewhere in the output). I am not sure whether this subtree could be copied to a position not shown in the right-hand-side, but it seems unlikely. Thus, this cloning would seem to be covered by the xT transducers introduced in the paper.
Future work involves building the wRTG (weighted regular tree grammar) for xT derivation trees given only the output tree or given only the input tree. The paper claims that there is a one-to-one correspondence between derivation trees and input trees for backward application with concrete transducers, but I am not sure why this need be the case (i.e. why the grammar could not contain two nonterminals which derive to exactly the same right hand side and are derived from the same parent node). Actually, this means that there is one derivation per input tree for a given output, so what I wrote above is not a contradiction. A concrete transducer is one in which all paths and labels in an input subtree are fully specified. Sources are represented by the same variables on the lhs and rhs.
Copying transducers require wRTG intersection in the case of backward application. I assume that the following scenario is an example of this case. Say the input tree is A(B) and the output string is "cat twc". The first rule is A x0->C(x0 x0), which copies B. The left child B must derive "cat", while the right child must derive "twc". Thus, two separate rules, B->cat and B->twc are needed. The grammar would need to contain both derivations starting from B. If B were copied to non-sibling nodes, it seems that the intersection would be more involved, as the grammar would need derivations from B at different functional positions in the output string. If the xTS transducer is non-copying, backward application entails CFG parsing. Output lattices can be parsed via backward application as well. Apparently, spans would be pairs of lattice states, although I am not sure what the lattice would be other than a set of sentences arranged in an array.


14. 結末
自動翻訳のモデルと訓練アルゴリズムが提供されました。 Performance is O(Gn^2) for the xT derivation algorithm and O(Gnm^3) forthe xTS algorithm. The training occurs by building an pruning the derivation forest, and subsequently counting the usage of rules in these derivations and normalizing. 実施も実験の結果も説明されました。

*1:))=(A,2) and labelt((1

*2:2

*3:a,h), (b,h).(i,(q,lambda,rhs,w)))) such that (q,lambda,rhs,w) element of R and i element of pathsa and q=labela(i) and lambda(a subtree i)=1 and b= a with its ith node given the right hand side of the rule such that each p in the paths of the rhs with a label (q',i') is assigned q'(a subtee (i.i'