Derivation of Dyna's Back-Propagation Algorithm (2004)

This is an online appendix to Eisner, Goldlust, and Smith (2005), Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language.

Section 4.2 and Figure 3 give a forward-chaining algorithm that computes the values of all items in a semiring-weighted logic program.

Then section 5.2 and Figure 4 give a gradient algorithm that differentiates the previous computation (when the semiring is (R,+,*)). It works backwards to rapidly compute the partial derivatives of the goal item with respect to each of the items. This gradient can be used to adjust the values of items that correspond to free parameters, within an algorithm like gradient descent or L-BFGS.

Below is the derivation of this gradient algorithm, which did not fit in the paper (as noted in the Figure 4 caption). It is a reformatted version of an August 12, 2004 email among the authors (from Jason Eisner to Eric Goldlust).

Note that the Dyna compiler produces specialized code for both the forward-chaining algorithm and the gradient algorithm, for any logic program in this semiring.

Background reading: A special case of these forward-chaining and back-propagation algorithms (for a particular finite-state model) was explained in detail in Eisner (2001), sections 4.2 and 4.3 respectively. Our derivation below uses the same notation.

You should already be familiar with the general idea of automatic differentiation (website) in the reverse mode (tutorial, history). Most people first see this in the case of back-propagation in neural networks (slides, video, explanation, book chapter, formulas and C++ code, another tutorial with code). Indeed, the algorithm below follows the same pattern as back-propagation through time for recurrent neural networks (Werbos 1989, Williams and Zipser 1989).

Setup

We need to start by writing out the timestamped forward pass, then derive a matching backward pass. I'll use ^(t) for a timestamp superscript.

First, here is a picture of a bit of the formula graph given a rule c += a₁ * a₂ * a₃ where a₂ is the only driver. You can see that agenda^(t-1)[a₂] is used to update the chart^(t-1) values into chart^(t) values and to define a new value for agenda^(t)[c]. Note that for a given time t, the chart values feed forward into the agenda values.

                 ,---------------------- agenda^(t)[c] 
                /                           *  |  *  \
               /                               |      \
   chart^(t)[a₁]              chart^(t)[a₂]       |       \    chart^(t)[a₃]
        |                      /  +  \         |        \     |
        |                     /       \        |         \    |
   chart^(t-1)[a₁]   chart^(t-1)[a₂]    agenda^(t-1)[a₂]   chart^(t-1)[a₃]

Forward pass with explicit timestamps

Here is the forward pass pseudocode.

[Note that we explicitly push current values onto a "tape" (stack) before we update them. Later, during the backward pass, this tape will be read backwards (popped) to help us unwind the computation. From section 5.2 of our paper: "At line 3, a stack is needed to remember the sequence of (a, old, Δ) triples from the original computation. It is a more efficient version of the 'tape' usually used in automatic differentiation."]

initialize the agenda
t:=0
while agenda is not empty

  t := t+1

  // choose an update from agenda
  choose a such that agenda^(t-1)[a] != 0  // this a could permanently be called a^(t-1)
  push (a, chart^(t-1)[a], agenda^(t-1)[a]) onto tape

  // compute chart^(t)
  chart^(t)    := chart^(t-1)           
  chart^(t)[a] += agenda{t-1)[a]        // overrides (adds to) previous line, for [a]

  // compute agenda^(t)
  agenda^(t)    := agenda^(t-1)         // plus other terms to be added below
  agenda^(t)[a] := 0                    // overrides previous line, for [a]
  if (chart^(t)[a] != chart^(t-1)[a])   // only propagate actual changes
    // the following updates may be done in any convenient order (e.g., by pattern matching on a)
    for each inference rule c += a₁*a₂*...*a_k
      for i from 1 to k 
        for each way (if any) of instantiating the rule's variables 
        such that a_i=a and (all j) chart^(t)[a_j] != 0 or chart^(t-1)[a_j] != 0
           agenda^(t)[c] += prod_{j=1)^(i-1) chart^(t)[a_j]
                            * agenda^(t-1)[a]
                            * prod_{j=i+1)^k chart^(t-1)[a_j]

           // has the same effect as updating chart[a₁], chart[a₂], ... 
           // one at a time in that order; the only ones that 
           // actually change are the ones that equal a.

Backward pass with explicit timestamps

Okay, now let's see what the backward pass must do. We'll just differentiate the defining computations above, basically in reverse order.

// t and tape are as they were left at the end of the forward pass
initialize the gradient computation 
  (e.g., set gagenda^(t):=0, gchart^(t):=0, then gchart^(t)[goal]:=1)
while tape is not empty                     
  pop (a, chart^(t-1)[a], agenda^(t-1)[a])  
                           // In the timestamped version, popping the old 
                           // chart and agenda values isn't actually necessary, 
                           // since we remember those values from the forward pass.

  gagendaold := 0         // holds accumulands into gagenda^(t-1)[a]
  gchart^(t-1) := 0       // to be accumulated below

  if (chart^(t)[a] != chart^(t-1)[a])
      // the following updates may be done in any convenient order (e.g., by pattern matching on a)
      for each inference rule c += a₁*a₂*...*a_k
        for i from 1 to k 
          for each way (if any) of instantiating the rule's variables such that a_i=a and (all j) chart^(t)[a_j] != 0 or chart^(t-1)[a_j] != 0

             for j':=1 to i-1
                gchart^(t)[a_j'] += gagenda^(t)[c]
                                   * prod_{j=1 to i-1 except j') chart^(t)[a_j]
                                   * agenda^(t-1)[a]
                                   * prod_{j=i+1)^k chart^(t-1)[a_j]

             gagendaold += prod_{j=1)^(i-1) chart^(t)[a_j]
                           * gagenda^(t)[c]
                           * prod_{j=i+1)^k chart^(t-1)[a_j]

             for j':=i+1 to k
                gchart^(t-1)[a_j'] += prod_{j=1)^(i-1) chart^(t)[a_j]
                                     * agenda^(t-1)[a]
                                     * prod_{j=i+1 to k except j') chart^(t-1)[a_j]
                                     * gagenda^(t)[c]

  // now gchart^(t) is defined
  gagenda^(t-1) := gagenda^(t)
  gagenda^(t-1)[a] := gagendaold + gchart^(t)[a] // overrides above since gagenda^(t-1)[a] does not feed forward into gagenda^(t)[a], but into other things
  gchart^(t-1) += gchart^(t) // note that in forward pass, chart^(t-1) influences agenda^(t) both directly and via chart^(t).

  t := t-1

Check: Special case where all rules are binary

As a check, the loop over inference rules simplifies as follows in the binary-rule case:

      for each inference rule c += a₁*a₂

        for each way (if any) of instantiating the rule's variables such that a₁=a and chart^(t)[a₂] != 0 or chart^(t-1)[a₂] != 0

             gagenda^(t-1)[a] += gagenda^(t)[c]
                                 * chart^(t-1)[a₂]

             gchart^(t-1)[a₂] += agenda^(t-1)[a]
                               * gagenda^(t)[c]

             // can get these by specializing the general version above, 
             // or by directly differentiating
             //    agenda^(t)[c] += agenda^(t-1)[a] * chart^(t-1)[a₂]
           
        for each way of instantiating the rule's variables such that a₂=a and chart^(t)[a₁] != 0 or chart^(t-1)[a₁] != 0

             gchart^(t)[a₁] += gagenda^(t)[c] 
                               * agenda^(t-1)[a]

             gagenda^(t-1)[a] += chart^(t)[a₁]
                                 * gagenda^(t)[c]

             // can get these by specializing the general version above, 
             // or by directly differentiating
             //    agenda^(t)[c] += chart^(t)[a₁] * agenda^(t-1)[a]

Backward pass without timestamps

Now our next job is to simplify these routines by eliminating the timestamps [and using the tape instead]. The forward pass simplifies as already shown in the paper [Figure 3]. Here's how the backward pass simplifies [Figure 4 in paper]:

initialize the gradient computation 
  (e.g., set gagenda:=0, gchart:=0, then gchart[goal]:=1)
while tape is not empty                     
  // chart currently holds chart^(t)
  // gchart and gagenda currently hold gchart^(t) and gagenda^(t)

  pop (a, oldchart, oldagenda)    // "old" refers to the ^(t-1) values

  // During the main loop below, chart and gagenda still refer to ^(t),
  // but gchart now refers to ^(t-1), as if we had done 
  // gchart^(t-1):=0 as in the timestamped version but then had done 
  // the gchart^(t-1)+=gchart^(t) step early.
  // Because the latter step was done early, before gchart^(t) was modified,
  // we'll need to ensure that any modifications to gchart^(t) (which is
  // no longer stored) are passed through so that they appropriately affect 
  // gchart^(t-1) and gagenda^(t-1)[a].
  //
  // As in the paper, we use chart(a_j,oldchart) to refer to 
  // a_j==a ? oldchart : chart[a_j], that is, chart^(t-1)[a_j].

  gagendaold := 0         // holds accumulands into gagenda^(t-1)[a] (while gagenda refers to ^(t))
  gchartnew := gchart[a]  // holds gchart^(t)[a] (while gchart refers to ^(t-1))

  if (chart[a] != oldchart)     // in other words, if update actually accomplished something
      // the following updates may be done in any convenient order (e.g., by pattern matching on a)
      for each inference rule c += a₁*a₂*...*a_k
        for i from 1 to k
          for each way (if any) of instantiating the rule's variables such that a_i=a and (all j) chart[a_j] != 0 or a_j=a

             for j':=1 to i-1
                // modifying gchart^(t), must pass through
                gchart[a_j'] += gagenda[c]     
                               * prod_{j=1 to i-1 except j') chart[a_j]
                               * oldagenda
                               * prod_{j=i+1)^k chart(a_j,oldchart)
                if (a_j'=a) then gchartnew += (same thing)

             gagendaold += prod_{j=1)^(i-1) chart[a_j]
                           * gagenda[c]   
                           * prod_{j=i+1)^k chart(a_j,oldchart)

             for j':=i+1 to k
                gchart[a_j'] += prod_{j=1)^(i-1) chart[a_j]
                               * oldagenda
                               * prod_{j=i+1 to k except j') chart(a_j,oldchart)
                               * gagenda[c]

  // now we roll gagenda and chart back to refer to ^(t-1)
  gagenda[a] := gagendaold + gchartnew   
  chart[a] := oldchart

  t := t-1

Check: Special case where all rules are binary

Again, let's consider what the main loop looks like in the binary-rule case:

      for each inference rule c += a₁*a₂*...*a_k

        for each way (if any) of instantiating the rule's variables such that a₁=a and chart[a₂] != 0 or a₂=a

           gagendaold += gagenda[c] * chart(a₂,oldchart)
           gchart[a₂] += oldagenda * gagenda[c]
           if (a₂=a) then gchartnew += oldagenda * gagenda[c]

        for each way (if any) of instantiating the rule's variables such that a₂=a and chart[a₁] != 0 or a₁=a

           gchart[a₁] += gagenda[c] * oldagenda
           gagendaold += chart[a₁] * gagenda[c]

Example

I suggest you check this with a simple example. Suppose we have the program

   a += a*a.      % note: equivalent to a = a*a since only one summand
   a += r.        % where r is an axiom

Then successive updates will look like this, I think:

   chart[a]=0, agenda[a]=r
   chart[a]=r, agenda[a]=r²-0²
   chart[a]=r+r², agenda[a]=(r+r²)²-r²
   chart[a]=r+(r+r²)², agenda[a]=(r+(r+r²)²)²-(r+r²)²
   chart[a]=r+(r+(r+r²)²)², agenda[a]=(old chart[a])²+(new chart[a])²

The convergence condition is chart[a]=r+chart[a]², which is what we want. For example, if r is 0.25, I think chart[a] will converge to 0.5.

I suggest you run the forward algorithm by hand for 2 steps (early stopping!), and see what formula you get for chart[a] in terms of r, presumably r+r². Then run the backward algorithm by hand, and make sure that gradient[r] ends up holding the correct derivative of that. In both cases, exactly follow the pseudocode in the EMNLP paper, with the modifications listed above. Let me know if it works!

Example

As [another] overly simple check, the back-relaxation code in my thesis (pp. 123-124) should fall out by trying the [backward algorithm] on the following program:

   stop(I) += edgezero(I)*pathto(I).
   pathto(J) += edge(I,J)*pathto(I).

when only pathto items are popped. Note that a₁=a₂ is impossible for this program. When we pop pathto(I), we get

  gagendaold := 0                     // in thesis, z := 0
  gchartnew := gchart[pathto(I)] // in thesis, doesn't appear; is effectively 0 since chart[pathto(...)] is never used: pathto items are never passengers

  // from stop(I) += edgezero(I)*pathto(I).
  gchart[edgezero(I)] += gagenda[stop(I)] * oldagenda   // in thesis, gP_i0 += gp_i * J_i
  gagendaold += chart[edgezero(I)] * gagenda[stop(I)]   // in thesis, z += P_i0 * gp_i

  // from pathto(J) += edge(I,J)*pathto(I).
  gchart[edge(I,J)] += gagenda[pathto(J)] * oldagenda   // in thesis, gP_ij += gJ_j * J_i
  gagendaold += chart[edge(I,J)] * gagenda[pathto(J)]   // in thesis, z += P_ij * gJ_j

  // after loop
  gagenda[pathto(I)] := gagendaold + gchartnew // in thesis, gJ_i := z (note gchartnew is effectively 0)
  chart[pathto(I)] := oldchart // in thesis, doesn't appear since chart[pathto(...)] is never stored: pathto items are never passengers (hence, no need to store oldchart on tape, either)

which seems to match, as shown in the comments.

Optimization opportunities

Some further optimizations seem possible though not urgent:

Notice that we are now maintaining two gradients per item, gchart and gagenda. agenda[a] is only used on the forward pass, and gagenda[a] is only used on the backward pass, so the implementation could probably store them in the same place. Indeed, I suspect that agenda and gagenda are perfect complements, in the sense that gagenda^(t)[a] must be stored on the backward pass (for a given t, a) if and only if agenda^(t)[a] was stored on the forward pass.
I think maintaining two gradients per item is probably necessary in the general case. Why? Because these gradients may be arbitrarily different. In the case of early stopping, chart[a] may have more influence on the objective function than agenda[a] does (it may participate as a passenger before a is popped).

We should try to figure out whether we can get by with one gradient, but I'm skeptical. However, in specific cases, it may be possible to get away with just one:
- If a will never participate as a passenger, then we don't have to store either chart[a] or gchart[a], since the former will never be used and the latter is therefore zero. That's the case above.
  
  (Probably the safest way to handle this is to treat pathto as a kind of transient item that is nonetheless processed on the agenda (rather than immediately). That way, we'll be able to cope if a does participate as a passenger, e.g., because the user asserts a new edge axiom in the middle of the computation.)
- If we know that we're not doing early stopping, then we can probably combine the gradients into a single field, since whatever happens to chart[a] in future will ultimately also happen to agenda[a] -- incrementing either one will ultimately have the same effect.
  
  (Should work out the improved pseudocode for this case. Confirm that the idea really works, even though chart[a] is used for passenger a and agenda[a] for driver a. Note: We do still want to use the tape even without early stopping, I think, to avoid the need for backward chaining and managing a backward-pass priority queue. Maybe there are also non-tape approaches that save memory.)
By static analysis, we may be able to determine at compile time that a₁=a₂ is impossible in some rule. That lets us skip two tests in the backward pass code (one is explicit, the other is hidden inside the construct "chart(a_j,oldchart)"). We can also avoid using the temp variable gchartnew in this case, since it will not be modified
The pseudocode suggests that the tape has to contain triples of the form (a, chart^(t-1)[a], agenda^(t-1)[a]). However, we don't really have to store both those ^(t-1) values. Given one, we can get the other upon popping by subtracting from chart^(t)[a]. (Since we're running a backward pass, we're presumably in the real or log-real semiring, so we can assume subtraction exists.)

Which is better to store? Requires a little thought. It depends on which way will give us better numerical precision, and how bad it is to have precision errors in agenda[a] vs. gradient[a]. (Storing agenda[a] means that the immediately needed gradient computation will always be accurate, using a as a driver, but we may accumulate error in chart[a] that will hurt us when and if a is used as a passenger as we continue on the backward pass. Storing chart[a] means that agenda[a] might be fuzzy but we will always have chart[a] correct if we need to use a as a passenger.)

In the forward pass, we can skip pushing (a, ...) onto the tape if chart[a] didn't change. This saves space. It also saves time: the backward pass never has to see (and skip over) agenda changes that didn't make any difference. The only reason I didn't do make these improvements in the pseudocode above was that it would have made the timestamped code uglier. (It's nice just to do ++t every time we pop something.)

This page online: http://cs.jhu.edu/~jason/papers/eisner+goldlust+smith.emnlp05-appendix.html
Jason Eisner - jason@cs.jhu.edu