ハードウェアの配列でLU因数分解

六件前に解説しようとした記事は先のアルゴリズムと関係がありそうです。
The recursion given by Kung was:
a(ij)(1) = a(ij)
a(ij)**(k+1) = a(ij)**(k)+l(ik)*(-u(kj))
l and u are elements in the lower triangular matrix L and upper triangular matrix U ultimately output by the systolic array.
l(ik) = 0 if ij and a(kj)**k if k<=j
the u(kk)**(-1) factor explains why a "reciprocal processor" is needed, as such a processor outputs the reciprocal of its input u(kk).
Consider the first column of the output, i.e. k=1.
a(21)**(2) = a(21)**1 + l(21)(-u(11))
l(21) = a(21)/u(11)
a(21)**2 = a(21)-a(21) = 0
So row 2, column 1 of A**2 is the correct value. Note that A**2 is affected by only the first column of L. This is the same effect as you would obtain by multiplying by an atomic lower triangular matrix in which only the first column contains non-zero off-diagonal elements and the diagonal is all ones. In fact, it seems that the approach sketched here is a version of the Doolittle algorithm which distributes the matrix multiplication. We simply replace the "n" in the Doolittle algorithm by "k" above.
Next, I would like to consider the geometry (幾何)of the arrays and the timing of input from A.