Systolic Arrays and Parallelization, 並行に配列を因数分解する

I recently perused a patent application by Professor Kung of Carnegie Mellon for a systolic array apparatus to be used in matrix computations. The arrays are multiple inner product step processors connected in a grid formation. Each processor computes C = C + AB. Thus, it takes 3 inputs, A, B, and C. It then produces three outputs, with C being a new value and A and B being the original values. The paper progresses as follows:
1. Matrix-vector multiplication: linear configuration of rectangular processors.
2. Matrix-matrix multiplication: diamond configuration of hexagonal processors.
3. LU decomposition: same configuration as above.

I am unclear about matrix-matrix multiplication for two reasons. Firstly, the delays between the writing of elements of the final output C to the boundary processors (from which they are sent through the array, accumulating terms from the elements of input matrices A and B) seem inconsistent. In some cases, the term following a term vertically in the input queue is written on the next clock cycle, and in other cases a term is written two cycles later. Secondly, I am not sure how the receiving memory connected to boundary elements is to know which indices (row,column) of C are being received on a given cycle. I suppose that this is left to the programming of the interfacing application.
I am particularly confused by the LU decomposition exposition. LU decomposition is designed to factor a matrix into a lower triangular matrix and an upper triangular matrix. How a set of processors which compute products and sums can effectively perform factorization is unclear. The top processor in the diamond configuration computes the reciprocal of its input.
aij**(l) = aij
ai**(k+l) = aij**k + lik(-ukj) here i,j,k are subscripts when following a or u.
ljk = 0 ifik
ukj = 0 if k>j, akj**k ifk<=j
Here, the exponentiation is actually indicating what iteration of akj we are on.
I am uncertain as to why these recurrences are valid. The paper claims that the factorization is done via "Gaussian elimination" without pivoting. It also claims that this is possible when the matrix is "symmetric positive definite".
It seems that the ukk**(-1) term is obtained by piping A values through the topmost processor.