回帰神経回路の使用 - ptoolisの日記

最近自然言語処理（NLP)にかんする論文を読んでいます。
Some models the author had not seen before, namely Maximum Entropy Markov Models and Conditional Random Fields, were encountered. マルコフ過程は次の状態は現在の状態だけに頼ります。
RNNの出力がCRF（Conditional Random Field）の入力になる形が説明されましたあ。It is unclear whether the Recurrent Neural Network and Conditional Random Field are integrated or the output of the RNN. CRFは隠れたマルコフ型と似ているはずです。目的は確率が最高な荷札列です。The CRF seeks to produce a sequence of labels most likely given the sequence of words. The CRF is, for each word point in the sequence, a set of joint probabilities of labels for the words seen up to that point. This is the upper row of the R-CRF in the paper. The Viterbi Algorithm is used to decide the most likely sequence of states given the word observations, transition probabilities for labels, and marginal label probabilities. These marginal label probabilities are the initial probabilities, or the transition probabilities when the previous state is null. In the Natural Language Processing example of this paper, a short-term memory seems to be implemented by connecting the output label for the tth word to the hidden layer for the t+1st word. Long term memory seems to be in place in the retention of the sentence state in the hidden layer's s(t-1) input to the s(t) node, where s(t) represents the first t-1 words seen.
The gradients of the objective function Q, which seem to be defined in terms of the Dirac delta function, make little sense. Upon further review, if the delta is 1 when y*=y and 0 otherwise, it does make sense. The derivatives with respect to zyt=k and aij both seem to be the derivatives of the logarithmic objective function. While the simplifications in terms of alpha and beta (forward and backward pass scores) are not so straightforward, they make sense qualitatively.
for example, the second term of the dQ/daji formula is -eta*SIGMA(t)(alpha(t-1,j)beta(t,i)exp(eta*aji+zi))/SIGMA(j)(alpha(t,j)beta(t,j)
This essentially means the total score of all paths with a j->i state connection, divided by the total score of all possible paths. The direct derivative of the second term of thelog objective is -SIGMA(y|y(t-1)=j,y(t)=i)(exp(SIGMA(t)(eta*a+z)))/SIGMA(y)(exp(SIGMA(t)(eta*a+z))), which essentially has the same meaning. alpha(t,i) and beta(t-1,q) are both recursively defined in terms of alpha(t-1,...) and beta(t,...), respectively. This is essentially a Markov model, where the probability of a path through k is the probability of the segment ending at k times the probability of the segment beginning from k.