# Latent Dirichlet allocation

## 模型概述

![](https://727008675-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LR7JX40daBlT_8LaK-D%2F-LVN3DCA7qrstWb3fFxW%2F-LVN3Df_gP9zwTwFcKAO%2FSmoothed_LDA.png?generation=1546591989408408\&alt=media)

* **notation**

$$\alpha$$ is the parameter of the Dirichlet prior on the per-document topic distributions, $$\beta$$ is the parameter of the Dirichlet prior on the per-topic word distribution, $$\theta *{m}$$ is the topic distribution for document m, $$\varphi *{k}$$ is the word distribution for topic k, $${\displaystyle z*{mn}}$$ is the topic for the n-th word in document m, and $${\displaystyle w*{mn}}$$ is the specific word.

### **Generative process**

1. Choose $${\displaystyle \theta \_{i}\sim \operatorname {Dir} (\alpha )}$$, where $$i\in {1,\dots ,M}$$ and $$\mathrm {Dir} (\alpha )$$ is a Dirichlet distribution with a symmetric parameter $$\alpha$$ which typically is sparse $$\alpha < 1$$
2. Choose $${\displaystyle \varphi \_{k}\sim \operatorname {Dir} (\beta )}$$, where $$k\in {1,\dots ,K}$$ and $$\beta$$ typically is sparse
3. For each of the word positions $$i,j$$, where $$i\in {1,\dots ,M}$$, and $$j\in {1,\dots ,N\_{i}}$$

   (a) Choose a topic $${\displaystyle z\_{i,j}\sim \operatorname {Multinomial} (\theta \_{i}).}$$

   (b) Choose a word $${\displaystyle w\_{i,j}\sim \operatorname {Multinomial} (\varphi *{z*{i,j}}).}$$

We can then mathematically describe the random variables as follows:

$$
{\displaystyle {\begin{aligned}{\boldsymbol {\varphi }}*{k=1\dots K}&\sim \operatorname {Dirichlet} *{V}({\boldsymbol {\beta }})\\{\boldsymbol {\theta }}*{d=1\dots M}&\sim \operatorname {Dirichlet} *{K}({\boldsymbol {\alpha }})\z*{d=1\dots M,w=1\dots N*{d}}&\sim \operatorname {Categorical} *{K}({\boldsymbol {\theta }}*{d})\w\_{d=1\dots M,w=1\dots N\_{d}}&\sim \operatorname {Categorical} *{V}({\boldsymbol {\varphi }}*{z\_{dw}})\end{aligned}}}
$$

## 模型求解

模型中涉及到的参数有: $$\theta, \varphi, Z$$, 下面会用 Variational Inference 和 collapsed Gibbs sampling 两种方法进行求解.

### Variational Inference

the total probability of the model is:

$$
{\displaystyle P({\boldsymbol {W}},{\boldsymbol {Z}},{\boldsymbol {\theta }},{\boldsymbol {\varphi }};\alpha ,\beta )=\prod \_{i=1}^{K}P(\varphi \_{i};\beta )\prod \_{j=1}^{M}P(\theta \_{j};\alpha )\prod *{t=1}^{N}P(Z*{j,t}\mid \theta *{j})P(W*{j,t}\mid \varphi *{Z*{j,t}}),}
$$

现考虑用如下分解的变分分布: $$q({\boldsymbol {Z}},{\boldsymbol {\theta }},{\boldsymbol {\varphi }}) = \prod \limits\_{k=1}^{K}q(\varphi\_k) \prod \limits\_{i=1}^{M} {q(\theta\_i) \prod \limits\_{j}^{N} q(z\_{i,j}) }$$ 对 $$P({\boldsymbol {Z}},{\boldsymbol {\theta }},{\boldsymbol {\varphi }} | {\boldsymbol {W}})$$ 做近似.

利用 mean field theory, 可以得到迭代公式.

* **求解** $$q(\theta\_{i})$$

$$
\begin{aligned}
q(\theta\_{i}) & = Dirichlet(\alpha^*\_1, \alpha^**2, \dots, \alpha^*\_K) \\
where \quad \alpha^**k &= \alpha + \sum*{j} q(z*{i,j}=k)
\end{aligned}
$$

* **求解** $$q(\varphi\_{k})$$

$$
\begin{aligned}
q(\varphi\_{k}) & = Dirichlet(\beta^*\_1, \beta^**2, \dots, \beta^*\_V) \\
where \quad \beta^**v & = \beta + \sum*{i,j} 1(w*{i,j}=v)q(z\_{i,j}=k)
\end{aligned}
$$

* **求解** $$q(z\_{i, j})$$

$$
q(z\_{i, j}=k) \propto exp(\psi(\alpha^*\_k) - \psi(\sum\_k \alpha^**k) + \phi(\beta^\**{k, w\_{i,j}}) - \phi(\sum\_{v} \beta^\*\_{k, v}))
$$

### collapsed Gibbs sampling

Following is the derivation of the equations for collapsed Gibbs sampling, which means $$\varphi$$s and $$\theta$$s will be integrated out.

$$
p(z\_{m,n}=k | Z\_{-m,n}, W; \alpha, \beta) \propto (\alpha\_k+n\_{m,(\cdot )}^{k, -(m,n)}) \frac {\beta\_{w\_{m,n}}+n\_{(\cdot ),w\_{m,n}}^{k, -(m,n)}}{\sum\_v \beta\_v+n\_{(\cdot ),v}^{k, -(m,n)}}
$$

其中, $$n\_{m,(\cdot )}^{k}$$ 表示在文档 $$m$$ 中, 出现主题 $$k$$ 中词的个数, $$n\_{(\cdot ),v}^{k}$$ 表示在主题 $$k$$ 中出现的词 $$v$$ 的次数. 上标中的 $$-(m,n)$$ 表示忽略标号为 $$(m, n)$$ 的词.

* **推导细节**

记 $$\Delta(\boldsymbol{\alpha}) = \frac{\prod \Gamma(\alpha\_i)}{\Gamma(\sum \alpha\_i)}$$.

$$
p(z\_{m,n} | z\_{-(m,n)}, W) = \frac{p(W, z)} {p(W, z\_{-(m,n)})} = \frac{p(W|z)p(z)} {p(W\_{-(m,n)}|z\_{-(m,n)}) p(w\_i) p(z\_{-(m,n)})} \propto \frac{p(W|z)}{p(W\_{-(m,n)}|z\_{-(m,n)})} \frac{p(z)}{p(z\_{-(m,n)})}
$$

下面分别求 $$p(W|z)$$, $$p(z)$$ 的表达式.

$$
\begin{aligned}
p(w | z; \beta) & = \int p(w|z, \varphi) p(\varphi | \boldsymbol{\beta}) d\varphi \\
& = \int \prod\_{i=1}^{M} \prod\_{j=1}^{N} p(w\_{i,j}|z\_{i,j}, \varphi) \prod\_{k=1}^{K} p(\varphi\_k | \boldsymbol{\beta}) d\varphi \\
& = \int \prod\_{i=1}^{M} \prod\_{j=1}^{N} \varphi\_{z\_{i,j}, w\_{i,j}} \prod\_{k=1}^{K} \frac{1}{\Delta(\boldsymbol{\beta})} \prod\_{v} \varphi\_{k,v}^{\beta\_v-1} d \varphi \\
& = \int \prod\_{k=1}^{K} \frac{1}{\Delta(\boldsymbol{\beta})} \prod\_v \varphi\_{k,v}^{\beta\_v+n\_{(\cdot ),v}^{k}-1} d \varphi \\
& = \prod\_{k=1}^{K} \frac{\Delta(\beta+n\_{(\cdot )}^{k})}{\Delta(\beta)} \quad where \quad n\_{(\cdot )}^{k} = (n\_{(\cdot ),1}^{k}, \dots, n\_{(\cdot ),V}^{k})
\end{aligned}
$$

$$
\begin{aligned}
p(z | \alpha) & = \int p(z | \theta) p(\theta | \alpha) d\theta \\
& = \int \prod\_{i=1}^{M} \prod\_{j=1}^{N} p(z\_{ij}|\theta) \prod\_{m=1}^{M} p(\theta\_m | \alpha) d \theta \\
& = \int \prod\_{i=1}^{M} \prod\_{j=1}^{N} \theta\_{i, z\_{i,j}} \prod\_{m=1}^{M} \frac{1}{\Delta(\alpha)} \prod\_{k} \theta\_{m, k}^{\alpha\_k-1} d \theta \\
& = \int \prod\_{i=1}^{M} \frac{1}{\Delta(\alpha)} \prod\_{k} \theta\_{m, k}^{\alpha\_k+n\_{m,(\cdot )}^{k}-1} d \theta \\
& = \prod\_{i=1}^{M} \frac{\Delta(\alpha+n\_{m,(\cdot )})}{\Delta(\alpha)} \quad where \quad n\_{m,(\cdot )} = (n\_{m,(\cdot )}^{1}, \dots, n\_{m,(\cdot )}^{K})
\end{aligned}
$$

$$ p(Z, W | \alpha, \beta) = p(Z | \alpha) p(W | Z, \beta) = \pro&#x64;*{k=1}^{K} \frac{\Delta(\beta+n*{(\cdot )}^{k})}{\Delta(\beta)} \pro&#x64;*{i=1}^{M} \frac{\Delta(\alpha+n*{m,(\cdot )})}{\Delta(\alpha)}

$$$
将上面表达式代入到 $$p(z\_{m,n} | z\_{-(m,n)}, W)$$, 有:
$$$

\begin{aligned} p(&#x7A;*{m,n} = k | z*{-(m,n)}, W) & \propto \frac{p(W, Z | \alpha, \beta)}{p(&#x57;*{-(m,n)}, Z*{-(m,n)}|\alpha, \beta )} \ &\propto \frac{\Delta(\beta+&#x6E;*{(\cdot )}^{k})}{\Delta(\beta + n*{(\cdot )}^{k,-(m,n)})} \frac{\Delta(\alpha + &#x6E;*{m,(\cdot )})}{\Delta(\alpha + n*{m,(\cdot )}^{-(m,n)})} \ &\propto \frac{\prod*v \Gamma(\beta+n*{(\cdot ),v}^{k})}{\prod*v \Gamma(\beta+n*{(\cdot ),v}^{k, -(m,n)})} \frac {\prod*k \Gamma(\alpha+n*{m,(\cdot )}^{k})}{\prod*k \Gamma(\alpha+n*{m,(\cdot )}^{k, -(m,n)})} \frac{\Gamma(\sum*v \beta+n*{(\cdot ),v}^{k, -(m,n)}))}{\Gamma(\sum*v \beta+n*{(\cdot ),v}^{k}))} \frac {\Gamma(\sum*k \alpha+n*{m,(\cdot )}^{k, -(m, n)})}{\Gamma(\sum*k \alpha+n*{m,(\cdot )}^{k}))} \ & \propto \frac {(\bet&#x61;*{w*{m,n}}+&#x6E;*{(\cdot ),w*{m,n}}^{k, -(m,n)}) (\alpha*k+n*{m,(\cdot )}^{k, -(m,n)})}{(\sum*v \beta\_v+n*{(\cdot ),v}^{k, -(m,n)})) (\sum*k \alpha\_k+n*{m,(\cdot )}^{k, -(m, n)})} \ & \propto (\alpha*k+n*{m,(\cdot )}^{k, -(m,n)}) \frac {\bet&#x61;*{w*{m,n}}+&#x6E;*{(\cdot ),w*{m,n}}^{k, -(m,n)}}{\sum*v \beta\_v+n*{(\cdot ),v}^{k, -(m,n)}} \ \end{aligned}

$$

* **Applied to new documents**

LDA 模型训练完成之后, 可以对新来的文档 new document 做 topic 语义分布的计算, 基本流程和 training 的过程完全类似。

对于新的文档， 我们只要认为 Gibbs Sampling 公式中的 $$\hat{φ}*{kt}$$ 部分是稳定不变的，是由训练语料得到的模型提供的，(即保证topic下word出现次数是固定不变的) 所以采样过程中我们只要估计该文档的 topic 分布 $$\theta*{new}$$ 就好了。

## 参考资料

* [Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) by David M. Blei, Andrew Y. Ng, Michael I. Jordan
* [Parameter estimation for text analysis](http://www.arbylon.net/publications/text-est.pdf) by Gregor Heinrich
