模型概述
α is the parameter of the Dirichlet prior on the per-document topic distributions, β is the parameter of the Dirichlet prior on the per-topic word distribution, θm is the topic distribution for document m, φk is the word distribution for topic k, zmn is the topic for the n-th word in document m, and wmn is the specific word.
Generative process
Choose θi∼Dir(α), where i∈{1,…,M} and Dir(α) is a Dirichlet distribution with a symmetric parameter α which typically is sparse α<1
Choose φk∼Dir(β), where k∈{1,…,K} and β typically is sparse
For each of the word positions i,j, where i∈{1,…,M}, and j∈{1,…,Ni}
(a) Choose a topic zi,j∼Multinomial(θi).
(b) Choose a word wi,j∼Multinomial(φzi,j).
We can then mathematically describe the random variables as follows:
φk=1…Kθd=1…Mzd=1…M,w=1…Ndwd=1…M,w=1…Nd∼DirichletV(β)∼DirichletK(α)∼CategoricalK(θd)∼CategoricalV(φzdw) 模型求解
模型中涉及到的参数有: θ,φ,Z, 下面会用 Variational Inference 和 collapsed Gibbs sampling 两种方法进行求解.
Variational Inference
the total probability of the model is:
P(W,Z,θ,φ;α,β)=i=1∏KP(φi;β)j=1∏MP(θj;α)t=1∏NP(Zj,t∣θj)P(Wj,t∣φZj,t), 现考虑用如下分解的变分分布: q(Z,θ,φ)=k=1∏Kq(φk)i=1∏M{q(θi)j∏Nq(zi,j)} 对 P(Z,θ,φ∣W) 做近似.
利用 mean field theory, 可以得到迭代公式.
求解 q(θi)
q(θi)whereαk∗=Dirichlet(α1∗,α2∗,…,αK∗)=α+j∑q(zi,j=k) 求解 q(φk)
q(φk)whereβv∗=Dirichlet(β1∗,β2∗,…,βV∗)=β+i,j∑1(wi,j=v)q(zi,j=k) 求解 q(zi,j)
q(zi,j=k)∝exp(ψ(αk∗)−ψ(k∑αk∗)+ϕ(βk,wi,j∗)−ϕ(v∑βk,v∗)) collapsed Gibbs sampling
Following is the derivation of the equations for collapsed Gibbs sampling, which means φs and θs will be integrated out.
p(zm,n=k∣Z−m,n,W;α,β)∝(αk+nm,(⋅)k,−(m,n))∑vβv+n(⋅),vk,−(m,n)βwm,n+n(⋅),wm,nk,−(m,n) 其中, nm,(⋅)k 表示在文档 m 中, 出现主题 k 中词的个数, n(⋅),vk 表示在主题 k 中出现的词 v 的次数. 上标中的 −(m,n) 表示忽略标号为 (m,n) 的词.
记 Δ(α)=Γ(∑αi)∏Γ(αi).
p(zm,n∣z−(m,n),W)=p(W,z−(m,n))p(W,z)=p(W−(m,n)∣z−(m,n))p(wi)p(z−(m,n))p(W∣z)p(z)∝p(W−(m,n)∣z−(m,n))p(W∣z)p(z−(m,n))p(z) 下面分别求 p(W∣z), p(z) 的表达式.
p(w∣z;β)=∫p(w∣z,φ)p(φ∣β)dφ=∫i=1∏Mj=1∏Np(wi,j∣zi,j,φ)k=1∏Kp(φk∣β)dφ=∫i=1∏Mj=1∏Nφzi,j,wi,jk=1∏KΔ(β)1v∏φk,vβv−1dφ=∫k=1∏KΔ(β)1v∏φk,vβv+n(⋅),vk−1dφ=k=1∏KΔ(β)Δ(β+n(⋅)k)wheren(⋅)k=(n(⋅),1k,…,n(⋅),Vk) p(z∣α)=∫p(z∣θ)p(θ∣α)dθ=∫i=1∏Mj=1∏Np(zij∣θ)m=1∏Mp(θm∣α)dθ=∫i=1∏Mj=1∏Nθi,zi,jm=1∏MΔ(α)1k∏θm,kαk−1dθ=∫i=1∏MΔ(α)1k∏θm,kαk+nm,(⋅)k−1dθ=i=1∏MΔ(α)Δ(α+nm,(⋅))wherenm,(⋅)=(nm,(⋅)1,…,nm,(⋅)K) $$ p(Z, W | \alpha, \beta) = p(Z | \alpha) p(W | Z, \beta) = \prod{k=1}^{K} \frac{\Delta(\beta+n{(\cdot )}^{k})}{\Delta(\beta)} \prod{i=1}^{M} \frac{\Delta(\alpha+n{m,(\cdot )})}{\Delta(\alpha)}
将上面表达式代入到 $$p(z_{m,n} | z_{-(m,n)}, W)$$, 有:
\begin{aligned} p(z{m,n} = k | z{-(m,n)}, W) & \propto \frac{p(W, Z | \alpha, \beta)}{p(W{-(m,n)}, Z{-(m,n)}|\alpha, \beta )} \ &\propto \frac{\Delta(\beta+n{(\cdot )}^{k})}{\Delta(\beta + n{(\cdot )}^{k,-(m,n)})} \frac{\Delta(\alpha + n{m,(\cdot )})}{\Delta(\alpha + n{m,(\cdot )}^{-(m,n)})} \ &\propto \frac{\prodv \Gamma(\beta+n{(\cdot ),v}^{k})}{\prodv \Gamma(\beta+n{(\cdot ),v}^{k, -(m,n)})} \frac {\prodk \Gamma(\alpha+n{m,(\cdot )}^{k})}{\prodk \Gamma(\alpha+n{m,(\cdot )}^{k, -(m,n)})} \frac{\Gamma(\sumv \beta+n{(\cdot ),v}^{k, -(m,n)}))}{\Gamma(\sumv \beta+n{(\cdot ),v}^{k}))} \frac {\Gamma(\sumk \alpha+n{m,(\cdot )}^{k, -(m, n)})}{\Gamma(\sumk \alpha+n{m,(\cdot )}^{k}))} \ & \propto \frac {(\beta{w{m,n}}+n{(\cdot ),w{m,n}}^{k, -(m,n)}) (\alphak+n{m,(\cdot )}^{k, -(m,n)})}{(\sumv \beta_v+n{(\cdot ),v}^{k, -(m,n)})) (\sumk \alpha_k+n{m,(\cdot )}^{k, -(m, n)})} \ & \propto (\alphak+n{m,(\cdot )}^{k, -(m,n)}) \frac {\beta{w{m,n}}+n{(\cdot ),w{m,n}}^{k, -(m,n)}}{\sumv \beta_v+n{(\cdot ),v}^{k, -(m,n)}} \ \end{aligned}
$$
LDA 模型训练完成之后, 可以对新来的文档 new document 做 topic 语义分布的计算, 基本流程和 training 的过程完全类似。
对于新的文档, 我们只要认为 Gibbs Sampling 公式中的 φ^kt 部分是稳定不变的,是由训练语料得到的模型提供的,(即保证topic下word出现次数是固定不变的) 所以采样过程中我们只要估计该文档的 topic 分布 θnew 就好了。
参考资料