2.word2vec的前世今生.md · 机器学习笔记

Word2vec的前世今生

统计语言模型

目前每天都在产生大量的文本、图片、语音、视频数据，对数据进行处理并从中挖掘出有价值的信息，离不开自然语言处理技术，其中统计语言模型是其中重要的一环，它是NLP的基础，被广泛应用于语音识别、机器翻译、分词、词性标注和信息检索等任务。

统计语言模型是用来计算一个句子概率的概率模型，通常基于一个语料库来创建，那什么叫做一个句子的概率呢? 假设 $W=w_1^T:=\left(w_1, w_2, \cdots, w_T\right)$ 表示由 $T$ 个词 $w_1, w_2, \cdots, w_T$ 按顺序构成的一个句子, 则 $w_1, w_2, \cdots, w_T$ 的联合概率 $p(s)=p\left(w_1^T\right)=p\left(w_1, w_2, \ldots, w_T\right)=\prod_{t=1}^T p\left(w_t \mid \text { Context }\right)$ 就是这个句子的概率. 利用 Bayes 公式, 上式可以被链式地分解为 $p\left(w_1^T\right)=p\left(w_1\right) \cdot p\left(w_2 \mid w_1\right) \cdot p\left(w_3 \mid w_1^2\right) \cdots p\left(w_T \mid w_1^{T-1}\right),$ 其中的 (条件) 概率 $p\left(w_1\right), p\left(w_2 \mid w_1\right), p\left(w_3 \mid w_1^2\right), \cdots, p\left(w_T \mid w_1^{T-1}\right)$ 就是语言模型的参数, 若这些参数已经全部算得, 那么给定一个句子 $w_1^T$, 就可以很快地算出相应的 $p\left(w_1^T\right)$ 了。

在公式(1)中, Context即为上下文，根据对 Context不同的划分方法，可以分为几大类：

上下文无关模型（Context=NULL）

该模型仅仅考虑当前词本身的概率，不考虑该词所对应的上下文环境。这是一种最简单，易于实现，但没有多大实际应用价值的统计语言模型。

\[p\left(w_t \mid \text { Context }\right)=p\left(w_t\right)=\frac{N_{w_t}}{N}\]

不考虑任何上下文信息，仅仅依赖于训练文本中的词频统计。它是n-gram模型中当 n=1的特殊情形，所以有时也称作 Unigram Model（一元文法统计模型）。实际应用中，常被应用到一些商用语音识别系统中。

### n-gram模型

n=1时，就是上面所说的上下文无关模型，这里 n-gram 一般认为是 N>=2是的上下文相关模型。当 n=2时，也称为 Bigram语言模型，直观的想，在自然语言中 "白色汽车"的概率比"白色飞翔"的概率要大很多，也就是 p(汽车

白色)> p(飞翔

白色)。n>2也类似，只是往前看 n-1个词而不是一个词。

一般 n-gram模型优化的目标是最大 log似然，即

\[\prod_{t=1}^T p_t\left(w_t \mid w_{t-\mathrm{n}+1}, w_{t-\mathrm{n}+2}, \ldots, w_{t-1}\right) \log p_m\left(w_t \mid w_{t-\mathrm{n}+1}, w_{t-\mathrm{n}+2}, \ldots, w_{t-1}\right)\]

n-gram模型的优点包含了前 N-1个词所能提供的全部信息，这些信息对当前词出现具有很强的约束力。同时因为只看 N-1个词而不是所有词也使得模型的效率较高。

基本思想：假定一个词出现的概率只与它前面固定数目的词相关。

主要工作：在语料中统计各种词串出现的次数及平滑化处理，概率值计算好之后就存储起来，下次需要计算一个句子的概率时，只需要找到相关的概率参数，将它们进行连乘。

然而, 在机器学习领域有一种通用的招数是这样的: 对所考虑的问题建模后先为其构造一个目标函数, 然后对这个目标函数进行优化, 从而求得一组最优的参数, 最后利用这组最优参数对应的模型来进行预测. 对于统计语言模型而言, 利用最大似然, 可把目标函数设为 $\prod_{w \in \mathcal{C}} p(w \mid \text { Context }(w)) .$ 其中 $\mathcal{C}$ 表示语料 (Corpus), Context $(w)$ 表示词 $w$ 的上下文 (Context), 即 $w$ 周边的词的集合. 当 Context $(w)$ 为空时, 就取 $p(w \mid \operatorname{Context}(w))=p(w)$. 特别地, 对于前面介绍的 $\mathrm{n}$-gram 模型, 就有 $\operatorname{Context}\left(w_i\right)=w_{i-n+1}^{i-1}$.

当然, 实际应用中常采用最大对数似然, 即把目标函数设为 $\mathcal{L}=\sum_{w \in \mathcal{C}} \log p(w \mid \operatorname{Context}(w)),$ 然后对这个函数进行最大化. 从上式可见, 概率 $p(w \mid \operatorname{Context}(w))$ 已被视为关于 $w$ 和 $\operatorname{Context}(w)$ 的函数, 即 $p(w \mid \operatorname{Context}(w))=F(w, \operatorname{Context}(w), \theta)$

其中 $\theta$ 为待定参数集. 这样一来, 一旦对 (3.4) 进行优化得到最优参数集 $\theta^$ 后, $F$ 也就唯一被确定了, 以后任何概率 $p(w \mid \operatorname{Context}(w))$ 就可以通过函数 $F\left(w, \operatorname{Context}(w), \theta^\right)$ 来计算了. 与 n-gram 相比, 这种方法不需要 (事先计算并) 保存所有的概率值, 而是通过直接计算来获取, 且通过选取合适的模型可使得 $\theta$ 中参数的个数远小于 n-gram 中模型参数的个数. 很显然, 对于这样一种方法, 最关键的地方就在于函数 $F$ 的构造了.下一小节将介绍一种通过神经网络来构造 $F$ 的方法. 之所以特意介绍这个方法, 是因为它可以视为 word2vec 中算法框架的前身或者说基础.

n-gram语言模型也存在一些问题：

n-gram语言模型无法建模更远的关系，语料的不足使得无法训练更高阶的语言模型。大部分研究或工作都是使用 Trigram，就算使用高阶的模型，其统计到的概率可信度就大打折扣，还有一些比较小的问题采用 Bigram。
这种模型无法建模出词之间的相似度，有时候两个具有某种相似性的词，如果一个词经常出现在某段词之后，那么也许另一个词出现在这段词后面的概率也比较大。比如"白色的汽车"经常出现，那完全可以认为"白色的轿车"也可能经常出现。
训练语料里面有些 n元组没有出现过，其对应的条件概率就是 0，导致计算一整句话的概率为 0。解决这个问题有两种常用方法：
- 方法一为平滑法。最简单的方法是把每个 n元组的出现次数加 1，那么原来出现 k次的某个 n元组就会记为 k+1次，原来出现 0次的 n元组就会记为出现 1次。这种也称为 Laplace平滑。当然还有很多更复杂的其他平滑方法，其本质都是将模型变为贝叶斯模型，通过引入先验分布打破似然一统天下的局面。而引入先验方法的不同也就产生了很多不同的平滑方法。
- 方法二是回退法。有点像决策树中的后剪枝方法，即如果 n元的概率不到，那就往上回退一步，用 n-1元的概率乘上一个权重来模拟

### n-pos模型

n-gram的一种衍生模型。n-gram模型假定第 t个词出现概率条件依赖它前 N-1个词，而现实中很多词出现的概率是条件依赖于它前面词的语法功能的。n-pos模型就是基于这种假设的模型，它将词按照其语法功能进行分类，由这些词类决定下一个词出现的概率。这样的词类称为词性（Part-of-Speech，简称为 POS）。n-pos模型中的每个词的条件概率表示为

\[p(s)=p\left(w_1^T\right)=p\left(w_1, w_2, \ldots, w_T\right)=\prod_{t=1}^T p\left(w_t \mid c\left(w_{t-\mathrm{n}+1)}, c\left(w_{t-\mathrm{n}+2}\right), \ldots, c\left(w_{t-1}\right)\right)\right.\]

$c$ 为类别映射函数, 即把 $T$ 个词映射到 $k$ 个类别 $(1=<K<=T)$ 。实际上 $n-P o s$ 使用了一种聚类的思想, 使得原来 n-gram 中 $w_{t-n+1}, w_{t-n+2}, \ldots, w_{t-1}$ 中的可能为 $T^{N-1}$减少到 $c\left(w_{t-\mathrm{n}+1)}, c\left(w_{t-\mathrm{n}+2}\right), \ldots, c\left(w_{t-1}\right)\right.$ 的 $K^{N-1}$, 同时这种减少还采用了语义有意义的类别。

### 基于决策树的语言模型

上面提到的上下文无关语言模型、n-gram语言模型、n-pos语言模型等等.都可以以统计决策树的形式表示出来。而统计决策树中每个结点的决策规则是一个上下文相关的问题。

基于决策树的语言模型优点是：分布数不是预先固定好的，而是根据训练预料库中的实际情况确定，更为灵活。缺点是：构造统计决策树的问题很困难，且时空开销很大。

### 最大熵模型

基本思想是：对一个随机事件的概率分布进行预测时，在满足全部已知的条件下对未知的情况不做任何主观假设。从信息论的角度来说就是：在只掌握关于未知分布的部分知识时，应当选取符合这些知识但又能使得熵最大的概率分布。

### 自适应语言模型前面的模型概率分布都是预先从训练语料库中估算好的，属于静态语言模型。而自适应语言模型类似是 Online Learning的过程，即根据少量新数据动态调整模型，属于动态模型。在自然语言中，经常出现这样现象：某些在文本中通常很少出现的词，在某一局部文本中突然大量地出现。能够根据词在局部文本中出现的情况动态地调整语言模型中的概率分布数据的语言模型成为动态、自适应或者基于缓存的语言模型。通常的做法是将静态模型与动态模型通过参数融合到一起，这种混合模型可以有效地避免数据稀疏的问题。

### 神经概率语言模型

一种简单的神经概率语言模型结构如下图所示。主要包含四个层。分别是输入层、投影层、隐藏层和输出层，其中W、U分别为投影层与隐藏层以及隐藏层和输出层之间的权值矩阵，p和q分别为隐藏层和输出层上的偏置向量。

20220913145742-2022-09-13-14-57-46

对于语料 $\mathcal{C}$ 中的任意一个词 $w$, 将 Context $(w)$ 取为其前面的 $n-1$ 个词 (类似于 n-gram), 这样二元对 $(\operatorname{Context}(w), w)$ 就是一个训练样本了. 接下来, 讨论样本 $(\operatorname{Context}(w), w)$ 经过如图所示的神经网络时是如何参与运算的. 注意, 一旦语料 $\mathcal{C}$ 和词向量长度 $m$ 给定后, 投影层和输出层的规模就确定了, 前者为 $(n-1) m$, 后者为 $N=

\mathcal{D}

$ 即语料 $\mathcal{C}$ 的词汇量大小. 而隐藏层的规模 $n_h$ 是可调参数由用户指定。

为什么投影层的规模是 $(n-1) m$ 呢? 因为输入层包含 $\operatorname{Context}(w)$ 中 $n-1$ 个词的词向量, 而投影层的向量 $\mathrm{x}_w$ 是这样构造的: 将输入层的 $n-1$ 个词向量按顺序首尾相接地拼起来形成一个长向量, 其长度当然就是 $(n-1) m$ 了. 有了向量 $\mathbf{x}_w$, 接下来的计算过程就很平凡了, 具体为 $\left\{\begin{array}{l} \mathbf{z}_w=\tanh \left(W \mathbf{x}_w+\mathbf{p}\right), \\ \mathbf{y}_w=U \mathbf{z}_w+\mathbf{q} \end{array}\right.$ 其中 $\tanh$ 为双曲正切函数, 用来做隐藏层的激活函数, 上式中, $\tanh$ 作用在向量上表示它作用在向量的每一个分量上.

经过上述两步计算得到的 $\mathbf{y}w=\left(y{w, 1}, y_{w, 2}, \cdots, y_{w, N}\right)^{\top}$ 只是一个长度为 $N$ 的向量, 其分量不能表示概率. 如果想要 $\mathbf{y}w$ 的分量 $y{w, i}$ 表示当上下文为 Context $(w)$ 时下一个词恰为词典 $\mathcal{D}$ 中第 $i$ 个词的概率, 则还需要做一个 softmax 归一化, 归一化后, $p(w \mid \operatorname{Context}(w))$ 就可以表示为 $p(w \mid \text { Context }(w))=\frac{e^{y_{w, i_w}}}{\sum_{i=1}^N e^{y_{w, i}}},$

其中 $i_w$ 表示词 $w$ 在词典 $\mathcal{D}$ 中的索引. 公式(10)给出了概率 $p(w \mid \operatorname{Context}(w))$ 的函数表示, 即找到了上一小节中提到的函数 $F(w$, Context $(w), \theta)$, 那么其中待确定的参数 $\theta$ 有哪些呢? 总结起来, 包括两部分

词向量: $\mathbf{v}(w) \in \mathbb{R}^m, w \in \mathcal{D}$ 以及填充向量.
神经网络参数: $W \in \mathbb{R}^{n_h \times(n-1) m}, \mathrm{p} \in \mathbb{R}^{n_h} ; U \in \mathbb{R}^{N \times n_h}, \mathrm{q} \in \mathbb{R}^N$, 这些参数均通过训练算法得到. 值得一提的是, 通常的机器学习算法中, 输入都是已知的, 而在上述神经概率语言模型中, 输入 $\mathbf{v}(w)$ 也需要通过训练才能得到.

与n-gram相比，神经概率语言模型的优势在于：

词语之间的相似性可以通过词向量来体现
基于词向量的模型自带平滑化功能（由公式(10)）$p(w \mid$ Context $(w)) \in(0,1)$ 不会为零), 不再需要像 n-gram 那样进行额外处理了.
最后, 我们回过头来想想, 词向量在整个神经概率语言模型中扮演了什么角色呢? 训练时, 它是用来帮助构造目标函数的辅助参数, 训练完成后, 它也好像只是语言模型的一个副产品。

## 什么是word embeddings？

NLP相关任务中最常见的第一步是创建一个词表库并把每个词顺序编号。这实际就是词表示方法中的 One-hot Representation，这种方法把每个词顺序编号，每个词就是一个很长的向量，向量的维度等于词表大小，只有对应位置上的数字为 1，其他都为 0。当然在实际应用中，一般采用稀疏编码存储，主要采用词的编号。

这种表示方法一个最大的问题是无法捕捉词与词之间的相似度，就算是近义词也无法从词向量中看出任何关系。此外这种表示方法还容易发生维数灾难。

人类在读到一个词语时，很容易联想到相近的词，但是这些语言关联是经过数百万年进化磨练的相当复杂的神经学计算的结果，而我们的机器学习模型必须从头开始，没有预先建立对词义的理解。

由于我们的计算机、脚本和机器学习模型无法阅读和理解任何人类意义上的文本，所以处理文本数据是有问题的，词嵌入是解决这个问题的主要方法，并且如此普遍，以至于实际上在任何 NLP 项目中都假设它们的使用。

一种简单的词向量是one-hot repesentation。one-hot向量是只有一个1，其余均为0的稀疏向量，是一种将单词表示为实数值向量的快速简便的方法。但该方法具有一些问题：

由于两个向量之间正交均为0，所以无法计算向量之间的相似性问题；
向量的维数大小等于词汇量，参数量随着数据量的增加而增大。
每个词的嵌入/特征向量大多为零，许多机器学习模型无法很好地处理非常高维和稀疏的特征。

onehot编码的适用场景：

特征输入量相对较小
不希望输入有意义地相关
不希望输入共享模型参数
有很多可以学习的数据

onehot编码在分类数据预处理的情况下总是相关的，因为许多机器学习模型不能直接处理分类数据（如文本标签）。您仍然可以使用它们将多类标签向量转换为多个二进制类向量，或将少数分类特征向量转换为其二进制版本。

每个语义特征都可以看作是更广泛、更高维的语义空间中的单个维度。

嵌入解决的核心问题是泛化。

当网络最终看到相似的词时，那么它将采用与其相似的路径，而不是网络必须从头开始学习如何完全处理它。这意味着嵌入允许我们构建更通用的模型——而不是网络需要争先恐后地学习许多不同的方法来处理断开连接的输入，而是让相似的词"共享"参数和计算路径。

尤其是在没有大量训练数据的情况下，嵌入可以提高几乎所有 NLP 问题的泛化能力和性能。

另一种词向量是Distributed Representation,基本思想是：通过训练将某种语言中的每个词映射成一个固定长度的短向量，所有这些向量构成一个词向量空间，二每一个向量则可视为该空间中的一个点，在这个空间中引入距离，就可以根据词之间的距离来判断他们之间的（词法、语义上）的相似性，word2vec中采用的就是这种Distributed Representation的词向量。

## 什么是word2vec?

word2vec是创建word embeddings的一种方法
word embeddings是一个word的数字矩阵表示
除了word2vec，还有其他创建word embeddings的方法，比如fastText，GloVe,ELMO,BERET,GPT-2等

word2vec给出了两套框架，分别基于Hierarchical Softmax 和Negative Sampling来进行设计。

## 基于Hierarchical Softmax的模型 word2vec中有两个重要的模型，分别为CBOW模型和Skip-gram模型。两个模型都包含三层：输入层、投影层和输出层。前者是在已知当前词的上下文的前提下预测当前词，而后者则恰恰相反，是在已知当前词的前提下，预测其上下文。

### CBOW

网络结构

输入层: 包含 $\operatorname{Context}(w)$ 中 $2 c$ 个词的词向量 $\mathbf{v}\left(\right.$ Context $\left.(w)1\right), \mathbf{v}\left(\operatorname{Context}(w)_2\right), \cdots$, $\mathbf{v}\left(\right.$ Context $\left.(w){2 c}\right) \in \mathbb{R}^m$. 这里, $m$ 的含义同上表示词向量的长度.
投影层: 将输入层的 $2 c$ 个向量做求和累加, 即 $\mathbf{x}w=\sum{i=1}^{2 c} \mathbf{v}\left(\right.$ Context $\left.(w)_i\right) \in \mathbb{R}^m$.

输出层: 输出层对应一棵二叉树, 它是以语料中出现过的词当叶子结点, 以各词在语料中出现的次数当权值构造出来的 Huffman 树. 在这棵 Huffman 树中, 叶子结点共 $N(=$ $

\mathcal{D}

)$ 个, 分别对应词典 $\mathcal{D}$ 中的词, 非叶子结点 $N-1$ 个 (图中标成黄色的那些结点).

20220913164131-2022-09-13-16-41-34

与神经概率语言模型对比，CBOW模型的区别是：

从输入层到投影层，前者是通过拼接，后者则是累加求和
隐藏层，前者有隐藏层，后者无隐藏层
输出层，前者是线性结构，后者是树形结构

梯度计算

Hierarchical Softmax 是 wor2vec 中用于提高性能的一项关键技术. 为描述方便起见, 在具体介绍这个技术之前, 先引入若干相关记号. 考虑 Huffman 树中的某个叶子结点, 假设它对应词典 $\mathcal{D}$ 中的词 $w$, 记

$p^w$ : 从根结点出发到达 $w$ 对应叶子结点的路径.
$l^w$ : 路径 $p^w$ 中包含结点的个数.
$p_1^w, p_2^w, \cdots, p_{l^w}^w$ : 路径 $p^w$ 中的 $l^w$ 个结点, 其中 $p_1^w$ 表示根结点, $p_{l^w}^w$ 表示词 $w$ 对应的结点.
$d_2^w, d_3^w, \cdots, d_{l^w}^w \in{0,1}$ : 词 $w$ 的 Huffman 编码, 它由 $l^w-1$ 位编码构成, $d_j^w$ 表示路径 $p^w$ 中第 $j$ 个结点对应的编码 (根结点不对应编码).
$\theta_1^w, \theta_2^w, \cdots, \theta_{l^w-1}^w \in \mathbb{R}^m$ : 路径 $p^w$ 中非叶子结点对应的向量, $\theta_j^w$ 表示路径 $p^w$ 中第 $j$ 个非叶子结点对应的向量.

按理说, 我们要用的是词典 $\mathcal{D}$ 中每个词 ( 即 Huffman 树中所有叶子节点) 的向量, 为什么这里还要为 Huffman 树中每一个非叶子结点也定义一个同长的向量呢? 事实上, 它们只是算法中的辅助向量, 具体用途在下文中将会为大家解释清楚.

好了, 引入了这么一大堆抽象的记号, 接下来, 我们还是通过一个简单的例子把它们落到实处吧, 看下图,, 考虑词 $w=$ "足球" 的情形.

图中由 4 条红色边串起来的 5 个节点就构成路径 $p^w$, 其长度 $l^w=5 \cdot p_1^w, p_2^w, p_3^w, p_4^w$, $p_5^w$ 为路径 $p^w$ 上的 5 个结点, 其中 $p_1^w$ 对应根结点. $d_2^w, d_3^w, d_4^w, d_5^w$ 分别为 $1,0,0,1$, 即 "足球" 的 Huffman 编码为 1001. 此外, $\theta_1^w, \theta_2^w, \theta_3^w, \theta_4^w$ 分别表示路径 $p^w$ 上 4 个非叶子结点对应的向量.

20220913170306-2022-09-13-17-03-09

如何定义条件概率函数 $p(w \mid \operatorname{Contex}(w))$ 呢? 更具体地说, 就是如何利用向量 $\mathbf{x}_w \in \mathbb{R}^m$ 以及 Huffman 树来定义函数 $p(w \mid$ Contex $(w))$ 呢?

以图中词 $w=$ "足球" 为例. 从根结点出发到达 "足球" 这个叶子节点, 中间共经历了 4 次分支 (每条红色的边对应一次分支), 而每一次分支都可视为进行了一次二分类.

既然是从二分类的角度来考虑问题, 那么对于每一个非叶子结点, 就需要为其左右孩子结点指定一个类别, 即哪个是正类 (标签为 1 ), 哪个是负类 (标签为 0). 碰巧, 除根结点以外, 树中每个结点都对应了一个取值为 0 或 1 的 Huffman 编码.word2vec将编码为 1 的结点定义为负类, 而将编码为 0 的结点定义为正类. 即 $\operatorname{Label}\left(p_i^w\right)=1-d_i^w, i=2,3, \cdots, l^w .$

简言之就是, 将一个结点进行分类时, 分到左边就是负类, 分到右边就是正类. 易知, 一个结点被分为正类的概率是 $\sigma\left(\mathbf{x}_w^{\top} \theta\right)=\frac{1}{1+e^{-\mathbf{x}_w^{\top} \theta}},$ 被分为负类的概率当然就等于 $1-\sigma\left(\mathbf{x}_w^{\top} \theta\right)$

对于从根结点出发到达 "足球" 这个叶子节点所经历的 4 次二分类, 将每次分类结果的概率写出来就是

第 1 次: $p\left(d_2^w \mid \mathbf{x}_w, \theta_1^w\right)=1-\sigma\left(\mathbf{x}_w^{\top} \theta_1^w\right)$;
第 2 次: $p\left(d_3^w \mid \mathbf{x}_w, \theta_2^w\right)=\sigma\left(\mathbf{x}_w^{\top} \theta_2^w\right)$;
第 3 次: $\left.p\left(d_4^w \mid \mathbf{x}_w, \theta_3^w\right)=\sigma\left(\mathbf{x}_w^{\top} \theta_3^w\right)\right)$;
第 4 次: $p\left(d_5^w \mid \mathbf{x}_w, \theta_4^w\right)=1-\sigma\left(\mathbf{x}_w^{\top} \theta_4^w\right)$, 但是, 我们要求的是 $p$ (足球 $\mid$ Contex (足球)), 它跟这 4 个概率值有什么关系呢? 关系就是 $p(\text { 足球 } \mid \text { Contex }(\text { 足球 }))=\prod_{j=2}^5 p\left(d_j^w \mid \mathbf{x}_w, \theta_{j-1}^w\right) .$ 至此, 通过 $w=$ "足球" 的小例子, Hierarchical Softmax 的基本思想其实就已经介绍完了. 小结一下: 对于词典 $\mathcal{D}$ 中的任意词 $w$, Huffman 树中必存在一条从根结点到词 $w$ 对应结点的路径 $p^w$ (且这条路径是唯一的). 路径 $p^w$ 上存在 $l^w-1$ 个分支, 将每个分支看做一次二分类, 每一次分类就产生一个概率, 将这些概率乘起来, 就是所需的 $p(w \mid \operatorname{Context}(w))$.

条件概率 $p(w \mid \operatorname{Context}(w))$ 的一般公式可写为 $p(w \mid \operatorname{Context}(w))=\prod_{j=2}^{l^w} p\left(d_j^w \mid \mathbf{x}_w, \theta_{j-1}^w\right),$ 其中 $p\left(d_j^w \mid \mathbf{x}_w, \theta_{j-1}^w\right)= \begin{cases}\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right), & d_j^w=0 ; \\ 1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right), & d_j^w=1,\end{cases}$ 或者写成整体表达式 $p\left(d_j^w \mid \mathbf{x}_w, \theta_{j-1}^w\right)=\left[\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]^{1-d_j^b} \cdot\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]^d .$

代入似然函数可得：

$\begin{aligned} \mathcal{L} &=\sum_{w \in \mathcal{C}} \log \prod_{j=2}^{l^w}\left\{\left[\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]^{1-d_j^w} \cdot\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]^{d_j^w}\right\} \\ &=\sum_{w \in \mathcal{C}} \sum_{j=2}^{l^w}\left\{\left(1-d_j^w\right) \cdot \log \left[\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]+d_j^w \cdot \log \left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]\right\}, \end{aligned}$ 为下面梯度推导方便起见, 将上式中双重求和符号下花括号里的内容简记为 $\mathcal{L}(w, j)$, 即 $\mathcal{L}(w, j)=\left(1-d_j^w\right) \cdot \log \left[\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]+d_j^w \cdot \log \left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right] .$ 至此, 已经推导出对数似然函数 , 这就是CBOW 模型的目标函数, 接下来讨论它的优化, 即如何将这个函数最大化. word2vec 里面采用的是随机梯度上升法.

随机梯度上升法的做法是: 每取一个样本 $(\operatorname{Context}(w), w)$, 就对目标函数中的所有 (相关) 参数做一次刷新. 观察目标函数 $\mathcal{L}$ 易知, 该函数中的参数包括向量 $\mathrm{x}w, \theta{j-1}^w, w \in \mathcal{C}$, $j=2, \cdots, l^w$. 为此, 先给出函数 $\mathcal{L}(w, j)$ 关于这些向量的梯度.

首先考虑 $\mathcal{L}(w, j)$ 关于 $\theta_{j-1}^w$ 的梯度计算. $\begin{aligned} \frac{\partial \mathcal{L}(w, j)}{\partial \theta_{j-1}^w} &=\frac{\partial}{\partial \theta_{j-1}^w}\left\{\left(1-d_j^w\right) \cdot \log \left[\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]+d_j^w \cdot \log \left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]\right\} \\ &=\left(1-d_j^w\right)\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right] \mathbf{x}_w-d_j^w \sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right) \mathbf{x}_w \quad \\ &\left.=\left\{\left(1-d_j^w\right)\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right]-d_j^w \sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right\} \mathbf{x}_w \quad \right) \\ &=\left[1-d_j^w-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right] \mathbf{x}_w . \end{aligned}$ 于是, $\theta_{j-1}^w$ 的更新公式可写为 $\theta_{j-1}^w:=\theta_{j-1}^w+\eta\left[1-d_j^w-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right] \mathbf{x}_w,$ 其中 $\eta$ 表示学习率, 下同. 接下来考虑 $\mathcal{L}(w, j)$ 关于 $\mathbf{x}w$ 的梯度. 观察可发现, $\mathcal{L}(w, j)$ 中关于变量 $\mathbf{x}_w$ 和 $\theta{j-1}^w$ 是对称的 (即两者可交换位置), 因此, 相应的梯度 $\frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}w}$ 也只需在 $\frac{\partial \mathcal{L}(w, j)}{\partial \theta{j-1}^o}$ 的基础上对这两个向量交换位置就可以了, 即 $\frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}_w}=\left[1-d_j^w-\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right)\right] \theta_{j-1}^w .$ 到这里, 细心的读者可能已经看出问题来了: 我们的最终目的是要求词典 $\mathcal{D}$ 中每个词的词向量, 而这里的 $\mathbf{x}_w$ 表示的是 Context $(w)$ 中各词词向量的累加. 那么, 如何利用 $\frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}_w}$ 来对 $\mathbf{v}(\widetilde{w}), \widetilde{w} \in \operatorname{Context}(w)$ 进行更新呢? word2vec 中的做法很简单, 直接取 $\mathbf{v}(\widetilde{w}):=\mathbf{v}(\widetilde{w})+\eta \sum_{j=2}^{l^w} \frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}_w}, \quad \widetilde{w} \in \operatorname{Context}(w)$

即把 $\sum_{j=2}^{l^w} \frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}_w}$ 贡献到 Context $(w)$ 中每一个词的词向量上. 这个应该很好理解, 既然 $\mathbf{x}_w$ 本身就是 Context $(w)$ 中各词词向量的累加, 求完梯度后当然也应该将其贡献到每个分量上去.

下面以样本 $(\operatorname{Context}(w), w)$ 为例, 给出 $\mathrm{CBOW}$ 模型中采用随机梯度上升法更新各参数的伪代码.

$\mathrm{e}=0$.
$\mathbf{x}w=\sum{u \in \operatorname{Context}(w)} \mathrm{v}(u)$.
$\mathrm{FOR} j=2: l^w \quad \mathrm{DO}$ { $\begin{aligned} &3.1 \quad q=\sigma\left(\mathbf{x}_w^{\top} \theta_{j-1}^w\right) \\ &3.2 g=\eta\left(1-d_j^w-q\right) \\ &3.3 \quad \mathrm{e}:=\mathbf{e}+g \theta_{j-1}^w \\ &3.4 \quad \theta_{j-1}^w:=\theta_{j-1}^w+g \mathbf{x}_w \end{aligned}$ }
$\operatorname{FOR} u \in \operatorname{Context}(w) \mathrm{DO}$ {

\[\mathbf{v}(u):=\mathbf{v}(u)+\mathbf{e}\]

注意, 步 $3.3$ 和步 $3.4$ 不能交换次序, 即 $\theta_{j-1}^w$ 应等贡献到 $\mathbf{e}$ 后再做更新.

### Skip-gram

网络结构

输入层: 只含当前样本的中心词 $w$ 的词向量 $\mathbf{v}(w) \in \mathbb{R}^m$.
投影层: 这是个恒等投影, 把 $\mathbf{v}(w)$ 投影到 $\mathbf{v}(w)$. 因此, 这个投影层其实是多余的, 这里之所以保留投影层主要是方便和 CBOW 模型的网络结构做对比.
输出层：与CBOW模型一样，也是一棵Huffman树。

20220913172724-2022-09-13-17-27-28

模型中将其定义为 $p(\text { Context }(w) \mid w)=\prod_{u \in \operatorname{Context}(w)} p(u \mid w),$ 上式中的 $p(u \mid w)$ 可按照上小节介绍的 Hierarchical Softmax 思想, 类似地写为 $p(u \mid w)=\prod_{j=2}^{l^u} p\left(d_j^u \mid \mathbf{v}(w), \theta_{j-1}^u\right),$ 其中 $p\left(d_j^u \mid \mathbf{v}(w), \theta_{j-1}^u\right)=\left[\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]^{1-d_j^u} \cdot\left[1-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]^{d_j^u} .$ 依次代回, 可得对数似然函数的具体表达式 $\begin{aligned} \mathcal{L} &=\sum_{w \in \mathcal{C}} \log \prod_{u \in \operatorname{Context}(w)} \prod_{j=2}^{l^u}\left\{\left[\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]^{1-d_j^u} \cdot\left[1-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]^{d_j^u}\right\} \\ &=\sum_{w \in \mathcal{C}} \sum_{u \in \operatorname{Context}(w)} \sum_{j=2}^{l^u}\left\{\left(1-d_j^u\right) \cdot \log \left[\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]+d_j^u \cdot \log \left[1-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]\right\} . \end{aligned}$ 同样, 为下面梯度推导方便起见, 将三重求和符号下花括号里的内容简记为 $\mathcal{L}(w, u, j)$, 即 $\mathcal{L}(w, u, j)=\left(1-d_j^u\right) \cdot \log \left[\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]+d_j^u \cdot \log \left[1-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right] .$ 至此, 已经推导出对数似然函数的表达式 , 这就是 Skip-gram 模型的目标函数. 接下来同样利用随机梯度上升法对其进行优化, 关键是要给出两类梯度.

首先考虑 $\mathcal{L}(w, u, j)$ 关于 $\theta_{j-1}^u$ 的梯度计算 (与 $\mathrm{CBOW}$ 模型对应部分的推导完全类似). $\begin{aligned} \frac{\partial \mathcal{L}(w, u, j)}{\partial \theta_{j-1}^u} &=\frac{\partial}{\partial \theta_{j-1}^u}\left\{\left(1-d_j^u\right) \cdot \log \left[\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]+d_j^u \cdot \log \left[1-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right]\right\} \\ &=\left(1-d_j^u\right)\left[1-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right] \mathbf{v}(w)-d_j^u \sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right) \mathbf{v}(w) \quad \text { } \\ &=\left\{\left(1-d_j^u\right)\left[1-\sigma\left(\mathbf{v}(w)^{\top} \mathbf{v}_{j-1}^u\right)\right]-d_j^u \sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right\} \mathbf{v}(w) \text { } \\ &=\left[1-d_j^u-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right] \mathbf{v}(w) . \end{aligned}$ 于是, $\theta_{j-1}^u$ 的更新公式可写为 $\theta_{j-1}^u:=\theta_{j-1}^u+\eta\left[1-d_j^u-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right] \mathbf{v}(w) .$ 接下来考虑 $\mathcal{L}(w, u, j)$ 关于 $\mathbf{v}(w)$ 的梯度. 同样利用 $\mathcal{L}(w, u, j)$ 中 $\mathbf{v}(w)$ 和 $\theta_{j-1}^w$ 的对称性, 有 $\frac{\partial \mathcal{L}(w, u, j)}{\partial \mathbf{v}(w)}=\left[1-d_j^u-\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)\right] \theta_{j-1}^u .$

于是, $\mathbf{v}(w)$ 的更新公式可写为 $\mathbf{v}(w):=\mathbf{v}(w)+\eta \sum_{u \in \operatorname{Context}(w)} \sum_{j=2}^{l^u} \frac{\partial \mathcal{L}(w, u, j)}{\partial \mathbf{v}(w)} .$ 下面以样本 $(w, \operatorname{Context}(w))$ 为例, 给出 Skip-gram 模型中采用随机梯度上升法更新各参数的伪代码.

FOR $u \in \operatorname{Context}(w) \quad \mathrm{DO}$ { $\mathrm{e}=0$ {

$q=\sigma\left(\mathbf{v}(w)^{\top} \theta_{j-1}^u\right)$
$g=\eta\left(1-d_j^u-q\right)$
$\mathbf{e}:=\mathbf{e}+g \theta_{j-1}^u$
$\theta_{j-1}^u:=\theta_{j-1}^u+g \mathbf{v}(w)$ } $\mathbf{v}(w):=\mathbf{v}(w)+\mathbf{e}$ }

同样，循环体内的3和4不能交换次序。

## 基于Negative Sampling的模型

Negative Sampling是NCE（Noise Contrastive Estimation）的一个简化版本，目的是用来提高训练速度并改善所得词向量的质量。与Hierarchical Softmax相比，NEG不再使用复杂的Huffman树，而是利用相对简单的随机负采样，大幅提高性能。

NCE的本质是利用已知的概率密度函数来估计末知的概率密度函数. 简帝来说, 假设夫知的概率密度函数为 $X$, 已知的概率密度为 $Y$, 如果得到了 $X$ 和 $Y$ 的关系, 那么 $X$ 也就可以求出来了

### CBOW

在 CBOW 模型中, 已知词 $w$ 的上下文 $\operatorname{Context}(w)$, 需要预测 $w$, 因此, 对于给定的 $\operatorname{Context}(w)$, 诃 $w$ 就是一个正样本, 其它词就是负样本了. 负样本那么多, 该如何选取呢?

假定现在已经选好了一个关于 $\operatorname{Context}(w)$ 的负样本子集 $N E G(w) \neq \emptyset$. 且对 $\forall \widetilde{w} \in \mathcal{D}$, 定义

$L^w(\widetilde{w})= \begin{cases}1, & \widetilde{w}=w \\ 0, & \widetilde{w} \neq w\end{cases}$ 表示词 $\widetilde{w}$ 的标签, 即正样本的标签为 1 , 负样本的标签为 0 .

对于一个给定的正样本 $(\operatorname{Context}(w), w)$, 我们希望最大化 $g(w)=\prod_{u \in\{w\} \cup N E G(w)} p(u \mid \operatorname{Context}(w))$ 其中 $p(u \mid \text { Context }(w))= \begin{cases}\sigma\left(\mathbf{x}_w^{\top} \theta^u\right), \\ 1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right), & L^w(u)=1\end{cases}$ 或者写成整体表达式 $p(u \mid \operatorname{Context}(w))=\left[\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]^{L^w(u)} \cdot\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]^{1-L^w(u)}$

这里 $\mathbf{x}_w$ 仍表示 Context $(w)$ 中各个的词向量之和,而 $\theta^u \in \mathbb{R}^m$ 表示词 $u$ 对应的一个 (辅助) 向量, 为待训练参数. 为什么要最大化 $g(w)$ 呢? 让我们先来看看 $g(w)$ 的表达式,代入后有

$g(w)=\sigma\left(\mathbf{x}_w^{\top} \theta^w\right) \prod_{u \in N E G(w)}\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]$ 其中 $\sigma\left(\mathbf{x}_w^{\top} \theta^w\right)$ 表示当上下文为 $\operatorname{Context}(w)$ 时, 预测中心词为 $w$ 的概率, 而 $\sigma\left(\mathbf{x}_w^{\top} \theta^u\right), u \in$ $N E G(w)$ 则表示当上下文为 $\operatorname{Context}(w)$ 时, 预测中心词为 $u$ 的概率 (这里可看成一个二分类问题, 具体可参见预备知识中的逻辑回归). 从形式上看, 最大化 $g(w)$, 相当于最大化 $\sigma\left(\mathrm{x}_w^{\top} \theta^w\right)$, 同时最小化所有的 $\sigma\left(\mathrm{x}_w^{\top} \theta^u\right), u \in N E G(w)$. 这不正是我们希望的田? 增大正样本的概率同时降低负样本的概率. 于是, 对于一个给定的语料库 $\mathcal{C}$, 函数 $G=\prod_{w \in \mathcal{C}} g(w)$ 就可以作为整体优化的目标. 当然, 为计算方便, 对 $G$ 取对数, 最终的目标函数 (为和前面章节统一起见, 这里仍将其记为 $\mathcal{L}$ ) 就是 $\begin{aligned} \mathcal{L} &=\log G=\log \prod_{w \in \mathcal{C}} g(w)=\sum_{w \in \mathcal{C}} \log g(w) \\ &=\sum_{w \in \mathcal{C}} \log \prod_{u \in\{w\} \cup N E G(w)}\left\{\left[\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]^{L^w(u)} \cdot\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]^{1-L^w(u)}\right\} \\ &=\sum_{w \in \mathcal{C}} \sum_{u \in\{w\} \cup N E G(w)}\left\{L^w(u) \cdot \log \left[\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]+\left[1-L^w(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]\right\} \end{aligned}$ 为下面梯度推导方便起见, 将上式中双重求和符号下花括号里的内容简记为 $\mathcal{L}(w, u)$, 即 $\mathcal{L}(w, u)=L^w(u) \cdot \log \left[\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]+\left[1-L^w(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right] .$ 接下来利用随机梯度上升法对公式进行优化, 关键是要给出 $\mathcal{L}$ 的两类梯度. 首先考虑 $\mathcal{L}(w, u)$ 关于 $\theta^u$ 的梯度计算. $\begin{aligned} \frac{\partial \mathcal{L}(w, u)}{\partial \theta^u} &=\frac{\partial}{\partial \theta^u}\left\{L^w(u) \cdot \log \left[\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]+\left[1-L^w(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]\right\} \\ &=L^w(u)\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right] \mathbf{x}_w-\left[1-L^w(u)\right] \sigma\left(\mathbf{x}_w^{\top} \theta^u\right) \mathbf{x}_w \quad \text { } \\ &=\left\{L^w(u)\left[1-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right]-\left[1-L^w(u)\right] \sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right\} \mathbf{x}_w \quad \text { } \\ &=\left[L^w(u)-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right] \mathbf{x}_w . \end{aligned}$ 于是, $\theta^u$ 的更新公式可写为 $\theta^u:=\theta^u+\eta\left[L^w(u)-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right] \mathbf{x}_w .$ 接下来考虑 $\mathcal{L}(w, u)$ 关于 $\mathrm{x}_w$ 的梯度. 同样利用 $\mathcal{L}(w, u)$ 中 $\mathrm{x}_w$ 和 $\theta^u$ 的对称性, 有 $\frac{\partial \mathcal{L}(w, u)}{\partial \mathbf{x}_w}=\left[L^w(u)-\sigma\left(\mathbf{x}_w^{\top} \theta^u\right)\right] \theta^u .$ 于是, 利用 $\frac{\partial \mathcal{L}(w, u)}{\partial x_w}$, 可得 $\mathbf{v}(\widetilde{w}), \widetilde{w} \in \operatorname{Context}(w)$ 的更新公式为 (至于为什么可以这么做请参考上一节基丁 Hierarchical Softmax 的 CBOW 模型对应部分的解释) $\mathbf{v}(\widetilde{w}):=\mathbf{v}(\widetilde{w})+\eta \sum_{u \in\{w\} \cup N E G(w)} \frac{\partial \mathcal{L}(w, u)}{\partial \mathbf{x}_w}, \widetilde{w} \in \operatorname{Context}(w) .$ 下面以样本 $(\operatorname{Context}(w), w)$ 为例, 给出基于 Negative Sampling 的 CBOW 模型中采用随机梯度上升法更新各参数的伪代码.

$1 . \mathrm{e}=0$

$\mathbf{x}w=\sum{u \in \operatorname{Context}(w)} \mathrm{v}(u)$.
FOR $u={w} \cup N E G(w)$ DO { $\begin{array}{ll} 3.1 & q=\sigma\left(\mathbf{x}_w^{\top} \theta^u\right) \\ 3.2 & g=\eta\left(L^w(u)-q\right) \\ 3.3 & \mathrm{e}:=\mathbf{e}+g \theta^u \\ 3.4 & \theta^u:=\theta^u+g \mathbf{x}_w \end{array}$ }
FOR $u \in \operatorname{Context}(w)$ DO {

\[\mathbf{v}(u):=\mathbf{v}(u)+\mathbf{e}\]

FOR $\widetilde{w}=\operatorname{Context}(w) \quad \mathrm{DO}$ { $e=0 .$ FOR $u={w} \cup N E G^{\tilde{w}}(w)$ DO { $\begin{aligned} q &=\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right) \\ g &=\eta\left(L^w(u)-q\right) \\ \mathbf{e} &:=\mathbf{e}+g \theta^u \\ \theta^u &:=\theta^u+g \mathbf{v}(\widetilde{w}) \\ \} & \\ \mathbf{v}(\widetilde{w}) &:=\mathbf{v}(\widetilde{w})+\mathbf{e} \end{aligned}$

注意, 步 $3.3$ 和步 $3.4$ 不能交换次序, 即 $\theta^u$ 要等贡献到 $\mathrm{e}$ 后才更新.

### Skip-gram

本小节介绍基于 Negative Sampling 的 Skip-gram 模型. 它和上小节介绍的 CBOW 模型所用的思想是一样的, 因此, 这里我们直接从目标函数出发, 且沿用之前的记号. 对于一个给定的样本 $(w, \operatorname{Context}(w))$, 我们希望最大化 $g(w)=\prod_{\tilde{w} \in \operatorname{Context}(w)} \prod_{u \in\{w\} \cup N E G^{\bar{w}}(w)} p(u \mid \widetilde{w}),$ 其中 $p(u \mid \widetilde{w})= \begin{cases}\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right), & L^w(u)=1 \\ 1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right), & L^w(u)=0\end{cases}$ 或者写成整体表达式 $p(u \mid \widetilde{w})=\left[\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]^{L^w(u)} \cdot\left[1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]^{1-L^w(u)},$ 这里 $N E G^{\tilde{w}}(w)$ 表示处理词 $\widetilde{w}$ 时生成的负样本子集. 于是, 对于一个给定的语料库 $\mathcal{C}$, 函数 $G=\prod_{w \in \mathcal{C}} g(w)$ 就可以作为整体优化的目标. 同样, 我们取 $G$ 的对数, 最终的目标函数就昆 $\mathcal{L}=\log G=\log \prod_{w \in \mathcal{C}} g(w)=\sum_{w \in \mathcal{C}} \log g(w)$ $=\sum_{w \in C} \sum_{\tilde{w} \in \operatorname{Context}(w)} \sum_{u \in{w} \cup N E G^*(w)}$ $\left{L^w(u) \cdot \log \left[\sigma\left(\mathbf{v}(\tilde{w})^{\top} \theta^u\right)\right]+\left[1-L^w(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]\right} .$ 为下面棁度推导方㑔起见. 将三重求和符号下花括号里的内容简记为 $\mathcal{L}(w, \widetilde{w}, u)$, 即 $\mathcal{L}(w, \widetilde{w}, u)=L^w(u) \cdot \log \left[\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]+\left[1-L^w(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right] .$ 接下来利用随机梯度上升法对公式进行优化, 关铤是要给出 $\mathcal{L}$ 的两类梯度. 首先考慮 $\mathcal{L}(w, \widetilde{w}, u)$ 关于 $\theta^u$ 的梯度计筫. $\begin{aligned} & \frac{\partial \mathcal{L}(w, \tilde{w}, u)}{\partial \theta^u} \\ =& \frac{\partial}{\partial \theta^u}\left\{L^w(u) \cdot \log \left[\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]+\left[1-L^w(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]\right\} \\ =& L^w(u)\left[1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right] \mathbf{v}(\widetilde{w})-\left[1-L^w(u)\right] \sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right) \mathbf{v}(\widetilde{w}) \quad(\text { }\\ =&\left\{L^w(u)\left[1-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right]-\left[1-L^w(u)\right] \sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right\} \mathbf{v}(\widetilde{w}) \quad(\text { ) }\\ =& {\left[L^w(u)-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right] \mathbf{v}(\widetilde{w}) . } \end{aligned}$ 于是, $\mathbf{v}^{\mathrm{u}}$ 更新公式可写为 $\mathbf{v}^u:=\mathbf{v}^u+\eta\left[L^w(u)-\sigma\left(\mathbf{v}(\widetilde{w})^{\top} \theta^u\right)\right] \mathbf{v}(\widetilde{w}) .$ $\frac{\partial \mathcal{L}(w, \tilde{w}, u)}{\partial \mathbf{v}(\widetilde{w})}=\left[L^w(u)-\sigma\left(\mathbf{v}(\tilde{w})^{\top} \theta^u\right)\right] \theta^u,$ 于是, $\mathbf{v}(\widetilde{w})$ 的更新公式可写为 $\mathbf{v}(\tilde{w}):=\mathbf{v}(\widetilde{w})+\eta \sum_{u \in\{w) \cup N E G^{\star}(w)} \frac{\partial \mathcal{L}(w, \tilde{w}, u)}{\partial \mathbf{v}(\tilde{w})} .$ 下面以样本 $(w, \operatorname{Context}(w))$ 为例, 给出基于 Negative Sampling 的 Skip-gram 模型中采用随机梯度上升法荲新务参数的伪代码.

步骤3.3和3.4不能交换次序。

## 负采样算法

顾名思义, 在基于 Negative Sampling 的 CBOW 和 Skip-gram 模型中, 负采样是个很重要的环节, 对于一个给定的词 $w$, 如何生成 $N E G(w)$ 呢? 词典 $\mathcal{D}$ 中的词在语料 $\mathcal{C}$ 中出地的次数有高有低, 对子那些高频词, 被选为负样本的概率就应该比较大, 反之, 对于那些低频词, 其被选中的概率就应该比较小. 这就是我们对采样过程的一个大致要求, 本质上就是一个带权采样问题。下面先用一段通俗的描述来帮助读者理解带权采样的机理. 设词典 $\mathcal{D}$ 中的每一个词 $w$ 对应一个线段 $l(w)$, 长度为 $\operatorname{len}(w)=\frac{\operatorname{counter}(w)}{\sum_{u \in \mathcal{D}} \operatorname{counter}(u)},$ 这里 counter( .) 表示一个词在语料 $\mathcal{C}$ 中出现的次数 (分母中的求和项用来做归一化). 现在将这些线段首尾相连地拼接在一起, 形成一个长度为 1 的单位线段. 如果随机地往这个单位线段上打点, 则其中长度越长的线段 (对应高频词) 被打中的概率就越大. 接下来再谈谈 word2vec 中的具体做法. 记 $l_0=0, l_k=\sum_{j=1}^k \operatorname{len}\left(w_j\right), k=1,2, \cdots, N$, 这里 $w_j$ 表示词典 $\mathcal{D}$ 中第 $j$ 个词, 则以 $\left{l_j\right}{j=0}^N$ 为剖分节点可得到区间 $[0,1]$ 上的一个非等距剖分, $I_i=\left(l{i-1}, l_i\right], i=1,2, \cdots, N$ 为其 $N$ 个剖分区间. 进一步引入区间 $[0,1]$ 上的一个等距离剖分, 剖分节点为 $\left{m_j\right}_{j=0}^M$, 其中 $M»N$, 具体见给出的示意图.

20220913204601-2022-09-13-20-46-05

将内部剖分节点 $\left{m_j\right}{j=1}^{M-1}$ 投影到非等距剖分上, 如图 13 中的红色虚线所示, 则可建立 $\left{m_j\right}{j=1}^{M-1}$ 与区间 $\left{I_j\right}{j=1}^N$ (或者说 $\left{w_j\right}{j=1}^N$ ) 的映射关系 Table $(i)=w_k$, where $m_i \in I_k, \quad i=1,2, \cdots, M-1$. 有了这个映射, 采样就简单了: 每次生成一个 $[1, M-1]$ 间的随机整数 $r, \operatorname{Table}(r)$ 就是一个样本. 当然, 这里还有一个细节, 当对 $w_i$ 进行负采样时, 如果碰巧选到 $w_i$ 自己怎么办? 那就跳过去呗:-), 代码中也是这么处理的. 值得一提的是, word2vec 源码中为词典 $\mathcal{D}$ 中的词设置权值时, 不是直接用 counter $(w)$, 而是对其作了 $\alpha$ 次幂, 其中 $\alpha=0.75$, 即变成了 $\operatorname{len}(w)=\frac{[\operatorname{counter}(w)]^{0.75}}{\sum_{u \in \mathcal{D}}[\operatorname{counter}(u)]^{0.75}}.$ 此外, 代码中取 $M=10^8$ (对应源码中变量 table_size), 而映射 (5.13) 则是通过一个名为 InitUnigramTable 的函数来完成.

## word2vec代码

word2vec-pytorch word2vec-tensorflow gensim-word2vec

## 拓展：

### 词向量三部曲：

word2vec
doc2vec
fastText

### word enbedding的方法

oneHot
word2vec
fastText
GloVe
ELMO
BERET
GPT-2

## 参考链接

Word2vec的前世今生

统计语言模型

上下文无关模型（Context=NULL）