矩阵导数概念简记

论文信息

标题：On the concept of matrix derivative
作者：Jan R. Magnus
期刊：Journal of Multivariate Analysis, 2010, 101(9): 2200-2206
DOI：https://doi.org/10.1016/j.jmva.2010.05.005

核心问题

这篇文章讨论的问题很集中：如何把向量导数（vector derivative）的概念推广到矩阵导数（matrix derivative）。

作者提出两种定义：

宽定义（broad definition）
窄定义（narrow definition）

然后比较这两种定义，并明确支持窄定义（narrow definition）。

简要总结

作者认为，矩阵导数不应该只是“把所有偏导数排成某个大矩阵”这么简单。更重要的是，这个导数对象要保留普通导数（derivative）原本的结构含义，也就是：

一行最好对应一个函数（function）
一列最好对应一个变量（variable）

如果只是把偏导数随意重排，那么虽然信息没有丢，但“导数是什么”这件事本身会变得模糊。因此作者认为，窄定义（narrow definition）比宽定义（broad definition）更合理，也更方便使用。

文章主旨

宽定义（broad definition）：只要求把所有偏导数都收集起来，但不严格要求这些偏导数如何排列。
窄定义（narrow definition）：要求排列方式保留导数结构，每一行对应某个函数对全部变量的偏导，每一列对应全部函数对某个变量的偏导。
作者观点：窄定义（narrow definition）更贴近 Jacobian 和 Hessian 的标准理解，因此更自然。

矩阵对矩阵求导的一般形式

假设我们存在矩阵 $\mathbf X \in \mathbb R^{p \times q}$ 和函数 $\mathbf A(\mathbf X) \in \mathbb R^{m \times n}$ ，在矩阵对矩阵求导（matrix-to-matrix derivative）里，采用窄定义（narrow definition）时，最自然的写法是

\frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf X)^T}.

这样做的好处是：它本质上就是一个向量对向量求导（vector-to-vector derivative）的 Jacobian。

Remark. 分母写成 $\operatorname{vec}(\mathbf X)^T$ ，表示我们把矩阵变量先转成行向量坐标，再按标准 Jacobian 的方式组织偏导数。

向量化（vec）公式

根据窄定义（narrow definition），矩阵求导需要先对矩阵做向量化，本文默认 vec 表示按列展开（column-wise stacking）。下面介绍两种矩阵向量化展开的一般结论：

若 $\mathbf A \in \mathbb{R}^{m\times n}$ ， $\mathbf B \in \mathbb{R}^{n\times p}$ ，则

\operatorname{vec}(\mathbf A\mathbf B) = (\mathbf I_p \otimes \mathbf A)\operatorname{vec}(\mathbf B) = (\mathbf B^T \otimes \mathbf I_m)\operatorname{vec}(\mathbf A).

若 $\mathbf A \in \mathbb{R}^{m\times n}$ ， $\mathbf B \in \mathbb{R}^{n\times r}$ ， $\mathbf C \in \mathbb{R}^{r\times p}$ ，则

\operatorname{vec}(\mathbf A(\mathbf B\mathbf C)) = (\mathbf I_p\otimes \mathbf A)\operatorname{vec}(\mathbf B\mathbf C) = (\mathbf I_p\otimes \mathbf A)(\mathbf C^T\otimes \mathbf I_n)\operatorname{vec}(\mathbf B) = (\mathbf C^T\otimes \mathbf A)\operatorname{vec}(\mathbf B).

Remark. 最后一步利用 Kronecker 积（Kronecker product）的乘法规则 $(\mathbf M\otimes \mathbf N)(\mathbf P\otimes \mathbf Q)=(\mathbf M\mathbf P)\otimes(\mathbf N\mathbf Q)$ 。

乘积法则（product rule）

设 $\mathbf A=\mathbf A(\mathbf X)\in\mathbb{R}^{m\times r}$ ， $\mathbf B=\mathbf B(\mathbf X)\in\mathbb{R}^{r\times p}$ ，则

\frac{\partial \operatorname{vec}(\mathbf A\mathbf B)}{\partial \operatorname{vec}(\mathbf X)^T} = (\mathbf B^T\otimes \mathbf I_m)\frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf X)^T} + (\mathbf I_p\otimes \mathbf A)\frac{\partial \operatorname{vec}(\mathbf B)}{\partial \operatorname{vec}(\mathbf X)^T}.

一个简短推导如下。先从普通微分公式出发：

d(\mathbf A\mathbf B)=d\mathbf A\cdot \mathbf B + \mathbf A\cdot d\mathbf B.

对两边做向量化，得到

d\operatorname{vec}(\mathbf A\mathbf B) = \operatorname{vec}(d\mathbf A\cdot \mathbf B)+\operatorname{vec}(\mathbf A\cdot d\mathbf B) = (\mathbf B^T\otimes \mathbf I_m)d\operatorname{vec}(\mathbf A) + (\mathbf I_p\otimes \mathbf A)d\operatorname{vec}(\mathbf B).

再写成

d\operatorname{vec}(\mathbf A) = \frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf X)^T}d\operatorname{vec}(\mathbf X), \qquad d\operatorname{vec}(\mathbf B) = \frac{\partial \operatorname{vec}(\mathbf B)}{\partial \operatorname{vec}(\mathbf X)^T}d\operatorname{vec}(\mathbf X),

代回上式，并比较 $d\operatorname{vec}(\mathbf X)$ 前面的系数，就得到上面的结论。

链式法则（chain rule）

同样地，矩阵对矩阵的求导也满足链式法则。若 $\mathbf B=\mathbf B(\mathbf X)$ ，而 $\mathbf A=\mathbf A(\mathbf B)$ ，则

\frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf X)^T} = \frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf B)^T} \frac{\partial \operatorname{vec}(\mathbf B)}{\partial \operatorname{vec}(\mathbf X)^T}.

例如当

\mathbf B(\mathbf X)=\mathbf C\mathbf X\mathbf D,\qquad \mathbf A(\mathbf B)=\mathbf E\mathbf B\mathbf F,

则

\frac{\partial \operatorname{vec}(\mathbf B)}{\partial \operatorname{vec}(\mathbf X)^T} = \mathbf D^T\otimes \mathbf C, \qquad \frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf B)^T} = \mathbf F^T\otimes \mathbf E,

因此

\frac{\partial \operatorname{vec}(\mathbf A)}{\partial \operatorname{vec}(\mathbf X)^T} = (\mathbf F^T\otimes \mathbf E)(\mathbf D^T\otimes \mathbf C).

Remark. 这和普通向量链式法则完全类似，只是把中间变量从向量换成了矩阵，并通过 vec 重新写成 Jacobian 乘法。

例子

下面给出两个我曾在论文证明中出现过的两个例子。统一设 $\mathbf F\in\mathbb R^{K\times K}$ ， $\mathbf b\in\mathbb R^K$ ， $\mathbf u_i,\mathbf w_j\in\mathbb R^K$ ，并且 $\mathbf F$ 可逆。其中 $\mathbf F$ 和 $\mathbf b$ 是自变量。

1. $\boldsymbol{\theta}_i(\boldsymbol{\gamma};\mathbf Z)=\sqrt N\,\mathbf F^T\mathbf u_i$

由向量化（vec）公式，我们有

\boldsymbol{\theta}_i = \sqrt N\,(\mathbf I_K\otimes \mathbf u_i^T)\operatorname{vec}(\mathbf F),

因此

\frac{\partial \boldsymbol{\theta}_i}{\partial \operatorname{vec}(\mathbf F)^T} = \sqrt N\,(\mathbf I_K\otimes \mathbf u_i^T), \qquad \frac{\partial \boldsymbol{\theta}_i}{\partial \mathbf b^T} = \mathbf 0_{K\times K}.

Remark. 这里对 $\mathbf b$ 的导数为零，是因为原式中根本不含参数 $\mathbf b$ 。

2. $\mathbf a_j(\boldsymbol{\gamma};\mathbf Z)=\sqrt J\,\mathbf F^{-1}\mathbf w_j+\mathbf b$

方法一

同样地，根据向量化公式我们有

\mathbf a_j = \sqrt J\,(\mathbf w_j^T\otimes \mathbf I_K)\operatorname{vec}(\mathbf F^{-1})+\mathbf b.

接下来的关键是弄清楚

\frac{\partial \operatorname{vec}(\mathbf F^{-1})}{\partial \operatorname{vec}(\mathbf F)^T}

是怎么来的。注意到

\mathbf F^{-1} \mathbf F = \mathbf I_K.

因此，

d(\mathbf F^{-1})\mathbf F + \mathbf F^{-1}d(\mathbf F) = 0.

我们有

d(\mathbf F^{-1})=-\mathbf F^{-1}(d\mathbf F)\mathbf F^{-1}

对两边做向量化（vectorization），得到

d\operatorname{vec}(\mathbf F^{-1}) = -\,\operatorname{vec}\!\big(\mathbf F^{-1}(d\mathbf F)\mathbf F^{-1}\big) = -\,(\mathbf F^{-T}\otimes \mathbf F^{-1})\,d\operatorname{vec}(\mathbf F).

最后可以得到

\frac{\partial \operatorname{vec}(\mathbf F^{-1})}{\partial \operatorname{vec}(\mathbf F)^T} = -\,(\mathbf F^{-T}\otimes \mathbf F^{-1}).

带入原式得到最终结果

\frac{\partial \mathbf a_j}{\partial \operatorname{vec}(\mathbf F)^T} = -\sqrt J\,(\mathbf w_j^T\otimes \mathbf I_K)(\mathbf F^{-T}\otimes \mathbf F^{-1}) = -\sqrt J\,((\mathbf F^{-1}\mathbf w_j)^T\otimes \mathbf F^{-1}),

以及

\frac{\partial \mathbf a_j}{\partial \mathbf b^T}=\mathbf I_K.

Permutation matrix 的例子

需要注意的是，以下两个求导结果是不同的：

\frac{\partial \operatorname{vec}(\mathbf F^T)}{\partial \operatorname{vec}(\mathbf F)^T} \neq \frac{\partial \operatorname{vec}(\mathbf F)}{\partial \operatorname{vec}(\mathbf F)^T} = \mathbf I_{K^2}.

原因在于我们对分子的矩阵做了转置，因此分子和分母的展开顺序已经不一致。此时左边的结果不再是单位矩阵，而是一个置换矩阵（permutation matrix）。置换矩阵最典型的用途，就是在分子和分母的展开顺序（stacking convention）不一致时，把它们重新对齐。

在这里，最常见的是交换矩阵（commutation matrix） $\mathbf K_{KK}$ 。若 $\mathbf F\in\mathbb{R}^{K\times K}$ ，则

\operatorname{vec}(\mathbf F^T)=\mathbf K_{KK}\operatorname{vec}(\mathbf F), \qquad \frac{\partial \operatorname{vec}(\mathbf F^T)}{\partial \operatorname{vec}(\mathbf F)^T}=\mathbf K_{KK}.

Remark. permutation matrix 的作用不是引入新信息，而是重排已有坐标。当分子和分母的 vec 顺序不一致时，可以先乘一个 $\mathbf K_{KK}$ 把它们转到同一坐标系。此外，permutation matrix 对 Kronecker 积也有“换位”作用。若 $\mathbf A,\mathbf B\in\mathbb{R}^{K\times K}$ ，则
$\mathbf K_{KK}(\mathbf A\otimes \mathbf B)=(\mathbf B\otimes \mathbf A)\mathbf K_{KK},$
等价地，
$\mathbf K_{KK}(\mathbf A\otimes \mathbf B)\mathbf K_{KK}=\mathbf B\otimes \mathbf A.$
这说明交换矩阵可以把 Kronecker 积中的两个因子互换，但代价是要同时调整左右两侧的展开顺序。