What are some strategies for computing derivatives in matrix calculus?

by Iamanon   Last Updated May 23, 2020 01:20 AM

In single variable calculus, the chain, product, and power rules are very straightforward. In vector calculus, it's a bit more involved.

I frequently find myself having to do a lot of work to arrive at the answer. For example, taking the derivative of

$$ \frac{\partial}{\partial x} Ax $$

where $x$ is a column vector and $A$ is a matrix, I usually end up doing the following, where I find the i-th row of the expression and then take its derivative and generalize that to the full case

$$ (Ax)_i = \sum_{j}A_{ij}x_j \\ \frac{\partial}{\partial x_k}(\sum_j A_{ij}x_j) = A_{ik} \\ \therefore \frac{\partial}{\partial x} = A^T $$

But in this case, we could have just used the analogous power rule, and take the constant coefficient of $Ax$ (i.e., $A^T$, when using a denominator layout) to be the derivative.

This is an easy case. Now what if we had $$ \frac{\partial}{\partial x}x^TA $$

In this case, it's not easy for me to identify what the derivative should be, so I write out the terms individually

$$ x^TA = \left[\sum_{i}x_iA_{i1} \ \ \sum_{i}x_iA_{i2} \ldots \right] \\ \frac{\partial}{\partial x_k} x^TA = A_k \\ \therefore \frac{\partial}{\partial x} x^TA = A $$ Same as the derivative of $Ax$.

Now we can get more complicated, by having expressions like $x^TAx$, or taking higher order derivatives, where it can become tedious to write out the terms as I've done. To avoid making the post too long, I'll just briefly show how I would compute the derivative of $x^TAx$.

\begin{align} x^TAx = \langle x, Ax \rangle = \langle Ax, x \rangle \\ = \sum_i x_i \sum_j A_{ij}x_j \\ \frac{\partial}{\partial x_k}\sum_i x_i \sum_j A_{ij}x_j = \sum_j A_{kj}x_j + \sum_i x_i A_{ik} \ \ \ \ \ \text{where I applied the product rule} \\ \therefore \frac{\partial }{\partial x}(x^TAx) = Ax + A^Tx \end{align}

So basically my question boils down to, is there a quick way or some trick to find the derivative without going through summations (i.e., writing out the terms individually)? The power rule seems identical to the single variable case, where $\frac{d}{dx} cx = c$ and analogously $\frac{d}{dx} Cx = C$. The product rule seems more involved, and I can't readily identify a generalization of the multivariable case.

Wikipedia has some tables for common derivatives, and those are more or less the ones that I come across, but I don't want to commit those to memory, and would rather know a good strategy to compute these fast.

Answers 1

You can view a matrix/tensor product as a tensor network and then the operation of taking derivative of the tensor product with respect to a tensor is the tensor that you get if you were to delete the tensor that you are taking derivative of in the tensor network representing the tensor product.

So let's stick with matrices and vectors for now. A matrix can we represented as a node with two edges in a tensor network. The two edges represent the rows and columns of the matrix. Example:

The matrix $A=\begin{pmatrix}a & b\\\ c & d \\\ f &g\end{pmatrix}$ can be represented as the tensor network:

enter image description here

Here, the left edge of node $A$ represent the columns of matrix $A$ and the right edge of the node $A$ represent the rows of node $A$. The number on top of the edges denote the dimension of the rows and columns respectively.

If you have vectors $u \in \mathbf{R}^2, v \in \mathbf{R}^3$, and we want to represent the product $v^TAu$ in a tensor network, we can just first draw our vectors in tensor network format:

enter image description here

Notice that the nodes $u,v$ have only one edge coming out of them and that is because they are 1-dimentional.

To create out tensor network for $v^TAu$, we just have to attach them correctedly according to how you multiply them, for example, the vector $v^T$ is multiplying the columns of $A$ and their dimension match.

enter image description here

Ok Notice the resulting tensor network has no edges that sticks out. This is because the result of the products is a scalar. Now if we want to take derivative with respect to either $u,v,A$ we can just delete the corresponding tensor in the tensor network and we will get the solution.

Let's see an example, Let's take the partial derivatives of $v^TAu$ with respect to each entries in $u$, then we would get a vector of dimension $2$. That vector is exactly $v^TA$ and you represent it in tensor network as :

enter image description here

as you can see, the result is a tensor network with a free edge of dimension $2$, which represent exactly the product $v^TA$.

Let's see how we can take the partial derivative of $v^TAu$ with respect to each entries in the matrix $A$. We should therefore expect that the result is a matrix of dimension $\mathbf{R}^{3 \times 2}$. The resulting matrix is the outer product of $u$ and $v$. In tensor network format we get:

enter image description here

Notice that this is tensor network is different from image 2 because image two we think of each vector as its own tensor network, But here, we think of then as together and as you can see, there are two free edges representing a two dimensional matrix. Notice when you have disconnected components in the tensor network, it means outer product.

There are one more case that we need to consider and that is what happen, if a tensor appears multiple times in the tensor network and you are taking the partial derivatives of the tensor network with respect to that tensor. In that case, you can consider removing each of occurrence of that tensor one at the time and then add the result tensor network to get the derivatives.

May 23, 2020 01:16 AM

Related Questions

Gradient of $g(x) = f(Ax + b)$

Updated March 23, 2019 23:20 PM

Matrix derivative of a function

Updated May 03, 2018 01:20 AM

Partial derivative of matrix

Updated March 30, 2019 23:20 PM