\[ % Defines an openable reference to the given tag \newcommand\fref[2][]{ \hspace{1em} \cssId{ref#1:#2}{ \class{start-socket}{ \phantom{\Box} } } } % Defines an inline reference to the given tag \newcommand\ifref[2][]{ \cssId{ref#1:#2}{ \class{start-socket}{ \phantom{} } } } % Defines a tag to reference later \newcommand\ftag[1]{ \cssId{tag:#1}{ \class{end-socket}{ \phantom{\large #1\Box} } } } \]

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \require{color} \definecolor{b}{RGB}{128, 177, 211} \definecolor{o}{RGB}{253, 180, 98} \]

Welcome!

This page is an in-the-works solution manual for The Elements of Statistical Learning. I’m writing the page using RMarkdown as I work through the book. This setup allows for intermixing R code and LaTeX mathematics, both vital components when working through the textbook exercises.

I’ll be adding more References and Problems as time goes on.

library(knitr)

References

Introduction

This is the reference section.
We use this section when as a notebook for topics not directly related to solving the exercises, but that are useful nonetheless.

Matrix Calculus

This section introduces and proves a few basic facts about matrix calculus. Specifically, we look at what happens when we take the derivatives of:

  • A scalar with respect to a vector
  • A vector with respect to a scalar
  • A vector with respect to another vector

For all of the material in this section, I’ll use \(x\) and \(y\) as scalar values. \(X\) will be an \(n\)-dimensional column vector and \(Y\) will be an \(m\)-dimensional column vector.

\[ X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \text{ and } Y = \begin{bmatrix} y_1 \\ y_y \\ \vdots \\ x_m \end{bmatrix} \]

All of this material (except the proofs) can be found on the Matrix Calculus Wikipedia page.
For even more fun on multi-dimensional calculus, consider looking through Ricci Calculus, a simplified version of calculus on tensors.

Definitions

Let’s start by taking the derivative of a scalar \(y\) with respect to a column vector \(X\). The result is pretty intuitive, just take the derivative of \(y\) with respect to each component in \(X\): \[ \dfrac{\partial y}{\partial X} = \partial y / \partial \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \ftag{MC.1} \] If \(X\) had been a row vector, then the result would have been a row vector as well. It’s interesting to notice that this quantity is actually just the gradient \(y\): \[ \nabla y \leftrightarrow \dfrac{\partial y}{\partial X} \]

Next, we take the derivative of a column vector \(Y\) with respect to a scalar \(x\). The result turns out similar to \(\ifref{MC.1}\) above, but transposed. \(Y\) is a column vector, but \(\partial Y / \partial x\) is a row vector: \[ \dfrac{\partial Y}{\partial x} = \dfrac{\partial}{\partial x} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \begin{bmatrix} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} & \cdots & \frac{\partial y_m}{\partial x} \end{bmatrix} \ftag{MC.2} \] If \(Y\) had been a row vector, then the result would have been a column vector instead.

Why do we transpose the output in this case? The transposition stems from a choice of notation. Currently, there exist two standards for doing matrix calculus called the numerator notation and the denominator notation.

  • In numerator notation, the output is always laid out according to the numerator and/or the transpose of the denominator.
  • In denominator notation, the output is always laid out according to the denominator and/or the transpose of the numerator. This is the notation that the authors of Elements of Statistical Learning seem to use and the notation I used in \(\ifref{MC.1}\) and \(\ifref{MC.2}\).

The following table summarizes the both of the notations:

Numerator Notation Denominator Notation
\[\dfrac{\partial y}{\partial X}\] \[\begin{bmatrix} \Box & \Box & \cdots \Box \end{bmatrix}\] \[\begin{bmatrix} \Box \\ \Box \\ \vdots \\ \Box \end{bmatrix}\]
\[\dfrac{\partial Y}{\partial x}\] \[\begin{bmatrix} \Box \\ \Box \\ \vdots \\ \Box \end{bmatrix}\] \[\begin{bmatrix} \Box & \Box & \cdots \Box \end{bmatrix}\]

Unless specified otherwise, I’ll be using denominator notation throughout the rest of these solutions.

At this point, you might be wondering why we need either of these notations at all. Why can we not simply make the output a column vector in either case? As you’ll see below, this is not always possible.

We close this section by defining the derivative of a \(m\)-dimensional column vector \(Y\) with respect to a \(n\)-dimensional column vector \(X\). If both \(X\) and \(Y\) are column vectors, which will dictate the columns of the matrix and which will dictate the rows? Well, since we’re using denominator notation, the output must be a \(n \times m\) matrix. This guarantees that \(X\) (an \(n \times 1\) matrix) stays in column form and \(Y\) (an \(m \times 1\) matrix) changes into row form. If we were using numerator notation, the output would have been an \(m \times n\) matrix instead. \[ \begin{array}{rccl} \dfrac{\partial Y}{\partial X} & = & \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} & \ftag{MC.3} \\ & & n \times m & \end{array} \]

Below is an easy way to remember how this notation works. In denominator notation, the denominator stays the same while the numerator gets transposed: \[ \dfrac{\partial Y}{\partial X} \leftrightarrow \dfrac{[m \times 1]}{[n \times 1]} = [n \times m] \] One immediate success of this notation comes from considering the derivative of \(X\) with respect to itself. Using the above definition, we arrive at: \[ \begin{array}{rccl} \dfrac{\partial X}{\partial X} & = & \begin{bmatrix} \frac{\partial x_1}{\partial x_1} & \frac{\partial x_2}{\partial x_1} & \cdots & \frac{\partial x_n}{\partial x_1} \\ \frac{\partial x_1}{\partial x_2} & \frac{\partial x_2}{\partial x_2} & \cdots & \frac{\partial x_n}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial x_1}{\partial x_n} & \frac{\partial x_2}{\partial x_n} & \cdots & \frac{\partial x_n}{\partial x_n} \end{bmatrix} & \fref{MC.3} \\ & = & \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & & & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} \\ \dfrac{\partial X}{\partial X} & = & I_n \ftag{MC.4} \end{array} \] This should line up with expectations. We knew that \(\partial X / \partial X\) must be a \(n \times n\) matrix, and the only matrix that makes sense here is the \(n \times n\) identity matrix \(I_n\).

Basic Properties

When you first started working with single variable calculus, you most likely learned rules like: \[ \dfrac{d}{dx} a x = a \text{, } \dfrac{d}{dx} x^2 = 2x \]

Not to mention more general rules like the product rule: \[ \dfrac{d}{dx} uv =\dfrac{dv}{dx} u + \dfrac{du}{dx} v \]

In this section, we’ll learn some elementary rules regarding matrix calculus. These rules will look similar to their univariate cousins above.

First up, consider a constant \(n\)-dimensional column vector \(A\) and its inner product with \(X\), \(A^{T}X\). Since this inner product is scalar, we can try to take its derivative with respect to \(X\), like we did in \(\ifref{MC.1}\). We get: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} A^{T}X & = & \dfrac{\partial}{\partial X} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) & \\ & = & \begin{bmatrix} \frac{\partial}{\partial x_1} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) \\ \frac{\partial}{\partial x_2} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) \\ \vdots \\ \frac{\partial}{\partial x_n} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) \end{bmatrix} & \fref{MC.1} \\ & & n \times 1 & \\ \\ & = & \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} & \\ & & n \times 1 & \\ \dfrac{\partial}{\partial X} A^{T}X & = & A \ftag{MC.5} \end{array} \] This should remind you of the classic result of taking a derivative of a linear function: \[ \dfrac{\partial}{\partial X} A^{T}X = A \leftrightarrow \dfrac{d}{dx} ax = a \]

Next, we take a look at the matrix calculus version of the product rule. Let \(U\) and \(V\) be \(m\)-dimensional vectors that depend on \(X\). We will attempt to take the derivative of their inner product \(U^{T}V\). Just like in \(\ifref{MC.5}\), we know \(U^{T}V\) is a scalar value so we can take its derivative with respect to \(X\). The derivation below gets tedious because of all the indices, but there’s no new information here: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} U^{T} V & = & \dfrac{\partial}{\partial X} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) & \\ & = & \begin{bmatrix} \frac{\partial}{\partial x_1} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) \\ \frac{\partial}{\partial x_2} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) \\ \vdots \\ \frac{\partial}{\partial x_n} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) \end{bmatrix} & \fref{MC.1} \\ & & n \times 1 & \\ \\ & = & \begin{bmatrix} (u_1 \frac{\partial v_1}{\partial x_1} + \frac{\partial u_1}{\partial x_1} v_1) + (u_2 \frac{\partial v_2}{\partial x_1} + \frac{\partial u_2}{\partial x_1} v_2) + \cdots + (u_m \frac{\partial v_m}{\partial x_1} + \frac{\partial u_m}{\partial x_1} v_m) \\ (u_1 \frac{\partial v_1}{\partial x_2} + \frac{\partial u_1}{\partial x_2} v_1) + (u_2 \frac{\partial v_2}{\partial x_2} + \frac{\partial u_2}{\partial x_2} v_2) + \cdots + (u_m \frac{\partial v_m}{\partial x_2} + \frac{\partial u_m}{\partial x_2} v_m) \\ \vdots \\ (u_1 \frac{\partial v_1}{\partial x_n} + \frac{\partial u_1}{\partial x_n} v_1) + (u_2 \frac{\partial v_2}{\partial x_n} + \frac{\partial u_2}{\partial x_n} v_2) + \cdots + (u_m \frac{\partial v_m}{\partial x_n} + \frac{\partial u_m}{\partial x_n} v_m) \end{bmatrix} & \fref{Product.Rule} \\ & & n \times 1 & \\ \\ & = & \begin{bmatrix} u_1 \frac{\partial v_1}{\partial x_1} + u_2 \frac{\partial v_2}{\partial x_1} + \cdots + u_m \frac{\partial v_m}{\partial x_1} \\ u_1 \frac{\partial v_1}{\partial x_2} + u_2 \frac{\partial v_2}{\partial x_2} + \cdots + u_m \frac{\partial v_m}{\partial x_2} \\ \vdots \\ u_1 \frac{\partial v_1}{\partial x_n} + u_2 \frac{\partial v_2}{\partial x_n} + \cdots + u_m \frac{\partial v_m}{\partial x_n} \\ \end{bmatrix} + \begin{bmatrix} \frac{\partial u_1}{\partial x_1} v_1 + \frac{\partial u_2}{\partial x_1} v_2 + \cdots + \frac{\partial u_m}{\partial x_1} v_m \\ \frac{\partial u_1}{\partial x_2} v_1 + \frac{\partial u_2}{\partial x_2} v_2 + \cdots + \frac{\partial u_m}{\partial x_2} v_m \\ \vdots \\ \frac{\partial u_1}{\partial x_n} v_1 + \frac{\partial u_2}{\partial x_n} v_2 + \cdots + \frac{\partial u_m}{\partial x_n} v_m \\ \end{bmatrix} \\ & & n \times 1 \hspace{13em} n \times 1 & \\ \\ & = & \begin{bmatrix} \frac{\partial v_1}{\partial x_1} & \frac{\partial v_2}{\partial x_1} & \cdots & \frac{\partial v_m}{\partial x_1} \\ \frac{\partial v_1}{\partial x_2} & \frac{\partial v_2}{\partial x_2} & \cdots & \frac{\partial v_m}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial v_1}{\partial x_n} & \frac{\partial v_2}{\partial x_n} & \cdots & \frac{\partial v_m}{\partial x_n} \end{bmatrix} \cdot \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \end{bmatrix} + \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_2}{\partial x_1} & \cdots & \frac{\partial u_m}{\partial x_1} \\ \frac{\partial u_1}{\partial x_2} & \frac{\partial u_2}{\partial x_2} & \cdots & \frac{\partial u_m}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial u_1}{\partial x_n} & \frac{\partial u_2}{\partial x_n} & \cdots & \frac{\partial u_m}{\partial x_n} \end{bmatrix} \cdot \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_m \end{bmatrix} \\ & & \hspace{4em} n \times m \hspace{5em} m \times 1 \hspace{5em} n \times m \hspace{5em} m \times 1 & \\ \\ \dfrac{\partial}{\partial X} U^{T} V & = & \dfrac{\partial V}{\partial X} U + \dfrac{\partial U}{\partial X} V \ftag{MC.6} & \fref{MC.3} \end{array} \] Phew! The result in \(\ifref{MC.6}\) shows exactly what you would expect. It demonstrates that the product rule generalizes from univariate calculus to matrix calculus. \[ \dfrac{\partial}{\partial X} U^{T}V = \dfrac{\partial V}{\partial X} U + \dfrac{\partial U}{\partial X} V \leftrightarrow \dfrac{d}{dx} uv =\dfrac{dv}{dx} u + \dfrac{du}{dx} v \]

Just like with the univariate product rule, \(\ifref{MC.6}\) allows us to derive other, more specific results. For example, consider the case when \(U = V = Y\). Then, by the product rule, we have: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} Y^{T} Y & = & \dfrac{\partial Y}{\partial X} Y + \dfrac{\partial Y}{\partial X} Y & \fref{MC.6} \\ & = & 2 \dfrac{\partial Y}{\partial X} Y \ftag{MC.7} & \end{array} \] Furthermore, if we let \(Y = X\), the we arrive at: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} X^{T} X & = & 2 \dfrac{\partial X}{\partial X} X & \fref{MC.7} \\ & = & 2 I_n X & \fref{MC.4} \\ & = & 2 X \end{array} \] This result should be very reminiscent of the classical derivative of a quadratic function: \[ \dfrac{\partial}{\partial X} X^{T}X = 2 X \leftrightarrow \dfrac{d}{dx} x^2 = 2x \]

Matrix Operations

Since statistical learning heavily utilizes linear algebra and matrix operations like trace, transpose, and inverse. This section reviews some useful definitions, properties, and interactions.

Of course, these notes are meant to serve as review only. They are no substitute for a proper text/class on linear algebra.

Trace

We define the trace of a matrix as the sum of all its diagonal elements. In particular, if we have a \(n \times n\) matrix \(A\) with elements \(a_{i j}\), then its trace is: \[ tr(A) = \sum\limits_{i = 1}^{n} a_{i i} = a_{1 1} + a_{2 2} + \cdots + a_{n n} \ftag{MO.1} \]

Visually, this looks like: \[ \require{enclose} tr(A) = tr\left( \begin{bmatrix} \enclose{circle}{a_{1 1}} & a_{1 2} & \cdots & a_{1 n} \\ a_{2 1} & \enclose{circle}{a_{2 2}} & \cdots & a_{2 n} \\ \vdots & \vdots & & \vdots \\ a_{n 1} & a_{n 2} & \cdots & \enclose{circle}{a_{n n}} \end{bmatrix} \right) = \sum\limits_{i = 1}^{n} a_{i i} \]

Note that, since rectangular matrices do not have proper diagonals, we can only define the \(tr(\Box)\) function on square matrices. This point will prove important when we consider the trace of products below.

Trace of a Sum:
If we have a sequence of \(K\) \(n \times n\) matrices \(A_k\), then the trace of the sum of this sequence is just the sum of the traces. This result comes from the fact that matrix addition happens element-wise. Thus: \[ tr\left( \sum\limits_{k = 1}^{K} A_k \right) = \sum\limits_{k = 1}^{K} tr(A_k) \ftag{MO.2} \]

Trace of a Product:
The trace of a product has a much more complicated structure than the trace of a sum (\(\ifref{MO.2}\)). Consider two matrices \(X\) and \(Y\), where \(X\) has dimensions \(n \times m\) and \(Y\) has dimensions \(m \times n\), and their product \(X Y = Z\): \[ \begin{array}{cccc} X & Y & = & Z \\ n \times m & m \times n & & n \times n \end{array} \]

Since \(Z\) is an \(n \times n\) matrix, we can ask about its trace \(tr(Z) = tr(X Y)\). How do we calculate this quantity? Note that we can’t take the traces of \(X\) and \(Y\) separately because neither of them are square. Instead, we have to go straight to the definition of matrix multiplication to arrive at an answer. Recall that, when we multiply two matrices \(X\) and \(Y\), the output is another matrix: \[ \begin{array}{cccc} X & Y & & Z \\ \begin{bmatrix} x_{1 1} & x_{1 2} & \cdots & x_{1 m} \\ x_{2 1} & x_{2 2} & \cdots & x_{2 m} \\ \vdots & \vdots & & \vdots \\ x_{n 1} & x_{n 2} & \cdots & x_{n m} \end{bmatrix} & \begin{bmatrix} y_{1 1} & y_{1 2} & \cdots & y_{1 n} \\ y_{2 1} & y_{2 2} & \cdots & y_{2 n} \\ \vdots & \vdots & & \vdots \\ y_{m 1} & y_{m 2} & \cdots & y_{m n} \end{bmatrix} & = & \begin{bmatrix} z_{1 1} & z_{1 2} & \cdots & z_{1 n} \\ z_{2 1} & z_{2 2} & \cdots & z_{2 n} \\ \vdots & \vdots & & \vdots \\ z_{n 1} & z_{n 2} & \cdots & z_{n n} \end{bmatrix} \\ n \times m & m \times n & & n \times n \end{array} \]

Here, each \(z_{i j}\) is defined as: \[ \require{color} z_{i j} = \sum\limits_{\color{red}k = 1}^{\color{red}m} x_{i {\color{red}k}} \cdot y_{{\color{red}k} j} \]

In effect, to get \(Z\), we sum over the middle dimension of size \(\color{red}m\), indexed by \(\color{red}k\). Here, we highlighted \(\color{red}m\) and \(\color{red}k\), to show how this summation occurs and how \(Z\) ends up with \(n \times n\) elements. If we take the above result and combine it with \(\ifref{MO.1}\), we arrive at: \[ \begin{array}{rcll} tr(X Y) & = & \displaystyle\sum\limits_{i = 1}^{n} z_{i i} & \fref{MO.1} \\ tr(X Y) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{m} x_{i k} \cdot y_{k i} \ftag{MO.3} & \end{array} \]

This expression allows us to calculate the trace of a product directly. However, the real power of \(\ifref{MO.3}\) comes from noticing what occurs when we exchange summations: \[ \begin{array}{rcll} tr(X Y) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{m} x_{i k} \cdot y_{k i} & \fref{MO.3} \\ & = & \displaystyle\sum\limits_{k = 1}^{m} \sum\limits_{i = 1}^{n} x_{i k} \cdot y_{k i} & \\ & = & \displaystyle\sum\limits_{k = 1}^{m} \sum\limits_{i = 1}^{n} y_{k i} \cdot x_{i k} & \\ tr(X Y) & = & tr(Y X) \ftag{MO.4} \end{array} \]

This shows that, when calculating a trace of a product, we are allowed to interchange the two terms. When calculating \(tr(X Y)\), the inner summation over \(k\) signified matrix multiplication and the outside summation over \(i\) signified the trace function. After interchanging the summations, the inner summation now signified matrix multiplication \(Y X\) and the outer summation the trace \(tr(Y X)\). Notice that this result is not at all obvious since since \(X Y\) is an \(n \times n\) matrix while \(Y X\) is an \(m \times m\) matrix.

Does \(\ifref{MO.4}\) generalize to a product of more than two matrices? Consider, for example, the following three matrices and their product:

  • \(A\), an \(n \times p\) matrix with elements \(a_{i j}\)
  • \(B\), a \(p \times q\) matrix with elements \(b_{i j}\)
  • \(C\), a \(q \times n\) matrix with elements \(c_{i j}\)
  • \(D = A B C\) with matrix elements \(d_{i j}\)

As before, we defined the three matrices above so that their product \(A B C = D\) will be a square \(n \times n\) matrix: \[ \begin{array}{ccccc} A & B & C & = & D \\ n \times p & p \times q & q \times n & & n \times n \end{array} \]

Notice that we immediately arrive at the conclusion that \(\ifref{MO.4}\) does not generalize in a pairwise sense. For example, if we interchange \(A\) and \(B\) in our product then it is not true that \(tr(A B C) = tr(B A C)\) for the simple reason that \(B A C\) is undefined: \[ \begin{array}{ccccc} B & A & C & = & ??? \\ p \times q & n \times p & q \times n & & ??? \end{array} \]

Once again, we go back to the definition of matrix multiplication to get the elements of \(A B C = D\): \[ d_{i j} = \sum\limits_{k = 1}^{q} \left( \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \right) \cdot c_{k j} = \sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \cdot c_{k j} \]

Here, notice how we sum over the inside dimensions \(q \leftrightarrow k\) and \(p \leftrightarrow l\) to leave just the final \(n \times n\) terms. To get \(tr(D)\) we pile on another summation, this time over \(n\): \[ \begin{array}{rcll} tr(D) & = & \displaystyle\sum\limits_{i = 1}^{n} d_{i i} & \fref{MO.1} \\ tr(D) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \cdot c_{k i} \end{array} \]

Can we start interchanging summations here as well? Yes, but we have to be careful. For example, if we interchange just the summations over \(n \leftrightarrow i\) and \(q \leftrightarrow k\), the result will no longer correspond to valid matrix multiplication. This is equivalent to when we tried to switch \(ABC\) to \(BAC\) above. Instead, we must cycle the summations so that the inner terms always cancel out: \[ \begin{array}{rcccl} tr(D) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \cdot c_{k i} & = & tr(A B C) \\ tr(D) & = & \displaystyle\sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} \sum\limits_{i = 1}^{n} b_{l k} \cdot c_{k i} \cdot a_{i l} & = & tr(B C A) \\ tr(D) & = & \displaystyle\sum\limits_{l = 1}^{p} \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{q} c_{k i} \cdot a_{i l} \cdot b_{l k} & = & tr(C A B) \end{array} \]

The above example with matrices \(A\), \(B\), and \(C\) shows that the trace operator is invariant under only cyclic permutations. In general if we have a sequence of \(K\) matrices \(A_k\) such that \(A_1 A_2 \cdots A_K\) is defined and square, we can write: \[ tr(A_1 A_2 \cdots A_K) = tr(A_2 A_3 \cdots A_K A_1) = tr(A_3 A_4 \cdots A_K A_1 A_2) = \dots = tr(A_K A_1 \cdots A_{K-1}) \ftag{MO.5} \]

Problems

Introduction

This is the problems section.
Click on the above tabs to get view the solution to a problem.

Ex 2.1

Suppose each of the \(K\) classes has an associated target \(\mathbf{t}_k\), which is a vector of all zeros, except a one in the \(k_{th}\) position. Show that classifying to the largest element of \(\mathbf{\hat{y}}\) amounts to choosing the closest target \(\argmin_k ||\mathbf{t}_k - \mathbf{\hat{y}}||\), if the elements of \(\mathbf{\hat{y}}\) sum to one.

Note:
The original problem, as written in the book, used \(t_k\) and \(\hat{y}\), not \(\mathbf{t}_k\) and \(\mathbf{\hat{y}}\). Throughout this problem, we will use \(\mathbf{boldface}\) to represent vectors, so I changed this for consistency.

Solution:
The authors seems to have left out some context here, making the question unnecessarily vague. We can still technically answer the question with only the information provided, but that would be uninteresting. We’ll provide our own context to the question, so that we can better understand why we would want to answer in the first place.

We imagine the following situation:
An experiment or simulation yielded \(n\) datapoints of the form \((\mathbf{x}_i, y_i)\), where each \(\mathbf{x}_i\) is \(p\)-dimensional column vector of input features (we have \(p\) features), and each \(y_i \in \{\;1, 2, \dots, K\;\}\) is a class label. Putting all of our data together into a compact form, we have: \[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \text{ and } X = \begin{bmatrix} \mathbf{x}_1^{T} \\ \mathbf{x}_2^{T} \\ \vdots \\ \mathbf{x}_n^{T} \end{bmatrix} \]

We want to use a supervised learning algorithm on this data so that we can predict new class labels \(\hat{y}\) given new input data. To make sure that our chosen algorithm runs effectively and is not biased for/against any class label, we choose to encode our data using a symmetric scheme called one-hot-encoding. This method takes a class label \(k\) (a number between \(1\) and \(K\)) and turns it into a \(K\)-dimensional vector with all \(0\)s except for the \(k_{th}\) location, which has a \(1\). For example, if we have a datapoint where \(y_i = 2\), we can encode it as; \[ y_i = 2 \;\leftrightarrow\; \mathbf{y}_i = \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]

Thus, our class label data is now an \(n \times K\) matrix and looks like: \[ Y = \begin{bmatrix} \mathbf{y}_1^{T} \\ \mathbf{y}_2^{T} \\ \vdots \\ \mathbf{y}_n^{T} \end{bmatrix} \]

After we train our algorithm on this data, any new predictions made by the trained model will have the form of probability vectors: \[ \mathbf{\hat{y}} = \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_K \end{bmatrix} \]

Here, the \(k_{th}\) component of \(\mathbf{\hat{y}}\) represents the probability/belief that the correct class label is \(k\). For example, if we have \(K = 3\) total class labels, then a possible prediction vector is: \[ \mathbf{\hat{y}} = \begin{bmatrix} 0.80 \\ 0.05 \\ 0.15 \end{bmatrix} \]

This output states that the probability that \(k = 1\) is the correct class label is \(80\%\), the probability that \(k = 2\) is the correct class label is \(5\%\), and the probability that the correct class label is \(k = 3\) is \(15\%\). Notice that the sum of all elements in \(\mathbf{\hat{y}}\) is \(1\) (since they’re probabilities), just as the authors mentioned in the problem statement.

When we choose a class label from \(\mathbf{\hat{y}}\), we want to choose \(k\) such that \(\hat{y}_k\) is largest possible. In the example above we would choose \(k = 1\) because \(0.80 = \hat{y}_1 > \hat{y}_3 > \hat{y}_2\) .

Now, onto the main question. The problem statement says that choosing the largest element of \(\mathbf{\hat{y}}\) is equivalent to choosing the \(k\) that minimizes the distance between \(\mathbf{\hat{y}}\) and \(\mathbf{t}_k\). We can see how this works by following our example with \(K = 3\) class labels. In this case, we have three \(\mathbf{t}_k\) vectors and three distances to calculate: \[ \begin{array}{rcccc} k = 1 & : & \mathbf{t}_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} & \implies & ||\;\mathbf{t}_1 - \mathbf{\hat{y}}\;|| = 0.065 \\ k = 2 & : & \mathbf{t}_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} & \implies & ||\;\mathbf{t}_2 - \mathbf{\hat{y}}\;|| = 1.565 \\ k = 3 & : & \mathbf{t}_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} & \implies & ||\;\mathbf{t}_3 - \mathbf{\hat{y}}\;|| = 1.365 \end{array} \]

Hence, once again we would choose \(k = 1\) because \(\mathbf{t}_1\) is closest to our prediction vector \(\mathbf{\hat{y}}\).

Proving this equivalency in the general amounts to expanding both vectors and calculating distances using the dot product: \[ \begin{array}{rcl} \displaystyle\argmin_k ||\;\mathbb{t}_k - \mathbb{\hat{y}}\;|| & = & \displaystyle\argmin_k \; (\mathbf{t}_k - \mathbf{\hat{y}})^{T}(\mathbf{t}_k - \mathbf{\hat{y}}) \\ & = & \displaystyle\argmin_k \; (\mathbf{t}_k^{T} - \mathbf{\hat{y}}^{T})(\mathbf{t}_k - \mathbf{\hat{y}}) \\ & = & \displaystyle\argmin_k \; (\mathbf{t}_k^{T}\mathbf{t}_k - \mathbf{t}_k^{T}\mathbf{\hat{y}} + \mathbf{\hat{y}}^{T}\mathbf{t}_k + \mathbf{\hat{y}}^{T}\mathbf{\hat{y}}) \\ & = & \displaystyle\argmin_k \; (1 - \hat{y}_k - \hat{y}_k + C) \\ & = & \displaystyle\argmin_k \; (-2\hat{y}_k) \\\ \displaystyle\argmin_k ||\;\mathbb{t}_k - \mathbb{\hat{y}}\;|| & = & \displaystyle\argmax_k \; \hat{y}_k \end{array} \]

In the above derivation, we used the following identities to simplify our expression:

  • The length of \(\mathbf{t}_k\), calculated as \(\mathbf{t}_k^{T}\mathbf{t}_k\), must be \(1\) since this vector only has one non-zero element (which is \(1\)). Since this is a constant, it does not contribute to the minimization over \(k\) so we get rid of it in a later step.
  • The two quantities \(\mathbf{t}_k^{T}\mathbf{\hat{y}}\) and \(\mathbf{\hat{y}}^{T}\mathbf{t}_k\) are equal and pick out the \(k_{th}\) element of \(\mathbf{\hat{y}}\): \[ \begin{array}{rccccl} \mathbf{t}_k^{T}\mathbf{\hat{y}} & = & \begin{bmatrix} 0 & \cdots & 1 & \cdots & 0 \end{bmatrix} & \begin{bmatrix} \hat{y}_1 \\ \vdots \\ \hat{y}_k \\ \vdots \\ \hat{y}_K \end{bmatrix} & & = & \hat{y}_k \\ \mathbf{\hat{y}}^{T}\mathbf{t}_k & = & \begin{bmatrix} \hat{y}_1 & \cdots & \hat{y}_k & \cdots & \hat{y}_K \end{bmatrix} & \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} & & = & \hat{y}_k \end{array} \]
  • The length of \(\mathbf{\hat{y}}\), calculated as \(\mathbf{\hat{y}}^{T}\mathbf{\hat{y}}\), is unknown but some constant \(C\). The important aspect here is that it does not depend on \(k\) so it does not contribute to the minimization.

Thus, we’ve shown that choosing \(k\) so that we pick the largest element of \(\mathbf{\hat{y}}\) is the same as choosing \(k\) so that we minimize the distance between \(\mathbf{t}_k\) and \(\mathbf{\hat{y}}\).

Ex 2.2

Show how to compute the Bayes decision boundary for the simulation example of Figure 2.5.

Solution:
According to the authors, the “oracle” which generates the data in Figure 2.5 works as follows:

First we generated \(10\) means \(m_k\) from a bivariate Gaussian distribution \(\mathcal{N}([1,0]^T,I_2)\) and labeled this class \(\color{b}{\mathrm{BLUE}}\). Similarly, \(10\) more were drawn from \(\mathcal{N}([0,1]^T,I_2)\) and labeled class \(\color{o}{\mathrm{ORANGE}}\). Then for each class we generated \(100\) observations as follows: for each observation, we picked an \(m_k\) at random with probability \(1/10\), and then generated a \(\mathcal{N}(m_k,I_2/5)\), thus leading to a mixture of Gaussian clusters for each class.

Before starting, let’s do some housekeeping.

We’ll define the \(10\) means from the top Gaussian distribution \(\mathcal{N}([1,0]^T,I_2)\) as the sample \({\color{b}\mathcal{P}} = \{\;{\color{b}p_1}, {\color{b}p_2}, \dots, {\color{b}p_{10}}\;\}\) and the \(10\) means from the bottom Gaussian distribution \(\mathcal{N}([0, 1]^T,I_2)\) as the sample \({\color{o}\mathcal{Q}} = \{\;{\color{o}q_1}, {\color{o}q_2}, \dots, {\color{o}q_{10}}\;\}\).

We can now use define the conditional probability distributions for \(G = \color{b}\mathrm{BLUE}\) and for \(G = \color{o}\mathrm{ORANGE}\) class lables. At each value of \(X = \mathbf{x}\) the probability of each class labels is proportional to a simple sum of \(10\) Gaussians so, according to the definition of a multivariate normal distribution, we have: \[ \begin{array}{rcl} \mathbf{P}(G = {\color{b}\mathrm{BLUE}} \mid X = \mathbf{x}) & = & K \displaystyle\sum_{{\color{b}p_i} \in {\color{b}\mathcal{P}}}\exp\left( -\dfrac{5}{2} (\mathbf{x} - {\color{b}p_i})^{T} (\mathbf{x} - {\color{b}p_i}) \right) \\ \mathbf{P}(G = {\color{o}\mathrm{ORANGE}} \mid X = \mathbf{x}) & = & K \displaystyle\sum_{{\color{o}q_i} \in {\color{o}\mathcal{Q}}}\exp\left( -\dfrac{5}{2} (\mathbf{x} - {\color{o}q_i})^{T} (\mathbf{x} - {\color{o}q_i}) \right) \\ \end{array} \]

The above statements assert that, at any given \(X = \mathbf{x}\), the probability of each label (\(G\)) equals the sum of the Gaussians centered at either \(\color{b}\mathcal{P}\) or \(\color{o}\mathcal{Q}\), with the same proportionality constant \(K\). We can derive the proportionality constant by requiring that \(\mathbf{P}(G = {\color{b}\mathrm{BLUE}} \mid X = \mathbf{x}) + \mathbf{P}(G = {\color{o}\mathrm{ORANGE}} \mid X = \mathbf{x}) = 1\), but this is unnecessary since \(K\) cancels in future steps.

In the Naive Bayes Classifier, we always choose the most likely value of \(G\) when making new predictions. Thus: \[ \hat{g} = \argmax_{g \in \{{\color{b}\mathrm{BLUE}}, {\color{o}\mathrm{ORANGE}}\}} \mathbf{P}(G = g \mid X = \mathbf{x}) \]

From this, it’s clear to see that the boundary will occur right at the values of \(X = \mathbf{x}\) where: \[ \mathbf{P}(G = {\color{b}\mathrm{BLUE}} \mid X = \mathbf{x}) = \mathbf{P}(G = {\color{o}\mathrm{ORANGE}} \mid X = \mathbf{x}) = \dfrac{1}{2} \]

Putting everything together, we calculate the Bayes boundary as the solution for \(X = \mathbf{x} \in \mathbb{R}^2\) to the equation: \[ \sum_{{\color{b}p_i} \in {\color{b}\mathcal{P}}}\exp\left( -\dfrac{5}{2} (\mathbf{x} - {\color{b}p_i})^{T} (\mathbf{x} - {\color{b}p_i}) \right) = \sum_{{\color{o}q_i} \in {\color{o}\mathcal{Q}}}\exp\left( -\dfrac{5}{2} (\mathbf{x} - {\color{o}q_i})^{T} (\mathbf{x} - {\color{o}q_i}) \right) \]

Once all the means in \(\color{b}\mathcal{P}\) and \(\color{o}\mathcal{Q}\) are known, we can compute the above with commonly available numerical packages.

Ex 2.3

Derive equation (2.24).

Note:
As stated in the text, equation \(\ifref{2.24}\) looks like: \[ d(p, N) = \left( 1 - \dfrac{1}{2}^{1/N} \right)^{1/p} \ftag{2.24} \]

The function given in \(\ifref{2.24}\) considers \(N\) datapoints uniformly distributed in a \(p\)-dimensional ball centered at the origin and outputs the median distance to the closest datapoint.

Solution:
To help us derive the above equation, we’ll make a few simple definitions. This will organize what we know and allow us to be more precise in our derivation.

  • First, we’ll formally define the hypersphere \(B\) we’ll be working with. \[ B = \{ \mathbf{x} \in \mathbb{R}^p \mid \lVert \mathbf{x} \rVert < 1 \} \]
  • Second, we’ll note that the volume of a hypersphere in \(p\)-dimensional space is always proportional to \(r^p\), where \(r\) is the radius of the hypersphere. Specifically, we note that we can write the volume \(V_p\) of a \(p\)-dimensional ball as: \[ V_p = K_p \cdot r^{p} \] The constant \(K_p\) only depends on the dimension \(p\). This, for our purposes, remains fixed throughout the discussion. Familiar examples of \(K_p\) are \(K_2 = \pi\) for a circle and \(K_3 = 4\pi/3\) for a sphere. Since \(B\) has radius \(r = 1\), its volume is just \(V_p = K_p \cdot r^p = K_p\).
  • Next, we define the random vector \(X_i\), which represents the position of a single datapoint we might sample out of \(B\). Since this distribution is uniform we can write the probability density function of each \(X_i\) as: \[ f_X(\mathbf{x}) = \begin{cases} 1/K_p & , & \mathbf{x} \in B \\ 0 & , & \text{otherwise} \end{cases} \]
  • Next, we define the distance variables \(D_i\). Let \(D_i = \lVert X_i \rVert\) be a random variable that represents the distance of a single datapoint from the origin. This notation allows us to represent the distances of all of our \(N\) datapoints as the set \(\mathcal{D} = \{ D_1, D_2, \dots , D_N \}\).
  • Finally, we define our variable of interest. Let \(M = \min(\mathcal{D})\) be a random variable which represents the distance of the closest point to the origin in our sample \(\mathcal{D}\).

With the above definitions in place, we now have a clear goal: we want to calculate the median value of \(M\) and show that it is equivalent to \(\ifref{2.24}\). We’ll start by calculating the probability distribution for a single \(D_i\). \[ \begin{array}{rcl} \mathbf{P}(D_i < d) & = & \mathbf{P}(\lVert X_i \rVert < d) \\ & = & \displaystyle\int_{\lVert \mathbf{x} \rVert < d} f_X({\mathbf{x}}) d\mathbf{x} \\ & = & \displaystyle\int_{\lVert \mathbf{x} \rVert < d} \dfrac{1}{K_p} d\mathbf{x} \\ \mathbf{P}(D_i < d) & = & \dfrac{1}{K_p} \displaystyle\int_{\lVert \mathbf{x} \rVert < d} d\mathbf{x} \\ \end{array} \]

Notice that the final integral is just the volume of a hypersphere with radius \(d\), so that we can continue as: \[ \begin{array}{rcl} \mathbf{P}(D_i < d) & = & \dfrac{1}{K_p} \displaystyle\int_{\lVert \mathbf{x} < d \rVert} d\mathbf{x} \\ & = & \dfrac{1}{K_p} \cdot K_p d^p \\ \mathbf{P}(D_i < d) & = & d^p \end{array} \]

Now, we turn our attention towards finding the distribution of \(M\). To do this, we note that the statement \(\min(\mathcal{D}) < d\) is equivalent to saying \(D_i \geq d\) for all \(i \leq N\). In other words, if we want our closest datapoint to be no farther than \(d\) away from the origin, all the other datapoints have to lie farther than \(d\). Using this, we write: \[ \begin{array}{rcl} \mathbf{P}(M < d) & = & \mathbf{P}(\min(\mathcal{D}) < d) \\ & = & \mathbf{P}(D_i \geq d \text{ for all } i < N) \\ & = & \displaystyle\prod_{i = 1}^{N} \mathbf{P}(D_i \geq d) \\ & = & \displaystyle\prod_{i = 1}^{N} 1 - \mathbf{P}(D_i < d) \\ \mathbf{P}(M < d) & = & \left( 1 - d^p \right)^N \end{array} \]

Now that we have the cumulative probability distribution of \(M\), we just need to find the median by setting it equal to \(1/2\). We do this below: \[ \begin{array}{rcl} \mathbf{P}(M < d) & = & \dfrac{1}{2} \\ \left( 1 - d^p \right)^N & = & \dfrac{1}{2} \\ d^p & = & 1 - \dfrac{1}{2}^{1/N} \\ d & = & \left( 1 - \dfrac{1}{2}^{1/N} \right)^{1/p} \end{array} \]

This completes our proof.

Ex 2.4

The edge effect problem discussed on page 23 is not peculiar to uniform sampling from bounded domains. Consider inputs drawn from a spherical multinormal distribution \(X \sim \mathcal{N}(0,I_p)\). The squared distance from any sample point to the origin has a \(\chi^2_p\) distribution with mean \(p\). Consider a prediction point \(x_0\) drawn from this distribution, and let \(a = x_0 / \lVert x_0 \rVert\) be an associated unit vector. Let \(z_i = a^{T}x_i\) be the projection of each of the training points on this direction.
Show that the \(z_i\) are distributed \(\mathcal{N}(0,1)\) with expected squared distance from the origin \(1\), while the target point has expected squared distance \(p\) from the origin.
Hence for \(p = 10\), a randomly drawn test point is about \(3.1\) standard deviations from the origin, while all the training points are on average one standard deviation along direction \(a\). So most prediction points see themselves as lying on the edge of the training set.

Solution:
Placeholder