Linear Algebra

Matrix

Symmetric Matrices

A square matrix is symmetric if . It is anti-symmetric if . It is easy to show that for any matrix , the matrix is symmetric and the matrix is anti-symmetric. From this it follows that any square matrix can be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since

and the first matrix on the right is symmetric, while the second is anti-symmetric. It is common to denote the set of all symmetric matrices of size as , so that means that is a symmetric matrix.

The Trace

The trace of a square matrix , denoted , is the sum of diagonal elements in the matrix:

The trace has the following properties :

  • For , .
  • For , .
  • For , .
  • For such that is square, .
  • For such that is square, , and so on for the product of more matrices.

Norms

A norm of a vector is informally measure of the “length” of the vector. For example, we have the commonly used Euclidean or norm,

Note that

More formally, a norm is any function that satisfies 4 properties:

  • non-negativity : For all .
  • definiteness : if and only if .
  • homogeneity : For all .
  • triangle inequality : For all .

Other examples of norms are the norm,

and the norm,

In fact, all three norms presented so far are examples of the family of norms, which are parameterized by a real number , and defined as

Norms can also be defined for matrices, such as the Frobenius norm,

Linear Independence and Rank

A set of vectors is said to be linearly independent if no vector can be represented as a linear combination of the remaining vectors. Conversely, a vector which can be represented as a linear combination of the remaining vectors is said to be linearly dependent. For example, if

for some then is dependent on . otherwise, it is independent of .

The column rank of a matrix is the largest number of columns of that constitute linearly independent set. In the same way, the row rank is the largest number of rows of that constitute a linearly independent set.

It is a basic fact of linear algebra, that for any matrix , and so this quantity is simply refereed to as the rank of , denoted as . The following are some basic properties of the rank:

  • For . If , then is said to be full rank.
  • For .
  • For .
  • For .

The Inverse

The inverse of a square matrix is denoted , and is the unique matrix such that

It turns out that may not exist for some matrices . we say is invertible or non-singular if exists and non-invertible or singular otherwise. One condition for invertibility we already know is that exists if and only if is full rank.

The following are properties of the inverse; all assume that are non-singular:

  • If , we can multiply by on both sides to obtain .
  • . For this reason this matrix is often denoted .

Orthonormal Matrices

Two vectors are orthogonal if . A vector is normalized if . A square matrix is orthonormal if all its columns are orthogonal to each other and are normalized.

It follows immediately from the definition of orthogonality and normality that

In other words, the inverse of an orthonormal matrix is its transpose.

Another nice property of orthonormal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e.,

for any orthonormal.

Columnspace and Nullspace of a Matrix

The span of a set of vectors is the set of all vectors that can be expressed as a linear combination of . That is,

It can be shown that if is a set of linearly independent vectors, where each , then . In other words, any vector can be written as a linear combination of through .

The projection of a vector onto the span of (here we assume ) is the vector , such that as close as possible to , as measured by the Euclidean norm . We denote the projection as and can define it formally as

The Columnspace of a matrix , denoted , is the the span of the columns of . In other words,

Making a few technical assumptions (namely that is full rank and that ), the projection of a vector onto the columnspace of is given by,

When contains only a single column, , this gives the special case for a projection of a vector on to a line:

The nullspace of a matrix , denoted is the set of all vectors that equal when multiplied by , i.e.,

Note that vectors in are of size , while vectors in the are of size , so vectors in and are both in . In fact, we can say much more. It turns out that

In other words, and are disjoint subsets that together span the entire space of . Sets of this type are called orthogonal complements, and we denote this

The Determinant

The determinant of a square matrix , is a function , and is denoted or .

properties of the determinant:

  • The determinant of the identity is 1,
  • Given a matrix , if we multiply a single row in by a scalar , then the determinant of the new matrix is
  • If we exchange any two rows and of , then the determinant of the new matrix is
  • For .
  • For .
  • For if and only if is singular (i.e., non-invertible).
  • For and non-singular, .

Before given the general definition for the determinant, we define, for to be the matrix that results from deleting the th row and th column from . The general recursive formula for the determinant is

The classical adjoint of a matrix , is denoted , and defined as

(note the switch in the indices ). It can be shown that for any nonsingular ,

Quadratic Forms and Positive Semidefinite Matrices

Given a matrix square and a vector , the scalar value is called a quadratic form. Written explicitly, we see that

Note that,

i.e., only the symmetric part of contributes to the quadratic form. For this reason, we often implicitly assume that the matrices appearing in a quadratic form are symmetric.

We give the following definitions:

  • A symmetric matrix is positive definite (PD) if for all non-zero vectors . This is usually denoted (or just ), and often times the set of all positive definite matrices is denoted .

  • A symmetric matrix is positive semidefinite (PSD) if for all vectors . This is writen (or just ), and the set of all positive semidefinite matrices is often denoted .

  • Likewise, a symmetric matrix is negative definite (ND), denoted (or just ) if for all non-zero .

  • Similarly, a symmetric matrix is negative semidefinite (NSD), denoted (or just ) if for all .

  • Finally, a symmetric matrix is indefinite, if it is neither positive semidefinite nor negative semidefinite - i.e., if there exists such that and .

It should be obvious that if is positive definite, then is negative definite and vice versa. Likewise, if is positive semidefinite then is negative semidefinite and vice versa. If is indefinite, then so is . It can also be shown that positive definite and negative definite matrices are always invertible.

Finally, there is one type of positive definite matrix that comes up frequently, and so deserves some special mention. Given any matrix (not necessarily symmetric or even square), the matrix (sometimes called a Gram matrix) is always positive semidefinite. Further, if (and we assume for convenience that is full rank), then is positive definite.

Eigenvalues and Eigenvectors

Given a square matrix , we say that is an eigenvalue of and is the corresponding eigenvector if

Intuitively, this definition means that multiplying by the vector results in a new vector that points in the same direction as , but scaled by a factor . Also note that for any eigenvector , and scalar , so is also an eigenvector. For this reason when we talk about the eigenvector associated with , we usually assume that the eigenvector is normalized to have length .

We can rewrite the equation above to state that is an eigenvalue-eigenvector pair of if,

But has a non-zero solution to if and only if has a non-empty nullspace, which is only the case if is singular, i.e.,

We can now use the definition of the determinant to expand this expression into a very large polynomial in , where will have maximum degree . We then find the (possibly complex) roots of this polynomial to find the eigenvalues . To find the eigenvector corresponding to the eigenvalue , we simply solve the linear equation .

The following are properties of eigenvalues and eigenvectors (in all cases assume has eigenvalues and associated eigenvectors ):

  • The trace of a is equal to the sum of its eigenvalues,
  • The determinant of is equal to the product of its eigenvalues,
  • The rank of is equal to the number of non-zero eigenvalues of .

  • If is non-singular then is an eigenvalue of with associated eigenvector , i.e., .

  • The eigenvalues of a diagonal matrix are just the diagonal entries .

We can write all the eigenvector equations simultaneously as

where the columns of are the eigenvectors of and is a diagonal matrix whose entries are the eigenvalues of , i.e.,

If the eigenvectors of are linearly independent, then the matrix will be invertible, so . Matrix that can be written in this form is called diagonalizable.

Eigenvalues and Eigenvectors of Symmetric Matrices

Two remarkable properties come about when we look at the eigenvalues and eigenvectors of a symmetric matrix . First, it can be shown that all the eigenvalues of are real. Secondly, the eigenvectors of are orthonormal, i.e., the matrix defined above is an orthonormal matrix (for this reason, we denote the matrix of eigenvectors as in this case). We can therefore represent as , remembering from above that the inverse of an orthonormal matrix is just its transpose.

Using this, we can show that the definiteness of a matrix depends entirely on the sign of its eigenvalues. Suppose . Then

where (and since is full rank, any vector can be represented in this form). Because is always positive, the sign of this expression depends entirely on the ’s. If all , then the matrix is positive definite; if all , it is positive semidefinite. Likewise, if all or , then is negative definite or negative semidefinite respectively. Finally, if has both positive and negative eigenvalues, it is indefinite.

An application where eigenvalues and eigenvectors come up frequently is in maximizing some function of a matrix. In particular, for a matrix , consider the following maximization problem.

i.e., we want to find the vector (of norm ) which maximizes the quadratic form. Assuming the eigenvalues are ordered as , the optimal for this optimization problem is , the eigenvector corresponding to . In this case the maximal value of the quadratic form is . Similarly, the optimal solution to the minimization problem,

is , the eigenvector corresponding to , and the minimal value is .

Matrix Calculus

The Gradient

Suppose that is a function that takes as input a matrix of size and returns a real value. Then the gradient of (with respect to ) is the matrix of partial derivatives, defined as:

i.e., an matrix with

Note that the size of is always the same as the size of . So if, in particular, is just a vector ,

It is very important to remember that the gradient of a function is only defined if the function is real-valued. We can not, for example, take the gradient of with respect to , since this quantity is vector-valued.

It follows directly from the equivalent properties of partial derivatives that:

The Hessian

Suppose that is a function that takes a vector in and returns a real number. Then the Hessian matrix with respect to , written or simply as is the matrix of partial derivatives,

In other words, , with

Note that the Hessian is always symmetric, since

Similar to the gradient, the Hessian is defined only when is real-valued.

It is natural to think of the gradient as the analogue of the first derivative for functions of vectors, and the Hessian as the analogue of the second derivative (and the symbols we use also suggest this relation). This intuition is generally correct, but there a few caveats to keep in mind.

First, for real-valued functions of one variable , it is a basic definition that the second derivative is the derivative of the first derivative, i.e.,

However, for functions of a vector, the gradient of the function is a vector, and we cannot take the gradient of a vector - i.e.,

and this expression is not defined. Therefore, it is not the case that the Hessian is the gradient of the gradient. However, this is almost true, in the following sense: If we look at the th entry of the gradient , and take the gradient with respect to we get

which is the th column (or row) of the Hessian. Therefore,

If we don’t mind being a little bit sloppy we can say that (essentially) , so long as we understand that this really means taking the gradient of each entry of , not the gradient of the whole vector.

Gradients and Hessians of Quadratic and Linear Functions

Now let’s try to determine the gradient and Hessian matrices for a few simple functions.

For , let for some known vector . then

so

From this we can easily see that $\nabla_x b^T x = b$. This should be compared to the analogous situation in single variable calculus, where $\partial / (\partial x) ax = a$.

Now consider the quadratic function for . Remember that

so

where the last equality follows since is symmetric (which we can safely assume, since it is appearing in a quadratic form). Note that the th entry of is just the inner product of the th row of and . Therefore, . Again, this should remind you of the analogous fact in single-variable calculus, that

Finally, lets look at the Hessian of the quadratic function (it should be obvious that the Hessian of a linear function is zero). This is even easier than determining the gradient of the function, since

Therefore, it should be clear that , which should be entirely expected (and again analogous to the single-variable fact that ).

To recap, if symmetric

Least Squares

Suppose we are given matrices (for simplicity we assume is full rank) and a vector such that . In this situation we will not be able to find a vector , such that , so instead we want to find a vector such that $Ax$ is as close as possible to , as measured by the square of the Euclidean norm .

Using the fact that , we have

Taking the gradient with respect to we have

Setting this last expression equal to zero and solving for gives the normal equations

Gradients of the Determinant

Now lets consider gradient of the determinant respect to a matrix, namely for , we want to find . From the definition of determinants that

so

From this it immediately follows from the properties of the adjoint that

Now lets consider the function . Note that we have to restrict the domain of to be the positive definite matrices, since this ensures that , so that the log of is a real number. In this case we can use the chain rule (nothing fancy, just the ordinary chain rule from single-variable calculus) to see that

From this is should be obvious that

where we can drop the transpose in the last expression because is symmetric. Note the similarity to the single-valued case , where .

Eigenvalues as Optimization

Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvector analysis. Consider the following equality constrained optimization problem:

for a symmetric matrix . A standard way of solving optimization problems with equality constraints is by forming the Lagrangian. The Lagrangian in this case can be given by

where is called the Lagrange multiplier. It can be established that for to be a optimal point to the problem, the gradient of the Lagrangian has to be zero at . That is,

Notice that this is just the linear equation . This shows that the only points which can possibly maximize (or minimize) assuming are the eigenvectors of .