Foundations of Linear Algebra


Table of Contents



The purpose of this post is to encapsulate the foundational linear algebra knowledge that I gained from working through the 18.06 course on MIT OCW (located here). Gilbert Strang is an amazing instructor and I highly recommend this course for anyone wanting to explore the depths of linear algebra. This post was written to serve as a reference for the fundamental concepts of linear algebra.



Introducing Vectors and Matrices


In this section, we will discuss definitions for vectors and matrices which are the foundational mathematical objects used in linear algebra.



What is a Vector?


A source of confusion when learning about vectors is the abundance of differing definitions. For example, in high school physics, a vector is introduced as a quantity that has length and direction. While in computer science, a vector is defined as an ordered list of numbers.

So… what really is a vector? For the purpose of this post and our development of linear algebra, it is useful to define a vector in its most general form, namely: a vector is simply an element of a vector space.

A vector space is defined as a set of objects (elements of are called vectors), and a field (elements of are called scalars). In general, there are several axioms that must be satisfied for and to be considered a vector space. Intuitively, these axioms largely boil down to the requirement that any linear combination of vectors in (using scalar coefficients from ) must also be in (i.e. if , then for , we have ).

The advantage of a vector space is that we can now generalize the concept of a vector beyond the specific definitions given in high school physics and computer science. For illustration, the following are examples of vector spaces:

To summarize, a vector space is defined by a set of vectors and a field . For to be a valid vector space, it must be closed under linear combination. This means that if , then for , we have . You can check that this result holds for the example vector spaces shown above. Thus, our answer to “what is a vector” is simply this: a vector is an element of a vector space.

It is important to note that the most general vector space does not include a notion of distance. To include a metric of distance, we must upgrade our general vector space into a normed vector space. If we go further still, and define an inner product on our space to measure how orthogonal two vectors are (e.g. the inner product for is the dot product), then we upgrade our normed vector space into an inner product space. A great resource to see the relationships between mathematical spaces is here. For this post, we will be using the term vector space quite loosely and in most cases will assume that the vector space is an inner product space (i.e. has norm and inner product defined for the space).

The advantage of working with the generalized concept of a vector space is that any results proven for the general case can be applied equally well to any vector space regardless of whether it is composed vectors that are n-tuples, matrices, polynomials, or functions. If we can show that our data forms a valid vector space, then we can leverage the mathematics developed on generalized vector spaces to analyze our data. For example, consider the natural language processing application of determining word embedding vectors. For this task, it is assumed that words in a language form a vector space such that more similar words are closer in space, and the relationship between words can be represented by the additive properties of the vector space (e.g. king – man + woman = queen). The goal of word embedding models such as word2vec is to learn a representation of words as coordinate vectors in . Once we have a coordinate vector for each word, we can apply the properties of the vector space to geometrically visualize words as coordinates in space so that we can calculate distances between words, linear combinations of words, and the inner product between words.

The real power of linear algebra comes from being able to use it to computationally solve equations and process data on a computer. Since computers work with discrete numbers in memory, this raises a challenge when working with vector spaces whose vectors are objects such as continuous functions. For example, suppose that we are trying to solve the equation . In this case, we are trying to find whether a vector in can be written as a linear combination of three other vectors from . To solve this problem on a computer, we must first answer an important question: how can we represent a polynomial vector from (e.g. ) on a computer? This brings us to one of the most powerful attributes of vector spaces. Recall that any linear combination of vectors in a vector space is also contained in the vector space. Intuitively it makes sense that we should be able to have a minimal set of vectors whose linear combinations can describe every other vector in the space. As an informal proof of this, consider a set containing all the vectors in the vector space. Now follow an iterative pattern of: 1) taking each vector and checking whether it can be written as a linear combination of the other vectors in the set, 2) if a linear combination exists, then remove the vector from the set. If we continue this process, eventually we will get down to a minimal set of vectors whose linear combinations describe all vectors in the space. This minimal set is called a basis for the space, each vector in the basis is called a basis vector, and the number of vectors in the basis is called the dimension of the vector space. For example, a basis for could be the set which has dimension 3 and whose linear combinations could be used to describe any polynomial of degree 2 and smaller. Suppose that we want to represent the polynomial using this basis. We can write down the corresponding coefficients as an ordered list of numbers , such that we can express the polynomial as a linear combination of the basis vectors . Hence, to represent the polynomial vector , we only need to know the corresponding coefficients of the basis vectors. This is critical because we can represent this list of coefficients easily on a computer. Therefore, instead of working directly with the polynomial vector, we can work with the coefficients defined for the basis we choose. Furthermore, notice that this ordered list of coefficients is itself a vector in , since .

In general, we can describe any vector in a vector space as a linear combination of basis vectors. Therefore, for a given basis, each vector in the space is uniquely identified by a coordinate vector whose elements are the basis vector coefficients in the linear combination. Hence, we can use coordinate vectors to represent any general vector (e.g. function, polynomial, etc.) on a computer and apply linear algebra operations to these coordinate vectors.

In summary, to use vectors for computational linear algebra we can think about the following steps:

  1. Define a vector space that represents the objects we are analyzing (e.g. tabular data as coordinates, polynomials, differentiable functions, language words, matrices, etc.)
  2. Determine an appropriate basis, such that all basis vectors are independent (i.e. a minimal set), and every vector in the space can be written as a linear combination of the basis vectors
  3. Perform a “coordinate transform” where we represent vectors in the space using coordinate vectors whose elements are the linear combination coefficients of the basis vectors (typically the coordinate vectors are in )
  4. Perform computational linear algebra procedures using the coordinate vector representation such as solving linear equations, change of basis (e.g. Fourier transform), or linear transformations



What is a Matrix?


In its most general form, a matrix is simply a rectangular array of mathematical objects such as numbers or symbols. Depending on the context, matrices can represent different things, just like how the number “3” can mean different things in different contexts. The strength of matrices is that we can use results from matrix theory to utilize the structure of the matrix to help solve the specific problem at hand (e.g. solving a system of equations). Such techniques from matrix theory include eigenvectors/eigenvalues, different factorizations (QR, SVD, LR, etc.), positive definiteness, four fundamental subspaces, etc. Once we pose our problem in matrix form, we gain access to these powerful analytical tools from matrix theory to help understand and manipulate the structure of our data. Some examples of the data that matrices represent includes the following:

Now that we have seen some of the contexts in which matrices are used, an important question arises: how should we visualize the action of a matrix? Here are some principals to guide our intuition:



Important Definitions


In this section, we will examine the definitions of several important concepts in linear algebra.



Important Theorems


Gilbert Strang proposes that there are six great theorems of linear algebra (as seen here). Some of these theorems are difficult to prove, so I will list the theorems without proof:



Rank


The rank of a matrix is defined as the number of linearly independent columns of . In other words, the rank is the dimension of the column space of . To compute the rank of a matrix , an effective strategy is to first transform into its reduced row echelon form and determine the number of pivot columns which corresponds to the rank. An interesting and nonobvious fact is that the column rank (number of independent columns) equals the row rank (number of independent rows) for any matrix .

How can we show that the column rank equals the row rank for any matrix? Here are three different ways to see this:

  1. Simplest Proof: The simplest proof to show that the row rank is equal to the column rank is the following. Suppose that we have any by matrix , with column rank and row rank . The fact that the column rank equals tells us that we can describe the columns of using linear combinations of vectors. If we put the vectors in an by matrix , and put the corresponding combination coefficients in a by matrix , then we can write as . The key insight for this proof is the fact that we can interpret the matrix multiplication as a linear combination of the columns of , OR we can interpret the matrix multiplication as a linear combination of the rows . Therefore, since is a by matrix, it follows that the rows of can be written as a linear combination of the rows in . Hence, the row rank of must be less than or equal to the number of rows in , such that . We can repeat this process in the reverse order using the fact that the row rank equals tells us that we can describe the rows of using linear combinations of vectors. We can put these vectors in the rows of and repeat the proof process described above to show that . Therefore, we can equate these expressions to show that .
  2. Information Theory Interpretation: Another intuitive “hand wavy” interpretation of why the dimensionality of the row space equals the column space can be framed in terms of information capacity. In this way, the rank tells you the information capacity of the channels (i.e. row space or column space) provided by the matrix. Simply put, if I give you an ordered list of rows in a matrix, you have all the information necessary to tell me what the columns are (here we are implicitly using the relationship between the rows and columns imposed by the definition of a matrix). Therefore, the columns are completely described by the information contained in the rows. Therefore, the column rank should be less than or equal to the row rank. Similarly, if I give you an ordered list of columns in a matrix, you have all the information necessary to tell me what the rows are. Therefore, the rows are completely described by the information contained in the columns. Therefore, the row rank should be less than or equal to the column rank. Hence, the column rank should equal the row rank. Continuing with this information theory interpretation, suppose that we have an by matrix , where . In this case, the vectors in the column space are in which have higher dimensionality than the vectors in the row space which are in . How is it possible for the row space and the column space to have the same information capacity if the vectors in their respective spaces have different dimensions? The key answer to this question is that the information capacity of the row space and column space depend on the dimensionality of the subspace, not the parent space. Hence, even though , the subspace dimensionality for the row space and column space are equal. This leads to a key insight into vector spaces, the information capacity of the space does not depend on whether the actual vectors in the space are in , , or some other abstract vector quantity. The information capacity depends on the subspace dimensionality. However, with this said, the subspace dimensionality has an upper bound described by the dimensionality of the parent space. Hence, a subspace of can have a maximum dimension of 2, while a subspace of can have a maximum dimension of 100.
  3. Specific 2 by 2 Example: Suppose we ask the question, is it possible for a matrix to have more linearly independent columns than rows. Let us take the example of a 2 by 2 matrix . Suppose that there exist scalars such that . Solving for these scalers gives which implies that . Now turning our attention to the rows, the question we are asking is whether we can find such that . We find that setting and will solve this equation. Hence, if the columns are not linearly independent, then the rows cannot be linearly independent.



Reduced Row Echelon Form


For any by matrix with rank , we can transform into what is called the reduced row echelon form such that , where represent a series of invertible elementary row operations (i.e. swap rows, scale rows, add multiple of rows), and represent a series of column permutations. The reduced row echelon matrix takes the form . Where is an identity matrix corresponding to the pivot columns, and is an by matrix containing the free variable columns. The reduced row echelon matrix plays an important role in understanding the four fundamental subspaces for .



Linear Independence and Basis


What is linear independence? Suppose that we have vectors . We say that these vectors are independent if no linear combination gives the zero vector. In other words, it is impossible to pick some nontrivial (i.e. not all zero) coefficients to make the linear combination equal the zero vector .

How can we determine whether a set of vectors is independent? One way is to construct a matrix with as the columns, and then determine whether there are any nonzero entries in the nullspace. We can achieve this by computing the reduced row echelon form of and determining whether we have any free variables.

We say that vectors span a space if the space consists of all linear combinations of those vectors. In other words, if span a space , then is the smallest space containing all linear combinations of those vectors. A basis for a vector space is a sequence of vectors with two properties: 1) are independent, and 2) span the vector space. The basis of a space provides everything we need to know about that space.

How can we test whether a set of vectors are a basis for ? We can construct a matrix with as the columns and compute the reduced row echelon matrix. If all columns are independent, then we will have no free variables so the column rank . To see this, remember that any matrix can be written in reduced row echelon form as . We clearly see that any column in can be written as a linear combination of the columns in since we have the identity matrix up top. Hence, we only have linearly independent columns. Note also, that we only have linearly independent rows, since rows filled with zeros are not independent.

Now, to ensure that the vectors span the complete space, we can look at whether is solvable for every in . To be solvable for every , we cannot have a row of zeros at the bottom of the reduced row echelon matrix. Therefore, to be a basis, the reduced row echelon form must be the identity matrix.

Suppose we now want to test whether a set of vectors are a basis for a subspace . We can repeat each of the steps described above, the only difference is that when examining whether these vectors span the subspace, we are only interested in determining whether is solvable for in the subspace. For a given subspace, every basis for that space has the same number of vectors. We call this number the dimension of the subspace.



Pseudoinverse


If we have a matrix that has full column rank but not full row rank, then we know that is invertible since the nullspace is empty (except for the zero vector). To see why, recall that full column rank implies that all columns are linearly independent. Therefore, by the definition of linear independence, only has the trivial solution and the nullspace only contains the zero vector. Furthermore, we see that the nullspace of is equal to the nullspace of since . Hence it follows that the nullspace of is empty and is invertible since it is square and must have full rank for the nullspace of to be empty. Therefore, we see that is a left inverse of since .

Similarly, if we have a matrix that has full row rank but not full column rank, then we know that is invertible Therefore, we see that is a right inverse of since.

Now suppose that a matrix does not have full row rank nor full column rank. Since the column space and the row space have the same dimension (column rank equals row rank) there should be a mapping between the two. The pseudoinverse is the invertible transformation from the row space to the column space.

If are in the row space, then . To see this, suppose that , then it follows that which implies that is in the nullspace. However, vectors in the nullspace are orthogonal to vectors in the row space, so . However, we know that , thus, .

How can we find the pseudoinverse ? We can compute the SVD which gives . By taking the reciprocals of the nonzero elements of we get which enables the computation of the pseudoinverse as .

Let’s dig into why the formula for the pseudoinverse makes sense. Suppose that the matrix has rank . We can write the pseudoinverse as a sum of rank 1 matrices as where . Suppose that we have a vector in which we can write as (since the columns of provide an orthonormal basis for ). Multiplying by the pseudoinverse and using orthonormality to simplify we get:

Since provides an orthonormal basis for the row space, we can interpret the action of the pseudoinverse as a two-step procedure: 1) computing the projection of onto the column space of , and 2) mapping this column space projection to a vector in the row space. Hence, the pseudoinverse finds the vector in the row space which maps to the vector in the column space such that is closest to . To see why, consider computing . If we expand using its rank-1 SVD representation we can write

Notice that the vector is the projection of onto the column space of . To summarize: The pseudoinverse in enabled by the SVD theorem which tells us that there exist basis vectors for the row and column space of such that . The pseudoinverse provides the coordinate vector in the row space of (relative to the basis ) that maps to the projection of onto the column space (relative to the basis ).



Projections


We call the closest point to along the line the projection of onto . To find the projection of a vector onto a vector , we can compute the error term as being the difference between and the projected point written as . At the closest point, the error term will be orthogonal to . To see why, think about the error term as being composed of two components: 1) a component orthogonal to , and 2) a component parallel to . To minimize the error term, we need to minimize the component parallel to . When the parallel component is 0, then what we have left is the orthogonal component. Therefore, at the closest point, the error term must be orthogonal to which gives us the equation . Solving for we get . Hence the projection of onto is . Therefore, we can think about the projection as being performed by a matrix where . We can multiply the matrix by to get the projection onto .

Examining the projection matrix, we see that the column space of is the combination of columns of . Notice that is rank 1 since is a vector. Therefore, the columns of are multiples of . So the column space of can be visualized as multiples of the vector . We also see that , so the matrix is symmetric. Furthermore, since the projection matrix will project onto , then it follows that if we project again onto then we will get the same point, so .

For many equations, there does not exist a solution. Therefore, our goal is to find the solution to , where is a projection of onto the column space of . We can compute the projection of a point onto the column space of by first computing the error measured as the distance between and the projected point . We can compute the projected point as . Now for the error vector to be orthogonal to the columns of , the error vector must be in the left nullspace of such that . We can expand this equation as . Therefore, the solution is and we can write the projection vector as . Therefore, the projection matrix is . Notice that if is a square invertible matrix, then the projection matrix becomes . We see that and . Note that if is in the column space then , while if is orthogonal to the column space then .

If is a projection matrix onto the column space, then is a projection matrix onto the left nullspace. This is because we can view any vector as being a combination of a component in the column space and a component in the left nullspace. Therefore, since gives us the column space component of , then gives us the projection onto the left nullspace.

In least squares the goal is to minimize . We can look at the least squares problem in two ways. The first way is to consider the errors produced by each equation which is contained in the individual elements of . For example, . These errors correspond to the squared difference between the prediction line and the training data. We can use calculus to minimize these errors by setting the partial derivatives to 0.

The second way to consider the column space as representing a space for the parameters. Each point in the column space represents a set of output predictions given the input data. Therefore, we can take our training data and project the desired outputs onto the column space to find the set of output predictions which are closest. Then we can determine the set of parameters which produce these predictions in the column space. The set of equations to solve to determine these parameters are called the normal equations written as .

When is invertible? If has independent columns, then is invertible. To show this, suppose that . This implies that must be orthogonal to the columns of . But by definition, is a combination of the columns of . Therefore, we must have . Since we postulate that the columns of are independent, then which implies that is invertible. Another way to see this is to take the dot product with such that .



Four Fundamental Subspaces


A subspace is a vector space that is contained inside of another vector space. For example, a line in that passes through the origin is a vector space since any linear combination of vectors along this line will also be on the line. Furthermore, a line in is a subspace of the vector space since every vector on the line is also in . The total possible subspaces of are: 1) all of , 2) any line in that passes through the origin, 3) the zero vector. Two subspaces and are said to be orthogonal if every vector in is orthogonal to every vector in .

We have four fundamental subspaces associated with any by matrix :

  1. The column space : is a subspace of and contains all linear combinations of the columns of . The pivot columns from the reduced row echelon matrix for form a basis for . Therefore, the dimension of the subspace is . All vectors in are orthogonal to vectors in the left nullspace . As an example, suppose we have a 3 by 3 matrix where each of the column vectors is in . We can interpret each column vector as describing a line in that passes through the origin. Therefore, each column vector is itself a subspace of . If we consider the linear combination of all columns of then we get a subspace of that we call the column space of , written as .
  2. The nullspace : is a subspace of and contains the vectors that are the solutions to the equation. From the reduced row echelon form of , written as , we can solve for a basis of the nullspace as the columns of where are the free variable coefficients in the reduced row echelon form of . Hence, we see that the dimension of the null space is equal to the number of free variables . To illustrate why the dimension of the nullspace of must be , consider the following argument. If the rank of is , then there are linearly independent rows of . Hence, to solve , we have equations (one for each linearly independent row), and variables in . Now consider the thought experiment of assigning arbitrary values to the first variables in . Under these conditions, we have variables remaining in that can be used to solve the system uniquely. Hence, we have degrees of freedom in the nullspace which amounts to having dimensionality .
  3. The row space : is a subspace of and contains all linear combinations of the rows of . The reduced row echelon form of can be written as . Therefore, we see that there are nonzero rows. Recall that we computed by performing invertible row and column operations on such that . Hence we see that where is a column permutation operation matrix (i.e. the inverse of swapping two rows is to apply the same operation again and swap them back) and only affects the order of the elements in rows of . Hence, we see that by just performing elementary row operations we produced the matrix which has rows filled with zeros. Therefore, we see that has independent rows and thus, has independent rows. Hence, the first rows of form a basis for the row space of .
  4. The left nullspace : is a subspace of and contains the vectors that are the solution to the equation . Since the rank of is , then the number of free columns of must be . Therefore, the dimension of the left nullspace must be . We can write the reduced row echelon matrix as . Rearranging, we see that. Since the reduced row echelon matrix has the form and is a column permutation matrix, will have no effect on the bottom rows filled with zeros. We can interpret as a linear combination of the rows of where the combination coefficients are the values in . Hence, the coefficients that produce rows filled with zeros are the last rows of . Additionally, we know the rows of are independent because the elementary matrices are invertible. Therefore, the last rows of form a basis for the left nullspace.



Finding Basis Vectors for the Four Fundamental Subspaces


Suppose we have an by matrix . Our goal in this section is to demonstrate how we can find a set of basis vectors for each of the four fundamental subspaces.

Column Space: To determine a basis for the column space of , we can first calculate the reduced row echelon form , where and represents a series of invertible elementary row operations. Note that for simplicity we are assuming that no column permutations are needed. We see that the first columns of form a basis for the column space of (due to the identity matrix in the top left). We can write each vector in the column space of as a linear combination of the columns of , written as where are scalars and are the columns of . Now using the relationship with the reduced row echelon matrix, we can write

Where are the columns of . Notice that the combination is in the column space of and therefore can be written using the first columns as basis vectors as , where are scalars. Hence, we can pop this back into the equation above to get

Hence, we have shown that the columns of corresponding to the pivot columns of can be used to describe any vector in the column space of and hence span the column space.

To show that this is the minimal spanning set, we must show that all of the are independent. We can show this by contradiction. Suppose that there did exist scalar coefficients such that . This implies that which is impossible since are all independent (they come from the identity matrix in ).

Therefore, the columns of corresponding to the pivot columns of span the column space of and the set is minimal since all these vectors are independent. Hence, these vectors form a basis for the column space of .

Row Space: To determine a basis for the row space of , we can first calculate the reduced row echelon form , where and represents a series of invertible elementary row operations. Thus, we can write, . We can interpret this matrix product as a linear combination of the rows of . Since rows filled with zeros do not contribute to a basis, and the first rows of are all independent (due to the identity matrix in the top left), we see that the first rows of form a basis for the row space of .

Nullspace: To determine a basis for the null space of , we can first calculate the reduced row echelon form , where and represents a series of invertible elementary row operations. We can show that any solution to will also be a be a solution to .

Therefore, every element in the nullspace of is also in the nullspace of , such that . We can also apply this logic in reverse:

Therefore, every element in the nullspace of is also in the nullspace of , such that . Therefore, the nullspace of equals the nullspace of , .

Hence, to find a basis for the nullspace of , we can instead find a basis for the nullspace of . We can view the pivot variables as being constrained for each equation in order to make the equation equal to 0. Therefore, the dimension of the nullspace is equal to the number of free variables which is . We can find independent solutions to the equation by noticing that . Therefore, forms a basis for the nullspace because it contains independent vectors (due to the identity matrix).

Left Nullspace: To determine a basis for the left nullspace of , we can first calculate the reduced row echelon form , where and represents a series of invertible elementary row operations. We know that the row rank is , therefore, the number of free columns of must be and the dimension of the left nullspace must be . The left nullspace is defined as the solutions to which is equivalent (by taking the transpose) to . Notice that in the result of the multiplication , the bottom rows of are filled with zeros. This means that the bottom rows of must be solutions to . Since is invertible, all the rows must be independent, otherwise there would be a linear combination of the rows such that for nonzero which is impossible since is invertible and there is no transformation to transform back into . Hence the bottom rows of must form a basis for the left nullspace of .



Solving the Equation


One of the fundamental problems in linear algebra is how to find solutions to the equation . In this section we will work through how to effectively visualize this problem and how to solve it.



Interpreting the Product


Suppose we have a 3 x 3 matrix defined as

and a 3 x 1 vector defined as

How should we interpret the matrix product ? Let us consider a few powerful interpretations. Note that although our example is in , these interpretations apply for any n-dimensional coordinate space.

  1. Row Picture - Hyperplanes: The row picture originates from expanding the matrix product using the rows of as . We see that the output elements are computed as the dot products between and the rows of .The key idea for interpretability is that we can visualize each row of as describing a normal vector for a hyperplane. Hence, when we take the dot product of with a row of , for example , we are effectively computing the closest distance from the origin to the hyperplane with normal vector that contains (note that these distances will be scaled if the rows of are not unit length). Therefore, for a given , the matrix product can be interpreted as computing the distances to hyperplanes containing and with normal vectors given by the rows of.
  2. Row Picture – Change of Coordinate System: The row picture originates from expanding the matrix product using the rows of as . We see that the output elements are computed as the dot products between and the rows of. The key idea for interpretability is that we can visualize each row of as describing a basis vector for a new coordinate system. Hence, when we take the dot product of with a row of , for example , we are effectively computing the projection of onto the new basis vector . Therefore, for a given , the matrix product can be interpreted as computing the projection of onto a new coordinate system described by the rows of .
  3. Column Picture: The column picture originates from expanding the matrix product using the columns of as . We see that the output vector can be interpreted as a linear combination of the columns of where the combination coefficients are the elements of . Therefore, for a given , the matrix is a linear combination of the columns.



Interpreting the Equation


Suppose that we are trying to solve the equation , where is an by matrix, and we have substantially more equations than unknowns so . There will typically not exist a unique solution to this problem. What does it mean for there not to exist a unique solution? We can interpret this in two ways:

  1. Row-centric: we can write the matrix product , as . Therefore, we can interpret the matrix product as a system equations, each of the form , where is the row of . Each of these equations can be viewed as a hyperplane in . If an satisfies the equation , then it implies that is located on the hyperplane. To pick an which satisfies the equation , it follows that must be on each of the hyperplanes, which is equivalent to being the intersection point of all hyperplanes. In the situation where we have more equations than unknowns such that , then we have more hyperplanes than the dimension of the space holding the hyperplanes and there will generally not exist an intersection point between these hyperplanes. To see why, consider the following example. Suppose that is a vector in such that we have 2 unknowns. If we have 1 equation, then any point on the hyperplane (in the case of , the generalized “hyperplane” is a line) described by that equation will be a solution to the system. If we have 2 linearly independent equations, then there will be a unique intersection point of the hyperplanes and a unique solution to the system. Suppose we now take this 2-equation system with a unique solution and want to add another equation to the system. For this new system to remain consistent, we must ensure that the new hyperplane contains the unique solution from the 2-equation system. The two normal vectors for the hyperplanes in the 2-equation system describe a complete basis for , therefore, any new hyperplane will have a normal vector that can be written as a linear combination of these basis vectors. Hence, any new equation will need to be linearly dependent on the existing equations from the 2-equation system to preserve consistency. Hence, if we have a new hyperplane that is not linearly dependent on the existing equations, we lose consistency and no longer have a solution. Another way to think about this situation is to consider that to add a new equation to the 2-equation system and preserve consistency, the new hyperplane must contain the unique solution to the 2-equation system, but is free to rotate around that point in space. A linear combination of the prior equations from the 2-equation system can be viewed as constructing a new hyperplane with a normal vector that is effectively a rotation along the axis between the two normal vectors from the 2-equation system.
  2. Column-centric: we can write the matrix product , as . Therefore, we can interpret the matrix product as the linear combination of the columns of . Effectively the matrix product defines the column space of . A solution can be found if is in the column space of . When , the column space is a lower dimensional linear manifold in the space . As such, it becomes likely that there is a component of which is not in the column space.

Now suppose that we are given a 3 x 1 vector defined as

And we are asked to find such that . How should we interpret the matrix equation ? Let us consider a few powerful interpretations.

  1. Row Picture - Hyperplanes: If we expand the matrix equation using the rows of , we can write an equation for each row, for example . To satisfy this equation, must lie on a hyperplane with normal vector that is a distance of from the origin. For to satisfy the matrix equation , must simultaneously lie on each of the hyperplanes defined by the row equations. Therefore, we can visualize the solution space for as the set of points in the intersection of all these hyperplanes. We can visualize the effect of changing the components of as the process of moving the individual hyperplanes either closer or further from the origin along their respective normal vector axes defined by the rows of .
  2. Column Picture: The column picture originates from expanding the matrix product using the columns of as. To satisfy this equation, we must be able to express as a linear combination of the columns of . Any coefficients that express as a linear combination of the columns of is a valid solution. Any that exists in the column space of , (i.e. the space defined by linear combinations of the columns) will have a solution .



Interpreting the Matrix Product


Now suppose that we have two matrices and , and we want to calculate. How can we interpret this multiplication? Here are a few different interpretation methods:

  1. Individual Elements: We can calculate the value of individual elements of by taking the dot product of a row of with a column of . For example, to calculate the (first row, second column), we could take the dot product of the first row of and the second column of , calculated as .
  2. Combination of Columns: We can calculate the value of a column of by taking a linear combination of the columns of , where the combination coefficients come from a column of . For example, we can calculate the second column of as follows: .
  3. Combination of Rows: We can calculate the value of a row of by taking a linear combination of the rows of , where the combination coefficients come from a row of . For example, we can calculate the second row of as follows:.
  4. Sum of Rank-1 Matrices: The matrix can be expressed as a sum of the outer products of the columns of and the rows of . Therefore, we can write: .
  5. Decomposition into Blocks: If we partition the matrix into blocks and cut the matrices and into appropriately sized blocks, then we can view the multiplication in terms of the multiplication of individual block components. For example, suppose that and were large matrices. if we partition the matrices such that , , , then we can write the equation for an individual block of as .



Interpreting the Matrix Inverse


What is a matrix inverse? We can view the matrix product as transforming the vector into . Hence, the inverse of this transformation, called , should be able to turn into , such that . Combining the equations, we see that . Since this equation must hold for any , it follows that must be the identity matrix such that and .

Suppose that we are trying to find the inverse of a square matrix . From the definition of the inverse, we must solve for the matrix such that . We can interpret this matrix multiplication using the interpretability methods in the section described above. Specifically, we can consider how the columns of can be written as a linear combination of the columns of . As the columns of are all independent, they must span the entire space. As such, it follows that the columns of must also be independent and span the entire space. Another way to interpret this is as follows, for to have an inverse, it must have the columns of in its column space. Similarly, we can write the matrix product as a linear combination of the rows of . Thus, in a likewise manner, for to have an inverse, its row space must contain the rows of .

Another way to see when an inverse does not exist is to consider whether there exists a nonzero such that . If such an exists, then the matrix is not invertible. To see why, suppose that an inverse did exist. Then it would be possible to write , but we know that which implies that which is impossible for nonzero , therefore cannot exist. To visualize this intuitively, when , we can think of as mapping points from the space onto the zero vector. The zero vector is like a black hole, once a vector enters it, no transformation will be able to get the original vector back.

Given an by square matrix , how can we solve for the inverse? From the definition, our inverse is defined by the equation. Hence, we have equations with unknowns (i.e. the unknowns are all elements in ). We can solve these equations using Gauss-Jordan elimination. We can view each elimination step as applying an elementary operation applied to the left side of such that we eventually convert into , with steps given as . Therefore, we see that . Notice that , so we can set up an augmented matrix where on the left side we determine the row operations to transform into , and on the right side we apply the same operations to to determine the inverse matrix .

Suppose that is a square matrix that has left inverse and right inverse . To show that , consider the following proof. Using a clever rearrangement of parentheses, we can write . Notice that this inequality only holds when is square, otherwise matrix multiplication is not possible due to the differing matrix sizes.



When can we solve ?


An important question in linear algebra is to determine whether has a solution, and how we can describe these solutions. A critical tool to understand the solution space is to write the matrix in reduced row echelon form. We can write the matrix in reduced row echelon form by left multiplying by a series of invertible elementary row operations , where is the reduced row echelon form of .

To assess whether the equation can be solved for a given , we can examine the rows of . If there exists a row of that is all zeros, and the corresponding element in is nonzero, then the equation has no solutions. We interpret this using our intuition for the row picture: every row equation in describes a hyperplane. When we apply elementary row operations to this equation, we transform the hyperplanes, but preserve the set of points that lie on the intersection between hyperplanes used in the transformation. For example, when adding two rows together, the resulting hyperplane will preserve the set of points that lie on the intersection of the two hyperplanes being added. Hence, to produce a row of zeros in , two rows must be combined that correspond to the same hyperplane normal vector. For there to be solutions, the distance to these hyperplanes from the origin (found in ) must be the same for both planes, otherwise they will not intersect, and no solutions exist.

If the equation is solvable, then we can write the general set of solutions as, where is a particular solution such that , and is a generic vector in the nullspace of such that . To show that contains all solutions to , let us consider any arbitrary solution . If we calculate the matrix product of with the difference between and , we get . This implies that is in the nullspace of such that . Since , we can let and write . Therefore, and contains all solutions to . To gain intuition behind this, we can visualize the solution space as being the intersection of hyperplanes which is a linear manifold. The points on this linear manifold can be described by the combination of a point on the linear manifold (), and the direction vectors which are parallel to the linear manifold (the nullspace vectors ).

Hence to find all solutions to , we need to find: 1) a particular solution , and 2) a basis for the nullspace . Since any particular solution will do, one way to find a particular solution is to take the reduced row echelon matrix , set the free variables in to 0, and solve for the pivot variables to satisfy .

To find the nullspace, we need to find the solutions to the equation . One way to do this is to capitalize upon the structure of the reduced row echelon matrix. We start by exchanging some columns in so that the pivot columns are clustered together on the left side of the matrix. We can write as a block matrix such that where is the set of coefficients associated with the free variables and is an identity matrix of size by where is the rank of . To find vectors for the nullspace, we need to solve the equation . Using the block representation for , we can solve for a basis of the nullspace by recognizing that where is an identity matrix of size by . Therefore, the columns of the nullspace matrix provide a basis for the nullspace of . To summarize the operations done: we are trying to solve for the solutions to . We can apply elementary row operations by left multiplying both sides of the equation with the elementary matrices such that . Therefore, we see that applying row operations (that are invertible) does not change the nullspace since . Finally, we can right multiply the matrix with permutation matrix to permute the columns, written as, where the final reduced row echelon matrix is , and the permuted vector is written as . The permutation ensures that the pivot variable columns are on the left side of and the free variable columns are on the right side of .

We can predict the type of solutions possible for the equation by looking at the rank of which is an by matrix with rank . Here we define to be the reduced row echelon form of with the pivot columns located to the left of the matrix. We have the following four potentials for the rank of .

  1. : In this case, since each row and column have a pivot. is invertible, and we have a unique solution to . To see why is invertible. Recall that to construct we apply a series of invertible elementary row operations to such that . Therefore, we see that
  2. : In this case, . Since we have no free variables, the nullspace only has the zero vector. Therefore, depending on the value of we will either have 0 or 1 solution to the equation
  3. : In this case, . Each row has a pivot, so we will always have a solution. Furthermore, we have free variables, therefore, the nullspace is a linear manifold and we will have infinitely many solutions
  4. : In this case, and we have free variables. Therefore, the nullspace is a linear manifold. Depending on the value of we will either have 0 or infinitely many solutions to the equation

For to be solvable, we have the following interpretations:

  1. The vector must be in the column space of .
  2. We can write the reduced row echelon form of , denoted by , by left multiplying both sides of the equation with elementary row operations such that. Hence, we have the transformed matrix equation . If contains a row of zeros, then the matching entry in the transformed vector must also be 0. If not, then no solution exists. We can interpret this using our intuition from the row picture of matrix multiplication. Each row equation of represents a hyperplane. When we perform an elementary row operation that adds a multiple of one row to another, we are transforming one of the hyperplanes into a new hyperplane while preserving the points that lie in the intersection of the hyperplanes represented by the two original rows. In other words, adding two rows together effectively constructs a new hyperplane that is rotated around the intersection set (in the intersection set would be a point, and in the intersection set would be a line). Hence, applying an elementary row operation modifies the hyperplane geometry but does not change the solution space (i.e. the set of intersections between all hyperplanes). Therefore, if we apply an elementary row operation that “adds a multiple of one row to another”, and this operation produces a row of zeros, then for a solution to exist, it must be that these two rows represent the same hyperplane, and therefore, the distance to the origin must be the same or else there is no intersections, and thereby, no solutions. Hence, when we subtract the distances to the origin for both rows (i.e. the corresponding values of the vector), the value must be 0. We can view each invertible matrix product as an operation that transforms the hyperplanes in our equation while preserving the set of intersection points between the hyperplanes that are combined.



How can we solve ?


Suppose we have an by matrix where . How do we quickly show that for any solvable equation , we must have an infinite number of solutions? One way is to recognize this result is to use the structure of the reduced row echelon form. If we put the matrix in reduced row echelon form with the pivot columns on the left side, then it follows that the number of pivots must be less than or equal to the smaller dimension . To understand why, remember that to compute the reduced row echelon form, we compute the following algorithm

  1. Start with the element in the first row and first column. If this element is zero, then try to swap rows to make it nonzero. If the entire first row is zeros, then swap columns to make it nonzero.
  2. Scale the first row to make the element in the first row, first column equal to 1
  3. Add a multiple of the first row to all other rows to make the element in the first column equal 0. At this point, the only nonzero element should be the element in the first row, first column
  4. Now proceed to complete steps 1-3 for the second row, second column, etc. Continue until there are no more nonzero entries in the remaining rows
  5. The total computation of the reduced row echelon form for can be written as where are the elementary row operations (i.e. scaling rows, swapping rows, adding a multiple of one row to another), and represent column permutation matrices used to swap the order of the columns. Recall that the inverse of a permutation matrix is the matrix itself. To understand this, consider that the inverse of swapping two columns is to simply apply the same operation again to swap the columns back. Hence by left multiplying with , we are effectively performing row permutations on so that whatever columns we swapped in , we swap the same rows in so that the result of the matrix product is unaffected as can be seen since . Looking back at steps 1-3 described above, we see that during the computation of the reduced row echelon form for , we will apply a sequence of row and column operations. As a hypothetical example, we may swap two rows (represented by ), then add a multiple of one row to another (represented by ), then swap two columns (represented by ), then add a multiple of one row to another (represented by ), etc. The key point is that it is intuitive to think of the row operations and column swaps as being intermingled (i.e. the order from the described example was ). Due to the matrix commutativity property, we can achieve this interpretation because we are “free” to compute the multiplication in the order that makes intuitive sense. To make this concrete when computing , the commutative properties tell us that we are free to put parentheses wherever we want. Hence, to represent the sequence described in the example above we have . In this case we see that the order of multiplications is which matches the example.
  6. The final form of the reduced row echelon matrix produced by the above steps is a matrix with the general form , where is the rank and is defined as the number of pivots. The subscripts of the block matrices in denote the row by column size of each of the blocks. It is important to note that depending on the shape and rank of the matrix, some of these blocks may be empty. For example, if then the matrix reduces down to . The block denoted by refers to the coefficients for the non-pivot variables, called the “free variables”. Note that the partitioning of the variables into “free” vs. “pivot” is somewhat arbitrary because we could simply apply a column swap to turn a “free” variable into a “pivot” variable. Therefore, what is most important is the final block partitioned form of the matrix and the fact that we can write any matrix in the reduced row echelon form using the steps described above.

Returning to the central question: how can we solve ? One method is as follows:

  1. Convert the system into reduced row echelon form
  2. We can find a particular solution by setting the free variables to 0 and solve for the pivot variables
  3. Next, we solve for the nullspace basis vectors by setting the free variables to the standard basis and solving for the pivot variables
  4. Combine the particular solution and the linear combinations of the nullspace basis vectors to form the complete solution

Let us discuss the intuition for the method described above. We can visualize the solution space to as being a linear manifold where the particular solution is a point on the linear manifold, and the nullspace basis describes a set of direction vectors that are parallel to the linear manifold. Hence, the particular solution plus a linear combination of the nullspace basis vectors defines all vectors on the linear manifold and ensures that .

To show that any solution to can be written in terms of a particular solution and vectors in the nullspace, let us postulate the potential of another solution that cannot be written in terms of and the vectors in the nullspace. If we take the difference between and we can write . Therefore, we see that is in the nullspace of , and we can write the solution in terms of and the vectors in the nullspace as . Hence any solution of can be written in terms of a particular solution and vectors in the nullspace.



Least Squares


In many practical situations we have noisy measurements with many measurements and few parameters. In this case, the chance of having the solution in the column space is very unlikely. Hence the equation will not have a solution. If we cannot find a solution to the equation , then an alternative approach is to find the solution that minimizes the error . This can be interpreted as finding the point in the column space of that is closest to . Recall that we can visualize the column space as a linear manifold. The displacement from a linear manifold to a point in space can be written as a combination of a component parallel to the manifold, and a component perpendicular to the manifold. Hence, the distance from a linear manifold to a point in space can be minimized by finding the point on the linear manifold such that the displacement only consists of the perpendicular component. For linear manifolds, this will be a unique point in space. We can find this point by solving for the which produces a displacement vector that is orthogonal to the column space. For a vector to be orthogonal to the column space, it must be orthogonal to all columns in . Therefore, it follows that . Using the QR factorization of we can write the equation as

To simplify this further, we would like to multiply both sides by . However, we must first show that this inverse exists. The nullspace is defined by solutions to . Substituting in the QR factorization we get . Since the columns of are orthonormal, it follows that the nullspace of is only the zero vector. Hence, the nullspace of must be equal to the nullspace of which in turn is equal to the nullspace of . If has full column rank (i.e. all its columns are independent) which is typically the case for tall skinny data matrices, then the nullspace of will only be the zero vector, which implies that the nullspace of will only be the zero vector. Hence, when has full column rank then exists and we can write the equation from above as . Solving this equation will give us the least squares solution to the equation .



Determinants


Every square matrix has a specific number called the determinant which encodes information about the matrix. One useful interpretation of the determinant is as the volume of the parallelepiped described by the vectors of the matrix (either the rows or columns). The determinant can be used to find a closed form expression for the inverse of a matrix. Cramer’s Rule uses this expression for the inverse to produce a closed form expression for a solution to a system of linear equations.

Instead of simply writing the formula for the determinant, it is useful to describe the determinant in terms of its properties, and then use these properties to define the formula. Four axiomatic properties which form the basis for the determinant are the following:

  1. The determinant of the identity matrix is
  2. If you exchange two rows of a matrix, you reverse the sign of its determinant (i.e. positive to negative or vice versa)
  3. The determinant behaves like a linear function on the rows of the matrix
  4. If we multiply one row of a matrix by , then the determinant is multiplied by such that:

From these four properties we can deduce all other properties of the determinant. Some of these properties include the following

To understand how we can compute the determinant of a matrix, let us take the example of a 2 by 2 matrix

By applying property 3 we can write

Notice that for the determinants with a column filled with zeros, we can add a multiple of one row to another which would create a row of zeros. From the Zero Row property, we know that the value of these determinants is 0. Hence, we can write

From property 4 and property 1 we see that and from properties 1, 2, and 4 we see that . Finally, we can combine these together to show that .

To calculate the determinants of higher dimensional matrices, we can extend the principal used in the 2 by 2 case, where we used property 3 to expand each row so that there is only one nonzero entry in each row. For the 2 by 2 case shown above, this was written as

Notice that if we expand the determinant of an by matrix in this way, then the number of “sub” determinants will be . Thankfully, we can simplify this calculation since any determinant where two rows have the same column index that is nonzero will be zero (to see why, if two such rows exist, then we can add a multiple of one row to the other to create a row of zeros which means the determinant is zero). In this case the number of “sub” determinants reduces from to . To see why it is we can ask how each “sub” determinant is constructed. We can construct a “sub” determinant by choosing which elements will be nonzero. We can think of this process as picking any one of the elements in the first row, and then choosing one of the remaining columns in the second row, etc. Therefore, we have possible “sub” determinants of this form.

From this factorization, we see that if an element in row and column is nonzero, then every other element in row and column must be zero for the determinant to be nonzero. Hence, we can pose the determinant calculation as a recursive algorithm where we compute the determinant using the summation of cofactors which are the determinants of submatrices. To see more detail on the explicit calculation of the determinant, see here.



Eigenvalues and Eigenvectors


We can think about a square matrix as a function that transforms a vector into a vector . The vectors which preserve their direction after the transformation (i.e. is parallel to ) are called the eigenvectors of . Therefore, the eigenvectors satisfy . If is singular, then is an eigenvalue.

What is the importance of eigenvectors/eigenvalues? Here are some key insights:

We can find the eigenvalues and eigenvectors for a matrix by solving the equation . We can rewrite this equation as . This implies that the matrix must be singular, otherwise the only solutions are the trivial solutions. Therefore, since is singular we know that . Once we have found the eigenvalues, then we can use elimination to find the nullspace of . The vectors in the nullspace are eigenvectors of with eigenvalue .

As an example, consider the projection matrix defined as . The projection matrix will project any vector onto the column space of . Therefore, any vector in the column space of will be an eigenvector of . We can see this since the column space of is defined by the columns of , so . Therefore, we see that the columns of are the eigenvectors of with eigenvalues . Also, any vector orthogonal to the column space will be an eigenvector of with eigenvalue . This can be seen since if is orthogonal to the columns of , then and .

Suppose that we have linearly independent eigenvectors of and we put these vectors as the columns of . Then we can write. Multiplying we can get , where is a diagonal matrix with the eigenvalues on the diagonal. Then we can multiply by the inverse to get .

What can we say about the eigenvalues and eigenvectors of ? If then we can multiply both sides by to get . Therefore, we see that has the same eigenvectors as and its eigenvalues are squared. We can also see this from . This is an incredibly powerful property of the eigenvalue diagonalization factorization.

In order to diagonalize a matrix, we need to have independent eigenvectors. When can we be certain that a matrix is diagonalizable? Two conditions that will guarantee that the matrix is diagonalizable are: 1) the matrix is symmetric (this comes from the spectral theorem), and 2) all eigenvalues of the matrix are different.

To see why independent eigenvectors exist if all the eigenvalues are different, consider the following proof. Suppose that we have different eigenvalues , but a dependent eigenvector. Then there exists a linear combination of in terms of the other independent eigenvectors such that . From the definition of an eigenvector, we know that . Substituting the linear combination on both sides gives:

Since all the eigenvectors are different, we see that the , hence the collection of vectors must be dependent which is a contradiction. Therefore, all the eigenvectors must be independent. If we have repeated eigenvalues, we may or may not have independent eigenvectors. For example, the identity matrix has eigenvalues all equal to 1, but has independent eigenvectors.

Some useful properties of eigenvalues are the following:



Solving Difference Equations


Suppose that we have a difference equation such that , then for some initial starting condition we can write the state as . We can write the vector as a linear combination of the eigenvectors of , such that . Then when we multiply by we get: . Therefore, the state is .

As an application of the difference equations, we can examine the Fibonacci numbers. We can write the Fibonacci equation as and . Finding the eigenvalues and eigenvectors of this matrix can allow us to calculate the Fibonacci numbers.



Solving Differential Equations


We can visualize a matrix as a linear operator that acts on vectors, and we can visualize the derivative as a linear operator that acts on functions. An interesting connection between the two is the connection between the eigenvalues of a matrix and the solutions to a linear differential equation to a linear differential equation. For example, note that the solutions to the equation are the eigenvectors (assuming that is a square matrix with independent eigenvectors). Likewise, for the differential equations, the solution to the equation are functions called eigenfunctions. For first order linear differential equations, the solutions are . Now suppose that we have a system of differential equations such that . In this case, we are now combining the differentiation operator with a vector of functions that is transformed by a matrix. We could write as a linear combination of the eigenvectors such that for some functions and . Substituting this into the equation above we get

This implies that and . The general solutions to these equations are . Therefore, we see that the general solution is

We can solve for the constants in this solution by using the initial conditions for the system. To generalize this idea for any system, suppose that we have the equation. We can write the values of at any given time point as a linear combination of the eigenvectors of (assuming the eigenvectors are independent and span the complete space) such that . We can write the equation as follows:

The general solution to this system is given as

What is a matrix exponential? We can write out the Taylor series for the exponential as

We can define the matrix exponential by extending this formula such that

Using the eigenvalue decomposition for , we can write

To calculate we recall that is the diagonal matrix with the eigenvalues on the diagonal. When we substitute this into the Taylor series expansion, we see that the result of the sum is a diagonal matrix where the diagonal entry is itself a Taylor series expansion for . Therefore, we see that

We can understand the stability of a system by examining the eigenvalues of the system. We have three possibilities:

  1. Transient: We see that the solution will go to zero if for all eigenvalues . To see why, remember that if we can write . Hence, the imaginary part causes a rotational component with unit length. What controls the scale is the real part.
  2. Steady State: If at least one eigenvalue is 0 and all other eigenvalues have negative real part then the system will converge to a steady state response
  3. Explode: If for any eigenvalue, then the system will explode



Markov Chains


A Markov matrix is defined such that all entries are nonnegative and the sum of the entries in each column is 1. A Markov matrix will have one eigenvalue which equals 1, and the magnitude of all other eigenvalues will be less than 1. The way we can see that 1 is an eigenvalue of a Markov matrix is to recognize that since all the columns add to 1, then subtracting the identity matrix from the Markov matrix will make it such that the columns of sum to 0. Therefore, the sum of the row vectors must be the zero vector which means that the rows of are dependent. Hence, is singular and has eigenvalue 0, and the matrix must have eigenvalue 1. If we have the difference equation and the matrix is Markov, then we can solve for the long-term state probabilities by using the eigenvalues and eigenvectors of .



Symmetric Matrices


Symmetric matrices are the most important class of matrices. Symmetric matrices have the property . Symmetric matrices have the property that the eigenvalues are real, and the eigenvectors are orthogonal (given from the spectral theorem).

Usually, we can write the eigendecomposition of a matrix as . However, when the matrix is symmetric, we can write . This is called the spectral theorem or principal axis theorem.

To show that the eigenvalues must be real for a symmetric matrix we can write the expression . Regardless of the eigenvalues we can write the complex conjugate as

Which follows if the matrix has real components. If we transpose this equation, we get

This implies that since the dot product represents the length of the vector and will be nonzero since the eigenvectors are nonzero. Therefore, we see that the eigenvalues must be real. The equivalent to symmetric matrices for complex entries are called Hermitian matrices and .

We can expand the eigenvalue factorization into rank 1 matrices such that

Recall that if we want to project a vector onto the vector , then we have the normal equation for the error such that . Solving for the coefficient we get . Now computing the projection, we get . Therefore, we see that the matrix is the projection matrix to project onto the vector . Furthermore, if is unit length, then the projection matrix simply becomes .

Another way to interpret this projection matrix is to consider what the expression means. Rearranging and expanding using the definition of the dot product, we see:

Therefore, the projection computes a scaling factor which we can see after drawing a triangle with as the hypotenuse and a ray along the vector , represents the distance along where the perpendicular projection of occurs. Hence multiplying this scaling factor by the unit vector gives the vector representing the projection of onto .

Returning to the eigenvector decomposition written as , we see that each component represents a perpendicular projection onto the unit eigenvector where the eigenvalue describes how much this projection contributes to the action of the matrix. Hence, every symmetric matrix can be interpreted as a combination of perpendicular projection matrices.



Special Matrices


In this section we will discuss some classes of matrices which have special properties.



Positive Definite Matrices


How do we know if a symmetric matrix is positive definite? There are a few different kinds of tests for positive definiteness: 1) are the eigenvalues all positive, 2) are the sub-determinants all positive, 3) are the pivots all positive, 4) at all nonzero points.

Given that the eigenvalues of a positive definite matrix are positive, then we know that the inverse of the matrix must also be positive definite since the eigenvalues are reciprocals which preserves the sign. If and are positive definite, then we can easy show that , so is positive definite.



Gram Matrices


Suppose we have an by matrix . We know that is square and symmetric, is it also positive definite? Looking at the expression , therefore is positive semidefinite. For to be positive definite, we must ensure that except for the zero vector. This means that the nullspace is empty and the column space must be full rank. Therefore, we can ensure that is positive definite if the rank of is . The matrix is called the Gram matrix and is always semi-positive definite. If is invertible, then will be positive definite.



Similar Matrices


Suppose that we have two square matrices and . These matrices are called similar if there exists a matrix such that . From the eigenvalue decomposition, we can show that which implies that is similar to the eigenvalue matrix . From here we can now define a family of similar matrices where we can transform the matrix such that for any invertible matrix we can write . Notice that is another eigenvalue decomposition with the same eigenvalues but different eigenvectors. Hence is similar to all other matrices with the same eigenvalues.

Therefore, similar matrices must have the same eigenvalues. To show this, we can write the eigenvalue decomposition as . Now suppose that we have a similar matrix . We can write out the eigenvalue decomposition as

Therefore, we see that the eigenvalues are preserved, but the eigenvectors have been transformed .

Suppose that we now consider matrices with the same eigenvalues which may not be diagonalizable. Consider a matrix which is a multiple of the identity where all the eigenvalues equal . We see that multiplying by any invertible matrix does not change the matrix since . Therefore, the matrix only has one matrix in its family.

If we now look at the matrix , clearly the eigenvalues are both 4. However, this matrix has a larger number of similar matrices than because any matrix will be similar. The most diagonal representative from each family of similar matrices is called the Jordan normal form. This enables us to effectively “complete” the diagonalization process for matrices which cannot be diagonalized.

A Jordan block is a partitioning of the matrix where each block has one eigenvector. Every square matrix is similar to a Jordan matrix , where is a partitioned matrix where each block has one associated eigenvector. To understand this, start with any . If the eigenvalues are distinct, then the matrix is diagonalizable and this matrix is similar to its diagonal eigenvector matrix. In this case, the Jordan form is . Now, if the eigenvalues are repeated and “missing” eigenvectors, then its Jordan matrix will have partitioned blocks that each have one associated eigenvector.



Orthogonal Matrices


Consider that we have a projection (or expansion) onto an orthonormal basis defined by . We can write this projection as . An important question is how we find the corresponding coefficients. For an orthonormal basis, this process is very easy, we can just take the dot product of both sides of this equation by the corresponding basis. For example, for the coefficient, the coefficient is defined as . We can write this in matrix notation as . Given that , we can write .

An orthogonal matrix is defined as a matrix where each column and row is perpendicular to every other column and row respectively. Specifically, a matrix is called orthonormal if where is the identity matrix. We can write this matrix in a rank-1 decomposition as .

We can show that if has orthonormal columns then it must have orthonormal rows as well. If has orthonormal columns, then it follows that . By left multiplying this expression by , then we can rearrange the expression as follows

Given that has full rank (since all columns are mutually orthogonal and nonzero), it follows that any nonzero row in can be written as a linear combination of the columns of . As such, the product of a nonzero row of with must result in a nonzero vector. Therefore, since the product of all rows of with result in zero vectors (as seen from the equation ), it follows that which implies that . From this result we see that . Therefore, and the rows of must also be orthonormal.



Graph Matrices


A graph is a set of nodes and edges connecting the nodes. In the incidence matrix, every row corresponds to an edge. Loops in the graph correspond to linearly dependent rows. If we have an incidence matrix , how can we interpret the in the product ? We can think about the element as representing the potential of the node. Therefore, the product tells us the potential differences between the nodes. In this way, the nullspace of can be visualized as the settings of the potentials such that the potential differences are 0. Therefore, we can write the vectors in the nullspace as where is a constant. In terms of an electrical network, the potential difference is zero on each edge if each node has the same potential. We can’t tell what that potential is by observing the flow of electricity through the network, since only the potential differences correspond to flow. Therefore, only if one of the nodes is grounded to potential 0 can we determine the absolute potential of all other nodes in the graph. If we ground a node in the network, we are setting one of the nodes in to 0, which effectively removes one of the columns from .

For an incidence matrix what does the nullspace of correspond to? We can think about the element as representing the current flowing along the edge. Therefore, the sum of the currents into and out of every node must be 0. The matrix product tells us the net current at each node. Hence the solutions to are the currents that follow Kirchhoff’s current law. The basis vectors for the nullspace are the currents required for each of the loops to satisfy Kirchhoff’s current law. We can add current sources to where are the current sources.



Change of Basis


A central question in linear algebra is the following: given a vector space , what is the best basis to choose to represent the vectors? For example, the vector space could the space of all grayscale intensity images, where each image is a vector in this space. A grayscale intensity image by definition is a matrix in . The most straightforward basis to choose is simply the standard basis where the element of the coordinate vector describes the pixel intensity of the element of the image. However, for specific applications such as image compression, this is not a very helpful basis, as we would prefer a basis where each basis vector describes global characteristics of the image. Several different basis representations have been researched include the following



Fourier Basis


The goal of Fourier analysis is to represent a signal using a set of complex exponential basis functions. The primary reasons why complex exponentials are used as basis functions is because they have the following properties:

  1. Periodic: A complex exponential basis function has a single associated frequency. Therefore, by using complex exponentials as basis functions, we can represent a signal in terms of composite frequency components. By computing the projection of the signal onto these basis functions, we can effectively transform a signal from the time/spatial domain into the frequency domain.
  2. Orthogonal: Having orthogonal basis functions leads to the derivation of elegant and simple formulas to calculate the transform coefficients when projecting the signal onto the basis functions.

To begin our adventure into Fourier analysis, let us start with a question. Suppose we have a periodic real valued function that has period such that . Is it possible to find a set of coefficients such that we can represent the signal for as:

To unpack this question, let us first examine the contents of the above equation. The complex exponential can be expanded as . Therefore, we see that the period of each complex exponential is . A couple interesting things to note: 1) when the complex exponential reduces down to which is the constant function. As such, we can interpret the coefficient as specifying the average value of the function. 2) the smallest nonzero frequency we have in the sum is when and the frequency is . The reason why we don’t use a frequency smaller than is that our signal is known to have a base frequency of , therefore, we don’t need any smaller frequencies to describe the signal. 3) The reason why we have negative values of is so that the values can be complex conjugates for so that the sum produces a real (potentially phase shifted) signal.

With this said, the central question becomes: how can we find the values for the coefficients ? This is where we can exploit the orthogonality of the complex exponentials. We can compute the projection of the signal onto the basis vector by multiplying by the basis vector and integrating over the period in the following way:

We call this equation for the coefficient the “analysis” equation. The key to this result is the orthogonality of the basis vectors which tells us that

Hence, we now have a formula for the coefficients of this series. We call this series the Fourier series of .

We can write the series representation for as an equation known as the “synthesis” equation such that .

A limitation of the Fourier series is that it assumes that is periodic with period . How could we extend the concept of the Fourier series to handle nonperiodic functions? One approach is to continue assuming that is periodic, but let the period get very large such that . If we take the derivation that we developed for the coefficients of the Fourier series (written below with a shifted integral bound) and write out the synthesis equation for the Fourier series we get

We can define the frequency of the complex exponential as a function . We see that the difference between any two successive frequencies is . Substituting into the expression, we see

Now, if we: 1) let , so , 2) let , and 3) swap the sum for an integral over , then we get the following

The inner integral gives a function over and this function is the Fourier transform of . The outer integral can be interpreted as the inverse Fourier transform of .

An important question to understand is: what is the bound of ? We can safely assume that every real-life signal will have finite energy which means that. Now using the fact that is periodic with amplitude 1, we see that .

Since any periodic signal can be written in terms of frequency components that are integer multiples of the base frequency of the signal, we only need a countably infinite number of complex exponentials to represent the periodic signal. This is the basis of the Fourier series. On the other hand, for infinite time signals that are aperiodic, we need to use all frequencies on the real number line to represent the signal. This leads to the Fourier integral transform where our integral produces a coefficient for each real frequency .

To develop intuition for the Fourier transform, we can think about the difference between the Fourier series and the Fourier transform, and the analogous relations between PMFs and PDFs from probability theory. For the Fourier series, we had a coefficient for each discrete frequency , with some of these frequencies having nonzero energy. This is analogous to the probability being located on discrete positions in the PMF.

Disclaimer: my knowledge of measure theory is minimal. Therefore, I have constructed this section mostly based on handwaving intuition. Please take it with a grain of salt.

Suppose that we want to represent the probability of landing anywhere on the real number line between . Let us try to construct a random variable that can represent this distribution. Suppose that we have a random variable associated with a uniform discrete distribution over the domain . The PMF associated with this variable is for any . Consider what happens as we let , we see that the PMF goes to zero for every point . This poses a problem if we want the PMF to describe the entire real number line because all probabilities go to zero.

However, we have an even bigger problem. The fundamental problem is that we are trying to describe an interval on the real number line between using a countably infinite set of points . As a result of measure theory, we know that the total measure of all the points in will be smaller than any interval of the real number line. Therefore, even if we let , the domain cannot describe the complete interval .

Therefore, we need a new approach to represent probabilities on the real number line. One approach is to define a primitive measure, such that we can divide the real number line into a countable collection of these primitive measures. Then we can assign probability values to these primitive measures such that they sum to 1. A suitable primitive measure to represent an interval on the real number line is an infinitesimal interval since we can clearly represent any interval as a countable collection of smaller intervals.

This pattern of defining a primitive measure that we can use to use to represent the domain of our random variable as a countable set is key as it enables us to assign a nonzero probability to some of the primitive measures. This process can be generalized for different dimension spaces. For example: for a domain of discrete points, a primitive measure is a single point, for a domain of intervals of the real line, a primitive measure is an infinitesimal interval, for a domain of areas, a primitive measure is an infinitesimal area, for a domain of volumes, a primitive measure is an infinitesimal volume, etc..

Suppose that we represent the interval as an ordered collection of intervals given as . Now suppose that each of these intervals has an associated probability value such that is the probability associated with interval . Then we can write the sum of these probabilities as . For a uniform distribution, the associated probability is . However, notice that as , we have . Is there a nicer form that we can use to represent the probability that does not converge to zero? Here comes the key piece of innovation: suppose that we define another function that describes the probability density of the interval such that where is the length of the interval. In the case of the uniform distribution, we see that as , . Hence, by dealing with the probability density, as the density will not converge to zero.

Suppose in the general case that our associated probability is not constrained to be uniform. In this case, can be any positive value such that the probabilities sum to 1, given by . Now substitute in our expression for the density as . Taking the limit, we see that . Here we have substituted the limiting sum with a Riemann integral.

Therefore, an effective way to represent an arbitrary probability distribution is using the probability density which describes the probability per unit length. An intuitive way to think about the integral , is as follows: 1) this integral represents a countably infinite sum of primitive measures that have been scaled by . The interval from 0 to 1 is completely described by all the in the sum. 2) the function can be interpreted as the density of some quantity such as probability. The product computes the probability over an infinitesimal interval.

Returning to the Fourier transform, since we need all real frequencies to describe an infinite time aperiodic signal, we need to consider the signal synthesis equation as being an integral over the domain of , where the primitive measure is an infinitesimal frequency interval of length .

Now that we have seen the Fourier series for signals that are periodic, and the Fourier transform for signals that are aperiodic, what about signals that are discretely sampled? One way of modeling the process of sampling is to multiply a signal by an impulse train. An impulse train is defined as where is the sampling interval.

To be able to use the Fourier series, we need to have a periodic signal. We can force our signal to be periodic by simply repeating it. Let us assume that our signal has length and hence, by repeating the signal, the repeated signal has period . We will assume that the sampling interval is a multiple of the period such that. We can write the Fourier series coefficients as follows

Since , we can rewrite the coefficient equation as: .

If we define such that , then we can write the equations for all coefficients as:

If we let , then we can write the system of equations as:

The matrix is called the N-point DFT matrix and will be unitary if scaled by , such that . We can interpret this matrix as performing a change of basis from “time” domain to “frequency” domain.



Matrix Factorizations


In this section we will discuss some of the important matrix factorizations.



CR


Suppose we have a 3 x 3 matrix defined as

We can write as the product of two matrices and such that

Where is constructed to contain the linearly independent columns of , and contains the coefficients for the linear combinations of these columns to produce . This decomposition provides two fundamental and important interpretations of matrix multiplication:

  1. Each column of can be viewed as a linear combination of the columns of where the combination coefficients are given by the columns of
  2. Each row of can be viewed as a linear combination of the rows of where the combination coefficients are given by the rows of

We can define the column rank and row rank of to be the number of linearly independent columns and rows respectively. From the decomposition we can show that the column rank must equal the row rank of any matrix as follows:

  1. Suppose that has linearly independent columns, then has columns and by the definition of its construction they are independent
  2. Every column of can be written as a combination of the columns of since
  3. The rows of must be independent since they contain the by identity matrix (this is because the columns of are also columns of by construction, and hence, there must be a corresponding by identity submatrix in . Since has rows and contains a by identity matrix, each row in contains a column which has value 1 for that row and 0 for all other rows)
  4. Every row of is a combination of the rows of since , therefore the row rank of must be less than or equal to the column rank of
  5. Now repeat steps 1-4 with , we get the result that the column rank of is less than or equal to the row rank of
  6. By combining the results from step 4 and step 5, we see that the row rank must equal the column rank



LU


What is Lower Triangular – Upper Triangular (LU) factorization? The LU factorization states that we can factor any matrix with proper row and/or column permutations as where is a lower triangular matrix and is an upper triangular matrix. To develop intuition for this factorization, recall that we can transform a matrix into an upper triangular matrix by applying elementary row operations that consist of: 1) permuting two rows or columns, and 2) adding a multiple of one row to another row. If the matrix requires permutations, we can factor all the permutations into a matrix for row permutations, and a matrix for column permutations such that . By using these permutation matrices, we can ensure that the “add a multiple of one row to another row” operation only needs to add a multiple of an upper row of to a lower row of . This ensures that the matrix required to represent this operation is lower triangular. Hence, we can write the transformation of into an upper triangular matrix as where is the product of lower triangular matrices each describing an “add a multiple of one row to another row” operation. Therefore, to show that , we must show that exists and is lower triangular. Consider the operation , which adds a multiple of one row to another row. The inverse of this operation is simply to subtract the same multiple of one row from the other row. Therefore, the inverse of exists and furthermore is lower triangular since to compute the inverse of , we simply take the off diagonal component of (which is below the diagonal) and invert the sign of the element (i.e. change from add to subtract). Hence, the inverse exists and is the product of lower triangular matrices. It can be shown that the product of two lower triangular matrices is also lower triangular by examining the elements of the resultant matrix and showing that the elements above the diagonal must all be 0. Therefore, the matrix exists and is lower triangular. Hence, we have the factorization . In the special case that already has proper row and/or column permutations, then and .

How many operations are required to transform an by matrix into an upper triangular matrix ? In the worst case, for each of the rows, we must perform a multiplication and a subtraction. Therefore, in the first step, we multiply the first row and add a copy of it to all other rows. This results in approximately steps. As we repeat this process, we see that the number of operations is roughly .

Suppose we are trying to solve the system . One way to approach this problem is to factor the matrix as , where is a lower triangular matrix, and is an upper triangular matrix. The basic idea behind this factorization is the same as Gaussian elimination. We apply a series of lower triangular elementary operations to the matrix to form the upper triangular matrix . Upon determining the LU factorization, we can solve the system by first solving the lower triangular system and then solving the upper triangular system .



QR


Another important factorization is the QR factorization which states that any matrix can be decomposed as where is an orthogonal matrix and is an upper triangular matrix. The Gram-Schmidt process is a way of computing the QR factorization, and proceeds by iteratively computing the component for the current vector that is orthogonal to each of the previously processed vectors. Using this process, an orthonormal basis can be built, and the resulting steps can be captured in the QR factorization.

A matrix with orthonormal columns is represented as . If is square, then implies that . Suppose that has orthonormal columns. If we want to project a vector onto its column space, then we can use the projection formula: . If is square then . For least squares, the normal equations for an orthonormal matrix reduces down to .

Given an matrix with rank , how do we find the factorization ? One method is to use Gram-Schmidt. The goal of Gram-Schmidt is to produce an orthonormal basis. The algorithm works as follows. Note that for simplicity we are assuming that the first columns of are independent. We denote the column of as .

  1. Pick the first column and compute the normalized basis vector .
  2. Select the next column and compute the projection onto , written as . Subtract the projection from and normalize to get the next orthogonal basis vector .
  3. Repeat this process for all independent columns to establish an orthonormal basis for the column space of .

Therefore, for a matrix with rank , the result of the Gram-Schmidt process will be an orthonormal basis for the column space . Since each column in is trivially in the column space. We can write each column as a linear combination . Using orthonormality, we can determine each of the coefficients as . Substituting these coefficients back into the combination we get

Therefore, we can represent the complete matrix as

As a result of using Gram-Schmidt to construct the orthonormal basis, we have when . Therefore, we can simplify the second matrix in the product as

And this matrix is upper triangular. Note that this matrix will only be invertible if has full column rank and . Hence, setting , and we see that any matrix can be factored as where is an orthogonal matrix of size and is an upper triangular matrix or size .

Some of the applications of QR factorization include:



SVD


The singular value decomposition tells us that any by matrix with rank can be factored as where and are orthogonal matrices and is a diagonal matrix.

By definition, a rank matrix has a column space of dimension which means that we need vectors to describe a basis for the column space. Since the column rank equals the row rank, it follows that the row space also has dimension . Since the dimensionality of these subspaces is the same, it makes intuitive sense that there should be a mapping between vectors in these subspaces. The process of finding such a mapping is the goal of the SVD.

The fundamental question that is answered by the SVD is the following: can we find an orthonormal basis for the row space , such that mapping these vectors to the column space produces an orthonormal basis for the column space .

Assuming that this factorization exists, then using matrices, we can write the mapping described above as

We can go further than this by including the left and right nullspace into this equation. Given that the rank is , the dimension of the nullspace will be which means that we can describe a basis for the nullspace using vectors. We can easily find orthonormal vectors which are orthogonal to by using a process such as Gram-Schmidt. Let us call these orthonormal basis vectors for the nullspace . Then we can write

Likewise, we can add vectors for the left nullspace. Given that the rank is , the dimension of the left nullspace will be . In a similar process to that described above, we can find an orthonormal basis for the left nullspace . Then we can write

We can now define:

Then our expression becomes . Given that is composed of orthonormal columns we know that . Multiplying both sides of this equation by gives

Therefore, we see that . Hence, we can write the above factorization as . Before diving into how to find these matrices, let us first describe what each of them represent:

At this point, the question becomes: how do we find the SVD factorization for a matrix ? Well, we can start by assuming that the factorization exists. Then multiplying on the left by we get:

Therefore, we see that must be the eigenvector matrix for , and must be the eigenvalue matrix of . Given that is a symmetric matrix, the spectral theorem tells us that there exists an orthonormal basis for its eigenvectors. Hence deriving from the eigenvectors of ensures that it is orthonormal.

To get , we notice that is the product of two diagonal matrices which will also be diagonal. Therefore, the entries along the diagonal of are simply the square roots of the corresponding diagonal entries of which are the eigenvalues of .

Finally, to get the matrix we can apply the same technique, but in the opposite order. Multiplying on the right by gives:

Therefore, we see that must be the eigenvector matrix for, and must be the eigenvalue matrix of . Similar to above, as a product of the spectral theorem, we see that deriving from the eigenvectors of ensures that it is orthonormal.

Therefore, we have shown how to derive the SVD for a general matrix . The beauty of the SVD is that it selects the best basis (i.e. orthonormal) for the four subspaces associated with , and describes a scaling mapping between the row and column space basis. The SVD enables us to describe the action of any matrix as three steps: 1) rotation, 2) scaling, 3) rotation.