Essence of linear algebra

February 2020

Notes on the excellent YouTube series by 3Blue1Brown. This series taught me more about linear algebra than any class in college.

(I wrote these to help reinforce the concepts from the series. It may be useful as a supplement, but it’s no substitute for the animations themselves.)

# What are vectors?

A vector can describe anything where there is a sensible notion of adding vectors and multiplying vectors by numbers.

Vectors are scaled by scalars, which are numbers, or constants.

In the physics world, vectors are often thought of as arrows in space, with direction and magnitude.

In the computer science world, vectors are often lists of numbers (scalars).

Both of these perspectives can provide a useful framework for understanding vectors. This series will focus on deepening one’s intuition via geometric interpretations.

# Linear combinations, span, and basis vectors

The coordinates of a vector can be thought of as scalars of some unit vectors. Those unit vectors are the basis vectors for the given vector space.

Any time you describe vectors numerically, it depends on an implicit choice of basis vectors.

A linear combination of vectors is the result of a scalar multiplication and element-wise vector addition. For vectors $\vec{v}$ and $\vec{w}$:

\[a \vec{v} + b \vec{w} = \begin{bmatrix} a v_1 + b w_1 \\ a v_2 + b w_2 \\ \cdots \\ a v_n + b w_n \end{bmatrix}\]

where $a$ and $b$ vary over the real numbers $\mathbb{R}$.

The span of a set of vectors is the set formed by all of their linear combinations. Geometrically, it can be easier to think about a vector as an arrow, and the span of that vector as the line drawn by that arrow in either direction, i.e. scaled up and down by any scalar.

This also extends to higher dimensions, e.g. a plane or hyperplane.

A vector is linearly dependent to a set of vectors if it is a linear combination of those vectors. In other words, it is linearly dependent if its removal from the set does not affect the span of that set (it lies somewhere on the span).

A basis of a vector space is a set of linearly independent vectors that spans the full space.

# Linear transformations and matrices

A linear transformation is a transformation function that accepts and returns a vector. To respect linearity, a transformation must:

preserve all lines
preserve the origin

A transformation from one coordinate system to another can be described with an $n$-dimensional transformation matrix, where the matrix is a set of vectors lined up in columns. Conversely, all matrices can also be thought of as transformations.

For example, consider a vector $\vec{v} = [x, y]$, defined in the standard 2D coordinate-plane. Think of $\vec{v}$ as a linear combination of the “implicit” standard basis vectors $\hat{i} = [1, 0]$ and $\hat{j} = [0, 1]$, i.e. $\vec{v} = x \hat{i} + y \hat{j}$.

Now, apply a linear transformation to the entire coordinate system, moving $\hat{i}$ and $\hat{j}$. This transformation can be described by a $(2, 2)$ matrix $A$, where the first column is the resulting coordinates of the basis vector $\vec{i} = [a, b]$, and the second column is the resulting coordinates of the basis vector $\vec{j} = [c, d]$.

If you apply that transformation to vector $\vec{v}$, by performing the multiplication $A\vec{v}$, the result is the same linear combination $x \vec{i} + y \vec{j}$!

\[\begin{bmatrix} a & c \\ b & d \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \underbrace{ x \begin{bmatrix} a \\ b \end{bmatrix} + y \begin{bmatrix} c \\ d \end{bmatrix} }_{\text{where all the intuition is}} = \begin{bmatrix} ax + cy \\ bx + dy \end{bmatrix}\]

# Matrix multiplication as composition

Matrix multiplication can be interpreted geometrically as the composition of linear transformations.

For example, to apply a rotation and a shear to a 2D vector, we could apply each one successively. But since both of these operations are linear transformations, we know that each basis vector $\hat{i}$ and $\hat{j}$ will land at some new coordinates $\vec{i}$ and $\vec{j}$ in the 2D plane. This means we can describe both operations with a single, composed $(2, 2)$ transformation matrix.

Note that matrix multiplication moves right-to-left, as if each operation were a function being applied to the result of an inner function, e.g. $f(g(h(x)))$. In other words, it is associative, but not commutative.

# The determinant

Given a 2D matrix $A$, the determinant represents the factor by which area is scaled by the transformation specified by $A$. Negative determinants describe inverted orientations, though the absolute value is still the scale factor.

To figure out if the determinant will be negative, recall that the identity matrix has a determinant of $1$. So if the relative positioning of $\hat{i}$ is clockwise from $\hat{j}$, the orientation is positive.

Similarly, in 3D, the determinant of a matrix represents the transformation’s scale factor for volume, and the sign of the orientation is indicated by the right-hand rule (pointer finger is $\hat{i}$, middle finger is $\hat{j}$, thumb is $\hat{k}$).

A matrix with a determinant of $0$ indicates that the transformation projects the vector into a lower-dimensional space.

Due to the compositionality of matrix multiplication, if you multiply two matrices together, the determinant of the resulting matrix is equal to the product of the determinants of the original two matrices, i.e. $det(A_1 A_2) = det(A_1) det(A_2)$.

# Inverse matrices, column space and null space

Matrices can be used to solve linear systems of equations. These must be expressable in the form of linear combinations of basis vectors.

The inverse matrix $A^{-1}$ of $A$ is the unique matrix with the property that $A^{-1}A = I$, where $I$ is the identity matrix ($1$s along the diagonal, $0$s everywhere else). That is, $A^{-1}$ is the transformation that will “undo” the transformation $A$.

If a matrix $A$ has a determinant of $0$, then its inverse does not exist. There is no way to invert a transformation from a higher dimension to a lower dimension.

If a line is projected to a point, how would you know which line to draw to “uncompress” it?

The rank of a matrix is the number of dimensions in the output of its transformation. For example, if a linear transformation results in a line, its matrix has a rank of $1$. If the output is a plane, it has a rank of $2$.

A full rank matrix preserves the number of dimensions of its input.

The column space of a matrix $A$ is the set of all possible outputs $A \vec{v}$, $\vec{v} \in \mathbb{R}$. That is, the column space of $A$ is the span of the columns of $A$.

Therefore, the rank of a matrix can be thought of as the dimensionality of its column space.

Any matrix which is not full rank has a null space, or kernel. The null space of a matrix is the vector space which is transformed to the origin.

In any full rank transformation, only the zero-vector outputs the origin (any linear transformation must preserve the origin).
In any non-full rank transformation, there is a set of vectors that will be transformed to the origin. For example, if a 2D matrix “compresses” a plane to a line, there exists some line in the original plane which was entirely mapped to the origin.

In the context of linear system of equations, $A\vec{x} = \vec{v}$, if $\vec{v} = 0$, the null space of $A$ gives you all possible solutions to the equation.

# Nonsquare matrices as transformations between dimensions

Nonsquare matrices represent transformations between dimensions. For example, the matrix

\[A = \begin{bmatrix} 2 & 0 \\ -1 & 1 \\ 2 & 1 \end{bmatrix}\]

represents a transformation from 2D to 3D space.

The columns still represent the transformed coordinates of the basis vectors—observe that there are two basis vectors, and they each land on coordinates which have three dimensions.

This is a full rank matrix, since the resulting vector space is a plane in 3D space.

# Dot products and duality

The dot product of two vectors $\vec{v}$ and $\vec{w}$ is the result of multiplying the length of $\vec{v}$ by the length of the projection of $\vec{w}$ onto $\vec{v}$.

This computation is symmetric: $\vec{v} \cdot \vec{w} = \vec{w} \cdot \vec{v}$. Scaling either vector results in a proportional scaling in the projected space of the other vector, i.e. $(2 \vec{v}) \cdot \vec{w} = 2 (\vec{v} \cdot \vec{w})$.

The dot product is often used as a measure of similarity between vectors. If we define $\theta$ to be the angle between two vectors $\vec{v}$ and $\vec{w}$, then:

$\theta = 90° \Rightarrow \vec{v} \cdot \vec{w} = 0$
$\Rightarrow$ the dot product is zero, as the vectors are orthogonal
$\theta < 90° \Rightarrow \vec{v} \cdot \vec{w} > 0$
$\Rightarrow$ the dot product is positive, as the vectors are “similar” (acute angle)
$\theta > 90° \Rightarrow \vec{v} \cdot \vec{w} < 0$
$\Rightarrow$ the dot product is negative, as the vectors are “opposing” (obtuse angle)

Operationally, the dot product can be computed by element-wise multiplication and then addition, i.e.

\[\vec{v} \cdot \vec{w} = v_1 w_1 + v_2 w_2 + \cdots + v_n w_n\]

or via the geometric definition:

\[\vec{v} \cdot \vec{w} = \lVert \vec{v} \rVert \lVert \vec{w} \rVert \cos \theta\]

Duality: a vector of length $n$ can be considered a $(1, n)$ matrix. Any vector inherently represents a transformation matrix onto the number line of its span.

This is easiest to understand with a 2D vector $\vec{u}$, which can be written as a $(1, 2)$ matrix $ [u_x, u_y] $. The columns of the matrix are where the basis vectors $\hat{i}, \hat{j}$ land in this transformation.

Observe that a dot product with a vector $\vec{u}$:

\[\begin{bmatrix} u_x \\ u_y \end{bmatrix} \cdot \begin{bmatrix} x \\ y \end{bmatrix} = u_x \cdot x + u_y \cdot y\]

is equivalent to the matrix-vector product, or the length of the vector’s projection onto $\vec{u}$, scaled by the length of $\vec{u}$:

\[\begin{bmatrix} u_x & u_y \\ \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = u_x \cdot x + u_y \cdot y\]

In different applications, it may be informative to think about vectors as the embodiment of a linear transformation.

# Cross products

The cross product is a binary operation on two 3D vectors, $\vec{v} \times \vec{w}$, producing a vector perpendicular to both $\vec{v}$ and $\vec{w}$. The output vector has a length equal to the area of the parallelogram enclosed by lining up copies of $\vec{v}$ and $\vec{w}$ tail to tip.

As with the determinant, order matters, and the sign of the cross product depends on the orientation. We can use the right-hand rule to determine if the orientation is positive or negative.

The numeric computation can be formulated as follows: set the standard basis vectors $[\hat{i}, \hat{j}, \hat{k}]$ to the first column of a $(3, 3)$ matrix, and set $\vec{v}$ and $\vec{w}$ to be the second and third columns. The cross product of $\vec{v}$ and $\vec{w}$ is the determinant of this matrix:

\[\vec{v} \times \vec{w} = det \left( \begin{bmatrix} \hat{i} & v_1 & w_1 \\ \hat{j} & v_2 & w_2 \\ \hat{k} & v_3 & w_3 \end{bmatrix} \right) = \hat{i}(v_2 w_3 - v_3 w_2) + \hat{j}(v_3 w_1 - v_1 w_3) + \hat{k}(v_1 w_2 - v_2 w_1)\]

Or in simpler terms:

\[\begin{bmatrix} v_1 \\ v_2 \\ v_3 \end{bmatrix} \times \begin{bmatrix} w_1 \\ w_2 \\ w_3 \end{bmatrix} = \begin{bmatrix} v_2 w_3 - v_3 w_2 \\ v_3 w_1 - v_1 w_3 \\ v_1 w_2 - v_2 w_1 \end{bmatrix}\]

An alternate name for the cross product is the “vector product”, which differentiates it from the dot product or “scalar product”.

The closer two vectors are in orientation, the smaller the cross product between them. The more perpendicular they are, the larger the cross product. This follows from the fact that a skinny parallelogram has a much smaller area than a square.

# Cross products in the light of linear transformations

Reframing the cross product’s formula as a linear transformation can connect the dots to its geometric interpretation.

Fix input vectors $\vec{v}$ and $\vec{w}$ in 3D space. Place a third variable vector $\vec{u}$ anchored at the origin. Consider the parallelepiped that $\vec{u}$ forms along with $\vec{v}$ and $\vec{w}$.

Define $f(\vec{u})$ to be a function which returns the volume of this parallelepiped as a function of $\vec{u}$. This function maps 3D space to 1D space. It is also linear, as the volume of the parallelepiped is determined by the formula

\[V = (\text{area of the base}) \cdot (\text{height})\]

and only the dimension of height varies as a function of $\vec{u}$.

Thus, there exists a $(1, 3)$ matrix that represents this function as a transformation of $\vec{u}$. This matrix has a dual vector, which we can call $\vec{p}$. So we can redefine the function $f(\vec{u})$, from a matrix-vector product to the dot product $\vec{p} \cdot \vec{u}$.

This vector $\vec{p}$ is the cross product. $\vec{p}$ must have the property that for any vector $\vec{u}$, the dot product $\vec{p} \cdot \vec{u}$ is the volume of the resulting parallelipiped between $\vec{u}$, $\vec{v}$, and $\vec{w}$.

The geometric interpretation: the dot product $\vec{p} \cdot \vec{u}$ is the projection of $\vec{u}$ onto $\vec{p}$, multiplied by the length of $\vec{p}$. If the length of $\vec{p}$ is set to the area of the base (the parallelogram formed by $\vec{v}$ and $\vec{w}$) and $\vec{p}$ is oriented to be perpendicular to $\vec{v}$ and $\vec{w}$ such that the right-hand-rule is satisfied, we can see that the dot product formula matches the previous formula for volume:

\[(\text{area of the base}) \cdot (\text{height}) = (\lVert \vec{p} \rVert) \cdot (\lVert \vec{u} \rVert \cos \theta)\]

Beautiful!

# Cramer’s rule

Cramer’s rule is a method to solve for just one of the variables of a linear system of equations $A\vec{x} = \vec{v}$, which doesn’t require the inverse matrix $A^{-1}$.

Note: Cramer’s rule is useful for deepening one’s intuition for linear systems, but practically will always be slower than Gaussian elimination.

A linear transformation that preserves the output of dot products is orthonormal:

\[T(\vec{v}) \cdot T(\vec{w}) = \vec{v} \cdot \vec{w}, \; \forall \vec{v}, \vec{w}\]

Geometrically, this involves intuition about how an arbitrary space (area, volume, etc.) is scaled via the determinant of the matrix $A$, and comparing the untransformed area (parallelogram, parallelopiped, etc.) to the transformed area.

# Change of basis

A change of basis is a transformation from one coordinate system to another.

Consider the matrix $A$ defined in our standard coordinate system, with columns equal to a 2nd coordinate system’s basis vectors. For example, with our basis vectors $\hat{i} = [1, 0]$ and $\hat{j} = [0, 1]$, and another system’s basis vectors $\hat{i}_2 = [2, 1]$ and $\hat{j}_2 = [-1, 1]$, then

\[A = \begin{bmatrix} 2 & -1 \\ 1 & 1 \\ \end{bmatrix}\]

where $A$ will transform any vector defined in the 2nd coordinate system into the same vector defined our standard coordinate system.

Conversely, to transform a vector from our coordinate system into the 2nd coordinate system, we apply the transformation corresponding to $A^{-1}$.

To transform a matrix defined in our coordinate system to another system, we can perform a similar operation. Consider a $90\degree$ counter-clockwise rotation defined in our language:

\[M = \begin{bmatrix} 0 & -1 \\ 1 & 0 \\ \end{bmatrix}\]

To apply the transformation in the 2nd coordinate system, we can first change the basis to our coordinate system, apply the transformation in our language, and then “undo” the change of basis: $A^{-1} M A = M_2$.

# Eigenvectors and eigenvalues

An eigenvector of a matrix is a nonzero vector that changes by a scalar factor when the transformation represented by the matrix is applied. In other words, it remains on its original span after the transformation.

Not all matrices have eigenvectors. For example, some 2D rotations will rotate every vector off its original span.

The eigenvalue corresponding to an eigenvector is the scalar factor which the eigenvector is scaled by when the linear transformation is applied.

Symbolically, eigenvectors are often represented with the following equation:

\[A\vec{v} = \lambda\vec{v}\]

where $A$ is the matrix representing some transformation, $\vec{v}$ is the eigenvector, and $\lambda$ is the scalar eigenvalue for that eigenvector. To find the eigenvectors, solve for $\vec{v}$ and $\lambda$. This is done by rearranging the equation to be:

\[(A - \lambda I) \vec{v} = \vec{0}\]

and searching for a nonzero solution for $\vec{v}$.

The only way a matrix-vector product can result in the zero-vector with a nonzero $\vec{v}$ is if the determinant of that matrix is $0$ (i.e. it represents a projection into a lower dimension).

A geometric interpretation in the 2D case: the matrix $(A - \lambda I)$ can only project some vector $\vec{v}$ onto $\vec{0}$ if it projects all of 2D space onto a line.

One useful application of eigenvectors is 3D rotations. An eigenvector of a 3D rotation is the axis of rotation. This provides a much more interpretable representation of the rotation: the angle of rotation around a certain vector.

A rigid transformation, or isometry, is a transformation that preserves length. Reflections, translations, rotations, and their combinations are all rigid transformations.

Another useful application is matrix exponentiation, i.e. applying a linear transformation $n$ times.

A diagonal matrix is a matrix with zeros everywhere other than the diagonal. For such a matrix, the values along the diagonal are the eigenvalues.

Raising a diagonal matrix to the $n$th power corresponds to applying a single transformation with the eigenvalues each raised to the $n$th power. This is much more computationally efficient than computing the sum-of-products matrix multiplication $n$ times.

For a non-diagonal matrix $A$, you still might be able take advantage of this property by changing the basis of $A$ to its eigenbasis: the matrix whose basis vectors are the eigenvectors of $A$.

This requires at least as many eigenvectors as there are dimensions of the matrix. For example, this would not work for a shear.

# Abstract vector spaces

A more formal definition of linearity: linear transformations preserve additivity and scaling.

additivity: $L(\vec{v} + \vec{w}) = L(\vec{v}) + L(\vec{w})$
scaling: $L(c\vec{v}) = cL(\vec{v})$

In modern linear algebra, there are 8 axioms that any set must satisfy to qualify as a vector space. These ensure that the operations of vector addition and scalar multiplication behave as expected.

All of the core concepts in linear algebra, such as determinants and eigenvectors, freely generalize from the typical notion of coordinate systems. They also apply to many other areas of math that share properties of linearity.

Functions, for example, are a more abstract building-block that maintain the core properties:

vector addition: $(f + g)(x) = f(x) + g(x)$
scalar multiplication: $(2f)(x) = 2 f(x)$

And the concepts from this series correspond to analogous ideas as applied to functions:

linear transformations $\leftrightarrow$ linear operators
dot products $\leftrightarrow$ inner products
eigenvectors $\leftrightarrow$ eigenfunctions