Contents
  1. Eigendecomposition
  2. Determinants
  3. Transformation Matrices
  4. Set Theory and Probability Foundations
  5. Expectation and Variance
  6. Covariance and Correlation
  7. Joint, Marginal, and Conditional
← All posts

Linear Algebra: Eigendecomposition, Transforms, and Probability Foundations

Eigendecomposition diagonalises a matrix by expressing it in terms of its eigenvectors and eigenvalues. Combined with probability foundations (expectation, variance, covariance), these form the mathematical core of most ML algorithms.

Eigendecomposition

For a square matrix AA, the eigendecomposition is:

A=VDVT,VT=V1A = V D V^T, \quad V^T = V^{-1}

where D=VTAVD = V^T A V is a diagonal matrix of eigenvalues and VV is the matrix of eigenvectors as columns.

AA is similar to DD: A=M1BMA = M^{-1} B M (similarity transform). Two matrices are similar if they represent the same linear transformation in different bases.

Conditions for diagonalisability:

  1. AA is similar to DD.
  2. The algebraic multiplicity (AM) of each eigenvalue λ\lambda equals its geometric multiplicity (GM).
  3. GM(λi)=n\sum GM(\lambda_i) = n (the sum of geometric multiplicities equals the matrix dimension).

Positive semidefinite: Ax,x0λi0\langle Ax, x \rangle \geq 0 \Leftrightarrow \lambda_i \geq 0.

Positive definite: Ax,x>0λi>0\langle Ax, x \rangle > 0 \Leftrightarrow \lambda_i > 0.

For symmetric matrices ARn×nA \in \mathbb{R}^{n \times n}: eigenvalues are real-valued, eigenvectors form an orthonormal basis, A+ATAATA + A^T \geq AA^T, and AA is positive semidefinite if rank(A)=n\text{rank}(A) = n.

Determinants

Key properties:

det(AB)=det(A)det(B)\det(AB) = \det(A) \cdot \det(B) det(AT)=det(A)\det(A^T) = \det(A) det(A1)=1det(A)=det(A)1\det(A^{-1}) = \frac{1}{\det(A)} = \det(A)^{-1}

Adding a row/column to another does not change the determinant. det(λA)=λndet(A)\det(\lambda A) = \lambda^n \det(A). Exchanging rows/columns changes the sign.

Transformation Matrices

For a linear mapping ϕ^:VW\hat{\phi}: V \to W with bases B=(b1,,bn)B = (b_1, \ldots, b_n) for VV and C=(c1,,cm)C = (c_1, \ldots, c_m) for WW:

ϕ^(bj)=a1jc1++amjcm\hat{\phi}(b_j) = a_{1j} c_1 + \ldots + a_{mj} c_m

The transformation matrix AA has columns that are the images of the basis vectors expressed in the target basis.

The change-of-basis formula: A^0=T1A0S\hat{A}_0 = T^{-1} A_0 S where SS and TT are the change-of-basis matrices.

Set Theory and Probability Foundations

De Morgan’s laws: AB=AˉBˉ,AB=AˉBˉ\overline{A \cup B} = \bar{A} \cap \bar{B}, \quad \overline{A \cap B} = \bar{A} \cup \bar{B}

Mutually exclusive events: P(A1)+P(A2)=P(AA2)P(A_1) + P(A_2) = P(A \cup A_2).

Total probability: P(A)=i=1nP(B=bi)P(AB=bi)P(A) = \sum_{i=1}^{n} P(B = b_i) \cdot P(A | B = b_i).

Expectation and Variance

Expectation (discrete): E(πxi)=1ni=1nE(xi)E(\pi x_i) = \frac{1}{n} \sum_{i=1}^{n} E(x_i) if independent.

Variance: σ2=E(X2)E(X)2\sigma^2 = E(X^2) - E(X)^2 Var(xi)=Var(xi)(if independent)\text{Var}(\sum x_i) = \sum \text{Var}(x_i) \quad \text{(if independent)} Var(X+S)=Var(X)+Var(S)+2Cov(X,S)\text{Var}(X + S) = \text{Var}(X) + \text{Var}(S) + 2\text{Cov}(X, S)

Covariance and Correlation

Cov(X,X)=Var(X)\text{Cov}(X, X) = \text{Var}(X) Cov(X,S)=Cov(S,X)(symmetric)\text{Cov}(X, S) = \text{Cov}(S, X) \quad \text{(symmetric)} Cov(X,Y)Var(X)Var(Y)=sd(X)sd(Y)|\text{Cov}(X, Y)| \leq \sqrt{\text{Var}(X) \cdot \text{Var}(Y)} = \text{sd}(X)\text{sd}(Y)

For linear transforms: Y=aX+bCov(X,Y)=aVar(X)Y = aX + b \Rightarrow \text{Cov}(X, Y) = a \cdot \text{Var}(X).

Correlation: ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}

Joint, Marginal, and Conditional

Joint distribution: p(X=x,Y=y)=p(x,y)p(X = x, Y = y) = p(x, y).

Marginal: p(X=x)=yp(X=x,Y=y)p(X = x) = \sum_y p(X = x, Y = y).

Conditional: p(XY)=p(X,Y)p(X)p(X | Y) = \frac{p(X, Y)}{p(X)} where p(X)p(X) is the marginal probability.

Chain rule: p(x1,,xn)=p(xnx1,,xn1)p(x1,,xn1)=p(x_1, \ldots, x_n) = p(x_n | x_1, \ldots, x_{n-1}) \cdot p(x_1, \ldots, x_{n-1}) = \ldots

Independence: p(x1,,xn)=p(x1)p(x2)p(xn)p(x_1, \ldots, x_n) = p(x_1) \cdot p(x_2) \cdots p(x_n).

These relationships appear throughout ML: in Bayesian inference (Bayes’ theorem is the conditional formula), in graphical models (chain rule and independence), and in dimensionality reduction (covariance structure).

← All posts