Regression
graph BT;
graph BT;
GENRLS[Generalized Elasticnetregularized least squares];
ENRLS[Elasticnetregularized least squares]>GENRLS;
LASSO[Lasso regression]>ENRLS;
GRLS[Generalized Tikhonovregularized least squares]>GENRLS;
LR[Lavrentyevregularized regularization]>GRLS;
RLS[Tikhonovregularized least squares]>ENRLS;
RLS>GRLS;
GLS[Generalized Least Squares]>RLS;
WLS[Weighted Least Squares]>GLS;
OLS[Ordinary Least Squares]>WLS;
Fahrplan:
 Present regressions as unintepreted mathematical constructions.
 Present parameter estimation models and show how certain regressions solve them (GaussMarkovAitken theorem).
 Present forecasting models and show how to solve them.
https://ryxcommar.com/2019/07/14/onmovingfromstatisticstomachinelearningthefinalstageofgrief/
Linear regression
Definition. Given an input set $\mathcal X$ and an input vector $\vec x ∈ {\mathcal X}^n$ and a set of $m$ basis functions $f_i : \mathcal X → \R$, a linear regression model is the relation
$$ \vec y = X(\vec x) ⋅ \vec b + \vec e $$
where
$$ X(\vec x) ≜ \begin{bmatrix} f_0(x_0) & f_1(x_0) & f_2(x_0) & ⋯ & f_m(x_0) \\ f_0(x_1) & f_1(x_1) & f_2(x_1) & ⋯ & f_m(x_1) \\ f_0(x_2) & f_1(x_2) & f_2(x_2) & ⋯ & f_m(x_2) \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ f_0(x_n) & f_1(x_n) & f_2(x_n) & ⋯ & f_m(x_n) \\ \end{bmatrix} $$
Note. The generalization where the output space is higher dimensional $\vec y ∈ \R^{n \times d}$, $f_i : \mathcal X → \R^d$ is really just a special case where we treat each component as a separate.
Note. The matrix $X$ is called the design matrix, model matrix or regressor matrix. The vector $\vec y$ is called the response vector or dependent variable, $\vec b$ is called the parameter vector, $\vec e$ is called the error vector and $\vec x$ is called the independent variable, predictor variable, regressor, covariate, explanatory variable, control variable and in the context exposure variable, risk factor, feature or input variable. (See this overview).
Question. Can this be generalized from $\R$ to $\C$ or arbitrary vector spaces? Are there even normed vector spaces not $\R$ or $\C$?
Simple least squares is the special case with $\mathcal X = \R$, $m = 2$, $f_0(x) = 1$ and $f_1(x) = x$.
Ordinary least squares is the special case with $\mathcal X = \R^p$, $m = p + 1$, $f_i(\vec x) = x_i$ and $f_p(\vec x) = 1$. In rare cases the $f_p$ can be left out.
Polynomial regression is the special case with $\mathcal X = \R$ and $f_i(x) = x^i$. In this case the design matrix is a Vandermonde matrix.
Note. In ordinary least squares it is common to include a constant intercept by setting $x_{0i} = 1$. Equivalently, a basis function $f_{p}(\vec x) = 1$ can be used. In either case, the coefficient associated with the constant function is called the intercept.
To do. More special cases of the design matrix: ANOVA, ANCOVA, MANCOVA, Linear regression, polynomial regression,
To do. Goodness of fit metrics such as $R^2$, $\bar{R}^2$, Loglikelihood, DurbinWatson statistic, Akaike criterion, Schwarz criterion, Ftest, Wilks's Lambda. (See here).
To do. Discuss other methods like total least squares, Errorsinvariables models, modelling uncertainty in $\vec x$.
To do. Discuss other goals like least absolute deviations, median absolute deviation, relative mean absolute difference, mean absolute deviation, minimum maximal deviation, Lasso regression, Basis pursuit denoising. See also here. Iteratively reweighted least squares can solve goals that are $p$norms.
Generalized Tikhonovregularized least squares
Definition. Given a linear regression model, parameter mean ${\vec b}_0 ∈ \R^m$, and a covariance matrices $K_e ∈ \R^{n × n}$ and $K_b ∈ \R^{m × m}$, the Generalized Tikhonovregularized least squares estimator $\hat{\vec b}$ is the value of $\vec b$ that minimizes the combined Mahalanobis norm of the residual vector $\vec e$ and the parameter vector $\vec b$:
$$ \hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec y  X ⋅ \vec b}_{K_e}^2 + \norm{\vec b  {\vec b}_0}_{K_b}^2 $$
Note. The special case where $K_b = \lambda I$ is the (nongeneralized) Tikhonov regression and also know as Ridge regression.
In statistics, the method is known as ridge regression, in machine learning it is known as weight decay, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for nonlinear leastsquares problems.
Note. The special case where $K_e = X$ is the Lavrentyev regularization.
Note. The special case where $K_b = 0$ is the generalized least squares. The generalized least squares is unbiased in $b$, but can illconditioned.
Note. The special case where $K_b = 0$ and $K_e = \operatorname{diag}(\vec w)$ is weighted least squares.
Note. The special case where $K_b = 0$ and $K_e = I$ is ordinary least squares.
Note. The matrices $X^T ⋅ \vec y$ and $X^T ⋅ X$ are moment matrix.
Theorem. The generalized Tikhonovregularized least squares estimator has a closed form solution
$$ \hat{\vec b} = \p{X^T ⋅ {K_e}^{1} ⋅ X + K_{b}}^{1} ⋅ \p{X^T ⋅ {K_e}^{1} ⋅ \vec y + K_b ⋅ {\vec b}_0} $$
To do. Proof by derivation.
To do. Derive expected value and variance of $\hat{\vec b}$ like here.
Algorithm. While the closed form solution can be used directly, it is numerically unstable. A more stable algorithm is the following: First, the problem is reduced to a generalized least squares regression
$$ \begin{aligned} \hat{\vec b} &= \arg\min_{\vec b} \norm{ \begin{bmatrix} \vec y \\ {\vec b}_0 \end{bmatrix}  \begin{bmatrix} X\p{\vec x} \\ I \end{bmatrix} ⋅ \vec b }_K & K &≜ \begin{bmatrix} K_e & 0 \\ 0 & K_b \end{bmatrix} \end{aligned} $$
which is then reduced to an ordinary least squares using a whitening transform with $W^*W = K^{1}$
$$ \hat{\vec b} ≜ \arg\min_{\vec b} \norm{ W ⋅ \begin{bmatrix} \vec y \\ {\vec b}_0 \end{bmatrix}  W ⋅ \begin{bmatrix} X\p{\vec x} \\ I \end{bmatrix} ⋅ \vec b } $$
Finally, this is solved using the MoorePenrose inverse computed using SVD.
def least_squares(y, X, K_e=None, K_b=None, b_0=None):
n, m = X.shape
assert x.shape == (n,)
assert y.shape == (n,)
if K_e is None: # Ordinary least squares
K_e = np.eye(n)
if K_e.shape == (n,): # Weighted least squares
K_e = np.diag(K_e)
assert K_e.shape == (n,n)
if not K_b is None:
if np.isscalar(K_b):
K_b = K_b * np.ones(m)
if K_b.shape == (m,):
K_b = np.diag(K_b)
assert K_b.shape == (m,m)
if b_0 is None:
b_0 = np.zeros(m)
assert b_0.shape == (m,)
# Reduce Tikhonovregularization to generalized least squares
y = np.concatenate([y, b_0])
X = np.block([[X],[np.eye(m)]])
K_e = np.block([
[K_e, np.zeros((n, m))],
[np.zeros((m, n)), K_b]
])
# Reduce generalized to ordinary least squares
W = whitening_transform(K_e)
y = np.matmul(W, y)
X = np.matmul(W, X)
# Solve OLS using MoorePenrose inverse
b = np.matmul(np.linalg.pinv(X), y)
return b
Least Angle Regression
Definition. Given a linear regression model, parameter mean ${\vec b}_0 ∈ \R^m$, and a covariance matrices $K_e ∈ \R^{n × n}$ and $K_b ∈ \R^{m × m}$, the Generalized ThikononvLasso estimator $\hat{\vec b}$ is the value of $\vec b$ that minimizes the combined Mahalanobis norm of the residual vector $\vec e$ and the parameter vector $\vec b$:
$$ \hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec y  X ⋅ \vec b}_{K_e}^2 + \norm{\vec b  {\vec b}_0}_{K_b}^2 + \norm{\vec b  {\vec b}_0}_{1,K_b'} $$
https://en.wikipedia.org/wiki/Elastic_net_regularization
To do. The generalized Lasso problem and the LARS(Lasso) solution from Algorithm 3.2a in HTF09.
https://en.wikipedia.org/wiki/Leastangle_regression
To do. Equivalence to support vector machines and the efficient solving algorithms employed there.
GaussMarkovAitken Theorem
To do. This does not generalize to higher dimensions without the additional constraint that the estimator is unbiased. See https://en.wikipedia.org/wiki/Stein%27s_example https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator.
https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem#Generalized_least_squares_estimator
https://en.wikipedia.org/wiki/Best_linear_unbiased_prediction
https://en.wikipedia.org/wiki/Minimum_mean_square_error
To do. https://en.wikipedia.org/wiki/Bayesian_interpretation_of_kernel_regularization https://stats.stackexchange.com/questions/163388/whyisthel2regularizationequivalenttogaussianprior
To do. GeneralizedTikhonov has multinormal prior on $\hat{\vec b}$. Lasso has Laplacian prior.
General linear model
https://en.wikipedia.org/wiki/General_linear_model
$$ Y = X ⋅ B + U $$
Mixed model
https://en.wikipedia.org/wiki/Mixed_model
Nonlinear regression model
Definition. Given an input set $\mathcal X$, parameter set $\mathcal B$, input vector $\vec x ∈ {\mathcal X}^n$ and model function $f : \mathcal X \times \mathcal B → \R$, a nonlinear regression model is the relation
$$ y_i = f(x_i, b) + e_i $$
for some $b ∈ \mathcal B$.
Nonlinear least squares
Definition. Nonlinear least squares https://en.wikipedia.org/wiki/Nonlinear_least_squares
$$ \hat{\vec b} ≜ \arg\min_{b} \norm{\vec e} $$
No closed form solution. Can be solved using global optimizers. Depending on $f$ a gradient may be available.
Algorithm. One method for solving is the Levenberg–Marquardt algorithm. Special cases are the GaussNewton algorithm and gradient descent
To do. Extend to norms other than Euclidean.
Partial least squares
https://en.wikipedia.org/wiki/Partial_least_squares_regression
Principal component regression
https://en.wikipedia.org/wiki/Principal_component_analysis
https://en.wikipedia.org/wiki/Principal_component_regression
Rational function regression
Initiate the optimization problem using a good seed value provided by solving the linear model $p(x)  q(x) y$.
$$ \vec y = \p{P(\vec x) ⋅ \vec b_p} ⊘ \p{Q(\vec x)⋅ \vec b_q} + \vec e $$
where $⊘$ is Hadamard division.
$$ \p{Q ⋅ \vec b_q} ⊙ \vec y = P ⋅ \vec b_p + \p{Q ⋅ \vec b_q} ⊙ \vec e $$
$$ \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y = P ⋅ \vec b_p + \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec e $$
$$ \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec e = \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y  P ⋅ \vec b_p $$
$$ \vec e = \operatorname{diag} \p{Q ⋅ \vec b_q}^{1} ⋅ \p {\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y  P ⋅ \vec b_p } $$
$$ \vec e = \vec y  \operatorname{diag} \p{Q ⋅ \vec b_q}^{1} ⋅ P ⋅ \vec b_p $$
$$ \hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec e}_Ω $$
Universal Kriging
http://www.kgs.ku.edu/Conferences/IAMG//Sessions/D/Papers/boogaart.pdf
https://www.nersc.no/sites/www.nersc.no/files/Basics2kriging.pdf
https://en.wikipedia.org/wiki/Regressionkriging
Special case: GaussianProcessRegression
https://people.cs.umass.edu/~wallach/talks/gp_intro.pdf
http://www.gaussianprocess.org/gpml/chapters/RW2.pdf
Rational Regression Kriging
Goal. Find the best fit RationalKriging model for a $\R^m → \R^n$ function given samples from $\R^m × \R^n$ .
$$ \vec F(\vec x) = \vec m(\vec x) + \vec \epsilon'(\vec x) + \vec \epsilon'' $$
Where $\vec m$ is rational function and $\epsilon'$ is a Gaussian process and $\vec \epsilon''$ are normally distributed residual errors.
The hyperparameters of the model are the numerator and denominator degree of the rational function, the covariance kernel of the Gaussian process and the variance of the residual errors.
https://en.wikipedia.org/wiki/Regressionkriging
https://en.wikipedia.org/wiki/Generalized_least_squares
Rational trend model
https://en.wikipedia.org/wiki/Polynomial_regression
https://en.wikipedia.org/wiki/Polynomial_and_rational_function_modeling
To do. What if we have uncertainty in $\vec x$ like we have in $\vec y$ through $\vec \epsilon''$?
To do. Can we add gradients to the input values like in Gradientenhanced kriging ?
https://halemse.ccsd.cnrs.fr/emse01525674/file/paperHAL.pdf
To do. The polynomial and rational Equioscillation Theorem/.
Gradient Enhanced Krigin
https://en.wikipedia.org/wiki/Gradientenhanced_kriging
Support Vector Machines
https://en.wikipedia.org/wiki/Support_vector_machine
Bayesian Optimization
https://www.youtube.com/watch?v=vz3D36VXefI
https://arxiv.org/abs/1807.02811
https://arxiv.org/abs/1009.5419
To do. https://en.wikipedia.org/wiki/Markov_decision_process
https://math.stackexchange.com/questions/924482/leastsquaresregressionmatrixforrationalfunctions
References
 Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009). “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”. Available online.
 Carl Edward Rasmussen & Christopher K. I. Williams (2006). “Gaussian Processes for Machine Learning”. Available online.
https://stats.stackexchange.com/questions/396914/whyiscomputingridgeregressionwithacholeskydecompositionmuchquickertha
https://www.analyticsvidhya.com/blog/2016/01/ridgelassoregressionpythoncompletetutorial/
https://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf

Philippe Rigollet (2016). “18.650 Statistics for Applications”. MIT OCW.