Regression
graph BT;
graph BT;
GENRLS[Generalized Elastic-net-regularized least squares];
ENRLS[Elastic-net-regularized least squares]-->GENRLS;
LASSO[Lasso regression]-->ENRLS;
GRLS[Generalized Tikhonov-regularized least squares]-->GENRLS;
LR[Lavrentyev-regularized regularization]-->GRLS;
RLS[Tikhonov-regularized least squares]-->ENRLS;
RLS-->GRLS;
GLS[Generalized Least Squares]-->RLS;
WLS[Weighted Least Squares]-->GLS;
OLS[Ordinary Least Squares]-->WLS;
Fahrplan:
- Present regressions as unintepreted mathematical constructions.
- Present parameter estimation models and show how certain regressions solve them (Gauss-Markov-Aitken theorem).
- Present forecasting models and show how to solve them.
https://ryxcommar.com/2019/07/14/on-moving-from-statistics-to-machine-learning-the-final-stage-of-grief/
Linear regression
Definition. Given an input set \mathcal X and an input vector \vec x ∈ {\mathcal X}^n and a set of m basis functions f_i : \mathcal X → \R, a linear regression model is the relation
\vec y = X(\vec x) ⋅ \vec b + \vec e
where
X(\vec x) ≜ \begin{bmatrix} f_0(x_0) & f_1(x_0) & f_2(x_0) & ⋯ & f_m(x_0) \\ f_0(x_1) & f_1(x_1) & f_2(x_1) & ⋯ & f_m(x_1) \\ f_0(x_2) & f_1(x_2) & f_2(x_2) & ⋯ & f_m(x_2) \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ f_0(x_n) & f_1(x_n) & f_2(x_n) & ⋯ & f_m(x_n) \\ \end{bmatrix}
Note. The generalization where the output space is higher dimensional \vec y ∈ \R^{n \times d}, f_i : \mathcal X → \R^d is really just a special case where we treat each component as a separate.
Note. The matrix X is called the design matrix, model matrix or regressor matrix. The vector \vec y is called the response vector or dependent variable, \vec b is called the parameter vector, \vec e is called the error vector and \vec x is called the independent variable, predictor variable, regressor, covariate, explanatory variable, control variable and in the context exposure variable, risk factor, feature or input variable. (See this overview).
Question. Can this be generalized from \R to \C or arbitrary vector spaces? Are there even normed vector spaces not \R or \C?
Simple least squares is the special case with \mathcal X = \R, m = 2, f_0(x) = 1 and f_1(x) = x.
Ordinary least squares is the special case with \mathcal X = \R^p, m = p + 1, f_i(\vec x) = x_i and f_p(\vec x) = 1. In rare cases the f_p can be left out.
Polynomial regression is the special case with \mathcal X = \R and f_i(x) = x^i. In this case the design matrix is a Vandermonde matrix.
Note. In ordinary least squares it is common to include a constant intercept by setting x_{0i} = 1. Equivalently, a basis function f_{p}(\vec x) = 1 can be used. In either case, the coefficient associated with the constant function is called the intercept.
To do. More special cases of the design matrix: ANOVA, ANCOVA, MANCOVA, Linear regression, polynomial regression,
To do. Goodness of fit metrics such as R^2, \bar{R}^2, Log-likelihood, Durbin-Watson statistic, Akaike criterion, Schwarz criterion, F-test, Wilks's Lambda. (See here).
To do. Discuss other methods like total least squares, Errors-in-variables models, modelling uncertainty in \vec x.
To do. Discuss other goals like least absolute deviations, median absolute deviation, relative mean absolute difference, mean absolute deviation, minimum maximal deviation, Lasso regression, Basis pursuit denoising. See also here. Iteratively reweighted least squares can solve goals that are p-norms.
Generalized Tikhonov-regularized least squares
Definition. Given a linear regression model, parameter mean {\vec b}_0 ∈ \R^m, and a covariance matrices K_e ∈ \R^{n × n} and K_b ∈ \R^{m × m}, the Generalized Tikhonov-regularized least squares estimator \hat{\vec b} is the value of \vec b that minimizes the combined Mahalanobis norm of the residual vector \vec e and the parameter vector \vec b:
\hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec y - X ⋅ \vec b}_{K_e}^2 + \norm{\vec b - {\vec b}_0}_{K_b}^2
Note. The special case where K_b = \lambda I is the (non-generalized) Tikhonov regression and also know as Ridge regression.
In statistics, the method is known as ridge regression, in machine learning it is known as weight decay, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.
Note. The special case where K_e = X is the Lavrentyev regularization.
Note. The special case where K_b = 0 is the generalized least squares. The generalized least squares is unbiased in b, but can ill-conditioned.
Note. The special case where K_b = 0 and K_e = \operatorname{diag}(\vec w) is weighted least squares.
Note. The special case where K_b = 0 and K_e = I is ordinary least squares.
Note. The matrices X^T ⋅ \vec y and X^T ⋅ X are moment matrix.
Theorem. The generalized Tikhonov-regularized least squares estimator has a closed form solution
\hat{\vec b} = \p{X^T ⋅ {K_e}^{-1} ⋅ X + K_{b}}^{-1} ⋅ \p{X^T ⋅ {K_e}^{-1} ⋅ \vec y + K_b ⋅ {\vec b}_0}
To do. Proof by derivation.
To do. Derive expected value and variance of \hat{\vec b} like here.
Algorithm. While the closed form solution can be used directly, it is numerically unstable. A more stable algorithm is the following: First, the problem is reduced to a generalized least squares regression
\begin{aligned} \hat{\vec b} &= \arg\min_{\vec b} \norm{ \begin{bmatrix} \vec y \\ {\vec b}_0 \end{bmatrix} - \begin{bmatrix} X\p{\vec x} \\ I \end{bmatrix} ⋅ \vec b }_K & K &≜ \begin{bmatrix} K_e & 0 \\ 0 & K_b \end{bmatrix} \end{aligned}
which is then reduced to an ordinary least squares using a whitening transform with W^*W = K^{-1}
\hat{\vec b} ≜ \arg\min_{\vec b} \norm{ W ⋅ \begin{bmatrix} \vec y \\ {\vec b}_0 \end{bmatrix} - W ⋅ \begin{bmatrix} X\p{\vec x} \\ I \end{bmatrix} ⋅ \vec b }
Finally, this is solved using the Moore-Penrose inverse computed using SVD.
def least_squares(y, X, K_e=None, K_b=None, b_0=None):
n, m = X.shape
assert x.shape == (n,)
assert y.shape == (n,)
if K_e is None: # Ordinary least squares
K_e = np.eye(n)
if K_e.shape == (n,): # Weighted least squares
K_e = np.diag(K_e)
assert K_e.shape == (n,n)
if not K_b is None:
if np.isscalar(K_b):
K_b = K_b * np.ones(m)
if K_b.shape == (m,):
K_b = np.diag(K_b)
assert K_b.shape == (m,m)
if b_0 is None:
b_0 = np.zeros(m)
assert b_0.shape == (m,)
# Reduce Tikhonov-regularization to generalized least squares
y = np.concatenate([y, b_0])
X = np.block([[X],[np.eye(m)]])
K_e = np.block([
[K_e, np.zeros((n, m))],
[np.zeros((m, n)), K_b]
])
# Reduce generalized to ordinary least squares
W = whitening_transform(K_e)
y = np.matmul(W, y)
X = np.matmul(W, X)
# Solve OLS using Moore-Penrose inverse
b = np.matmul(np.linalg.pinv(X), y)
return b
Least Angle Regression
Definition. Given a linear regression model, parameter mean {\vec b}_0 ∈ \R^m, and a covariance matrices K_e ∈ \R^{n × n} and K_b ∈ \R^{m × m}, the Generalized Thikononv-Lasso estimator \hat{\vec b} is the value of \vec b that minimizes the combined Mahalanobis norm of the residual vector \vec e and the parameter vector \vec b:
\hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec y - X ⋅ \vec b}_{K_e}^2 + \norm{\vec b - {\vec b}_0}_{K_b}^2 + \norm{\vec b - {\vec b}_0}_{1,K_b'}
https://en.wikipedia.org/wiki/Elastic_net_regularization
To do. The generalized Lasso problem and the LARS(Lasso) solution from Algorithm 3.2a in HTF09.
https://en.wikipedia.org/wiki/Least-angle_regression
To do. Equivalence to support vector machines and the efficient solving algorithms employed there.
Gauss-Markov-Aitken Theorem
To do. This does not generalize to higher dimensions without the additional constraint that the estimator is unbiased. See https://en.wikipedia.org/wiki/Stein%27s_example https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator.
https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem#Generalized_least_squares_estimator
https://en.wikipedia.org/wiki/Best_linear_unbiased_prediction
https://en.wikipedia.org/wiki/Minimum_mean_square_error
To do. https://en.wikipedia.org/wiki/Bayesian_interpretation_of_kernel_regularization https://stats.stackexchange.com/questions/163388/why-is-the-l2-regularization-equivalent-to-gaussian-prior
To do. Generalized-Tikhonov has multinormal prior on \hat{\vec b}. Lasso has Laplacian prior.
General linear model
https://en.wikipedia.org/wiki/General_linear_model
Y = X ⋅ B + U
Mixed model
https://en.wikipedia.org/wiki/Mixed_model
Non-linear regression model
Definition. Given an input set \mathcal X, parameter set \mathcal B, input vector \vec x ∈ {\mathcal X}^n and model function f : \mathcal X \times \mathcal B → \R, a nonlinear regression model is the relation
y_i = f(x_i, b) + e_i
for some b ∈ \mathcal B.
Nonlinear least squares
Definition. Nonlinear least squares https://en.wikipedia.org/wiki/Non-linear_least_squares
\hat{\vec b} ≜ \arg\min_{b} \norm{\vec e}
No closed form solution. Can be solved using global optimizers. Depending on f a gradient may be available.
Algorithm. One method for solving is the Levenberg–Marquardt algorithm. Special cases are the Gauss-Newton algorithm and gradient descent
To do. Extend to norms other than Euclidean.
Partial least squares
https://en.wikipedia.org/wiki/Partial_least_squares_regression
Principal component regression
https://en.wikipedia.org/wiki/Principal_component_analysis
https://en.wikipedia.org/wiki/Principal_component_regression
Rational function regression
Initiate the optimization problem using a good seed value provided by solving the linear model p(x) - q(x) y.
\vec y = \p{P(\vec x) ⋅ \vec b_p} ⊘ \p{Q(\vec x)⋅ \vec b_q} + \vec e
where ⊘ is Hadamard division.
\p{Q ⋅ \vec b_q} ⊙ \vec y = P ⋅ \vec b_p + \p{Q ⋅ \vec b_q} ⊙ \vec e
\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y = P ⋅ \vec b_p + \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec e
\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec e = \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y - P ⋅ \vec b_p
\vec e = \operatorname{diag} \p{Q ⋅ \vec b_q}^{-1} ⋅ \p {\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y - P ⋅ \vec b_p }
\vec e = \vec y - \operatorname{diag} \p{Q ⋅ \vec b_q}^{-1} ⋅ P ⋅ \vec b_p
\hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec e}_Ω
Universal Kriging
http://www.kgs.ku.edu/Conferences/IAMG//Sessions/D/Papers/boogaart.pdf
https://www.nersc.no/sites/www.nersc.no/files/Basics2kriging.pdf
https://en.wikipedia.org/wiki/Regression-kriging
Special case: Gaussian-Process-Regression
https://people.cs.umass.edu/~wallach/talks/gp_intro.pdf
http://www.gaussianprocess.org/gpml/chapters/RW2.pdf
Rational Regression Kriging
Goal. Find the best fit Rational-Kriging model for a \R^m → \R^n function given samples from \R^m × \R^n .
\vec F(\vec x) = \vec m(\vec x) + \vec \epsilon'(\vec x) + \vec \epsilon''
Where \vec m is rational function and \epsilon' is a Gaussian process and \vec \epsilon'' are normally distributed residual errors.
The hyper-parameters of the model are the numerator and denominator degree of the rational function, the covariance kernel of the Gaussian process and the variance of the residual errors.
https://en.wikipedia.org/wiki/Regression-kriging
https://en.wikipedia.org/wiki/Generalized_least_squares
Rational trend model
https://en.wikipedia.org/wiki/Polynomial_regression
https://en.wikipedia.org/wiki/Polynomial_and_rational_function_modeling
To do. What if we have uncertainty in \vec x like we have in \vec y through \vec \epsilon''?
To do. Can we add gradients to the input values like in Gradient-enhanced kriging ?
https://hal-emse.ccsd.cnrs.fr/emse-01525674/file/paperHAL.pdf
To do. The polynomial and rational Equioscillation Theorem/.
Gradient Enhanced Krigin
https://en.wikipedia.org/wiki/Gradient-enhanced_kriging
Support Vector Machines
https://en.wikipedia.org/wiki/Support_vector_machine
Bayesian Optimization
https://www.youtube.com/watch?v=vz3D36VXefI
https://arxiv.org/abs/1807.02811
https://arxiv.org/abs/1009.5419
To do. https://en.wikipedia.org/wiki/Markov_decision_process
https://math.stackexchange.com/questions/924482/least-squares-regression-matrix-for-rational-functions
References
- Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009). “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”. Available online.
- Carl Edward Rasmussen & Christopher K. I. Williams (2006). “Gaussian Processes for Machine Learning”. Available online.
https://stats.stackexchange.com/questions/396914/why-is-computing-ridge-regression-with-a-cholesky-decomposition-much-quicker-tha
https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
https://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf
-
Philippe Rigollet (2016). “18.650 Statistics for Applications”. MIT OCW.