Regression

graph BT;
graph BT;
    GENRLS[Generalized Elastic-net-regularized least squares];
    ENRLS[Elastic-net-regularized least squares]-->GENRLS;
    LASSO[Lasso regression]-->ENRLS;
    GRLS[Generalized Tikhonov-regularized least squares]-->GENRLS;
    LR[Lavrentyev-regularized regularization]-->GRLS;
    RLS[Tikhonov-regularized least squares]-->ENRLS;
    RLS-->GRLS;
    GLS[Generalized Least Squares]-->RLS;
    WLS[Weighted Least Squares]-->GLS;
    OLS[Ordinary Least Squares]-->WLS;

Fahrplan:

https://ryxcommar.com/2019/07/14/on-moving-from-statistics-to-machine-learning-the-final-stage-of-grief/

Linear regression

Definition. Given an input set \mathcal X and an input vector \vec x ∈ {\mathcal X}^n and a set of m basis functions f_i : \mathcal X → \R, a linear regression model is the relation

\vec y = X(\vec x) ⋅ \vec b + \vec e

where

X(\vec x) ≜ \begin{bmatrix} f_0(x_0) & f_1(x_0) & f_2(x_0) & ⋯ & f_m(x_0) \\ f_0(x_1) & f_1(x_1) & f_2(x_1) & ⋯ & f_m(x_1) \\ f_0(x_2) & f_1(x_2) & f_2(x_2) & ⋯ & f_m(x_2) \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ f_0(x_n) & f_1(x_n) & f_2(x_n) & ⋯ & f_m(x_n) \\ \end{bmatrix}

Note. The generalization where the output space is higher dimensional \vec y ∈ \R^{n \times d}, f_i : \mathcal X → \R^d is really just a special case where we treat each component as a separate.

Note. The matrix X is called the design matrix, model matrix or regressor matrix. The vector \vec y is called the response vector or dependent variable, \vec b is called the parameter vector, \vec e is called the error vector and \vec x is called the independent variable, predictor variable, regressor, covariate, explanatory variable, control variable and in the context exposure variable, risk factor, feature or input variable. (See this overview).

Question. Can this be generalized from \R to \C or arbitrary vector spaces? Are there even normed vector spaces not \R or \C?

Simple least squares is the special case with \mathcal X = \R, m = 2, f_0(x) = 1 and f_1(x) = x.

Ordinary least squares is the special case with \mathcal X = \R^p, m = p + 1, f_i(\vec x) = x_i and f_p(\vec x) = 1. In rare cases the f_p can be left out.

Polynomial regression is the special case with \mathcal X = \R and f_i(x) = x^i. In this case the design matrix is a Vandermonde matrix.

Note. In ordinary least squares it is common to include a constant intercept by setting x_{0i} = 1. Equivalently, a basis function f_{p}(\vec x) = 1 can be used. In either case, the coefficient associated with the constant function is called the intercept.

To do. More special cases of the design matrix: ANOVA, ANCOVA, MANCOVA, Linear regression, polynomial regression,

To do. Goodness of fit metrics such as R^2, \bar{R}^2, Log-likelihood, Durbin-Watson statistic, Akaike criterion, Schwarz criterion, F-test, Wilks's Lambda. (See here).

To do. Discuss other methods like total least squares, Errors-in-variables models, modelling uncertainty in \vec x.

To do. Discuss other goals like least absolute deviations, median absolute deviation, relative mean absolute difference, mean absolute deviation, minimum maximal deviation, Lasso regression, Basis pursuit denoising. See also here. Iteratively reweighted least squares can solve goals that are p-norms.

Generalized Tikhonov-regularized least squares

Definition. Given a linear regression model, parameter mean {\vec b}_0 ∈ \R^m, and a covariance matrices K_e ∈ \R^{n × n} and K_b ∈ \R^{m × m}, the Generalized Tikhonov-regularized least squares estimator \hat{\vec b} is the value of \vec b that minimizes the combined Mahalanobis norm of the residual vector \vec e and the parameter vector \vec b:

\hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec y - X ⋅ \vec b}_{K_e}^2 + \norm{\vec b - {\vec b}_0}_{K_b}^2

Note. The special case where K_b = \lambda I is the (non-generalized) Tikhonov regression and also know as Ridge regression.

In statistics, the method is known as ridge regression, in machine learning it is known as weight decay, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.

Note. The special case where K_e = X is the Lavrentyev regularization.

Note. The special case where K_b = 0 is the generalized least squares. The generalized least squares is unbiased in b, but can ill-conditioned.

Note. The special case where K_b = 0 and K_e = \operatorname{diag}(\vec w) is weighted least squares.

Note. The special case where K_b = 0 and K_e = I is ordinary least squares.

Note. The matrices X^T ⋅ \vec y and X^T ⋅ X are moment matrix.

Theorem. The generalized Tikhonov-regularized least squares estimator has a closed form solution

\hat{\vec b} = \p{X^T ⋅ {K_e}^{-1} ⋅ X + K_{b}}^{-1} ⋅ \p{X^T ⋅ {K_e}^{-1} ⋅ \vec y + K_b ⋅ {\vec b}_0}

To do. Proof by derivation.

To do. Derive expected value and variance of \hat{\vec b} like here.

Algorithm. While the closed form solution can be used directly, it is numerically unstable. A more stable algorithm is the following: First, the problem is reduced to a generalized least squares regression

\begin{aligned} \hat{\vec b} &= \arg\min_{\vec b} \norm{ \begin{bmatrix} \vec y \\ {\vec b}_0 \end{bmatrix} - \begin{bmatrix} X\p{\vec x} \\ I \end{bmatrix} ⋅ \vec b }_K & K &≜ \begin{bmatrix} K_e & 0 \\ 0 & K_b \end{bmatrix} \end{aligned}

which is then reduced to an ordinary least squares using a whitening transform with W^*W = K^{-1}

\hat{\vec b} ≜ \arg\min_{\vec b} \norm{ W ⋅ \begin{bmatrix} \vec y \\ {\vec b}_0 \end{bmatrix} - W ⋅ \begin{bmatrix} X\p{\vec x} \\ I \end{bmatrix} ⋅ \vec b }

Finally, this is solved using the Moore-Penrose inverse computed using SVD.

def least_squares(y, X, K_e=None, K_b=None, b_0=None):
    n, m = X.shape
    assert x.shape == (n,)
    assert y.shape == (n,)
    if K_e is None: # Ordinary least squares
        K_e = np.eye(n)
    if K_e.shape == (n,): # Weighted least squares
        K_e = np.diag(K_e)
    assert K_e.shape == (n,n)
    if not K_b is None:
        if np.isscalar(K_b):
            K_b = K_b * np.ones(m)
        if K_b.shape == (m,):
            K_b = np.diag(K_b)
        assert K_b.shape == (m,m)
        if b_0 is None:
            b_0 = np.zeros(m)
        assert b_0.shape == (m,)

        # Reduce Tikhonov-regularization to generalized least squares
        y = np.concatenate([y, b_0])
        X = np.block([[X],[np.eye(m)]])
        K_e = np.block([
            [K_e, np.zeros((n, m))],
            [np.zeros((m, n)), K_b]
        ])
    # Reduce generalized to ordinary least squares
    W = whitening_transform(K_e)
    y = np.matmul(W, y)
    X = np.matmul(W, X)

    # Solve OLS using Moore-Penrose inverse
    b = np.matmul(np.linalg.pinv(X), y)
    return b

Least Angle Regression

Definition. Given a linear regression model, parameter mean {\vec b}_0 ∈ \R^m, and a covariance matrices K_e ∈ \R^{n × n} and K_b ∈ \R^{m × m}, the Generalized Thikononv-Lasso estimator \hat{\vec b} is the value of \vec b that minimizes the combined Mahalanobis norm of the residual vector \vec e and the parameter vector \vec b:

\hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec y - X ⋅ \vec b}_{K_e}^2 + \norm{\vec b - {\vec b}_0}_{K_b}^2 + \norm{\vec b - {\vec b}_0}_{1,K_b'}

https://en.wikipedia.org/wiki/Elastic_net_regularization

To do. The generalized Lasso problem and the LARS(Lasso) solution from Algorithm 3.2a in HTF09.

https://en.wikipedia.org/wiki/Least-angle_regression

To do. Equivalence to support vector machines and the efficient solving algorithms employed there.

Gauss-Markov-Aitken Theorem

To do. This does not generalize to higher dimensions without the additional constraint that the estimator is unbiased. See https://en.wikipedia.org/wiki/Stein%27s_example https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator.

https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem#Generalized_least_squares_estimator

https://en.wikipedia.org/wiki/Best_linear_unbiased_prediction

https://en.wikipedia.org/wiki/Minimum_mean_square_error

To do. https://en.wikipedia.org/wiki/Bayesian_interpretation_of_kernel_regularization https://stats.stackexchange.com/questions/163388/why-is-the-l2-regularization-equivalent-to-gaussian-prior

To do. Generalized-Tikhonov has multinormal prior on \hat{\vec b}. Lasso has Laplacian prior.

General linear model

https://en.wikipedia.org/wiki/General_linear_model

Y = X ⋅ B + U

Mixed model

https://en.wikipedia.org/wiki/Mixed_model

Non-linear regression model

Definition. Given an input set \mathcal X, parameter set \mathcal B, input vector \vec x ∈ {\mathcal X}^n and model function f : \mathcal X \times \mathcal B → \R, a nonlinear regression model is the relation

y_i = f(x_i, b) + e_i

for some b ∈ \mathcal B.

Nonlinear least squares

Definition. Nonlinear least squares https://en.wikipedia.org/wiki/Non-linear_least_squares

\hat{\vec b} ≜ \arg\min_{b} \norm{\vec e}

No closed form solution. Can be solved using global optimizers. Depending on f a gradient may be available.

Algorithm. One method for solving is the Levenberg–Marquardt algorithm. Special cases are the Gauss-Newton algorithm and gradient descent

To do. Extend to norms other than Euclidean.

Partial least squares

https://en.wikipedia.org/wiki/Partial_least_squares_regression

Principal component regression

https://en.wikipedia.org/wiki/Principal_component_analysis

https://en.wikipedia.org/wiki/Principal_component_regression

Rational function regression

Initiate the optimization problem using a good seed value provided by solving the linear model p(x) - q(x) y.

\vec y = \p{P(\vec x) ⋅ \vec b_p} ⊘ \p{Q(\vec x)⋅ \vec b_q} + \vec e

where is Hadamard division.

\p{Q ⋅ \vec b_q} ⊙ \vec y = P ⋅ \vec b_p + \p{Q ⋅ \vec b_q} ⊙ \vec e

\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y = P ⋅ \vec b_p + \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec e

\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec e = \operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y - P ⋅ \vec b_p

\vec e = \operatorname{diag} \p{Q ⋅ \vec b_q}^{-1} ⋅ \p {\operatorname{diag} \p{Q ⋅ \vec b_q} ⋅ \vec y - P ⋅ \vec b_p }

\vec e = \vec y - \operatorname{diag} \p{Q ⋅ \vec b_q}^{-1} ⋅ P ⋅ \vec b_p

\hat{\vec b} ≜ \arg\min_{\vec b} \norm{\vec e}_Ω

Universal Kriging

http://www.kgs.ku.edu/Conferences/IAMG//Sessions/D/Papers/boogaart.pdf

https://www.nersc.no/sites/www.nersc.no/files/Basics2kriging.pdf

https://en.wikipedia.org/wiki/Regression-kriging

Special case: Gaussian-Process-Regression

https://people.cs.umass.edu/~wallach/talks/gp_intro.pdf

http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

Rational Regression Kriging

Goal. Find the best fit Rational-Kriging model for a \R^m → \R^n function given samples from \R^m × \R^n .

\vec F(\vec x) = \vec m(\vec x) + \vec \epsilon'(\vec x) + \vec \epsilon''

Where \vec m is rational function and \epsilon' is a Gaussian process and \vec \epsilon'' are normally distributed residual errors.

The hyper-parameters of the model are the numerator and denominator degree of the rational function, the covariance kernel of the Gaussian process and the variance of the residual errors.

https://en.wikipedia.org/wiki/Regression-kriging

https://en.wikipedia.org/wiki/Generalized_least_squares

Rational trend model

https://en.wikipedia.org/wiki/Polynomial_regression

https://en.wikipedia.org/wiki/Polynomial_and_rational_function_modeling

To do. What if we have uncertainty in \vec x like we have in \vec y through \vec \epsilon''?

To do. Can we add gradients to the input values like in Gradient-enhanced kriging ?

https://hal-emse.ccsd.cnrs.fr/emse-01525674/file/paperHAL.pdf

To do. The polynomial and rational Equioscillation Theorem/.

Gradient Enhanced Krigin

https://en.wikipedia.org/wiki/Gradient-enhanced_kriging

Support Vector Machines

https://en.wikipedia.org/wiki/Support_vector_machine

Bayesian Optimization

https://www.youtube.com/watch?v=vz3D36VXefI

https://arxiv.org/abs/1807.02811

https://arxiv.org/abs/1009.5419

To do. https://en.wikipedia.org/wiki/Markov_decision_process

https://math.stackexchange.com/questions/924482/least-squares-regression-matrix-for-rational-functions

References

https://stats.stackexchange.com/questions/396914/why-is-computing-ridge-regression-with-a-cholesky-decomposition-much-quicker-tha

https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/

https://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf

Remco Bloemen
Math & Engineering
https://2π.com