Lasso算法

Lasso算法与基追踪降噪联系紧密。

历史来源

Robert Tibshirani最初使用Lasso来提高预测的准确性与回归模型的可解释性，他修改了模型拟合的过程，在协变量中只选择一个子集应用到最终模型中，而非用上全部协变量。这是基于有着相似目的，但方法有所不同的Breiman的非负参数推断。

Lasso结合了上述的两种方法，它通过强制让回归系数绝对值之和小于某固定值，即强制一些回归系数变为0，有效地选择了不包括这些回归系数对应的协变量的更简单的模型。这种方法和岭回归类似，在岭回归中，回归系数平方和被强制小于某定值，不同点在于岭回归只改变系数的值，而不把任何值设为0。

基本形式

Lasso最初为了最小二乘法而被设计出来，Lasso的最小二乘法应用能够简单明了地展示Lasso的许多特性。

${\displaystyle y_{i}-{\hat {\beta }}_{0}-x_{i}^{T}\beta =y_{i}-({\bar {y}}-{\bar {x}}^{T}\beta )-x_{i}^{T}\beta =(y_{i}-{\bar {y}})-(x_{i}-{\bar {x}})^{T}\beta ,}$

${\displaystyle \min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}\right\}{\text{ subject to }}\|\beta \|_{1}\leq t.}$

in the so-called Lagrangian form

${\displaystyle \min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{1}\right\}}$

where the exact relationship between ${\displaystyle t}$ and ${\displaystyle \lambda }$ is data dependent.

Orthonormal covariates

Some basic properties of the lasso estimator can now be considered.

Assuming first that the covariates are orthonormal so that ${\displaystyle (x_{i}\mid x_{j})=\delta _{ij}}$, where ${\displaystyle (\cdot \mid \cdot )}$ is the inner product and ${\displaystyle \delta _{ij}}$ is the Kronecker delta, or, equivalently, ${\displaystyle X^{T}X=I}$, then using subgradient methods it can be shown that

{\displaystyle {\begin{aligned}{\hat {\beta }}_{j}={}&S_{N\lambda }({\hat {\beta }}_{j}^{\text{OLS}})={\hat {\beta }}_{j}^{\text{OLS}}\max \left(0,1-{\frac {N\lambda }{|{\hat {\beta }}_{j}^{\text{OLS}}|}}\right)\\&{\text{ where }}{\hat {\beta }}^{\text{OLS}}=(X^{T}X)^{-1}X^{T}y\end{aligned}}}

${\displaystyle S_{\alpha }}$ is referred to as the soft thresholding operator, since it translates values towards zero (making them exactly zero if they are small enough) instead of setting smaller values to zero and leaving larger ones untouched as the hard thresholding operator, often denoted ${\displaystyle H_{\alpha }}$, would.

This can be compared to ridge regression, where the objective is to minimize

${\displaystyle \min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\|y-X\beta \|_{2}^{2}+\lambda \|\beta \|_{2}^{2}\right\}}$

yielding

${\displaystyle {\hat {\beta }}_{j}=(1+N\lambda )^{-1}{\hat {\beta }}_{j}^{\text{OLS}}.}$

So ridge regression shrinks all coefficients by a uniform factor of ${\displaystyle (1+N\lambda )^{-1}}$ and does not set any coefficients to zero.

It can also be compared to regression with best subset selection, in which the goal is to minimize

${\displaystyle \min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{0}\right\}}$

where ${\displaystyle \|\cdot \|_{0}}$ is the "${\displaystyle \ell ^{0}}$ norm", which is defined as ${\displaystyle \|z\|=m}$ if exactly m components of z are nonzero. In this case, it can be shown that

${\displaystyle {\hat {\beta }}_{j}=H_{\sqrt {N\lambda }}\left({\hat {\beta }}_{j}^{\text{OLS}}\right)={\hat {\beta }}_{j}^{\text{OLS}}\mathrm {I} \left(\left|{\hat {\beta }}_{j}^{\text{OLS}}\right|\geq {\sqrt {N\lambda }}\right)}$

where ${\displaystyle H_{\alpha }}$ is the so-called hard thresholding function and ${\displaystyle \mathrm {I} }$ is an indicator function (it is 1 if its argument is true and 0 otherwise).

Therefore, the lasso estimates share features of the estimates from both ridge and best subset selection regression since they both shrink the magnitude of all the coefficients, like ridge regression, but also set some of them to zero, as in the best subset selection case. Additionally, while ridge regression scales all of the coefficients by a constant factor, lasso instead translates the coefficients towards zero by a constant value and sets them to zero if they reach it.

Correlated covariates

Returning to the general case, in which the different covariates may not be independent, a special case may be considered in which two of the covariates, say j and k, are identical for each case, so that ${\displaystyle x_{(j)}=x_{(k)}}$, where ${\displaystyle x_{(j),i}=x_{ij}}$. Then the values of ${\displaystyle \beta _{j}}$ and ${\displaystyle \beta _{k}}$ that minimize the lasso objective function are not uniquely determined. In fact, if there is some solution ${\displaystyle {\hat {\beta }}}$ in which ${\displaystyle {\hat {\beta }}_{j}{\hat {\beta }}_{k}\geq 0}$, then if ${\displaystyle s\in [0,1]}$ replacing ${\displaystyle {\hat {\beta }}_{j}}$ by ${\displaystyle s({\hat {\beta }}_{j}+{\hat {\beta }}_{k})}$ and ${\displaystyle {\hat {\beta }}_{k}}$ by ${\displaystyle (1-s)({\hat {\beta }}_{j}+{\hat {\beta }}_{k})}$, while keeping all the other ${\displaystyle {\hat {\beta }}_{i}}$ fixed, gives a new solution, so the lasso objective function then has a continuum of valid minimizers. Several variants of the lasso, including the Elastic Net, have been designed to address this shortcoming, which are discussed below.

参考文献

1. Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the lasso”. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178页面存档备份，存于互联网档案馆）.
2. ^ Breiman, Leo. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. （原始内容存档于2020-06-08）.
3. ^ Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. （原始内容存档于2020-11-17）.
4. ^ Tibshirani, Robert. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 （英语）.[永久失效链接]
5. ^ 引用错误：没有为名为Zou 2005的参考文献提供内容