Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 10

20 Mar 2024 (1 year ago)

Penalty Functions

Penalty functions describe the irritation with a residual of a certain value.
The goal is to minimize total irritation by minimizing the sum of these penalties.
L1 (Lasso) penalty functions are sparsifying, often resulting in many zero entries in the solution to an optimization problem.
L1 penalties are relatively relaxed for large residuals and upset with small residuals, while quadratic penalties are the opposite.
Huber penalties blend quadratic and L1 penalties, matching least squares for small residuals and resembling an L1 penalty for large residuals.

Regularized Approximation

Regularized approximation is a criterion problem with multiple objectives, often involving tracking a desired trajectory, minimizing input size, and minimizing input variation.
Weights (Delta and Ada) shape the solution and can be adjusted to achieve desired results.
Changing the penalty function to the sum of absolute values (L1) would likely result in a sparse solution.
Sparse first difference means that most of the time the value is equal to the previous value, encouraging piece-wise constant inputs.

Signal Reconstruction

Signal reconstruction aims to form an approximation of a corrupted signal by minimizing a regularization function or smoothing objective.
The trade-off in signal reconstruction is between deviating from the given corrupted signal and the size of the smoothing cost.
Cross-validation can be used to select the amount of smoothing by randomly removing a portion of the data and pretending it is missing.
When the penalty is an L1 Norm on the first difference, it is called the total variation, and the solution is expected to have a sparse difference corresponding to a piece-wise constant approximation.
Total variation smoothing preserves sharp boundaries in images.
Excessive regularization of total variation denoising results in a cartoonish effect with constant grayscale values in different regions.

Robust Approximation

Robust approximation addresses uncertainty in the model by ignoring it or taking an average of possible values.
The most common method for handling uncertainty in practice is to ignore it or use a mean and variance as a probability distribution.
A better approach is to simulate multiple scenarios or use an educated guess to account for uncertainty.
Regularization in machine learning and statistics addresses uncertainty in data.

Handling Uncertainty

Different approaches to handling uncertainty include stochastic methods, worst-case methods, and hybrids of these.
Stochastic methods assume that the uncertain parameter comes from a probability distribution.
Worst-case methods assume that the uncertain parameter satisfies certain constraints.
Robust stochastic methods combine elements of both stochastic and worst-case methods.
A simple example illustrates the different methods and their effects on the resulting model.
A practical trick for handling uncertainty is to obtain a small number of plausible models and use a min-max approach.
The speaker discusses how to generate multiple plausible models when faced with uncertainty in data.
One approach is to use bootstrapping, which is a valid method in this context.
Another principled approach is to use maximum likelihood estimation and then generate models with parameters that are close to the maximum likelihood estimate.
The speaker introduces the concept of stochastic robustly squares, which is a regularized least squares method that accounts for uncertainty in the data.
Regularization, such as Ridge regression, can be interpreted as a way of acknowledging uncertainty in the features of the data.
The speaker provides an example of a uniform distribution on a unit disc in Matrix space to illustrate the concept of uncertainty in model parameters.
The speaker emphasizes the importance of acknowledging uncertainty in data and models, and suggests that simply admitting uncertainty can provide significant benefits.

Maximum Likelihood Estimation

The video discusses statistical estimation, specifically maximum likelihood estimation.
Maximum likelihood estimation aims to find the parameter values that maximize the likelihood of observing the data.
The log likelihood function is often used instead of the likelihood function since maximizing both functions is equivalent.
Regularization can be added to the maximum likelihood function to prevent overfitting.
Maximum likelihood estimation is a convex problem when the log likelihood function is concave.
An example of a linear measurement model is given, where the goal is to estimate the unknown parameter X from a set of measurements.
The log likelihood function for this model is derived and shown to be convex when the noise is Gaussian.
The maximum likelihood solution for this model is the least squares solution.
The noise distribution can also be Laplacian, which has heavier tails than a Gaussian distribution.
Maximum likelihood estimation with Laplacian noise is equivalent to L1 estimation.
L1 estimation is more robust to outliers compared to least squares estimation because it is less sensitive to large residuals.
The difference between L2 fitting and L1 approximation can be explained by the difference in the assumed noise density.
Huber fitting can be interpreted as maximum likelihood estimation with a mixture of Gaussian and exponential distributions.

Logistic Regression

Logistic regression is a model for binary classification where the probability of an outcome is modeled using a logistic function.
The logistic map corresponding to the maximum likelihood is infinitely sharp because making it less sharp would drop the probability and lose log likelihood.
When the log-likelihood is unbounded above, it means the data are linearly separable, and this issue can be fixed by adding regularization.

Hypothesis Testing

In basic hypothesis testing, a randomized detector is a 2xN matrix that determines the probability of guessing the distribution from which an outcome originated.
The confusion matrix, or detection probability matrix, is obtained by multiplying the randomized detector matrix by the probability distributions P and Q.
The goal is to have the detection matrix be an identity matrix, indicating perfect accuracy in guessing the distribution.
The choice of the randomized detector involves a multi-criterion problem, balancing the probabilities of false negatives and false positives.
In some cases, an analytical solution exists for the optimal randomized detector, resulting in a deterministic detector.
The likelihood ratio test is a simple method for determining the distribution from which an outcome originated based on the ratio of P over Q.
In infinite dimensions, determining the density of two continuous distributions can be done by comparing their ratio to a threshold.
Deterministic detectors can be used to classify data points based on their density ratio.
Other approaches, such as minimizing the probability of being wrong, can also be used for classification.
These approaches often result in non-deterministic detectors.