We can show that for a choice stepsize, gradient descent converges to the minimum L2 norm solution.
We use the huber loss function losshub(y^,y). Thus we can define sample loss as the objective function
f(w):=LShub(w)=m1i∑losshub(<w,xi>,yi)
Claim:f is convex and differentiable.
To see that f is convex w.r.t w, we recognize that f is a sum of functions- each function is either 21(<w,x>−y)2 which is the composition of a convex function x2 with an affine function and thus is precisely a convex function w.r.t. w or ∣<w,x>−y∣−21 where ∣x∣=max(x,−x) is the pointwise maximum of convex functions and thus convex, and so we have the composition of a convex function with an affine function and thus we get a convex function w.r.t w.
Since the sum of convex functions is convex we conclude f is convex.
To see that f is differentiable, it is enough to show that the partials are continuous. In particular we show that the partials of order 2 are continuous, meaning that f is twice differentiable. We have
We can use this to give us a learning generalization guarantee if we know that it is on average replace one stable. We see for the non-negative squared loss function
and we note that for any l(w,zi)=21(yi−wTxi)2 that we have the upper bound given the assumption that with probability 1 that ∣∣x∣∣≤1 and ∣y∣≤B that
Here we have the hypothesis class Hk={x→<w,x>:∣∣w∣∣0≤k} of k sparse linear predictors.
We can show the hardness of learning k=n sparse linear predictors in the realizable case by showing a vertex cover reduction, since vertex cover is a NP-hard problem.
Let us be given any graph G=(V,E). We can arbitrarily enumerate the vertices and with this ordering one-hot encode all edges e=(u,v) as R∣V∣ vector where the entry for u,v is 1 and 0 everywhere else.
Now we define ve as the one hot encoded vector and we let our data be (xi,yi)=(ve,1) all the encoded edges with label 1.
Let us denote the sub hypothesis class H+k={x→<w,x>:w≽0&∣∣w∣∣0≤k} i.e all the k sparse predictors whose elements are precisely non-negative. We also follow the convention that sgn(0)=−1.
Then the decision problem of vertex cover (i.e there exists a vertex cover of size at most k vertices) is equivalent to the CONSH+n decision problem
∃w≽0∈Rn:∀i,wTx∗i>0
We recognize that
wTxi>0⟺yisgn(wTxi)>0⟺CONSH+n=1
and moreover that if such consistent w exists that means that we can use the vertices corresponding to the positive entries to construct our cover and we have that every edge is incident to such vertex cover. Similarly, if no consistent w exists, then we cannot construct a vertex cover, meaning that VCOVER=1⟺CONSH+n=1 .
Then, we simply recognize that the assumption of non-negative entries for our decision problem does not matter since
∃w∈H+n:∀i,wTxi>0⟺∃w∈Hn:∀i,wTxi>0
The forwards implication is trivial since any n sparse linear predictor with non-negative entries is a n sparse linear predictor and the backwards implication is true since if such w exists then we can replace any negative entry to a non-negative entry and we get a w≽0 such that ∀iwTx∗i.
Then we have that CONSHn=1⟺CONSH+n=1 which means that the vertex cover decision problem
VCOVER=1⟺CONSHn=1
which concludes our vertex cover reduction to the decision problem of whether there exists a consistent n sparse linear predictor which shows the hardness of learning n linear predictors in the realizable case.
Stability
Here we are given that for any p∈(1,2], Ψ(w)=∣∣w∣∣p2 is 2(p−1) strongly convex. We then show that RERM with a choice p,λ can learn convex G Lipschitz B bounded problems.
Since Ψ(w) is 2(p−1) strongly convex, following the work in 4b) we can easily get that λΨ(w) is 2λ(p−1) strongly convex. In particular, applying the β stability result from 4a), we get a stability rate of
Now we pick our choice p=1+logd1 where we note that assuming that d>e that p∈(1,2) and we also pick λ=mB2G2logd. Then our learning guarantee upperbound becomes