7.2.4
Consider the case when p=2. Because , minimizing
is equivalent to minimizing its square
The problem is given formally as
This is a nonlinear programming problem and can be solved using the standard techniques. In this work, a gradient-based, non-parametric algorithm is used to obtain a numerical solution.
The perceptron algorithm [Rosenblatt 1962] has already provided a
solution for learning a correct parameter estimate. It iterates on
to increase the objective of the form
based on the
gradient information where the gradient is simply x. In one cycle, all
are tried in turn to adjust
. If
,
meaning x is correctly classified under
, no action is taken.
Otherwise if
,
is adjusted according to the
following gradient-ascent rule:
where is a small constant. The updating is followed by a
normalization operation
when is required. The cycle is repeated until a correct
classification is made for all the data and thus a correct
is
learned. With the assumption that the solution exists, also meaning
that the data set
is linearly separable from the origin of the
data space, the algorithm is guaranteed to converge after a finite
number of iterations [Duda and Hart 1973].
Figure 7.3: Algorithm for finding the optimal combination of parameters.
From (Li 1995c) with permission; © 1995 Kluwer.
The objective function
can be minimized
in a similar way. The gradient is
An updating goes as follows
where is a small constant. If
were unconstrained, the
above might diverge when
becomes too small. However,
the updating is again followed by the normalization
. This
process repeats until
converges to
.
In our implementation, the two stages are combined into one procedure as
shown in Fig.7.3. In the procedure,
initialize() sets
at random, correct(
) learns a
correct
from
and
(
), where
is a small
number, verifies the convergence. The algorithm is very stable.
The amount of change in each minimization iteration is . The influence from x is
weighted by
. This means that those x with smaller
values (closer to the hyperplane) have bigger force in
pushing the hyperplane away from themselves; whereas those with big
values (far away from the hyperplane) have small influence.
This effectively stabilizes the learning process.
When , we are facing the minimax problem
(7.23). An algorithm for solving this is the
``Generalized Portrait Technique" [Vapnik 1982] which is designed for
constructing hyperplanes with maximum separability. It is extended by
Boser et al.
[Boser et al. 1992] to train classifiers of the form
where
is a vector of functions of x. The key
idea is to transform the problem into the dual space by means of the
Lagrangian. This gives a quadratic optimization with constraints. The
optimal parameter estimate is expressed as a linear combination of
supporting patterns where the supporting patterns correspond to the data
points nearest to the hyperplane. Two benefits are gained from this
method: there are no local minima in the quadratic optimization and the
obtained maximum separability is insensitive to small changes of the
learned parameters.
The computed using the non-parametric procedure is optimal with
respect to the training data
. It is the best result that can be
obtained from
for generalization to other data. Better results may
be obtained, provided that more knowledge about the training data is
available.