next up previous index
Next: Reducing Search Space Up: Theory of Parameter Estimation Previous: Linear Classification Function

7.2.4

A Non-parametric Learning Algorithm

Consider the case when p=2. Because , minimizing is equivalent to minimizing its square

The problem is given formally as

 

This is a nonlinear programming problem and can be solved using the standard techniques. In this work, a gradient-based, non-parametric algorithm is used to obtain a numerical solution.

The perceptron algorithm [Rosenblatt 1962] has already provided a solution for learning a correct parameter estimate. It iterates on to increase the objective of the form based on the gradient information where the gradient is simply x. In one cycle, all are tried in turn to adjust . If , meaning x is correctly classified under , no action is taken. Otherwise if , is adjusted according to the following gradient-ascent rule:

where is a small constant. The updating is followed by a normalization operation

when is required. The cycle is repeated until a correct classification is made for all the data and thus a correct is learned. With the assumption that the solution exists, also meaning that the data set is linearly separable from the origin of the data space, the algorithm is guaranteed to converge after a finite number of iterations [Duda and Hart 1973].

  
Figure 7.3: Algorithm for finding the optimal combination of parameters. From (Li 1995c) with permission; © 1995 Kluwer.

The objective function can be minimized in a similar way. The gradient is

An updating goes as follows

 

where is a small constant. If were unconstrained, the above might diverge when becomes too small. However, the updating is again followed by the normalization . This process repeats until converges to .

In our implementation, the two stages are combined into one procedure as shown in Fig.7.3. In the procedure, initialize() sets at random, correct() learns a correct from and (), where is a small number, verifies the convergence. The algorithm is very stable.

The amount of change in each minimization iteration is . The influence from x is weighted by . This means that those x with smaller values (closer to the hyperplane) have bigger force in pushing the hyperplane away from themselves; whereas those with big values (far away from the hyperplane) have small influence. This effectively stabilizes the learning process.

When , we are facing the minimax problem (7.23). An algorithm for solving this is the ``Generalized Portrait Technique" [Vapnik 1982] which is designed for constructing hyperplanes with maximum separability. It is extended by Boser et al. [Boser et al. 1992] to train classifiers of the form where is a vector of functions of x. The key idea is to transform the problem into the dual space by means of the Lagrangian. This gives a quadratic optimization with constraints. The optimal parameter estimate is expressed as a linear combination of supporting patterns where the supporting patterns correspond to the data points nearest to the hyperplane. Two benefits are gained from this method: there are no local minima in the quadratic optimization and the obtained maximum separability is insensitive to small changes of the learned parameters.

The computed using the non-parametric procedure is optimal with respect to the training data . It is the best result that can be obtained from for generalization to other data. Better results may be obtained, provided that more knowledge about the training data is available.



next up previous index
Next: Reducing Search Space Up: Theory of Parameter Previous: Linear Classification Function