[Home]
[Notes]
[Publications]
[Projects]
[GitHub]
[BHL0388]
Given a dataset \(\mathcal{D} = \{(\vec{x}_1, y_1), \cdots, (\vec{x}_N, y_N)\}\) and a hypothesis set \(\mathcal{H}\), our learning algorithm \(\mathcal{A}\) tries to learn a function \(g \in \mathcal{H}\) that approximates the underlying, true function \(f: \mathcal{X} \to \mathcal{Y}\), which generates the points in \(\mathcal{D}\).
Given a customer who is applying for a credit card, we want to build a system that determines if we should grant the application or not based on the customer's information such as age, annual salary, year in job, etc. The bank's historical credit approval data can be seen as a dataset \(\mathcal{D} = \{(\vec{x}_i, y_i)\}_{i=1}^N\)where each \(\vx_i \in \mathcal{X}\) and each represents a customer. There is a target function \(f: \mathcal{X} \to \mathcal{Y}\) that gives \(\vx\)'s credit behavior \(f(\vec{x}) = y\). Each \(\vx\) is a multidimensional vector where each component is a feature (age, for example). Our learning algorithm \(\mathcal{A}\) considers a hypothesis class \(\mathcal{H}\) and takes the dataset \(\mathcal{D}\) and tries to give a function \(g \in \mathcal{H}\) so that \(g\) performs similar to \(f\). We will use \(g\) as our system of approving credit card.
Let \(\vec{x} = (x_1, x_2, \cdots, x_d)\) be a customer, we compute a weighted score \(\sum^d_{i=1} w_ix_i\) such that if the weighted score is greater than a threshold \(t\), we approve the credit card, otherwise we don't. Furthermore, suppose \(\mathcal{Y} = \{+1, -1\}\) where \(+1\) means good credit and \(-1\) means bad credit, then \(h \in \mathcal{H}\) looks like \[ h(\vec{x}) = \sign\left( \left( \sum^d_{i=1} w_ix_i \right) - t\right) \] The \(h\) here depends on, or is parametrized by \(\{w_i\}\) and \(t\). We call the set of all possible \(h\) (the set of all possible combination of \(\{w_i\}\) and t) is called perceptron.
Observe that we can move \(t\) into the summation by extending \(w_i\) and \(x_i\) so that \(w_0 = -t\) and \(x_0 = 1\). Then, we can write the summation as an inner product: \[ h(\vx) = \sign(\vec{w}^T\vx). \] Essentially, this \(\vec{w}\) represents a hypothesis. We now assume that \(\vec{w}\) always includes the (negative) threshold \(t\).
In \(\mathbb{R}^2\), \(\vx = [x_1 \quad x_2]^T\), then, the hypothesis is \[ h(\vx) = \sign(w_0 + w_1x_1 + w_2x_2.) \] We can plot all \(\vx\) in the dataset in a graph. Moreover, we use \(\circ\) to represent \(+1\) and \(\times\) to represent \(-1\). Besides, observe that \(w_0 + w_1x_1 + w_2x_2\) is a line. \(h(\vx) = +1\) if \(\vx\) lies on one side of the line, and \(-1\) otherwise on the other side. Since the perceptron is essentially using a line to make a decision, we also call such a hypothesis class linear binary classifier.
With \(\mathcal{H}\), all possible perceptron, how to we select the best of them, denoted as \(g\), such that \(g\) is close to \(f\), which we never know. How do we decide which \(h \in \mathcal{H}\) should be our \(g\), the best? Since \(\mathcal{D}\) is generated by \(f\), we ideally want \[ g(\vx_i) = f(\vx_i) = y_i. \] We can focus on finding the \(g\) that has this property. However, it is difficult, since each \(h\) in \(\mathcal{H}\) is parametrized by a line, and there are infinitely (uncountable) many candidates for us to choose.
We can start from some \(g_0\), and iteratively corrects its mistakes on \(\mathcal{D}\). We use \(\vec{w}_0\) represents \(g_0\) and say \(\vec{w}_0 = \vec{0}\).
Algorithm (Perceptron Learning Algorithm): For \(t = 0, 1, \ldots\) 1. Find a mistake of \(\vec{w}_t\), called \((\vx_{n(t)}, y_{n(t)})\). By mistake, we mean that \(\sign(\vec{w}^T_t\vx_{n(t)}) \ne y_{n(t)}\). 2. We try to correct the mistake by \(\vec{w}_{t + 1} \leftarrow \vec{w}_t + y_{n(t)}\vx_{n(t)}\). Until no more mistakes Return the last \(\vec{w}\), called \(\vec{w}_{PLA}\) as \(g\).
Algorithm (Cyclic PLA): For \(t = 0, 1, \ldots\) 1. Find a mistake of \(\vec{w}_t\), called \((\vx_{n(t)}, y_{n(t)})\). By mistake, we mean that \(\sign(\vec{w}^T_t\vx_{n(t)}) \ne y_{n(t)}\). 2. We try to correct the mistake by \(\vec{w}_{t + 1} \leftarrow \vec{w}_t + y_{n(t)}\vx_{n(t)}\). Until a full cycle (of \(\mathcal{D}\)) not encountering mistakes. Return the last \(\vec{w}\), called \(\vec{w}_{PLA}\) as \(g\).
Suppose there is a mistake at iteration \(t\), with instance \(\vx_n\). We observe that \[ y_n\vec{w}^T_{t +1} \vx_n \ge y_n \vec{w}^T_t\vx_n \] is always true. This shows that the algorithm is indeed trying to correct the mistake. Why? Since it's a mistake, \(\sign(\vec{w}_t^T \vx_n) \ne y_n\). Therefore, \(y_n \vec{w}^T_t\vx_n\) is negative. \(y_n\vec{w}^T_{t +1} \vx_n\) is greater than that means that it is trying to be positive, which is correct.
We ponder, if the algorithm halts. If it halts, how does \(g\) perform on the instances outside \(\mathcal{D}\)?
If PLA halts, i.e. no more mistakes, then \(\mathcal{D}\) allows some \(\vw\) to make no mistakes. We call such \(\mathcal{D}\) linear separable. But given a linear separable \(\mathcal{D}\), will PLA halt?
\(\mathcal{D}\) is linear separable, is equivalent to there is an omniscient \(\vw_f\) such that \(y_n = \sign(\vw^T_f \vx_n)\), for all \(n\). In other word, \[ \min_n y_n \vw^T_f \vx_n > 0. \] So, for any \(\vx_i\), \[ y_i\vw^T_f\vx_i \ge \min_ny_n\vw^T_f\vx_n > 0. \] To measure the similarity between \(\vx_f\) and \(\vx_T\), learned at step \(T\), we can use the inner product. The larger the inner product, the closer the two vectors are. The following derivation shows that we the algorithm runs from one iteration to the next, our learned line is getting closer and closer to the true one. \[
\begin{align*} \vw_f\vw_T &= \vw_f(\vw_{T-1} + y_{n(t)}\vx_{n(t)})\\ &= \vw_f\vx_{T-1} + y_{n(t)}\vw_f\vx_{n(t)} \\ &> \vw_f\vw_{T-1} \end{align*}\] If you are careful enough, you would say that the magnitude of inner product also depends on the magnitude of the vector. This is true, we would want \(\vec{w}_T\) to not grow too large as \(T\) goes large. \[
\begin{align*} \norm{\vw_T}^2 &= \norm{\vw_{T-1} + y_{n(t)}\vx_{n(t)}}^2 \\ &= \norm{\vw_{T-1}}^2 + 2y_{n(t)}\vw_{T-1}^T\vx_{n(t)} + \norm{y_{n(t)}\vx_{n(t)}}^2\\ &\le \norm{\vw_{T-1}}^2 + \norm{y_{n(t)}\vx_{n(t)}}^2\\ &= \norm{\vw_{T-1}}^2 + \norm{\vx_{n(t)}}^2\\ &\le \norm{\vw_{T-1}}^2 + \max_n\norm{\vx_n}^2\\ \end{align*}\] With this bound, suppose our \(\vw_0\) is the zero vector, then \[ \norm{\vw_T}^2 \le \norm{\vw_0}^2 + TR^2 = TR^2 \] where \(R^2 = \max_n \norm{\vx_n}^2\) (Consider \(R\) as the radius of the dataset). Let \(\rho = \min_n y_n \frac{\vw_f^T}{\norm{\vw_f}}\vx_n\) (closet distance between \(\vx\) and \(\vw_f\), or margin) Hence, the normalized inner product has the following property: \[
\begin{align*} \frac{\vw^T_f}{\norm{\vw_f}}\frac{\vw_T}{\norm{\vw_T}} &\ge \frac{\vw^T_f}{\norm{\vw_f}}\frac{\vw_T}{\sqrt{TR^2}}\\ &= \frac{\vw^T_f}{\norm{\vw_f}}\frac{\vw_T}{\sqrt{T}R}\\ &= \frac{1}{\sqrt{T}R}\frac{\vw^T_f}{\norm{\vw_f}}\vw_T\\ &= \frac{1}{\sqrt{T}R}\frac{\vw^T_f}{\norm{\vw_f}}(\vw_0 + y_{n(0)}\vx_{n(0)} + y_{n(1)}\vx_{n(1)} + \cdots + y_{n(T-1)}\vx_{n(T-1)})\\ &= \frac{1}{\sqrt{T}R}\frac{\vw^T_f}{\norm{\vw_f}}(y_{n(0)}\vx_{n(0)} + y_{n(1)}\vx_{n(1)} + \cdots + y_{n(T-1)}\vx_{n(T-1)})\\ &\ge \frac{1}{\sqrt{T}R}T\rho\\ &= \frac{\sqrt{T}\rho}{R}.\\ \end{align*}\] When normalized \(\vw_T\) is equivalent as the normalized \(\vw_f\), their inner product would be 1, So, \[ 1 \ge \frac{\sqrt{T}\rho}{R}. \] Hence, \[ \sqrt{T} \le \frac{R}{\rho} \Longleftrightarrow T \le \frac{R^2}{\rho^2}. \] This means that the number of iterations we find \(\vw_f\) is bounded by \(R^2 /\rho^2\), which depends on the dataset and \(\vw_f\) itself only. So, as long as the dataset is linear separable, PLA halts.
Pros: It is simple to implement, fast, and works in any dimension \(d\). Cons: We assume \(\mathcal{D}\) is linear separable. But, in reality, this property is unknown in advance. Moreover, we are not fully sure how long does the halting take. This is because \(\rho\) depends on \(\vw_f\), which we never have access to in the first place. Therefore, we run the PLA and observe that it does not halt for a long time. We can't decide between if \(\mathcal{D}\) is not linear separable, or if the algorithm is talking a long time to halt.
Taking a step back, we used to assume that \(\cal{Y}\) is generated by applying the target function \(f\) on \(\mathcal{X}\). However, in reality, there might be noises. So that \(\cal{Y} = f(\cal{X}) + \text{noises}\). This means that noisy \(\mathcal{D}\) could not linear separable, even though \(\cal{D}\) is. How do we learn a \(\vw\) in this case?
First, we assume that the noise is little (which is reasonable), i.e. \[ y_n = f(\vx_n), \quad \text{usually}. \] Therefore, if \(g \approx f\), then \[ y_n = g(\vx_n), \quad \text{usually}. \] We want to find a line (hyperplane) \(\vw_g\) such that the number of mistakes made on \(\mathcal{D}\) is minimized: \[ \vw_g \leftarrow \arg \min_\vw \sum^N_{n=1} \mathbb{1}_{[y_n \ne \sign(\vw^T\vx_n)]}. \] Unfortunately, solving the optimization problem above is NP-hard. However, we can modify PLA to get an approximately good \(g\).
Algorithm (Pocket Algorithm): Initialize pocket weights \(\hat{\vw}\) For \(t = 0, 1, \ldots\) 1. Find a (random) mistake of \(\vw_t\) called \(\vw_{n(t), y_{n(t)}}\) 2. (Try to) correct the mistake by \(\vw_{t + 1} \leftarrow \vw_t + y_{n(t)}\vw_{n(t)}\) 3. If \(\vw_{t + 1}\) makes fewer mistakes than \(\hat{\vw}\), replace \(\hat{\vw}\) by \(\vw_{t + 1}\) Until enough iterations. Return \(\hat{\vw}\) (called \(\vw_{POCKET}\)) as \(g\)