# Kernel Tricks

# Kernel Tricks with Logistic Regression and SVMs

## Intro

The motivation behind this notebook is to investigate the utility of kernel
functions with models *other* than support vector machines (SVMs).

## Defined

A *kernel function* is any mapping $\phi: \chi \mapsto \nu$ from a space in the feature space to a higher dimensional space in which classes are linearly separable. In practice, it’s possible (and much easier) to use a kernel function $k(\mathbf{x}, \mathbf{x’}) = \langle \phi(\mathbf{x}), \phi(\mathbf{x’})\rangle_\nu$, in which $\phi$ is no longer explicity defined.

Mercer’s theorem holds under the following condition: where $c_i c_j$ are any positive choices of real numbers.

If the above condition is met, then the mapping $\phi$ is guaranteed to exist.

```
from __future__ import division, print_function
import numpy as np
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from IPython.display import HTML
import matplotlib.pyplot as plt, mpld3
import matplotlib.animation as animation
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.mlab as mlab
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
%matplotlib inline
```

## Process

I’m going to create my own data set. Knowing the distribution from which I’m drawing samples will free me of any statistical uncertainty!

```
# Positives are drawn from a random normal centered at 0, 0
means = np.array([0, 0])
cov_m = np.array([[7, 3], [3, 5]])
pos_points = np.random.multivariate_normal(means, cov_m, 500)
def pos_distr(x1_array, x2_array):
"""
The positive points probability density function (PDF)
Inputs:
x_array: Array of shape (n, 2) for n points
Outputs:
array of shape (n, ) corresponding to the evaluated function
"""
factor = np.linalg.det(2 * np.pi * cov_m) ** (-1. / 2)
x_array = np.vstack([x1_array, x2_array])
def compute_point(x_point):
diff = (x_point - means)
return np.exp(-0.5 * diff.T.dot(np.linalg.inv(cov_m)).dot(diff))
return factor * np.apply_along_axis(compute_point, axis=0, arr=x_array)
```

```
# Make a random "ring" distribution with some degree of thickness
def get_cartesian(rs, thetas):
"""Given polar coordinates, returns equivalent cartesian coords"""
result = np.zeros((rs.size, 2))
result[:, 0] = np.cos(thetas) * rs
result[:, 1] = np.sin(thetas) * rs
return result
def get_polar(xs, ys):
"""Given cartesian, returns polar"""
rs = np.sqrt(xs**2 + ys**2)
return np.arccos(xs/rs) * rs, np.arcsin(ys/rs) * rs
size = 500
loc = 13.
scale = 3.
rs = np.random.normal(loc=loc, scale=scale, size=size)
thetas = np.random.uniform(low=0.0, high=(2. * np.pi), size=size)
neg_points = get_cartesian(rs, thetas)
```

```
def neg_distr(x1_array, x2_array):
r_array = np.sqrt(x1_array ** 2 + x2_array ** 2)
factor = ((16 * np.pi * scale) ** 2) ** (-1. / 2)
diff = (r_array - loc)
return factor * np.exp(-(diff ** 2) / (2 * (scale ** 2)))
x_array = (np.array([0, 0]), np.array([0, 13]))
neg_distr(*x_array) # Returns: array([ 5.54710331e-07, 6.63145596e-03])
```

```
fig = plt.figure(figsize=(10, 10))
ax = fig.gca(projection='3d')
prec = 300
X1 = np.linspace(-20, 20, prec)
X2 = X1
X1g, X2g = np.meshgrid(X1, X2)
Z_neg = neg_distr(X1g, X2g)
Z_pos = mlab.bivariate_normal(X1g, X2g, sigmax=7, sigmay=5, sigmaxy=3)
def init():
ax.plot_surface(
X1g, X2g, Z_neg, cmap=cm.coolwarm,
linewidth=0, antialiased=False, alpha=0.3
);
ax.plot_surface(
X1g, X2g, Z_pos, cmap=cm.copper,
linewidth=0, antialiased=False
);
return fig,
def animate(i):
ax.view_init(elev=i, azim=23+i)
return fig,
anim = animation.FuncAnimation(
fig, animate, init_func=init,
frames=40, interval=20, blit=True
);
HTML(anim.to_html5_video())
```

# Bayesian Decision Boundary

## Finding the boundary mathematically

The basic procedure is to just equalize the two distributions.

Distribution 1:

Substituting $\mathbf{\mu}$ and $\mathbf{\mathrm{\Sigma}}$ and taking advantage of symmetries allows us to simplify things in polar coordinates

Distribution 2:

Setting these equations equal allows us to find where the probability densities intersect, which is the *actual* bayesian decision boundary.

This “simplifies” to:

which allows us to get $r \sim f(\theta)$ by solving a quadratic equation:

…

This is an algebra exercise left for the reader to complete. 😉

The solution is an ellipse when projected onto the x, y plane. That’s what’s important.

## Looking at our class data in $\mathbb{R}^2$

```
fig = plt.figure(figsize=(8, 8))
plt.plot(neg_points[:,0], neg_points[:,1], 'r.', alpha=0.3, label='Neg');
plt.plot(pos_points[:,0], pos_points[:,1], 'b+', alpha=0.3, label='Pos');
plt.legend();
```

## Scatter Plot

```
# Make it look like we were handed the data
data = np.vstack((
np.hstack((neg_points, np.zeros((neg_points.shape[0], 1)))),
np.hstack((pos_points, np.ones((pos_points.shape[0], 1))))
))
np.random.shuffle(data)
```

```
data[:10] # Print out the head
```

```
array([[ 12.27223441, 8.53757204, 0. ],
[ 0.67737101, -14.12379135, 0. ],
[ 1.18904235, 10.9653481 , 0. ],
[ 3.61474628, -1.55084327, 1. ],
[ 4.11860015, 6.33836676, 1. ],
[ -4.26303044, -15.27184675, 0. ],
[ 0.79370745, -2.89958874, 1. ],
[-12.70257559, -3.2291174 , 0. ],
[ 6.11823128, 0.99146382, 1. ],
[ 1.74475667, 1.44796774, 1. ]])
```

# Building an actual $\phi$

Since Sklearn’s `LogisticRegression`

model doesn’t include any option
for using a kernel function, we’re going to implement our own and use
the *exact* same process with the SVC to see how they perform and if there
are any pros and cons to each approach.

In this case, it’s pretty obvious how to linearly separate the data. Add quadratic terms!

```
X = data[:, :2]
# Adding quadratic term
quad_term = X[:, 0] ** 2 + X[:, 1] ** 2
X = np.hstack((X, quad_term.reshape(quad_term.size, 1)))
y = data[:, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

# 3-D visualization with new quadratic feature

```
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)
pos_filter = y.astype(bool)
def init():
ax.scatter(
X[pos_filter,0], X[pos_filter,1], X[pos_filter,2],
c='b', alpha=0.4, marker='+'
);
ax.scatter(
X[~pos_filter,0], X[~pos_filter,1], X[~pos_filter,2],
c='r', alpha=0.4, marker='.'
);
return fig,
def animate(i):
ax.view_init(elev=23., azim=i)
return fig,
anim = animation.FuncAnimation(
fig, animate, init_func=init,
frames=360, interval=20, blit=True
)
HTML(anim.to_html5_video())
```

## Linearly seperable

It’s now clear that the two classes are somewhat linearly separable.

# Model 1: SVM Classifier

We’ll determine the optimal hyperparameter (C, which controls the penalty for misclassification) through 5-fold cross validation on the training data. We’ll test only a linear kernel because I already added a quadratic term earlier.

**Note**: I’m also setting `probability`

to true so I can compare with logistic regression later

```
C_vals = np.logspace(-2,5,num=20)
params = [{'svc__C': C_vals}]
svc = SVC(kernel='linear', probability=True)
# in this case, scaling isn't really necessary, but I'll do it it anyway
pipeline = Pipeline([
('scale', StandardScaler()),
('svc', svc)])
g_svc = GridSearchCV(pipeline, param_grid=params, cv=5).fit(X_train, y_train)
print("Chosen params:", g_svc.best_params_)
print("Train CV accuracy:", g_svc.best_score_)
```

```
Chosen params: {'svc__C': 0.69519279617756058}
Train CV accuracy: 0.982666666667
```

```
svc_model = g_svc.best_estimator_
print("Test accuracy: {:.2f}".format(svc_model.score(X_test, y_test)))
```

```
Test accuracy: 0.98
```

# Model 2: Kernel Logistic Regression

I’m going to tune the $\beta$ parameters for the logistic regression on the training data through cross validation, and test on the test set using a threshold of 50%.

```
lrcv = LogisticRegressionCV(cv=5, solver='liblinear')
lrcv.fit(X_train, y_train)
```

```
LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
fit_intercept=True, intercept_scaling=1.0, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
refit=True, scoring=None, solver='liblinear', tol=0.0001,
verbose=0)
```

```
print("Test accuracy: {:.2f}".format(lrcv.score(X_test, y_test)))
```

```
Test accuracy: 0.98
```

# Probability prediction comparison

What is the root mean squared difference in predicted probabilities?

*1* indicates total disagreement *0* indicates complete agreement.

```
rmsd = np.sqrt(
np.mean(
(lrcv.predict_proba(X_test)[:,0] - svc_model.predict_proba(X_test)[:,0]) ** 2
)
)
```

```
print('Root mean squared difference in probabilities:', rmsd)
```

```
Root mean squared difference in probabilities: 0.107123116443
```

This means that there’s about an average of a 10% difference in predicted probabilities for any one point.

# Which parts of the probability distributions are in disagreement?

```
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)
pos_test_filter = y_test.astype(bool)
y_lr_p = lrcv.predict_proba(X_test)[:,0]
y_svc_p = svc_model.predict_proba(X_test)[:,0]
abs_disagree = y_lr_p - y_svc_p
def init():
ax.scatter(
X_test[pos_test_filter,0], X_test[pos_test_filter,1],
abs_disagree[pos_test_filter], c='b'
);
ax.scatter(
X_test[~pos_test_filter,0], X_test[~pos_test_filter,1],
abs_disagree[~pos_test_filter], c='r'
);
return fig,
def animate(i):
ax.view_init(elev=23., azim=i)
return fig,
anim = animation.FuncAnimation(
fig, animate, init_func=init,
frames=360, interval=20, blit=True
)
HTML(anim.to_html5_video())
```

# Conclusion

This demonstrates that the kernel logistic regression has extreme utility. It was able to more accurately capture the actual underlying probability density functions. The central core has a very high probability of belonging to the positive class which should decrease linearly moving outward.

Still, as classifiers in the real world, they performed about the same. In instances where you’d rather have accurate predictions of probabilities, which can allow you to adjust the threshold to your own needs, the logistic regression together with a kernel function is a great option that is probably underused.