Kernel Tricks with Logistic Regression and SVMs

Intro

The motivation behind this notebook is to investigate the utility of kernel functions with models other than support vector machines (SVMs).

Defined

A kernel function is any mapping $\phi: \chi \mapsto \nu$ from a space in the feature space to a higher dimensional space in which classes are linearly separable. In practice, it’s possible (and much easier) to use a kernel function $k(\mathbf{x}, \mathbf{x’}) = \langle \phi(\mathbf{x}), \phi(\mathbf{x’})\rangle_\nu$, in which $\phi$ is no longer explicity defined.

Mercer’s theorem holds under the following condition: where $c_i c_j$ are any positive choices of real numbers.

If the above condition is met, then the mapping $\phi$ is guaranteed to exist.

from __future__ import division, print_function

import numpy as np

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from IPython.display import HTML

import matplotlib.pyplot as plt, mpld3
import matplotlib.animation as animation
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.mlab as mlab
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

%matplotlib inline

Process

I’m going to create my own data set. Knowing the distribution from which I’m drawing samples will free me of any statistical uncertainty!

# Positives are drawn from a random normal centered at 0, 0

means = np.array([0, 0])
cov_m = np.array([[7, 3], [3, 5]])

pos_points = np.random.multivariate_normal(means, cov_m, 500)

def pos_distr(x1_array, x2_array):
    """
    The positive points probability density function (PDF)

    Inputs:
        x_array: Array of shape (n, 2) for n points
    Outputs:
        array of shape (n, ) corresponding to the evaluated function
    """
    factor = np.linalg.det(2 * np.pi * cov_m) ** (-1. / 2)
    x_array = np.vstack([x1_array, x2_array])
    def compute_point(x_point):
        diff = (x_point - means)
        return np.exp(-0.5 * diff.T.dot(np.linalg.inv(cov_m)).dot(diff))
    return factor * np.apply_along_axis(compute_point, axis=0, arr=x_array)
# Make a random "ring" distribution with some degree of thickness

def get_cartesian(rs, thetas):
    """Given polar coordinates, returns equivalent cartesian coords"""
    result = np.zeros((rs.size, 2))
    result[:, 0] = np.cos(thetas) * rs
    result[:, 1] = np.sin(thetas) * rs
    return result

def get_polar(xs, ys):
    """Given cartesian, returns polar"""
    rs = np.sqrt(xs**2 + ys**2)
    return np.arccos(xs/rs) * rs, np.arcsin(ys/rs) * rs

size = 500
loc = 13.
scale = 3.

rs = np.random.normal(loc=loc, scale=scale, size=size)
thetas = np.random.uniform(low=0.0, high=(2. * np.pi), size=size)

neg_points = get_cartesian(rs, thetas)
def neg_distr(x1_array, x2_array):
    r_array = np.sqrt(x1_array ** 2 + x2_array ** 2)
    factor = ((16 * np.pi * scale) ** 2) ** (-1. / 2)
    diff = (r_array - loc)
    return factor * np.exp(-(diff ** 2) / (2 * (scale ** 2)))

x_array = (np.array([0, 0]), np.array([0, 13]))
neg_distr(*x_array)  # Returns: array([  5.54710331e-07,   6.63145596e-03])
fig = plt.figure(figsize=(10, 10))
ax = fig.gca(projection='3d')

prec = 300

X1 = np.linspace(-20, 20, prec)
X2 = X1
X1g, X2g = np.meshgrid(X1, X2)

Z_neg = neg_distr(X1g, X2g)
Z_pos = mlab.bivariate_normal(X1g, X2g, sigmax=7, sigmay=5, sigmaxy=3)

def init():
    ax.plot_surface(
        X1g, X2g, Z_neg, cmap=cm.coolwarm,
        linewidth=0, antialiased=False, alpha=0.3
    );
    ax.plot_surface(
        X1g, X2g, Z_pos, cmap=cm.copper,
        linewidth=0, antialiased=False
    );
    return fig,

def animate(i):
    ax.view_init(elev=i, azim=23+i)
    return fig,

anim = animation.FuncAnimation(
    fig, animate, init_func=init,
    frames=40, interval=20, blit=True
);

HTML(anim.to_html5_video())

Bayesian Decision Boundary

Finding the boundary mathematically

The basic procedure is to just equalize the two distributions.

Distribution 1:

Substituting $\mathbf{\mu}$ and $\mathbf{\mathrm{\Sigma}}$ and taking advantage of symmetries allows us to simplify things in polar coordinates

Distribution 2:

Setting these equations equal allows us to find where the probability densities intersect, which is the actual bayesian decision boundary.

This “simplifies” to:

which allows us to get $r \sim f(\theta)$ by solving a quadratic equation:

This is an algebra exercise left for the reader to complete. 😉

The solution is an ellipse when projected onto the x, y plane. That’s what’s important.

Looking at our class data in $\mathbb{R}^2$

fig = plt.figure(figsize=(8, 8))
plt.plot(neg_points[:,0], neg_points[:,1], 'r.', alpha=0.3, label='Neg');
plt.plot(pos_points[:,0], pos_points[:,1], 'b+', alpha=0.3, label='Pos');
plt.legend();

Scatter Plot

# Make it look like we were handed the data

data = np.vstack((
    np.hstack((neg_points, np.zeros((neg_points.shape[0], 1)))),
    np.hstack((pos_points, np.ones((pos_points.shape[0], 1))))
))

np.random.shuffle(data)
data[:10]  # Print out the head
array([[ 12.27223441,   8.53757204,   0.        ],
       [  0.67737101, -14.12379135,   0.        ],
       [  1.18904235,  10.9653481 ,   0.        ],
       [  3.61474628,  -1.55084327,   1.        ],
       [  4.11860015,   6.33836676,   1.        ],
       [ -4.26303044, -15.27184675,   0.        ],
       [  0.79370745,  -2.89958874,   1.        ],
       [-12.70257559,  -3.2291174 ,   0.        ],
       [  6.11823128,   0.99146382,   1.        ],
       [  1.74475667,   1.44796774,   1.        ]])

Building an actual $\phi$

Since Sklearn’s LogisticRegression model doesn’t include any option for using a kernel function, we’re going to implement our own and use the exact same process with the SVC to see how they perform and if there are any pros and cons to each approach.

In this case, it’s pretty obvious how to linearly separate the data. Add quadratic terms!

X = data[:, :2]

# Adding quadratic term
quad_term = X[:, 0] ** 2 + X[:, 1] ** 2
X = np.hstack((X, quad_term.reshape(quad_term.size, 1)))

y = data[:, 2]

X_train, X_test, y_train, y_test = train_test_split(X, y)

3-D visualization with new quadratic feature

fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)

pos_filter = y.astype(bool)

def init():
    ax.scatter(
        X[pos_filter,0], X[pos_filter,1], X[pos_filter,2],
        c='b', alpha=0.4, marker='+'
    );

    ax.scatter(
        X[~pos_filter,0], X[~pos_filter,1], X[~pos_filter,2],
        c='r', alpha=0.4, marker='.'
    );
    return fig,

def animate(i):
    ax.view_init(elev=23., azim=i)
    return fig,

anim = animation.FuncAnimation(
    fig, animate, init_func=init,
    frames=360, interval=20, blit=True
)

HTML(anim.to_html5_video())

Linearly seperable

It’s now clear that the two classes are somewhat linearly separable.

Model 1: SVM Classifier

We’ll determine the optimal hyperparameter (C, which controls the penalty for misclassification) through 5-fold cross validation on the training data. We’ll test only a linear kernel because I already added a quadratic term earlier.

Note: I’m also setting probability to true so I can compare with logistic regression later

C_vals = np.logspace(-2,5,num=20)

params = [{'svc__C': C_vals}]

svc = SVC(kernel='linear', probability=True)
# in this case, scaling isn't really necessary, but I'll do it it anyway
pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('svc', svc)])

g_svc = GridSearchCV(pipeline, param_grid=params, cv=5).fit(X_train, y_train)

print("Chosen params:", g_svc.best_params_)
print("Train CV accuracy:", g_svc.best_score_)
Chosen params: {'svc__C': 0.69519279617756058}
Train CV accuracy: 0.982666666667
svc_model = g_svc.best_estimator_

print("Test accuracy: {:.2f}".format(svc_model.score(X_test, y_test)))
Test accuracy: 0.98

Model 2: Kernel Logistic Regression

I’m going to tune the $\beta$ parameters for the logistic regression on the training data through cross validation, and test on the test set using a threshold of 50%.

lrcv = LogisticRegressionCV(cv=5, solver='liblinear')

lrcv.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='liblinear', tol=0.0001,
           verbose=0)
print("Test accuracy: {:.2f}".format(lrcv.score(X_test, y_test)))
Test accuracy: 0.98

Probability prediction comparison

What is the root mean squared difference in predicted probabilities?

1 indicates total disagreement 0 indicates complete agreement.

rmsd = np.sqrt(
    np.mean(
        (lrcv.predict_proba(X_test)[:,0] - svc_model.predict_proba(X_test)[:,0]) ** 2
    )
)
print('Root mean squared difference in probabilities:', rmsd)
Root mean squared difference in probabilities: 0.107123116443

This means that there’s about an average of a 10% difference in predicted probabilities for any one point.

Which parts of the probability distributions are in disagreement?

fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)

pos_test_filter = y_test.astype(bool)
y_lr_p = lrcv.predict_proba(X_test)[:,0]
y_svc_p = svc_model.predict_proba(X_test)[:,0]
abs_disagree = y_lr_p - y_svc_p


def init():
    ax.scatter(
        X_test[pos_test_filter,0], X_test[pos_test_filter,1],
        abs_disagree[pos_test_filter], c='b'
    );

    ax.scatter(
        X_test[~pos_test_filter,0], X_test[~pos_test_filter,1],
        abs_disagree[~pos_test_filter], c='r'
    );
    return fig,

def animate(i):
    ax.view_init(elev=23., azim=i)
    return fig,

anim = animation.FuncAnimation(
    fig, animate, init_func=init,
    frames=360, interval=20, blit=True
)

HTML(anim.to_html5_video())

Conclusion

This demonstrates that the kernel logistic regression has extreme utility. It was able to more accurately capture the actual underlying probability density functions. The central core has a very high probability of belonging to the positive class which should decrease linearly moving outward.

Still, as classifiers in the real world, they performed about the same. In instances where you’d rather have accurate predictions of probabilities, which can allow you to adjust the threshold to your own needs, the logistic regression together with a kernel function is a great option that is probably underused.