Nonparametric Kernel Estimation and Regression in Julia

2016/02/22

In term of Nonparametric density estimation and local regression, I would strongly recommend the package KernelEstimator because I developed it.

Comparing to the KernelDensity from JuliaStat group, KernelEstimator provides more flexible kernels. In KernelEstimator, kernel is just a function but in KernelDensity kernel has to be of type Distribution with a closed form character function. KernelDensity use Fourier transformation to reduce the computing complexity and it is much more efficient. However the price to pay is it can only be used in very limited situations. It is not wise to equal kernel with density function.

  1. Kernel may not be a meaningful density function, such Epanechnikov kernel is not an interesting distribution. However to define a distribution corresponding to Epanechnikov kernel, we still have to define its rand, cdf, logpdf and quantile methods.

  2. A density function may not have a simple character function form

  3. A kernel may not necessary be a density. When we estimate a cumulative density function, the kernel to use should also be in the CDF form.

  4. A kernel does not have to be nonnegative. Certain kernel with negative value can also be used to estimate density as long as it satisfies some conditions. Search [Bias Reduced Kernel].

In addition Fourier transformation assume the kernel keeps unchanged on all the data points except for a shift in mean. But the shape of some kernel can also be different at difference data points. Such as Beta and Gamma kernel.

And also Fourier transformation approach make the prediction on an arbitrary point difficult. It is designed to predict on some grid points in an interval. To predict on an arbitrary $$x$$ it has to do an interpolation.

The most proud feature of KernelEstimator is it provides Beta kernel and Gamma kernel for bounded density estimation.

Why Boundary Matters

Usually kernel does not matter. However there is an exception when the data is bounded. When the domain of $$x$$ is bounded and the density of close to the boundary is large, the regular kernel estimation will suffer boundary biases. Think of this, to estimate the density near the boundary we have to have data there. Then the kernel function will have part of its density leaking outside of the boundary. That means we are underestimating the true density. Cutting off at the boundary and cumulating all the leaked density at one point at the boundary does not help.

See an example of $$\chi^2(2)$$. The red density using normal density is very wiggly and has large error near 0. The blue density using gamma density fits the truth closely. Both density estimation obtain their bandwidth via cross validation.

If manually increase the bandwidth of normal kernel, the variance is much smaller but the bias near the boundary gets larger.

Similar problem also exist in kernel regression.

How to use

The sample code of previous example

using Distributions, KernelEstimator, RCall

x = rand(Chisq(2), 1000)
xs = linspace(0.01, 10, 500)
dentrue = pdf(Chisq(2), xs)
dengamma = kerneldensity(x, xeval=xs, kernel=gammakernel, lb=0.0)
dennormal = kerneldensity(x, xeval=xs)
dennormal2 = kerneldensity(x, xeval=xs, h=.3)

@rput xs dentrue dengamma dennormal dennormal2
rprint("""
png("~/Downloads/gamma_normal.png")
plot(xs, dentrue, type="l", lwd=3)
lines(xs, dengamma, lwd=2, col="blue")
lines(xs, dennormal, lwd=2, lty=2, col="red")
#lines(xs, dennormal2, lwd=2, lty=3, col="yellow")
legend("topright", c("Truth", "Gamma Kernel", "Normal kernel"),
lwd=c(3,2,2,2), lty=c(1,1,2,3), col=c("black", "blue", "red"))
dev.off()
""")

The basic usage is just

kerneldensity(x)

This will default using Gaussian Kernel with no boundaries and choose bandwidth via cross validation. We can specify where to evaluate the density by specifying xeval. The default value of xeval is x because the first purpose of kernel density is predicting not plotting.

The kernel choices are gaussiankernel, betakernel, gammakernel and ekernel. ekernel is for Epanechnikov kernel which is the best kernel in theory. betakernel is used when data are two sides bounded while gammakernel is used when data is one side bounded.

If data is bounded, lb and ub are to set the lower and upper bound. If both are set to be finite values, then betakernel is used ignoring the user’s specification. If only one is set to be a finite value then gammakernel is used no matter what user sets the kernel to be.

Kernel Regression

Local constant and local linear regression are provided. Usage can be as simple as

y=2 .* x.^2 + rand(Normal(), 500)
yfit0=localconstant(x, y, xeval=xeval)
yfit1=locallinear(x, y, xeval=xeval)
yfit0=npr(x, y, xeval=xeval, reg=localconstant)
yfit1=npr(x, y, xeval=xeval, reg=locallinear)

gammakernel and betakernel are also provided in kernel regression since boundary of $$x$$ effects the prediction on $$y$$.

In addition the confidence band can be obtained using

cb=bootstrapCB(x, y, xeval=xeval)