Statistics using Julia

2016/02/20

Although still in alpha stage, Julia is already quite usable. It does not have so many packages as R yet but the basic recipes, such as distributions and optimization, are ready. This is enough for research usage because most of time the existing packages does not fit into the purpose. We need to reimplement our own methods any way.

In this post I will introduce some packages and features essential for statistical research.

Distributions

The Distributions is an excellent package providing quite amount of typical distributions. Distributions are of type Distribution. There are common interface to Distribution, such as pdf, logpdf, cdf, loglikelihood, entropy and quantile… For example the following code generates $$1000$$ random number from Gaussian mixture model and calculate the log-likelihood.

using Distributions
m = MixtureModel(Normal, [(-2.0, 1.2),
    (0.0, 1.0), (3.0, 2.5)], [0.2, 0.5, 0.3])
x = rand(m, 1000)
loglikelihood(m, x)

The density functions of Distributions are imported from another package StatsFuns which use the Rmath library. A problem with Rmath is it only provide scalar density function which can be inefficient when we are evaluating on a vector of values. For example, to obtain the log-likelihood a Beta distribution on $$10000$$ data points the Beta function will be calculated for $$10000$$ times. When I was developing the KernelEstimator package, I realized writing the kernel function in pure Julia can be more efficient than using Rmath. In addition, it is possible to optimize the exp and log in vector case using Yeppp package. Hope one day the density functions in StatsFuns be rewritten in julia instead of importing Rmath.

Plotting

There is no very handy plotting package in Julia right now. The Gadfly is ambitious but the fact is its lengthy grammar, slow plotting and unsatisfactory graph for displaying in formal situations. PyPlot is flexible but I am not familiar with its usage. Most of the time I send the data to R via RCall and use the plot from base or ggplot2. For example

using Distributions, RCall
m = MixtureModel(Normal, [(-2.0, 1.2),
    (0.0, 1.0), (3.0, 2.5)], [0.2, 0.5, 0.3])
xs = linspace(-5,6, 500)
den = pdf(m, xs)
@rput xs, den
rprint("""
plot(xs, den, type="l")
""")

RCall is definitely more than just plotting. It can start an R session, sending data from julia to R via @rput and fetching from R via @rget. That allow us to call the large amount of available R packages within julia.

Parallel Computing

Julia has its own parallel computing framework. Starting Julia with

$ julia -p 4

will attach 4 workers if a computer has more than 4 processes. Then we can do parallel computing via the pmap function or the @parallel macro. For example, in GaussianMixtureTest I want to find out the largest log-likelihood among several possible $$\tau$$ values

using Distributions
import GaussianMixtureTest
@everywhere using GaussianMixtureTest
x = rand(Normal(), 1000)
vtau = [.5, .3, .1;]
@parallel (max) for i in 1:3
    re=gmm(x, 2, tau = vtau[i], taufixed=true)
    re[4]
end

Parallel on Linux Cluster

Parallel computing within a single computer can only use a few processes. But a typical simulation study may have to be repeated for thousands times while each simulation may take several hours. In this case several hundred processes are needed. It is possible for julia to combine workers or processes across many nodes. For example in a PBS system, we can start julia with 160 workers in the following way. First request 10 nodes and 16 processes on each,

$ qsub -I -l nodes=10:ppn=16

Then start julia with

$ julia --machinefile=$PBS_NODEFILE

This will attach all requested processes into julia. The amazing thing is the 160 processes appear no difference with the 4 local processes started by julia -p 4 in a single computer to the user. Or in other words it can run in parallel across several computers without using MPI and thus code running locally will run on linux serve without any change.

Here is an example of repeating a hypothesis test for 160 times and see its asymptotic distribution. The following code first defines a function to do the simulation and uses pmap to run it on all workers.

~~~ julia import GaussianMixtureTest, Distributions @everywhere using GaussianMixtureTest, Distributions @everywhere function brun(b::Int) srand(b) mu_true = [-2.0858,-1.4879] wi_true = [0.0828,0.9172] sigmas_true = [0.6735,0.2931] n = 282 m = MixtureModel(map((u, v) -> Normal(u, v), mu_true, sigmas_true), wi_true) x = rand(m, n) T1, P = GaussianMixtureTest.kstest(x, 2) T1 end Tvec = pmap(brun, 1:160) ~~~~~

More

There are some other useful packages, such as numerical optimization packages Optim or NLopt, numerical integration package Cubature, Gauss Quadrature calculation package FastGaussQuadrature and Nonparametric estimation KernelDensity and KernelEstimator.

Given all these useful packages and julia’s amazing speed, everyone who want to develop some statistical method from scratch should have a try in julia.