Although still in alpha stage, Julia is already quite usable. It does not have so many packages as R yet but the basic recipes, such as distributions and optimization, are ready. This is enough for research usage because most of time the existing packages does not fit into the purpose. We need to reimplement our own methods any way.
In this post I will introduce some packages and features essential for statistical research.
Distributions is an excellent package providing quite amount of typical distributions. Distributions are of type
Distribution. There are common interface to
Distribution, such as
quantile… For example the following code generates $$1000$$ random number from Gaussian mixture model and calculate the log-likelihood.
using Distributions m = MixtureModel(Normal, [(-2.0, 1.2), (0.0, 1.0), (3.0, 2.5)], [0.2, 0.5, 0.3]) x = rand(m, 1000) loglikelihood(m, x)
The density functions of
Distributions are imported from another package
StatsFuns which use the
Rmath library. A problem with
Rmath is it only provide scalar density function which can be inefficient when we are evaluating on a vector of values. For example, to obtain the log-likelihood a Beta distribution on $$10000$$ data points the Beta function will be calculated for $$10000$$ times. When I was developing the
KernelEstimator package, I realized writing the kernel function in pure Julia can be more efficient than using
Rmath. In addition, it is possible to optimize the
log in vector case using
Yeppp package. Hope one day the density functions in
StatsFuns be rewritten in julia instead of importing
There is no very handy plotting package in Julia right now. The
Gadfly is ambitious but the fact is its lengthy grammar, slow plotting and unsatisfactory graph for displaying in formal situations.
PyPlot is flexible but I am not familiar with its usage. Most of the time I send the data to R via
RCall and use the
ggplot2. For example
using Distributions, RCall m = MixtureModel(Normal, [(-2.0, 1.2), (0.0, 1.0), (3.0, 2.5)], [0.2, 0.5, 0.3]) xs = linspace(-5,6, 500) den = pdf(m, xs) @rput xs, den rprint(""" plot(xs, den, type="l") """)
RCall is definitely more than just plotting. It can start an R session, sending data from julia to R via
@rput and fetching from R via
@rget. That allow us to call the large amount of available R packages within julia.
Julia has its own parallel computing framework. Starting Julia with
$ julia -p 4
will attach 4 workers if a computer has more than 4 processes. Then we can do parallel computing via the
pmap function or the
@parallel macro. For example, in
GaussianMixtureTest I want to find out the largest log-likelihood among several possible $$\tau$$ values
using Distributions import GaussianMixtureTest @everywhere using GaussianMixtureTest x = rand(Normal(), 1000) vtau = [.5, .3, .1;] @parallel (max) for i in 1:3 re=gmm(x, 2, tau = vtau[i], taufixed=true) re end
Parallel on Linux Cluster
Parallel computing within a single computer can only use a few processes. But a typical simulation study may have to be repeated for thousands times while each simulation may take several hours. In this case several hundred processes are needed. It is possible for julia to combine workers or processes across many nodes. For example in a PBS system, we can start julia with 160 workers in the following way. First request 10 nodes and 16 processes on each,
$ qsub -I -l nodes=10:ppn=16
Then start julia with
$ julia --machinefile=$PBS_NODEFILE
This will attach all requested processes into julia. The amazing thing is the 160 processes appear no difference with the 4 local processes started by
julia -p 4 in a single computer to the user. Or in other words it can run in parallel across several computers without using MPI and thus code running locally will run on linux serve without any change.
Here is an example of repeating a hypothesis test for 160 times and see its asymptotic distribution. The following code first defines a function to do the simulation and uses
pmap to run it on all workers.
~~~ julia import GaussianMixtureTest, Distributions @everywhere using GaussianMixtureTest, Distributions @everywhere function brun(b::Int) srand(b) mu_true = [-2.0858,-1.4879] wi_true = [0.0828,0.9172] sigmas_true = [0.6735,0.2931] n = 282 m = MixtureModel(map((u, v) -> Normal(u, v), mu_true, sigmas_true), wi_true) x = rand(m, n) T1, P = GaussianMixtureTest.kstest(x, 2) T1 end Tvec = pmap(brun, 1:160) ~~~~~
There are some other useful packages, such as numerical optimization packages
NLopt, numerical integration package
Cubature, Gauss Quadrature calculation package
FastGaussQuadrature and Nonparametric estimation
Given all these useful packages and julia’s amazing speed, everyone who want to develop some statistical method from scratch should have a try in julia.