Although still in alpha stage, Julia is already quite usable. It does not have so many packages as R yet but the basic recipes, such as distributions and optimization, are ready. This is enough for research usage because most of time the existing packages does not fit into the purpose. We need to reimplement our own methods any way.

In this post I will introduce some packages and features essential for statistical research.

## Distributions

The `Distributions`

is an excellent package providing quite amount of typical distributions. Distributions are of type `Distribution`

. There are common interface to `Distribution`

, such as `pdf`

, `logpdf`

, `cdf`

, `loglikelihood`

, `entropy`

and `quantile`

… For example the following code generates $$1000$$ random number from Gaussian mixture model and calculate the log-likelihood.

```
using Distributions
m = MixtureModel(Normal, [(-2.0, 1.2),
(0.0, 1.0), (3.0, 2.5)], [0.2, 0.5, 0.3])
x = rand(m, 1000)
loglikelihood(m, x)
```

The density functions of `Distributions`

are imported from another package `StatsFuns`

which use the `Rmath`

library. A problem with `Rmath`

is it only provide scalar density function which can be inefficient when we are evaluating on a vector of values. For example, to obtain the log-likelihood a Beta distribution on $$10000$$ data points the Beta function will be calculated for $$10000$$ times. When I was developing the `KernelEstimator`

package, I realized writing the kernel function in pure Julia can be more efficient than using `Rmath`

. In addition, it is possible to optimize the `exp`

and `log`

in vector case using `Yeppp`

package. Hope one day the density functions in `StatsFuns`

be rewritten in julia instead of importing `Rmath`

.

## Plotting

There is no very handy plotting package in Julia right now. The `Gadfly`

is ambitious but the fact is its lengthy grammar, slow plotting and unsatisfactory graph for displaying in formal situations. `PyPlot`

is flexible but I am not familiar with its usage. Most of the time I send the data to R via `RCall`

and use the `plot`

from `base`

or `ggplot2`

. For example

```
using Distributions, RCall
m = MixtureModel(Normal, [(-2.0, 1.2),
(0.0, 1.0), (3.0, 2.5)], [0.2, 0.5, 0.3])
xs = linspace(-5,6, 500)
den = pdf(m, xs)
@rput xs, den
rprint("""
plot(xs, den, type="l")
""")
```

`RCall`

is definitely more than just plotting. It can start an R session, sending data from julia to R via `@rput`

and fetching from R via `@rget`

. That allow us to call the large amount of available R packages within julia.

## Parallel Computing

Julia has its own parallel computing framework. Starting Julia with

```
$ julia -p 4
```

will attach 4 workers if a computer has more than 4 processes. Then we can do parallel computing via the `pmap`

function or the `@parallel`

macro. For example, in `GaussianMixtureTest`

I want to find out the largest log-likelihood among several possible $$\tau$$ values

```
using Distributions
import GaussianMixtureTest
@everywhere using GaussianMixtureTest
x = rand(Normal(), 1000)
vtau = [.5, .3, .1;]
@parallel (max) for i in 1:3
re=gmm(x, 2, tau = vtau[i], taufixed=true)
re[4]
end
```

## Parallel on Linux Cluster

Parallel computing within a single computer can only use a few processes. But a typical simulation study may have to be repeated for thousands times while each simulation may take several hours. In this case several hundred processes are needed. It is possible for julia to combine workers or processes across many nodes. For example in a PBS system, we can start julia with 160 workers in the following way. First request 10 nodes and 16 processes on each,

```
$ qsub -I -l nodes=10:ppn=16
```

Then start julia with

```
$ julia --machinefile=$PBS_NODEFILE
```

This will attach all requested processes into julia. The amazing thing is the 160 processes appear no difference with the 4 local processes started by `julia -p 4`

in a single computer to the user. Or in other words it can run in parallel across several computers without using MPI and thus code running locally will run on linux serve without any change.

Here is an example of repeating a hypothesis test for 160 times and see its asymptotic distribution. The following code first defines a function to do the simulation and uses `pmap`

to run it on all workers.

~~~ julia import GaussianMixtureTest, Distributions @everywhere using GaussianMixtureTest, Distributions @everywhere function brun(b::Int) srand(b) mu_true = [-2.0858,-1.4879] wi_true = [0.0828,0.9172] sigmas_true = [0.6735,0.2931] n = 282 m = MixtureModel(map((u, v) -> Normal(u, v), mu_true, sigmas_true), wi_true) x = rand(m, n) T1, P = GaussianMixtureTest.kstest(x, 2) T1 end Tvec = pmap(brun, 1:160) ~~~~~

## More

There are some other useful packages, such as numerical optimization packages `Optim`

or `NLopt`

, numerical integration package `Cubature`

, Gauss Quadrature calculation package `FastGaussQuadrature`

and Nonparametric estimation `KernelDensity`

and `KernelEstimator`

.

Given all these useful packages and julia’s amazing speed, everyone who want to develop some statistical method from scratch should have a try in julia.