Thursday, August 13, 2009

Examining distributions in R

In this post, I want to look at some distributions using R. Let's get 100 samples from the normal distribution with mean = 5:

set.seed(157)
v = rnorm(100,mean=5)
summary(v)


> summary(v)
Min. 1st Qu. Median
2.375 4.292 5.032
Mean 3rd Qu. Max.
5.072 5.865 8.349


The sort function does what it says. We make a plot called a "step" plot:

n = length(v)
s = sort(v)
plot(s,(1:n)/n,
type='s',lwd=2,col='darkred',
ylim=c(0,1))
points(mean(v),0.5,
pch=16,cex=2,col='blue')




You can also use draw a Q-Q plot (quantile-quantile). This plots the quantiles for our distribution (y-axis) against those for a normal distribution (x-axis). A straight line indicates that our distribution is close to normal.

qqnorm(v)




Let's get a larger sample:

v = rnorm(10000,mean=5)
m = mean(v)


> m
[1] 4.988964


s = sd(v)


> s
[1] 1.000791


The quantile function accepts an argument telling it how to make the quantiles:

S = seq(0,1,by=0.001)
x = quantile(v,S)
round(x[25:27],2)
round(x[975:977],2)


> round(x[25:27],2)
2.4% 2.5% 2.6%
3.04 3.05 3.06
> round(x[975:977],2)
97.4% 97.5% 97.6%
6.93 6.95 6.98


As expected, since the mean is 5 and the standard deviation is 1, 97.5 % of the values are < 5 + 1.96.

I showed this trick for plotting the normal distribution before. Now we plot it in red, and then overlay it with the density from the t-distribution (obtained with the function dt) with degrees of freedom (df) = 9,6,3,2,1.

plot(function(x) dnorm(x),
-5:5,max(v),lwd=2,col='red')

color.list=c(
'purple','blue','steelblue',
'darkred','magenta')
L = c(9,6,3,2,1)
for (i in 1:5) {
x = L[i]
plot(function(x) dt(x,df=i),
-5:5,max(v),lwd=2,
col=color.list[i],
add=T) }




The t distribution has fatter tails. If we had plotted it for larger values of df, it would asymptotically approach the normal distribution.

By using a very large sample, we can get a good idea for the cut-offs for 95% and 97.5% from the normal distribution:

S = seq(0,1,by=0.001)
v = rnorm(100000)
x = quantile(v,S)
round(x[950:952],2)
round(x[975:977],2)


> round(x[950:952],2)
94.9% 95.0% 95.1%
1.63 1.64 1.65
> round(x[975:977],2)
97.4% 97.5% 97.6%
1.94 1.96 1.98


And now compare them to the cut-offs for the t-distribution with df=9. We'll use these in the next example.

w = rt(100000,df=8)
y = quantile(w,S)
round(y[950:952],2)
round(y[975:977],2)


> w = rt(100000,df=8)
> y = quantile(w,S)
> round(y[950:952],2)
94.9% 95.0% 95.1%
1.84 1.85 1.86
> round(y[975:977],2)
97.4% 97.5% 97.6%
2.28 2.30 2.33