While SPSS, SAS and STATA are the most widely used statistical analysis software programs used today, another program is gaining significance across universities and smaller research shops.
R is open source (read: FREE!), lightweight and has features that blow the trads out of the water.
I first heard of it a few years ago, tried using it but I was bogged down with the syntax programming. One of the reasons SPSS is so popular is because it is so easy to use with drop-down menus. SAS has drop-downs, too, but the syntax is so easy to write, why bother? I’m not as familiar with STATA but my impression is that it is more similar to SAS than SPSS.
R doesn’t have drop-downs, you tell it what you want it to do. The advantage is that it is extremely customizable. I haven’t used SPSS for a while so this may be irrelevant, but back when I used it, you couldn’t modify your charts and graphs. You could always pick out a graph made in SPSS because of the thick bright red fill (in other words, it was boring). R allows you to define pretty much everything.
I was forced to learn R for a statistics class in grad school. My prof liked to make us do everything the long (and most difficult) way possible. For one project we had to do a permutation to use a Pearson Chi-square test for independence.
Permutations allow you to test for independence without making assumptions about the data distribution. For example, the Pearson Chi-square test for Independence assumes a Chi-square distribution. But what if your data isn’t chi-square? Well, then you do a permutation.
This basically takes your observed data, rearranges it a bunch of times (like 10,000 times, for example), then you look at the distribution of the data assuming the Null Hypothesis is true (i.e., your response and explanatory variables are Independent). So instead of forcing your observed data to take some assumed distribution and increase your Type I error rate, running a permutation allows you to compare the observed results to it’s own distribution. Or something like that.
There’s a one line code in R that will do all of this for you:
The short way:
chisq.test(gender.table2, correct=FALSE, simulate.p.value = TRUE, B = 1000)
And here’s the long way:
#Put data into raw form
all.data<- matrix(data=NA, nrow=0, ncol = 2)
#Put data into "raw" form
for (i in 1:nrow(gender.table2)) {
for (j in 1:ncol(gender.table2)) {
all.data<- rbind(all.data, matrix(data = c(i,j), nrow = gender.table2[i,j], ncol=2, byrow=T))
}
}
all.data
save<- xtabs(~all.data[,1]+ all.data[,2])
#First do one permutation to illustrate:
set.seed(8067)
all.data.star<- cbind(all.data[,1], sample(all.data[,2], replace = F))
all.data.star
calc.stat<- chisq.test(all.data.star[,1], all.data.star[,2], correct = F)
calc.stat$statistic
save.star<- xtabs(~all.data.star[,1] + all.data.star[,2])
#Now do this with forloop
do.it<- function(data.set) {
all.data.star<- cbind(data.set[,1], sample(data.set[,2], replace = F))
chisq.test(all.data.star[,1], all.data.star[,2], correct=F)$statistic
}
summarize<- function(result.set, statistic, df, B) {
par(mfrow = c(1,2))
#Histogram
hist(x= result.set, main = expression(paste("Histogram of ", X^2, " perm. dist.")))
segments(x0 = statistic, y0 = -10, x1 = statistic, y1 = 10)
#QQ Plot
chi.quant<- qchisq(p= seq(from=1/(B+1), to = 1-1/(B+1), by = 1/(B+1)), df=df)
plot(x= sort(result.set), y = chi.quant, main = expression(paste("QQ-Plot of ", X^2, " perm. dist.")))
abline(a = 0, b = 1)
par(mfrow = c(1,1))
#p-value
mean(result.set>= statistic)
}
#Do.it for 1,000
do.it(data.set = all.data)
B<- 1000
results<- matrix(data = NA, nrow = B, ncol = 1)
set.seed(8067)
for(i in 1:B) {
results[i,1]<- do.it(all.data)
}
summarize(results, x.sq$statistic, (nrow(gender.table2) - 1) * (ncol(gender.table2)-1), B)
Here’s the easy way again:
chisq.test(gender.table2, correct=FALSE, simulate.p.value = TRUE, B = 1000)
It‘s quite in here! Why not leave a response?