Rev note: Some (possibly) useful commands and experiments done in class are added.
Remove id data from 2017 data and output as a csv file. In addition, I also randomly reorder the rows for anonymity. (R codes FYR/For Your Reference)
#t<-matrix(scan("midstat2017m.csv"),ncol=3,byrow=T)
## The first column is id and intentionally left out
#exam16 <-data.frame(year=as.factor(t[,2]),mid=t[,3])
#exam16<-exam16[sample(nrow(exam16)),]
## Two different csv outputs
#write.csv(exam16, file = "stat17noid2.csv")
#write.table(exam16, file = "stat17noid.csv",row.names=FALSE, na="",col.names=FALSE, sep=" ")
Import stat 2017 data
library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
#stat17<-fread('stat17noid.csv')
stat17<-fread('http://faculty.ndhu.edu.tw/~chtsao/ftp/stat17noid.csv')
trying URL 'http://faculty.ndhu.edu.tw/~chtsao/ftp/stat17noid.csv'
Content type 'text/plain' length 438 bytes
==================================================
downloaded 438 bytes
colnames(stat17)<-c("year","mid")
stat17$year<-as.factor(stat17$year)
summary(stat17)
year mid
2:47 Min. :-50.00
3:10 1st Qu.: 15.00
4: 5 Median : 30.00
7: 1 Mean : 39.75
3rd Qu.: 58.00
Max. :108.00
Data cleansing using subset command
stat17[which(mid < 0)]
stat17<-subset(stat17,mid >= 0)
summary(stat17)
year mid
2:47 Min. : 4.00
3:10 1st Qu.: 15.00
4: 5 Median : 30.50
7: 0 Mean : 41.19
3rd Qu.: 58.00
Max. :108.00
# You may also view the whole dataframe in the Environment pane ~ View(stat17)
Import stat 2016 data
stat16 <- fread('http://faculty.ndhu.edu.tw/~chtsao/ftp/stat2016.txt')
trying URL 'http://faculty.ndhu.edu.tw/~chtsao/ftp/stat2016.txt'
Content type 'text/plain' length 558 bytes
==================================================
downloaded 558 bytes
colnames(stat16)<-c("year","mid","final")
stat16$year<-as.factor(stat16$year)
head(stat16) # Take a quick look of first few cases
summary(stat16)
year mid final
2:42 Min. : 0.00 Min. :-10.00
3:15 1st Qu.: 25.50 1st Qu.: 13.50
4: 3 Median : 46.00 Median : 31.00
5: 3 Mean : 46.52 Mean : 34.11
3rd Qu.: 67.50 3rd Qu.: 53.00
Max. :110.00 Max. :100.00
Now we have two dataframes, stat2016(year, mid, final), stat2017(year,mid).
Where are we now? What do we know? What do we want to know (but unknown now)?
WalkProg before you run. Think before you prog.
Some handy functions/commands for exploratory data analysis and data cleansing
suppressMessages(library(dplyr)) # load package dplyr but suppress its messages
stat16.23<-filter(stat16, year == 2 | year == 3)
stat16.2<-filter(stat16,year==2 )
stat16.3<-filter(stat16,year==3)
summary(stat16.23)
year mid final
2:42 Min. : 0.00 Min. :-10.00
3:15 1st Qu.: 25.00 1st Qu.: 14.00
4: 0 Median : 44.00 Median : 33.00
5: 0 Mean : 46.07 Mean : 34.63
3rd Qu.: 70.00 3rd Qu.: 53.00
Max. :110.00 Max. :100.00
library(ggplot2)
Stackoverflow is a great place to get help:
http://stackoverflow.com/tags/ggplot2.
scatter <- ggplot(data=stat16.23, aes(x = mid, y = final))
scatter + geom_point(aes(color=year, shape=year)) +
xlab("midterm") + ylab("final") +
ggtitle("Midterm vs Final Plot (Stat16.23)")
More
smooth <- ggplot(data=stat16.23, aes(x=mid, y=final, color=year)) +
geom_point(aes(shape=year), size=1.5) + xlab("mid") + ylab("final") +
ggtitle("Scatterplot with smoothers")
# Linear model
smooth + geom_smooth(method="lm")
#Double check with console output
plot(final~mid, data=stat16.2)
m16.2<-lm(final~mid,data=stat16.2); summary(m16.2);
Call:
lm(formula = final ~ mid, data = stat16.2)
Residuals:
Min 1Q Median 3Q Max
-31.051 -14.020 -4.080 7.962 55.367
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.5001 6.2849 1.512 0.139
mid 0.6388 0.1202 5.315 4.32e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.65 on 40 degrees of freedom
Multiple R-squared: 0.4139, Adjusted R-squared: 0.3993
F-statistic: 28.25 on 1 and 40 DF, p-value: 4.317e-06
abline(m16.2)
stat17.23<-filter(stat17, year == 2 | year == 3)
boxplot(mid~year, data=stat17.23)
stat17.2<-filter(stat17,year==2 )
stat17.3<-filter(stat17,year==3)
summary(stat17.2);summary(stat17.3)
year mid
2:47 Min. : 4.00
3: 0 1st Qu.: 14.00
4: 0 Median : 31.00
7: 0 Mean : 39.79
3rd Qu.: 58.00
Max. :108.00
year mid
2: 0 Min. : 20.00
3:10 1st Qu.: 23.00
4: 0 Median : 34.50
7: 0 Mean : 48.90
3rd Qu.: 74.75
Max. :105.00
par(mfrow=c(1,2));
hist(stat17.3$mid);hist(stat17.2$mid)
smid17.2<-sort(stat17.2$mid, decreasing=TRUE)
ep<-rank(smid17.2)/47
summary(smid17.2);sd(smid17.2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 14.00 31.00 39.79 58.00 108.00
[1] 31.68317
head(smid17.2)
[1] 108 105 100 95 95 95
smid17.2[27]
[1] 27
qnorm.ep<-qnorm(ep,39.79,31.68)
smid17.2
[1] 108 105 100 95 95 95 88 80 78 60 60 58 58 55 48 48 45 45 44 44
[21] 44 40 32 31 28 27 27 24 21 20 16 15 15 15 15 13 10 10 10 8
[41] 7 6 6 6 6 5 4
plot(smid17.2~qnorm.ep)