martes, 24 de enero de 2012

Loop for merging several datasets

So, a twitter follower asked me for a favor. He basically had several thousand *.dta files (Stata files), and he only needed two variables contained in each. He didn't want to type the name of each file separately (who would?) and the desired outcome was a dataset containing the two variables, and all of the observations of the files. He had never used Stata. How to do it? I tried doing it in Stata, but building local macros and specifying the return type is cumbersome and boring. So, of course, I switched to R. The result was a 10 line Rfile, pretty simple, and super fast to run. First, you will need to put all of the original files in the same folder. Second, make sure that each file is indexed with a root name (in this example, the rootname is file, yes, I know, I am so original!). Finally, the output will be done in a *.csv file. If you want it in another format, well, good luck!

## BEGIN BY SETTING THE WORKING DIRECTORY...
setwd("/Users/Myname/Desktop/Statahelp/")
library("foreign")
## VARSK ARE THE VARIABLES TO KEEP, IN THIS CASE, X1 AND X2.
varsk <- c("x1","x2")
## BEGIN A LOOP. FOR THE EXAMPLE, I AM ASSUMING THAT THERE ARE 500 FILES
for (i in 1:500){
strd <- paste('file',as.character(i),'.dta', sep='')
strk <- paste('file',as.character(i),'.csv', sep='')
data <- read.dta(strd)
datak <- data[varsk]
write.table(datak, file = strk, sep = ",", append = T)
}
If your files are in a different directory (other than the one specified in wd), then change the string in strd.
I also strongly suggest adding a variable in datak that is equal to [i]. This is solely for purposes of identifying the dataset that that particular observation came from. This is done by simply adding a line:
datak$fileid <- i
Done! Hope this was helpful!