0.1 A simple task
Sounds like a simple task. You have multiple CSV files in a folder. You want to import each of them into your environment but into a single data frame. Here is the challenge: most new R users have only used read.csv()
to import single files. I remember, when I was an R beginner, the very first time when I wanted to import a whole bunch of CSV files, the task of running read.csv()
again and again sounded to me like a loop function.
0.2 Get comfortable with apply()
I have learnt that the sooner you get comfortable with apply()
and all its variants you will discover how easy it is going to make your life in R.
0.2.1 Here is the situation I am dealing with
# bash commands below
echo -n "Total number of files: " && ls ~/Desktop/rawdata | wc -l
## Total number of files: ls: cannot access '/home/rachit/Desktop/rawdata': No such file or directory
## 0
As you can see, the total number csv files I have in my folder is 14. I already know that the files contain exactly the same columns. So all I have to do is import them into my environment one by one and then convert them into a single dataframe.
0.2.2 Step 1: import them into the environment
For this I will read_csv()
function from the readr package.
# get all the file names
filenames <- list.files()
# read in the files into a single list
all <- lapply(filenames, function(name) {
readr:: read_csv(name)
})
Now the files are in a list called all
. Notice I could have gone about creating a for
loop to get this done which is what I used to do when I first started to ramp up my R game and started to deal with multiple files. But then I started to get adventurous with apply
and its siblings - lapply
, sapply
, tapply
, vapply
.
Earlier my code would have looked like this:
all <- list()
for (i in filenames) {
all[i] <- readr::read_csv(i)
}
Now, nothing wrong with the above code. It does work (just a tad slower) but look at the elegance of the lapply
code. Not only is it faster, it also has fewer lines.
0.2.3 Step 2: convert all list items into a single dataframe
Now here, we will see the power of rbind.fill()
which can be found in the plyr package. It is one line of code that will collapse the entire list into a single data frame.
df <- plyr::rbind.fill(all)
There you go! Job done. The more comfortable you get with tidyverse (mother package of the tidy R philiosophy), the easier your R experience is going to be.
Let’s take a look to see if everything is as expected. We can do a simple check to compare the total number of observations in the list with the total number of observations in the dataframe.
# total number of rows in the list
sum(
sapply(all, nrow))
## [1] 33306
# total number of rows in the final data frame
nrow(df)
## [1] 33306
There you go!