Importing Data into R
January 21, 2018
howto notes R tutorial- Packages
- CSV Files
- Tab-delimited files
- Excel Files
- Load a specific worksheet
- Other Table files
- Examples
- Select the hot dog with the least calories: Lily
- Select the observation with the most sodium: Tom
- Print lily and tom
Packages
CSV files
read.csvread_csvfrom the tidyversereadrpackagefreadfrom thedata.tablepackage
Tab-separated files
read.delimread_tsvfor tab-delimited files; from thereadrpackagefreadfrom thedata.tablepackage
Excel files
readxlpackage for .xls and .xlsx filesread_excel: import sheet into Rexcel_sheets: lists the different sheets in an Excel file
Other files
read.tableto read any tabular dataread_delimfrom thereadrpackage; is thereadrequivalent ofread.tablefreadfrom thedata.tablepackage; is thedata.tableequiv ofread.tableread_tablefromreadrif separated by whitespace
The
readrpackages are faster than the base packages, butfread(fromdata.table) is even faster.
CSV Files
read.csv
The read.csv function is part of the utils package with loads with base R, so you won’t need to install anything to get going.
chi_books <- read.csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv",
stringsAsFactors = FALSE,
col.names = c("Author", "Record", "Year", "Rank", "Title")
)
str(chi_books)
## 'data.frame': 25 obs. of 5 variables:
## $ Author: chr "Tartt, Donna." "Flynn, Gillian, 1971-" "Grisham, John." "Connelly, Michael, 1956-" ...
## $ Record: chr "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316055437" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780307588364" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780385537131" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316069519" ...
## $ Year : int 2013 2012 2013 2013 2014 2013 2013 2013 2013 2013 ...
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr "The goldfinch" "Gone girl: a novel" "Sycamore row" "The gods of guilt: a novel" ...
By default, the stringsAsFactors argument is set to true, so you will have to use the stringsAsFactors = FALSE argument if the variable is not a factor/categorical variable.
Tip: use the colClasses argument, e.g. colClasses = c("character", "NULL", "numeric", "factor") to specifically assign the type for each variable. In this case, you don’t really need to use the stringAsFactors argument. Of note, using "NULL" means that column will be skipped and not loaded into the dataframe.
read_csv
library(readr)
chi_booksR <- read_csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv")
head(chi_booksR)
## # A tibble: 6 x 5
## author
## <chr>
## 1 Tartt, Donna.
## 2 Flynn, Gillian, 1971-
## 3 Grisham, John.
## 4 Connelly, Michael, 1956-
## 5 Kidd, Sue Monk.
## 6 Evanovich, Janet.
## # ... with 4 more variables: catalog_record <chr>, pub_year <int>,
## # rank <int>, title <chr>
Set column types in readr with col_types
Using the base utils package we can do this using colClasses as described above. In the readr package, this can be done easily using the col_types argument. You can do this using shorthand notation:
- c = character
- d = double
- i = integer
- l = logical
- _ (underscore) is used to skip a column
Alternatively, you can use collector arguments, such as col_factor, col_integer, col_double(), col_character(), etc. You then need to pass these using col_types = list(col_character(), col_integer(), col_factor(levels = c("Small", "Medium", "Large"))).
chi_booksR <- read_csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv",
col_types = "cciic")
head(chi_booksR)
## # A tibble: 6 x 5
## author
## <chr>
## 1 Tartt, Donna.
## 2 Flynn, Gillian, 1971-
## 3 Grisham, John.
## 4 Connelly, Michael, 1956-
## 5 Kidd, Sue Monk.
## 6 Evanovich, Janet.
## # ... with 4 more variables: catalog_record <chr>, pub_year <int>,
## # rank <int>, title <chr>
Tab-delimited files
read.delim
tab <- read.delim("https://gist.githubusercontent.com/Stevoisiak/d1b8c7b15256e86dea5baaefcef428bd/raw/5c8896389279bb5383b4c69c65ed12524c10c3d1/system_list.txt", stringsAsFactors = FALSE)
str(tab)
## 'data.frame': 171 obs. of 3 variables:
## $ value : chr "32X" "3DO" "AcornA" "AcornE" ...
## $ name : chr "32X" "3DO" "Acorn Archimedes" "Acorn Electron" ...
## $ display: chr "yes" "" "" "" ...
read_tsv
tabR <- read_tsv("https://gist.githubusercontent.com/Stevoisiak/d1b8c7b15256e86dea5baaefcef428bd/raw/5c8896389279bb5383b4c69c65ed12524c10c3d1/system_list.txt",
col_names = c("Code", "Name", "Display")
)
## Parsed with column specification:
## cols(
## Code = col_character(),
## Name = col_character(),
## Display = col_character()
## )
head(tabR)
## # A tibble: 6 x 3
## Code Name Display
## <chr> <chr> <chr>
## 1 value name display
## 2 32X 32X yes
## 3 3DO 3DO <NA>
## 4 AcornA Acorn Archimedes <NA>
## 5 AcornE Acorn Electron <NA>
## 6 AMG Amiga <NA>
Excel Files
List the available sheets
library(readxl)
excel_sheets("file.xlsx") # gives a list of the different sheets in the file
Import sheets into R
library(readxl)
tumor <- read_excel("tumor.xlsx") # by default, loads 1st sheet
# Load a specific worksheet
tumor3 <- read_excel("tumor.xlsx", sheet = 3)
tumor_cKIT <- read_excel("tumor.xlsx", sheet = "cKIT")
Import Multiple sheets w/ lapply
data_list <- lapply(excel_sheets("data.xlsx"),
read_excel,
path = "data.xlsx")
TODO: put up the code to do this with purrr also
Other readxl arguments
sheetcol_names- If first row contains column names,
col_names = TRUE(the default setting) - Set column names using a name vector, e.g.
col_names = c("name", "age", "grade")
- If first row contains column names,
col_types- Set column types using a vector, e.g.
col_types = c("text", "numeric", "date", "blank") - Setting a column type as
blankjust ignores that column
- Set column types using a vector, e.g.
skip- specify number of rows you want to skip
Other Table files
Let’s use an example where instead of a comma, tab, or other traditional separation, we find ourselves dealing with a table that happens to be separated by question marks, i.e. ?
read.table
table <- read.table("file.txt",
header = TRUE,
sep = "?"
stringsAsFactors = FALSE,
col.names = c("name", "age", "grade"))
read.table has a lot of different possible arguments, so look up the documentation if you are dealing with a wacky format.
read_delim from readr
tableR <- read_delim("file.txt",
delim = "?",
col_names = c("name", "age", "grade")
)
The equivalent of the sep argument (from the base utils package) for readr is delim as is illustrated in the above examples.
fread from data.table
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2
chi_fread <- fread("https://data.cityofchicago.org/resource/v3hq-i8r8.csv")
head(chi_fread)
## author
## 1: Tartt, Donna.
## 2: Flynn, Gillian, 1971-
## 3: Grisham, John.
## 4: Connelly, Michael, 1956-
## 5: Kidd, Sue Monk.
## 6: Evanovich, Janet.
## catalog_record
## 1: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316055437
## 2: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780307588364
## 3: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780385537131
## 4: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316069519
## 5: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780670024780
## 6: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=0345542886
## pub_year rank title
## 1: 2013 1 The goldfinch
## 2: 2012 2 Gone girl: a novel
## 3: 2013 3 Sycamore row
## 4: 2013 4 The gods of guilt: a novel
## 5: 2014 5 The invention of wings
## 6: 2013 6 Takedown twenty
The upsides of fread over read.table is that it is much faster (as are most things with data.table) and it just kinda knows how to handle the data without specifying things so much.
A few (of many) fread arguments:
drop = 2:4to exclude columns 2 to 4drop = c("b", "c", "d")to exclude columns \(b\), \(c\), and \(d\)select = c(1, 5)to select columns 1 and 5select = c("a", "e")to select columns \(a\) and \(e\)
Examples
Lily wants the hotdog with the least calories and Tom wants the hotdog with the most calories 1 -
hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))
# Select the hot dog with the least calories: Lily
lily <- hotdogs[which.min(hotdogs$calories), ]
# Select the observation with the most sodium: Tom
tom <- hotdogs[which.max(hotdogs$sodium), ]
# Print lily and tom
lily
tom
DataCamp example↩