Importing Data into R

January 21, 2018

howto notes R tutorial

Packages
CSV Files
- read.csv
- read_csv
  - Set column types in readr with col_types
Tab-delimited files
- read.delim
- read_tsv
Excel Files
- List the available sheets
- Import sheets into R
Load a specific worksheet
- Import Multiple sheets w/ lapply
- Other readxl arguments
Other Table files
Examples
Select the hot dog with the least calories: Lily
Select the observation with the most sodium: Tom
Print lily and tom

Packages

CSV files

read.csv
read_csv from the tidyverse readr package
fread from the data.table package

Tab-separated files

read.delim
read_tsv for tab-delimited files; from the readr package
fread from the data.table package

Excel files

readxl package for .xls and .xlsx files
- read_excel: import sheet into R
- excel_sheets: lists the different sheets in an Excel file

Other files

read.table to read any tabular data
read_delim from the readr package; is the readr equivalent of read.table
fread from the data.table package; is the data.table equiv of read.table
read_table from readr if separated by whitespace

The readr packages are faster than the base packages, but fread (from data.table) is even faster.

CSV Files

`read.csv`

The read.csv function is part of the utils package with loads with base R, so you won’t need to install anything to get going.

chi_books <- read.csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv",
                      stringsAsFactors = FALSE,
                      col.names = c("Author", "Record", "Year", "Rank", "Title")
                      )
str(chi_books)

## 'data.frame':    25 obs. of  5 variables:
##  $ Author: chr  "Tartt, Donna." "Flynn, Gillian, 1971-" "Grisham, John." "Connelly, Michael, 1956-" ...
##  $ Record: chr  "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316055437" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780307588364" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780385537131" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316069519" ...
##  $ Year  : int  2013 2012 2013 2013 2014 2013 2013 2013 2013 2013 ...
##  $ Rank  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title : chr  "The goldfinch" "Gone girl: a novel" "Sycamore row" "The gods of guilt: a novel" ...

By default, the stringsAsFactors argument is set to true, so you will have to use the stringsAsFactors = FALSE argument if the variable is not a factor/categorical variable.

Tip: use the colClasses argument, e.g. colClasses = c("character", "NULL", "numeric", "factor") to specifically assign the type for each variable. In this case, you don’t really need to use the stringAsFactors argument. Of note, using "NULL" means that column will be skipped and not loaded into the dataframe.

`read_csv`

library(readr)
chi_booksR <- read_csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv") 
head(chi_booksR)

## # A tibble: 6 x 5
##                     author
##                      <chr>
## 1            Tartt, Donna.
## 2    Flynn, Gillian, 1971-
## 3           Grisham, John.
## 4 Connelly, Michael, 1956-
## 5          Kidd, Sue Monk.
## 6        Evanovich, Janet.
## # ... with 4 more variables: catalog_record <chr>, pub_year <int>,
## #   rank <int>, title <chr>

Set column types in `readr` with `col_types`

Using the base utils package we can do this using colClasses as described above. In the readr package, this can be done easily using the col_types argument. You can do this using shorthand notation:

c = character
d = double
i = integer
l = logical
_ (underscore) is used to skip a column

Alternatively, you can use collector arguments, such as col_factor, col_integer, col_double(), col_character(), etc. You then need to pass these using col_types = list(col_character(), col_integer(), col_factor(levels = c("Small", "Medium", "Large"))).

chi_booksR <- read_csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv",
                       col_types = "cciic") 
head(chi_booksR)

## # A tibble: 6 x 5
##                     author
##                      <chr>
## 1            Tartt, Donna.
## 2    Flynn, Gillian, 1971-
## 3           Grisham, John.
## 4 Connelly, Michael, 1956-
## 5          Kidd, Sue Monk.
## 6        Evanovich, Janet.
## # ... with 4 more variables: catalog_record <chr>, pub_year <int>,
## #   rank <int>, title <chr>

Tab-delimited files

`read.delim`

tab <- read.delim("https://gist.githubusercontent.com/Stevoisiak/d1b8c7b15256e86dea5baaefcef428bd/raw/5c8896389279bb5383b4c69c65ed12524c10c3d1/system_list.txt", stringsAsFactors = FALSE)
str(tab)

## 'data.frame':    171 obs. of  3 variables:
##  $ value  : chr  "32X" "3DO" "AcornA" "AcornE" ...
##  $ name   : chr  "32X" "3DO" "Acorn Archimedes" "Acorn Electron" ...
##  $ display: chr  "yes" "" "" "" ...

`read_tsv`

tabR <- read_tsv("https://gist.githubusercontent.com/Stevoisiak/d1b8c7b15256e86dea5baaefcef428bd/raw/5c8896389279bb5383b4c69c65ed12524c10c3d1/system_list.txt",
                 col_names = c("Code", "Name", "Display")
                 )

## Parsed with column specification:
## cols(
##   Code = col_character(),
##   Name = col_character(),
##   Display = col_character()
## )

head(tabR)

## # A tibble: 6 x 3
##     Code             Name Display
##    <chr>            <chr>   <chr>
## 1  value             name display
## 2    32X              32X     yes
## 3    3DO              3DO    <NA>
## 4 AcornA Acorn Archimedes    <NA>
## 5 AcornE   Acorn Electron    <NA>
## 6    AMG            Amiga    <NA>

Excel Files

List the available sheets

library(readxl)

excel_sheets("file.xlsx") # gives a list of the different sheets in the file

Import sheets into R

library(readxl)

tumor <- read_excel("tumor.xlsx") # by default, loads 1st sheet

# Load a specific worksheet

tumor3 <- read_excel("tumor.xlsx", sheet = 3)
tumor_cKIT <- read_excel("tumor.xlsx", sheet = "cKIT")

Import Multiple sheets w/ `lapply`

data_list <- lapply(excel_sheets("data.xlsx"),
                      read_excel,
                      path = "data.xlsx")

TODO: put up the code to do this with purrr also

Other `readxl` arguments

sheet
col_names
- If first row contains column names, col_names = TRUE (the default setting)
- Set column names using a name vector, e.g. col_names = c("name", "age", "grade")
col_types
- Set column types using a vector, e.g. col_types = c("text", "numeric", "date", "blank")
- Setting a column type as blank just ignores that column
skip - specify number of rows you want to skip

Other Table files

Let’s use an example where instead of a comma, tab, or other traditional separation, we find ourselves dealing with a table that happens to be separated by question marks, i.e. ?

`read.table`

table <- read.table("file.txt",
                    header = TRUE,
                    sep = "?"
                    stringsAsFactors = FALSE,
                    col.names = c("name", "age", "grade"))

read.table has a lot of different possible arguments, so look up the documentation if you are dealing with a wacky format.

`read_delim` from `readr`

tableR <- read_delim("file.txt",
                     delim = "?",
                     col_names = c("name", "age", "grade")
                     )

The equivalent of the sep argument (from the base utils package) for readr is delim as is illustrated in the above examples.

`fread` from `data.table`

library(data.table)

## Warning: package 'data.table' was built under R version 3.4.2

chi_fread <- fread("https://data.cityofchicago.org/resource/v3hq-i8r8.csv")
head(chi_fread)

##                      author
## 1:            Tartt, Donna.
## 2:    Flynn, Gillian, 1971-
## 3:           Grisham, John.
## 4: Connelly, Michael, 1956-
## 5:          Kidd, Sue Monk.
## 6:        Evanovich, Janet.
##                                                                                                        catalog_record
## 1: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316055437
## 2: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780307588364
## 3: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780385537131
## 4: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316069519
## 5: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780670024780
## 6:    http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=0345542886
##    pub_year rank                      title
## 1:     2013    1              The goldfinch
## 2:     2012    2         Gone girl: a novel
## 3:     2013    3               Sycamore row
## 4:     2013    4 The gods of guilt: a novel
## 5:     2014    5     The invention of wings
## 6:     2013    6            Takedown twenty

The upsides of fread over read.table is that it is much faster (as are most things with data.table) and it just kinda knows how to handle the data without specifying things so much.

A few (of many) fread arguments:

drop = 2:4 to exclude columns 2 to 4
drop = c("b", "c", "d") to exclude columns \(b\), \(c\), and \(d\)
select = c(1, 5) to select columns 1 and 5
select = c("a", "e") to select columns \(a\) and \(e\)

Examples

Lily wants the hotdog with the least calories and Tom wants the hotdog with the most calories ¹ -

hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))

# Select the hot dog with the least calories: Lily
lily <- hotdogs[which.min(hotdogs$calories), ]

# Select the observation with the most sodium: Tom
tom <- hotdogs[which.max(hotdogs$sodium), ]

# Print lily and tom
lily
tom

DataCamp example↩

Python Basics: From Zero to Full Monty

September 27, 2017

notes study tutorial python

Tables: Converting Markdown to Huxtable

September 21, 2017

blog howto markdown R

Normal Distribution & Central Limit Theorem

September 19, 2017

notes review study