Importing Data into R

January 21, 2018
howto notes R tutorial

Packages

CSV files

  • read.csv
  • read_csv from the tidyverse readr package
  • fread from the data.table package

Tab-separated files

  • read.delim
  • read_tsv for tab-delimited files; from the readr package
  • fread from the data.table package

Excel files

  • readxl package for .xls and .xlsx files
    • read_excel: import sheet into R
    • excel_sheets: lists the different sheets in an Excel file

Other files

  • read.table to read any tabular data
  • read_delim from the readr package; is the readr equivalent of read.table
  • fread from the data.table package; is the data.table equiv of read.table
  • read_table from readr if separated by whitespace

The readr packages are faster than the base packages, but fread (from data.table) is even faster.

CSV Files

read.csv

The read.csv function is part of the utils package with loads with base R, so you won’t need to install anything to get going.

chi_books <- read.csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv",
                      stringsAsFactors = FALSE,
                      col.names = c("Author", "Record", "Year", "Rank", "Title")
                      )
str(chi_books)
## 'data.frame':    25 obs. of  5 variables:
##  $ Author: chr  "Tartt, Donna." "Flynn, Gillian, 1971-" "Grisham, John." "Connelly, Michael, 1956-" ...
##  $ Record: chr  "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316055437" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780307588364" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780385537131" "http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316069519" ...
##  $ Year  : int  2013 2012 2013 2013 2014 2013 2013 2013 2013 2013 ...
##  $ Rank  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title : chr  "The goldfinch" "Gone girl: a novel" "Sycamore row" "The gods of guilt: a novel" ...

By default, the stringsAsFactors argument is set to true, so you will have to use the stringsAsFactors = FALSE argument if the variable is not a factor/categorical variable.

Tip: use the colClasses argument, e.g. colClasses = c("character", "NULL", "numeric", "factor") to specifically assign the type for each variable. In this case, you don’t really need to use the stringAsFactors argument. Of note, using "NULL" means that column will be skipped and not loaded into the dataframe.

read_csv

library(readr)
chi_booksR <- read_csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv") 
head(chi_booksR)
## # A tibble: 6 x 5
##                     author
##                      <chr>
## 1            Tartt, Donna.
## 2    Flynn, Gillian, 1971-
## 3           Grisham, John.
## 4 Connelly, Michael, 1956-
## 5          Kidd, Sue Monk.
## 6        Evanovich, Janet.
## # ... with 4 more variables: catalog_record <chr>, pub_year <int>,
## #   rank <int>, title <chr>

Set column types in readr with col_types

Using the base utils package we can do this using colClasses as described above. In the readr package, this can be done easily using the col_types argument. You can do this using shorthand notation:

  • c = character
  • d = double
  • i = integer
  • l = logical
  • _ (underscore) is used to skip a column

Alternatively, you can use collector arguments, such as col_factor, col_integer, col_double(), col_character(), etc. You then need to pass these using col_types = list(col_character(), col_integer(), col_factor(levels = c("Small", "Medium", "Large"))).

chi_booksR <- read_csv("https://data.cityofchicago.org/resource/v3hq-i8r8.csv",
                       col_types = "cciic") 
head(chi_booksR)
## # A tibble: 6 x 5
##                     author
##                      <chr>
## 1            Tartt, Donna.
## 2    Flynn, Gillian, 1971-
## 3           Grisham, John.
## 4 Connelly, Michael, 1956-
## 5          Kidd, Sue Monk.
## 6        Evanovich, Janet.
## # ... with 4 more variables: catalog_record <chr>, pub_year <int>,
## #   rank <int>, title <chr>

Tab-delimited files

read.delim

tab <- read.delim("https://gist.githubusercontent.com/Stevoisiak/d1b8c7b15256e86dea5baaefcef428bd/raw/5c8896389279bb5383b4c69c65ed12524c10c3d1/system_list.txt", stringsAsFactors = FALSE)
str(tab)
## 'data.frame':    171 obs. of  3 variables:
##  $ value  : chr  "32X" "3DO" "AcornA" "AcornE" ...
##  $ name   : chr  "32X" "3DO" "Acorn Archimedes" "Acorn Electron" ...
##  $ display: chr  "yes" "" "" "" ...

read_tsv

tabR <- read_tsv("https://gist.githubusercontent.com/Stevoisiak/d1b8c7b15256e86dea5baaefcef428bd/raw/5c8896389279bb5383b4c69c65ed12524c10c3d1/system_list.txt",
                 col_names = c("Code", "Name", "Display")
                 )
## Parsed with column specification:
## cols(
##   Code = col_character(),
##   Name = col_character(),
##   Display = col_character()
## )
head(tabR)
## # A tibble: 6 x 3
##     Code             Name Display
##    <chr>            <chr>   <chr>
## 1  value             name display
## 2    32X              32X     yes
## 3    3DO              3DO    <NA>
## 4 AcornA Acorn Archimedes    <NA>
## 5 AcornE   Acorn Electron    <NA>
## 6    AMG            Amiga    <NA>

Excel Files

List the available sheets

library(readxl)

excel_sheets("file.xlsx") # gives a list of the different sheets in the file

Import sheets into R

library(readxl)

tumor <- read_excel("tumor.xlsx") # by default, loads 1st sheet

# Load a specific worksheet

tumor3 <- read_excel("tumor.xlsx", sheet = 3)
tumor_cKIT <- read_excel("tumor.xlsx", sheet = "cKIT")

Import Multiple sheets w/ lapply

data_list <- lapply(excel_sheets("data.xlsx"),
                      read_excel,
                      path = "data.xlsx")

TODO: put up the code to do this with purrr also

Other readxl arguments

  • sheet
  • col_names
    • If first row contains column names, col_names = TRUE (the default setting)
    • Set column names using a name vector, e.g. col_names = c("name", "age", "grade")
  • col_types
    • Set column types using a vector, e.g. col_types = c("text", "numeric", "date", "blank")
    • Setting a column type as blank just ignores that column
  • skip - specify number of rows you want to skip

Other Table files

Let’s use an example where instead of a comma, tab, or other traditional separation, we find ourselves dealing with a table that happens to be separated by question marks, i.e. ?

read.table

table <- read.table("file.txt",
                    header = TRUE,
                    sep = "?"
                    stringsAsFactors = FALSE,
                    col.names = c("name", "age", "grade"))

read.table has a lot of different possible arguments, so look up the documentation if you are dealing with a wacky format.

read_delim from readr

tableR <- read_delim("file.txt",
                     delim = "?",
                     col_names = c("name", "age", "grade")
                     )

The equivalent of the sep argument (from the base utils package) for readr is delim as is illustrated in the above examples.

fread from data.table

library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2
chi_fread <- fread("https://data.cityofchicago.org/resource/v3hq-i8r8.csv")
head(chi_fread)
##                      author
## 1:            Tartt, Donna.
## 2:    Flynn, Gillian, 1971-
## 3:           Grisham, John.
## 4: Connelly, Michael, 1956-
## 5:          Kidd, Sue Monk.
## 6:        Evanovich, Janet.
##                                                                                                        catalog_record
## 1: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316055437
## 2: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780307588364
## 3: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780385537131
## 4: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780316069519
## 5: http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=9780670024780
## 6:    http://chipublib.bibliocommons.com/search?suppress=true&custom_edit=false&custom_query=identifier%3A=0345542886
##    pub_year rank                      title
## 1:     2013    1              The goldfinch
## 2:     2012    2         Gone girl: a novel
## 3:     2013    3               Sycamore row
## 4:     2013    4 The gods of guilt: a novel
## 5:     2014    5     The invention of wings
## 6:     2013    6            Takedown twenty

The upsides of fread over read.table is that it is much faster (as are most things with data.table) and it just kinda knows how to handle the data without specifying things so much.

A few (of many) fread arguments:

  • drop = 2:4 to exclude columns 2 to 4
  • drop = c("b", "c", "d") to exclude columns \(b\), \(c\), and \(d\)
  • select = c(1, 5) to select columns 1 and 5
  • select = c("a", "e") to select columns \(a\) and \(e\)

Examples

Lily wants the hotdog with the least calories and Tom wants the hotdog with the most calories 1 -

hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))

# Select the hot dog with the least calories: Lily
lily <- hotdogs[which.min(hotdogs$calories), ]

# Select the observation with the most sodium: Tom
tom <- hotdogs[which.max(hotdogs$sodium), ]

# Print lily and tom
lily
tom

  1. DataCamp example

Python Basics: From Zero to Full Monty

September 27, 2017
notes study tutorial python

Tables: Converting Markdown to Huxtable

September 21, 2017
blog howto markdown R

Normal Distribution & Central Limit Theorem

September 19, 2017
notes review study