Data Validation / Testing

There are two recent packages that work on validating the data in a data frame and another package from Hadley that is more of a replacement for stopifnot.

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

suppressPackageStartupMessages(library(dplyr))
library(assertr)

mtcars %>%
  verify(nrow(.) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
  insist_rows(maha_dist, within_n_mads(10), everything()) %>%
  group_by(cyl) %>%
  summarise(avg.mpg = mean(mpg))
## # A tibble: 3 × 2
##     cyl  avg.mpg
##   <dbl>    <dbl>
## 1     4 26.66364
## 2     6 19.74286
## 3     8 15.10000

The validate R-package makes it super-easy to check whether data lives up to expectations you have based on domain knowledge. It works by allowing you to define data validation rules independent of the code or data set.

library(validate)
data(women)
cf <- check_that(women, height > 0, weight > 0, height/weight > 0.5)
cf
## Object of class 'validation'
## Call:
##     check_that(women, height > 0, weight > 0, height/weight > 0.5)
## 
## Confrontations: 3
## With fails    : 1
## Warnings      : 0
## Errors        : 0
v <- validator(height > 0, weight > 0, height/weight > 0)
confront(women, v)
## Object of class 'validation'
## Call:
##     confront(x = women, dat = v)
## 
## Confrontations: 3
## With fails    : 0
## Warnings      : 0
## Errors        : 0

If you can include this in an automated data processing step it will help with checking the underlying data behaves like you think it does.

assertthat provides a drop in replacement for stopifnot() that makes it easy to check the pre- and post-conditions of a function, while producing useful error messages.

library(assertthat)
x <- 1:10
stopifnot(is.character(x))
## Error: is.character(x) is not TRUE
assert_that(is.character(x))
## Error: x is not a character vector
assert_that(length(x) == 5)
## Error: length(x) not equal to 5
assert_that(is.numeric(x))
## [1] TRUE

Data Science Workflow

  • Git and Github
    • This tutorials toes the line perfectly between comprehensive github use and brevity to quickly find what you want

Textbooks