Reproducible Research Week 1

Replication

The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent

Investigators
Data
Analytical Methods
Laboratories
Instruments

Replication is particularly important in studies that can impact broad policy or regulatory decisions

What's wrong with replication?

Some studies cannot be replicated

No time, opportunistic
No money
Unique

Reproducible Research: Make analytic data and code available so that others may reproduce findings

Reproducibility bridges the gap between replication which is awesome and doing nothing.

Why do we need reproducible research?

New technologies increasing data collection throughput; data are more complex and extremely high dimensional

Existing databases can be merged into new "megadatabases"

Computing power is greatly increased, allowing more sophisticated analyses

For every field "X" there is a field "Computational X"

Research Pipeline

Measured Data -> Analytic Data -> Computational Results -> Figures/Tables/Numeric Summaries -> Articles -> Text

Data/Metadata used to develop test should be made publically available

The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available

"Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported" -- IOM Report

What do we need for reproducible research?

Analytic data are available
Analytic code are available
Documentation of code and data
Standard means of distribution

Who is the audience for reproducible research?

Authors:

Want to make their research reproducible
Want tools for reproducible research to make their lives easier (or at least not much harder)

Readers:

Want to reproduce (and perhaps expand upon) interesting findings
Want tools for reproducible research to make their lives easier.

Challenges for reproducible research

Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)
Readers must download data/results individually and piece together which data go with which code sections, etc.
Readers may not have the same resources as authors
Few tools to help authors/readers

What happens in reality

Authors:

Just put stuff on the web
(Infamous for disorganization) Journal supplementary materials
There are some central databases for various fields (e.g biology, ICPSR)

Readers:

Just download the data and (try to) figure it out
Piece together the software and run it

Literate (Statistical) Programming

An article is a stream of text and code

Analysis code is divided into text and code "chunks"

Each code chunk loads data and computes results

Presentation code formats results (tables, figures, etc.)

Article text explains what is going on

Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents

Literate programming is a general concept that requires

A documentation language (human readable)
A programming language (machine readable)

Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML

Quick summary so far

Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate

Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available

There is a growing number of tools for creating reproducible documents

Golden Rule of Reproducibility: Script Everything

Steps in a Data Analysis

Define the question
Define the ideal data set
Determine what data you can access
Obtain the data
Clean the data
Exploratory data analysis
Statistical prediction/modeling
Interpret results
Challenge results
Synthesize/write up results
Create reproducible code

"Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?" -- Dan Myer

Defining a question is the kind of most powerful dimension reduction tool you can ever employ.

An Example for #1

Start with a general question

Can I automatically detect emails that are SPAM or not?

Make it concrete

Can I use quantitative characteristics of emails to classify them as SPAM?

Define the ideal data set

The data set may depend on your goal

Descriptive goal -- a whole population
Exploratory goal -- a random sample with many variables measured
Inferential goal -- The right population, randomly sampled
Predictive goal -- a training and test data set from the same population
Causal goal -- data from a randomized study
Mechanistic goal -- data about all components of the system

Determine what data you can access

Sometimes you can find data free on the web

Other times you may need to buy the data

Be sure to respect the terms of use

If the data don't exist, you may need to generate it yourself.

Obtain the data

Try to obtain the raw data

Be sure to reference the source

Polite emails go a long way

If you load the data from an Internet source, record the URL and time accessed

Clean the data

Raw data often needs to be processed

If it is pre-processed, make sure you understand how

Understand the source of the data (census, sample, convenience sample, etc)

May need reformatting, subsampling -- record these steps

Determine if the data are good enough -- If not, quit or change data

Exploratory Data Analysis

Look at summaries of the data

Check for missing data

-> Why is there missing data?

Look for outliers

Create exploratory plots

Perform exploratory analyses such as clustering

If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis

plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)

Statistical prediction/modeling

Should be informed by the results of your exploratory analysis

Exact methods depend on the question of interest

Transformations/processing should be accounted for when necessary

Measures of uncertainty should be reported.

Interpret Results

Use the appropriate language

Describes
Correlates with/associated with
Leads to/Causes
Predicts

Gives an explanation

Interpret Coefficients

Interpret measures of uncertainty

Challenge Results

Challenge all steps:

Question
Data Source
Processing
Analysis
Conclusions

Challenge measures of uncertainty

Challenge choices of terms to include in models

Think of potential alternative analyses

Synthesize/Write-up Results

Lead with the question

Summarize the analyses into the story

Don't include every analysis, include it

If it is needed for the story
If it is needed to address a challenge
Order analyses according to the story, rather than chronologically
Include "pretty" figures that contribute to the story

In the lecture example...

Lead with the question

Can I use quantitative characteristics of the emails to classify them as SPAM?

Describe the approach

Collected data from UCI -> created training/test sets

Explored Relationships

Choose logistic model on training set by cross validation

Applied to test, 78% test set accuracy

Interpret results

Number of dollar signs seem reasonable, e.g. "Make more money with Viagra $ $ $ $"

Challenge Results

78% isn't that great

Could use more variables

Why use logistic regression?

Data Analysis Files

Data

Raw Data
Processed Data

Figures

Exploratory Figures
Final Figures

R Code

Raw/Unused Scripts
Final Scripts
R Markdown Files

Text

README files
Text of Analysis/Report

Raw Data

Should be stored in the analysis folder

If accessed from the web, include URL, description, and date accessed in README

Processed Data

Processed data should be named so it is easy to see which script generated the data

The processing script -- processed data mapping should occur in the README

Processed data should be tidy

Exploratory Figures

Figures made during the course of your analysis, not necessarily part of your final report

They do not need to be "pretty"

Final Figures

Usually a small subset of the original figures

Axes/Colors set to make the figure clear

Possibly multiple panels

Raw Scripts

May be less commented (but comments help you!)

May be multiple versions

May include analyses that are later discarded

Final Scripts

Clearly commented

Small comments liberally - what, when, why, how
Bigger commented blocks for whole sections

Include processing details

Only analyses that appear in the final write-up

R Markdown Files

R Markdown files can be used to generate reproducible reports

Text and R code are integrated

Very easy to create in RStudio

Readme Files

Not necessary if you use R Markdown

Should contain step-by-step instructions for analysis

Text of the document

It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)

It should tell a story

It should not include every analysis you performed

References should be included for statistical methods