mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-11-29 11:27:11 -05:00
397 lines
9.5 KiB
Markdown
397 lines
9.5 KiB
Markdown
|
# Reproducible Research Week 1
|
|||
|
|
|||
|
## Replication
|
|||
|
|
|||
|
The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent
|
|||
|
|
|||
|
- Investigators
|
|||
|
- Data
|
|||
|
- Analytical Methods
|
|||
|
- Laboratories
|
|||
|
- Instruments
|
|||
|
|
|||
|
Replication is particularly important in studies that can impact broad policy or regulatory decisions
|
|||
|
|
|||
|
|
|||
|
|
|||
|
### What's wrong with replication?
|
|||
|
|
|||
|
Some studies cannot be replicated
|
|||
|
|
|||
|
- No time, opportunistic
|
|||
|
- No money
|
|||
|
- Unique
|
|||
|
|
|||
|
*Reproducible Research:* Make analytic data and code available so that others may reproduce findings
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Reproducibility bridges the gap between replication which is awesome and doing nothing.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
## Why do we need reproducible research?
|
|||
|
|
|||
|
New technologies increasing data collection throughput; data are more complex and extremely high dimensional
|
|||
|
|
|||
|
Existing databases can be merged into new "megadatabases"
|
|||
|
|
|||
|
Computing power is greatly increased, allowing more sophisticated analyses
|
|||
|
|
|||
|
For every field "X" there is a field "Computational X"
|
|||
|
|
|||
|
|
|||
|
|
|||
|
## Research Pipeline
|
|||
|
|
|||
|
Measured Data -> Analytic Data -> Computational Results -> Figures/Tables/Numeric Summaries -> Articles -> Text
|
|||
|
|
|||
|
Data/Metadata used to develop test should be made publically available
|
|||
|
|
|||
|
The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available
|
|||
|
|
|||
|
"Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported" -- IOM Report
|
|||
|
|
|||
|
|
|||
|
|
|||
|
### What do we need for reproducible research?
|
|||
|
|
|||
|
- Analytic data are available
|
|||
|
- Analytic code are available
|
|||
|
- Documentation of code and data
|
|||
|
- Standard means of distribution
|
|||
|
|
|||
|
|
|||
|
|
|||
|
### Who is the audience for reproducible research?
|
|||
|
|
|||
|
Authors:
|
|||
|
|
|||
|
- Want to make their research reproducible
|
|||
|
- Want tools for reproducible research to make their lives easier (or at least not much harder)
|
|||
|
|
|||
|
Readers:
|
|||
|
|
|||
|
- Want to reproduce (and perhaps expand upon) interesting findings
|
|||
|
- Want tools for reproducible research to make their lives easier.
|
|||
|
|
|||
|
### Challenges for reproducible research
|
|||
|
|
|||
|
- Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)
|
|||
|
- Readers must download data/results individually and piece together which data go with which code sections, etc.
|
|||
|
- Readers may not have the same resources as authors
|
|||
|
- Few tools to help authors/readers
|
|||
|
|
|||
|
### What happens in reality
|
|||
|
|
|||
|
Authors:
|
|||
|
|
|||
|
- Just put stuff on the web
|
|||
|
- (Infamous for disorganization) Journal supplementary materials
|
|||
|
- There are some central databases for various fields (e.g biology, ICPSR)
|
|||
|
|
|||
|
Readers:
|
|||
|
|
|||
|
- Just download the data and (try to) figure it out
|
|||
|
- Piece together the software and run it
|
|||
|
|
|||
|
## Literate (Statistical) Programming
|
|||
|
|
|||
|
An article is a stream of text and code
|
|||
|
|
|||
|
Analysis code is divided into text and code "chunks"
|
|||
|
|
|||
|
Each code chunk loads data and computes results
|
|||
|
|
|||
|
Presentation code formats results (tables, figures, etc.)
|
|||
|
|
|||
|
Article text explains what is going on
|
|||
|
|
|||
|
Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents
|
|||
|
|
|||
|
Literate programming is a general concept that requires
|
|||
|
|
|||
|
1. A documentation language (human readable)
|
|||
|
2. A programming language (machine readable)
|
|||
|
|
|||
|
Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML
|
|||
|
|
|||
|
### Quick summary so far
|
|||
|
|
|||
|
Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate
|
|||
|
|
|||
|
Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available
|
|||
|
|
|||
|
There is a growing number of tools for creating reproducible documents
|
|||
|
|
|||
|
|
|||
|
|
|||
|
**Golden Rule of Reproducibility: Script Everything**
|
|||
|
|
|||
|
## Steps in a Data Analysis
|
|||
|
|
|||
|
1. Define the question
|
|||
|
2. Define the ideal data set
|
|||
|
3. Determine what data you can access
|
|||
|
4. Obtain the data
|
|||
|
5. Clean the data
|
|||
|
6. Exploratory data analysis
|
|||
|
7. Statistical prediction/modeling
|
|||
|
8. Interpret results
|
|||
|
9. Challenge results
|
|||
|
10. Synthesize/write up results
|
|||
|
11. Create reproducible code
|
|||
|
|
|||
|
"Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?" -- Dan Myer
|
|||
|
|
|||
|
Defining a question is the kind of most powerful dimension reduction tool you can ever employ.
|
|||
|
|
|||
|
### An Example for #1
|
|||
|
|
|||
|
**Start with a general question**
|
|||
|
|
|||
|
Can I automatically detect emails that are SPAM or not?
|
|||
|
|
|||
|
**Make it concrete**
|
|||
|
|
|||
|
Can I use quantitative characteristics of emails to classify them as SPAM?
|
|||
|
|
|||
|
### Define the ideal data set
|
|||
|
|
|||
|
The data set may depend on your goal
|
|||
|
|
|||
|
- Descriptive goal -- a whole population
|
|||
|
- Exploratory goal -- a random sample with many variables measured
|
|||
|
- Inferential goal -- The right population, randomly sampled
|
|||
|
- Predictive goal -- a training and test data set from the same population
|
|||
|
- Causal goal -- data from a randomized study
|
|||
|
- Mechanistic goal -- data about all components of the system
|
|||
|
|
|||
|
### Determine what data you can access
|
|||
|
|
|||
|
Sometimes you can find data free on the web
|
|||
|
|
|||
|
Other times you may need to buy the data
|
|||
|
|
|||
|
Be sure to respect the terms of use
|
|||
|
|
|||
|
If the data don't exist, you may need to generate it yourself.
|
|||
|
|
|||
|
### Obtain the data
|
|||
|
|
|||
|
Try to obtain the raw data
|
|||
|
|
|||
|
Be sure to reference the source
|
|||
|
|
|||
|
Polite emails go a long way
|
|||
|
|
|||
|
If you load the data from an Internet source, record the URL and time accessed
|
|||
|
|
|||
|
### Clean the data
|
|||
|
|
|||
|
Raw data often needs to be processed
|
|||
|
|
|||
|
If it is pre-processed, make sure you understand how
|
|||
|
|
|||
|
Understand the source of the data (census, sample, convenience sample, etc)
|
|||
|
|
|||
|
May need reformatting, subsampling -- record these steps
|
|||
|
|
|||
|
**Determine if the data are good enough** -- If not, quit or change data
|
|||
|
|
|||
|
### Exploratory Data Analysis
|
|||
|
|
|||
|
Look at summaries of the data
|
|||
|
|
|||
|
Check for missing data
|
|||
|
|
|||
|
-> Why is there missing data?
|
|||
|
|
|||
|
Look for outliers
|
|||
|
|
|||
|
Create exploratory plots
|
|||
|
|
|||
|
Perform exploratory analyses such as clustering
|
|||
|
|
|||
|
If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis
|
|||
|
|
|||
|
`plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)`
|
|||
|
|
|||
|
### Statistical prediction/modeling
|
|||
|
|
|||
|
Should be informed by the results of your exploratory analysis
|
|||
|
|
|||
|
Exact methods depend on the question of interest
|
|||
|
|
|||
|
Transformations/processing should be accounted for when necessary
|
|||
|
|
|||
|
Measures of uncertainty should be reported.
|
|||
|
|
|||
|
### Interpret Results
|
|||
|
|
|||
|
Use the appropriate language
|
|||
|
|
|||
|
- Describes
|
|||
|
- Correlates with/associated with
|
|||
|
- Leads to/Causes
|
|||
|
- Predicts
|
|||
|
|
|||
|
Gives an explanation
|
|||
|
|
|||
|
Interpret Coefficients
|
|||
|
|
|||
|
Interpret measures of uncertainty
|
|||
|
|
|||
|
### Challenge Results
|
|||
|
|
|||
|
Challenge all steps:
|
|||
|
|
|||
|
- Question
|
|||
|
- Data Source
|
|||
|
- Processing
|
|||
|
- Analysis
|
|||
|
- Conclusions
|
|||
|
|
|||
|
Challenge measures of uncertainty
|
|||
|
|
|||
|
Challenge choices of terms to include in models
|
|||
|
|
|||
|
Think of potential alternative analyses
|
|||
|
|
|||
|
### Synthesize/Write-up Results
|
|||
|
|
|||
|
Lead with the question
|
|||
|
|
|||
|
Summarize the analyses into the story
|
|||
|
|
|||
|
Don't include every analysis, include it
|
|||
|
|
|||
|
- If it is needed for the story
|
|||
|
- If it is needed to address a challenge
|
|||
|
- Order analyses according to the story, rather than chronologically
|
|||
|
- Include "pretty" figures that contribute to the story
|
|||
|
|
|||
|
### In the lecture example...
|
|||
|
|
|||
|
Lead with the question
|
|||
|
|
|||
|
Can I use quantitative characteristics of the emails to classify them as SPAM?
|
|||
|
|
|||
|
Describe the approach
|
|||
|
|
|||
|
Collected data from UCI -> created training/test sets
|
|||
|
|
|||
|
Explored Relationships
|
|||
|
|
|||
|
Choose logistic model on training set by cross validation
|
|||
|
|
|||
|
Applied to test, 78% test set accuracy
|
|||
|
|
|||
|
Interpret results
|
|||
|
|
|||
|
Number of dollar signs seem reasonable, e.g. "Make more money with Viagra $ $ $ $"
|
|||
|
|
|||
|
Challenge Results
|
|||
|
|
|||
|
78% isn't that great
|
|||
|
|
|||
|
Could use more variables
|
|||
|
|
|||
|
Why use logistic regression?
|
|||
|
|
|||
|
|
|||
|
|
|||
|
## Data Analysis Files
|
|||
|
|
|||
|
Data
|
|||
|
|
|||
|
- Raw Data
|
|||
|
- Processed Data
|
|||
|
|
|||
|
Figures
|
|||
|
|
|||
|
- Exploratory Figures
|
|||
|
- Final Figures
|
|||
|
|
|||
|
R Code
|
|||
|
|
|||
|
- Raw/Unused Scripts
|
|||
|
- Final Scripts
|
|||
|
- R Markdown Files
|
|||
|
|
|||
|
Text
|
|||
|
|
|||
|
- README files
|
|||
|
- Text of Analysis/Report
|
|||
|
|
|||
|
### Raw Data
|
|||
|
|
|||
|
Should be stored in the analysis folder
|
|||
|
|
|||
|
If accessed from the web, include URL, description, and date accessed in README
|
|||
|
|
|||
|
### Processed Data
|
|||
|
|
|||
|
Processed data should be named so it is easy to see which script generated the data
|
|||
|
|
|||
|
The processing script -- processed data mapping should occur in the README
|
|||
|
|
|||
|
Processed data should be tidy
|
|||
|
|
|||
|
### Exploratory Figures
|
|||
|
|
|||
|
Figures made during the course of your analysis, not necessarily part of your final report
|
|||
|
|
|||
|
They do not need to be "pretty"
|
|||
|
|
|||
|
### Final Figures
|
|||
|
|
|||
|
Usually a small subset of the original figures
|
|||
|
|
|||
|
Axes/Colors set to make the figure clear
|
|||
|
|
|||
|
Possibly multiple panels
|
|||
|
|
|||
|
### Raw Scripts
|
|||
|
|
|||
|
May be less commented (but comments help you!)
|
|||
|
|
|||
|
May be multiple versions
|
|||
|
|
|||
|
May include analyses that are later discarded
|
|||
|
|
|||
|
### Final Scripts
|
|||
|
|
|||
|
Clearly commented
|
|||
|
|
|||
|
- Small comments liberally - what, when, why, how
|
|||
|
|
|||
|
- Bigger commented blocks for whole sections
|
|||
|
|
|||
|
Include processing details
|
|||
|
|
|||
|
Only analyses that appear in the final write-up
|
|||
|
|
|||
|
### R Markdown Files
|
|||
|
|
|||
|
R Markdown files can be used to generate reproducible reports
|
|||
|
|
|||
|
Text and R code are integrated
|
|||
|
|
|||
|
Very easy to create in RStudio
|
|||
|
|
|||
|
### Readme Files
|
|||
|
|
|||
|
Not necessary if you use R Markdown
|
|||
|
|
|||
|
Should contain step-by-step instructions for analysis
|
|||
|
|
|||
|
### Text of the document
|
|||
|
|
|||
|
It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
|
|||
|
|
|||
|
It should tell a story
|
|||
|
|
|||
|
It should not include every analysis you performed
|
|||
|
|
|||
|
References should be included for statistical methods
|