Website snapshot

This commit is contained in:
Brandon Rozek 2020-01-15 21:51:49 -05:00
parent ee0ab66d73
commit 50ec3688a5
281 changed files with 21066 additions and 0 deletions

View file

@ -0,0 +1,397 @@
# Reproducible Research Week 1
## Replication
The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent
- Investigators
- Data
- Analytical Methods
- Laboratories
- Instruments
Replication is particularly important in studies that can impact broad policy or regulatory decisions
### What's wrong with replication?
Some studies cannot be replicated
- No time, opportunistic
- No money
- Unique
*Reproducible Research:* Make analytic data and code available so that others may reproduce findings
Reproducibility bridges the gap between replication which is awesome and doing nothing.
## Why do we need reproducible research?
New technologies increasing data collection throughput; data are more complex and extremely high dimensional
Existing databases can be merged into new "megadatabases"
Computing power is greatly increased, allowing more sophisticated analyses
For every field "X" there is a field "Computational X"
## Research Pipeline
Measured Data -> Analytic Data -> Computational Results -> Figures/Tables/Numeric Summaries -> Articles -> Text
Data/Metadata used to develop test should be made publically available
The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available
"Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported" -- IOM Report
### What do we need for reproducible research?
- Analytic data are available
- Analytic code are available
- Documentation of code and data
- Standard means of distribution
### Who is the audience for reproducible research?
Authors:
- Want to make their research reproducible
- Want tools for reproducible research to make their lives easier (or at least not much harder)
Readers:
- Want to reproduce (and perhaps expand upon) interesting findings
- Want tools for reproducible research to make their lives easier.
### Challenges for reproducible research
- Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)
- Readers must download data/results individually and piece together which data go with which code sections, etc.
- Readers may not have the same resources as authors
- Few tools to help authors/readers
### What happens in reality
Authors:
- Just put stuff on the web
- (Infamous for disorganization) Journal supplementary materials
- There are some central databases for various fields (e.g biology, ICPSR)
Readers:
- Just download the data and (try to) figure it out
- Piece together the software and run it
## Literate (Statistical) Programming
An article is a stream of text and code
Analysis code is divided into text and code "chunks"
Each code chunk loads data and computes results
Presentation code formats results (tables, figures, etc.)
Article text explains what is going on
Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents
Literate programming is a general concept that requires
1. A documentation language (human readable)
2. A programming language (machine readable)
Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML
### Quick summary so far
Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate
Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available
There is a growing number of tools for creating reproducible documents
**Golden Rule of Reproducibility: Script Everything**
## Steps in a Data Analysis
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modeling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
11. Create reproducible code
"Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?" -- Dan Myer
Defining a question is the kind of most powerful dimension reduction tool you can ever employ.
### An Example for #1
**Start with a general question**
Can I automatically detect emails that are SPAM or not?
**Make it concrete**
Can I use quantitative characteristics of emails to classify them as SPAM?
### Define the ideal data set
The data set may depend on your goal
- Descriptive goal -- a whole population
- Exploratory goal -- a random sample with many variables measured
- Inferential goal -- The right population, randomly sampled
- Predictive goal -- a training and test data set from the same population
- Causal goal -- data from a randomized study
- Mechanistic goal -- data about all components of the system
### Determine what data you can access
Sometimes you can find data free on the web
Other times you may need to buy the data
Be sure to respect the terms of use
If the data don't exist, you may need to generate it yourself.
### Obtain the data
Try to obtain the raw data
Be sure to reference the source
Polite emails go a long way
If you load the data from an Internet source, record the URL and time accessed
### Clean the data
Raw data often needs to be processed
If it is pre-processed, make sure you understand how
Understand the source of the data (census, sample, convenience sample, etc)
May need reformatting, subsampling -- record these steps
**Determine if the data are good enough** -- If not, quit or change data
### Exploratory Data Analysis
Look at summaries of the data
Check for missing data
-> Why is there missing data?
Look for outliers
Create exploratory plots
Perform exploratory analyses such as clustering
If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis
`plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)`
### Statistical prediction/modeling
Should be informed by the results of your exploratory analysis
Exact methods depend on the question of interest
Transformations/processing should be accounted for when necessary
Measures of uncertainty should be reported.
### Interpret Results
Use the appropriate language
- Describes
- Correlates with/associated with
- Leads to/Causes
- Predicts
Gives an explanation
Interpret Coefficients
Interpret measures of uncertainty
### Challenge Results
Challenge all steps:
- Question
- Data Source
- Processing
- Analysis
- Conclusions
Challenge measures of uncertainty
Challenge choices of terms to include in models
Think of potential alternative analyses
### Synthesize/Write-up Results
Lead with the question
Summarize the analyses into the story
Don't include every analysis, include it
- If it is needed for the story
- If it is needed to address a challenge
- Order analyses according to the story, rather than chronologically
- Include "pretty" figures that contribute to the story
### In the lecture example...
Lead with the question
Can I use quantitative characteristics of the emails to classify them as SPAM?
Describe the approach
Collected data from UCI -> created training/test sets
Explored Relationships
Choose logistic model on training set by cross validation
Applied to test, 78% test set accuracy
Interpret results
Number of dollar signs seem reasonable, e.g. "Make more money with Viagra $ $ $ $"
Challenge Results
78% isn't that great
Could use more variables
Why use logistic regression?
## Data Analysis Files
Data
- Raw Data
- Processed Data
Figures
- Exploratory Figures
- Final Figures
R Code
- Raw/Unused Scripts
- Final Scripts
- R Markdown Files
Text
- README files
- Text of Analysis/Report
### Raw Data
Should be stored in the analysis folder
If accessed from the web, include URL, description, and date accessed in README
### Processed Data
Processed data should be named so it is easy to see which script generated the data
The processing script -- processed data mapping should occur in the README
Processed data should be tidy
### Exploratory Figures
Figures made during the course of your analysis, not necessarily part of your final report
They do not need to be "pretty"
### Final Figures
Usually a small subset of the original figures
Axes/Colors set to make the figure clear
Possibly multiple panels
### Raw Scripts
May be less commented (but comments help you!)
May be multiple versions
May include analyses that are later discarded
### Final Scripts
Clearly commented
- Small comments liberally - what, when, why, how
- Bigger commented blocks for whole sections
Include processing details
Only analyses that appear in the final write-up
### R Markdown Files
R Markdown files can be used to generate reproducible reports
Text and R code are integrated
Very easy to create in RStudio
### Readme Files
Not necessary if you use R Markdown
Should contain step-by-step instructions for analysis
### Text of the document
It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)
It should tell a story
It should not include every analysis you performed
References should be included for statistical methods

View file

@ -0,0 +1,227 @@
## Coding Standards for R
1. Always use text files/text editor
2. Indent your code
3. Limit the width of your code (80 columns?)
4. Author suggests indentation of 4 spaces at minimum
5. Limit the length of individual functions
## What is Markdown?
Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it structurally to valid XHTML/HTML
## Markdown Syntax
`*This text will appear italicized!*`
*This text will appear italicized!*
`**This text will appear bold!**`
**This text will appear bold**
`## This is a secondary heading`
`###This is a tertiary heading `
## This is a secondary heading
### This is a tertiary heading
Unordered Lists
`- first item in list`
`- second item in list`
- first item in list
- second item in list
Ordered lists
`1. first item in list`
`2. second item in list`
`3. third item in list`
1. first item in list
2. second item in list
3. third item in list
Create links
`[Download R](http://www.r-project.org/)`
[Download R](http://www.r-project.org/)
Advanced linking
`I spent so much time reading [R bloggers][1] and [Simply Statistics][2]!`
`[1]: http://www.r-bloggers.com/ "R bloggers"`
`[2]: http://simplystatistics.org/ "Simply Statistics"`
I spent so much time reading [R bloggers][1] and [Simply Statistics][2]!
[1]: http://www.r-bloggers.com/ "R bloggers"
[2]: http://simplystatistics.org/ "Simply Statistics"
Newlines require a double space after the end of a line
## What is Markdown?
Created by John Gruber and Aaron Swartz. It is a simplified version of "markup" languages. It allows one to focus on writing as opposed to formatting. Markdown provides a simple, minimal, and intuitive way of formatting elements.
You can easily convert Markdown to valid HTML (and other formats) using existing tools.
## What is R Markdown?
R Markdown is the integration of R code with markdown. It allows one to create documents containing "live" R code. R code is evaluated as part of the processing of the markdown and its results are inserted into the Markdown document. R Markdown is a core tool in **literate statistical programming**
R Markdown can be converted to standard markdown using `knitr` package in R. Markdown can then be converted to HTML using the `markdown` package in R. This workflow can be easily managed using R Studio. One can create powerpoint like slides using the `slidify` package.
## Problems, Problems
- Authors must undertake considerable effort to put data/results on the web
- Readers must download data/results individually and piece together which data go with which code sections, etc.
- Authors/readers must manually interact with websites
- There is no single documents to integrate data analysis with textual representations; i.e data, code, and text are not linked
One of the ways to resolve this is to simply put the data and code together in the same document so that people can execute the code in the right order, and the data are read at the right times. You can have a single document that integrates the data analysis with all the textual representations.
## Literate Statistical Programming
- Original idea comes from Don Knuth
- An article is a stream of **text** and **code**
- Analysis code is divded into text and code "chunks"
- Presentation code formats results (tables, figures, etc.)
- Article text explains what is going on
- Literate programs are weaved to produce human-readable documents and tangled to produce machine-readable documents.
## Literate Statistical Programming
- Literate programming is a general concept. We need
- A documentation language
- A programming language
- `knitr` supports a variety of documentation languages
## How Do I Make My Work Reproducible?
- Decide to do it (ideally from the start)
- Keep track of everything, hopefully through a version control system
- Use software in which operations can be coded
- Don't save output
- Save data in non-proprietary formats
## Literate Programming: Pros
- Text and code all in one place, logical order
- Data, results automatically updated to reflect external changes
- Code is live -- automatic "regression test" when building a document
## Literate Programming: Cons
- Text and code are all in one place; can make documents difficult to read, especially if there is a lot of code
- Can substantially slow down processing of documents (although there are tools to help)
## What is Knitr Good For?
- Manuals
- Short/Medium-Length technical documents
- Tutorials
- Reports (Especially if generated periodically)
- Data Preprocessing documents/summaries
## What is knitr NOT good for?
- Very long research articles
- Complex time-consuming computations
- Documents that require precise formatting
## Non-GUI Way of Creating R Markdown documents
```R
library(knitr)
setwd(<working directory>)
knit2html('document.Rmd')
browseURL('document.html')
```
## A few notes about knitr
- knitr will fill a new document with filler text; delete it
- Code chunks begin with ` ```{r}` and ends with ` ``` `
- All R code goes in between these markers
- Code chunks can have names, which is useful when we start making graphics
` ```{r firstchunk}`
`## R code goes here`
` ``` `
- By default, code in a code chunk is echoed, as will the results of the computation (if there are results to print)
## Processing of knitr documents
- You write RMarkdown document (.Rmd)
- knitr produces a Markdown document (.md)
- knitr converts the Markdown document into HTML (by default)
- .Rmd -> .md -> .html
- You should NOT edit (or save) the .md or .html documents until you are finished
## Inline Text Computations
You can reference variable in RMarkdown through the following
```
`The current time is `r time`. My favorite random number is `r rand`
```
## Setting Global Options
- Sometimes we want to set options for every code chunk that are different from the defaults
- For example, we may want to suppress all code echoing and results output
- We have to write some code to set these global options
Example for suppressing all code chunks
```R
```{r setoptions, echo=FALSE}
opts_chunk$set(echo=False, results = "hide")
```
```
## Some Common Options
- Output
- Results: "axis", "hide"
- echo: TRUE, FALSE
- Figures
- fig.height: numeric
- fig.width: numeric
## Caching Computations
- What if one chunk takes a long time to run?
- All chunks have to be re-computed every time you re-knit the file
- The `cache=TRUE` option can be set on a chunk-by-chunk basis to store results of computation
- After the first run, results are loaded from cache
## Caching Caveats
- If the data or code (or anything external) changes, you need to re-run the cache code chunks
- Dependencies are not checked explicitly!!!!
- Chunks with significant *side effects* may not be cacheable
## Summary of knitr
- Literate statistical programming can be a useful way to put text, code, data, output all in one document
- knitr is a powerful tool for iterating code and text in a simple document format

View file

@ -0,0 +1,308 @@
## tl;dr
People are busy, especially managers and leaders. Results of data analyses are sometimes presented in oral form, but often the first cut is presented via email.
It is often useful therefore, to breakdown the results of an analysis into different levels of granularity/detail
## Hierarchy of Information: Research Paper
- Title / Author List
- Speaks about what the paper is about
- Hopefully interesting
- No detail
- Abstract
- Motivation of the problem
- Bottom Line Results
- Body / Results
- Methods
- More detailed results
- Sensitivity Analysis
- Implication of Results
- Supplementary Materials / Gory Details
- Details on what was done
- Code / Data / Really Gory Details
- For reproducibility
## Hierarchy of Information: Email Presentation
- Subject Line / Subject Info
- At a minimum: include one
- Can you summarize findings in one sentence?
- Email Body
- A brief description of the problem / context: recall what was proposed and executed; summarize findings / results. (Total of 1-2 paragraphs)
- If action is needed to be taken as a result of this presentation, suggest some options and make them as concrete as possible
- If questions need to be addressed, try to make them yes / no
- Attachment(s)
- R Markdown file
- knitr report
- Stay Concise: Don't spit out pages of code
- Links to Supplementary Materials
- Code / Software / Data
- Github Repository / Project Website
## DO: Start with Good Science
- Remember: Garbage, in, garbage out
- Find a coherent focused question. This helps solve many problems
- Working with good collaborators reinforces good practices
- Something that's interesting to you will hopefully motivate good habits
## DON'T: Do Things By Hand
- Editing spreadsheets of data to "clean it up"
- Removing outliers
- QA / QC
- Validating
- Editing tables or figures (e.g rounding, formatting)
- Downloading data from a website
- Moving data around your computer, splitting, or reformatting files.
Things done by hand need to precisely documented (this is harder than it sounds!)
## DON'T: Point and Click
- Many data processing / statistical analysis packages have graphical user interfaces (GUIs)
- GUIs are convenient / intuitive but the actions you take with a GUI can be difficult for others to reproduce
- Some GUIs produce a log file or script which includes equivalent commands; these can be saved for later examination
- In general, be careful with data analysis software that is highly interactive; ease of use can sometimes lead to non-reproducible analyses.
- Other interactive software, such as text editors, are usually fine.
## DO: Teach a Computer
If something needs to be done as part of your analysis / investigation, try to teach your computer to do it (even if you only need to do it once)
In order to give your computer instructions, you need to write down exactly what you mean to do and how it should be done. Teaching a computer almost guarantees reproducibility
For example, by, hand you can
1. Go to the UCI Machine Learning Repository at http://archive.ics.uci.edu/mil/
2. Download the Bike Sharing Dataset
Or you can teach your computer to do it using R
```R
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip", "ProjectData/Bike-Sharing-Dataset.zip")
```
Notice here that:
- The full URL to the dataset file is specified
- The name of the file saved to your local computer is specified
- The directory to which the filed was saved is specified ("ProjectData")
- Code can always be executed in R (as long as link is available)
## DO: Use Some Version Control
It helps you slow things down by adding changes into small chunks. (Don't just do one massive commit). It allows one to track / tag snapshots so that one can revert back to older versions of the project. Software like Github / Bitbucket / SourceForge make it easy to publish results.
## DO: Keep Track of Your Software Environment
If you work on a complex project involving many tools / datasets, the software and computing environment can be critical for reproducing your analysis.
**Computer Architecture**: CPU (Intel, AMD, ARM), CPU Architecture, GPUs
**Operating System**: Windows, Mac OS, Linux / Unix
**Software Toolchain**: Compilers, interpreters, command shell, programming language (C, Perl, Python, etc.), database backends, data analysis software
**Supporting software / infrastructure**: Libraries, R packages, dependencies
**External dependencies**: Websites, data repositories, remote databases, software repositories
**Version Numbers:** Ideally, for everything (if available)
This function in R helps report a bunch of information relating to the software environment
```R
sessionInfo()
```
## DON'T: Save Output
Avoid saving data analysis output (tables, figures, summaries, processed data, etc.), except perhaps temporarily for efficiency purposes.
If a stray output file cannot be easily connected with the means by which it was created, then it is not reproducible
Save the data + code that generated the output, rather than the output itself.
Intermediate files are okay as long as there is clear documentation of how they were created.
## DO: Set Your Seed
Random number generators generate pseudo-random numbers based on an initial seed (usually a number or set of numbers)
In R, you can use the `set.seed()` function to set the seed and to specify the random number generator to use
Setting the seed allows for the stream of random numbers to be exactly reproducible
Whenever you generate random numbers for a non-trivial purpose, **always set the seed**.
## DO: Think About the Entire Pipeline
- Data analysis is a lengthy process; it is not just tables / figures/ reports
- Raw data -> processed data -> analysis -> report
- How you got the end is just as important as the end itself
- The more of the data analysis pipeline you can make reproducible, the better for everyone
## Summary: Checklist
- Are we doing good science?
- Is this interesting or worth doing?
- Was any part of this analysis done by hand?
- If so, are those parts precisely documented?
- Does the documentation match reality?
- Have we taught a computer to do as much as possible (i.e. coded)?
- Are we using a version control system?
- Have we documented our software environment?
- Have we saved any output that we cannot reconstruct from original data + code?
- How far back in the analysis pipeline can we go before our results are no longer (automatically reproducible)
## Replication and Reproducibility
Replication
- Focuses on the validity of the scientific claim
- Is this claim true?
- The ultimate standard for strengtening scientiffic evidence
- New investigators, data, analytical methods, laboratories, instruments, etc.
- Particularly important in studies that can impact broad policy or regulatory decisions.
Reproducibility
- Focuses on the validity of the data analysis
- Can we trust this analysis?
- Arguably a minimum standard for any scientific study
- New investigators, same data, same methods
- Important when replication is impossible
## Background and Underlying Trends
- Some studies cannot be replicated: No time, no money, or just plain unique / opportunistic
- Technology is increasing data collection throughput; data are more complex and high-dimensional
- Existing databases can be merged to become bigger databases (but data are used off-label)
- Computing power allows more sophisticated analyses, even on "small" data
- For every field "X", there is a "Computational X"
## The Result?
- Even basic analyses are difficult to describe
- Heavy computational requirements are thrust upon people without adequate training in statistics and computing
- Errors are more easily introduced into long analysis pipelines
- Knowledge transfer is inhibited
- Results are difficult to replicate or reproduce
- Complicated analyses cannot be trusted
## What Problem Does Reproducibility Solve?
What we get:
- Transparency
- Data Availability
- Software / Methods of Availability
- Improved Transfer of Knowledge
What we do NOT get
- Validity / Correctness of the analysis
An analysis can be reproducible and still be wrong
We want to know 'can we trust this analysis
Does requiring reproducibility deter bad analysis?
## Problems with Reproducibility
The premise of reproducible research is that with data/code available, people can check each other and the whole system is self-correcting
- Addresses the most "downstream" aspect of the research process -- Post-publication
- Assumes everyone plays by the same rules and wants to achieve the same goals (i.e. scientific discovery)
## Who Reproduces Research?
- For reproducibility to be effective as a means to check validity, someone needs to do something
- Re-run the analysis; check results match
- Check the code for bugs/errors
- Try alternate approaches; check sensitivity
- The need for someone to do something is inherited from traditional notion of replication
- Who is "someone" and what are their goals?
## The Story So Far
- Reproducibility brings transparency (wrt code+data) and increased transfer of knowledge
- A lot of discussion about how to get people to share data
- Key question of "can we trust this analysis"? is not addressed by reproducibility
- Reproducibility addresses potential problems long after they've occurred ("downstream")
- Secondary analyses are inevitably colored by the interests/motivations of others.
## Evidence-based Data Analysis
- Most data analyses involve stringing together many different tools and methods
- Some methods may be standard for a given field, but others are often applied ad hoc
- We should apply throughly studied (via statistical research), mutually agreed upon methods to analyze data whenever possible
- There should be evidence to justify the application of a given method
## Evidence-based Data Analysis
- Create analytic pipelines from evidence-based components - standardize it
- A deterministic statistical machine
- Once an evidence-based analytic pipeline is established, we shouldn't mess with it
- Analysis with a "transparent box"
- Reduce the "research degrees of freedom"
- Analogous to a pre-specified clinical trial protocol
## Case Study: Estimating Acute Effects of Ambient Air Pollution Exposure
- Acute / Short-term effects typically estimated via panel studies or time series studies
- Work originated in late 1970s early 1980s
- Key question "Are short-term changes in pollution associated with short-term changes in a population health outcome?"
- Studies are usually conducted at a community level
- Long history of statistical research investigating proper methods of analysis
## Case Study: Estimating Acute Effects of Ambient Air Pollution Exposure
- Can we encode everything that we have found in statistical / epidemiological research into a single package?
- Time series studies do not have a huge range of variation; typically involves similar types of data and similar questions
- We can create a deterministic statistical machine for this area?
## DSM Modules for Time Series Studies of Air Pollution and Health
1. Check for outliers, high leverage, overdispersion
2. Fill in missing data? No!
3. Model selection: Estimate degrees of freedom to adjust for unmeasured confounders
- Other aspects of model not as critical
4. Multiple lag analysis
5. Sensitivity analysis wrt
- Unmeasured confounder adjustment
- Influential points
## Where to Go From Here?
- One DSM is not enough, we need many!
- Different problems warrant different approaches and expertise
- A curated library of machines providing state-of-the-art analysis pipelines
- A CRAN/CPAN/CTAN/... for data analysis
- Or a "Cochrane Collaboration" for data analysis
## A Curated Library of Data Analysis
- Provide packages that encode data analysis pipelines for given problems, technologies, questions
- Curated by experts knowledgeable in the field
- Documentation / References given supporting module in the pipeline
- Changes introduced after passing relevant benchmarks/unit tests
## Summary
- Reproducible research is important, but does not necessarily solve the critical question of whether a data analysis is trustworthy
- Reproducible research focuses on the most "downstream" aspect of research documentation
- Evidence-based data analysis would provide standardized best practices for given scientific areas and questions
- Gives reviewers an important tool without dramatically increases the burden on them
- More effort should be put into improving the quality of "upstream" aspects of scientific research

View file

@ -0,0 +1,158 @@
## The `cacher` Package for R
- Add-on package for R
- Evaluates code written in files and stores immediate results in a key-value database
- R expressions are given SHA-1 hash values so that changes can be tracked and code reevaluated if necessary
- "Chacher packages" can be built for distribution
- Others can "clone" an analysis and evaluate subsets of code or inspect data objects
The value of this is so other people can get the analysis or clone the analysis and look at subsets of the code. Or maybe more specifically data objects. People who want to run your code may not necessarily have the resources that you have. Because of that, they may not want to run the entire Markov chain Monte Carlo simulation that you did to get the posterior distribution or the histogram that you got at the end.
But the idea is that you peel the onion a little bit rather than just go straight to the core.
## Using `cacher` as an Author
1. Parse the R source file; Create the necessary cache directiories and subdirectories
2. Cycle through each expression in the source file
- If an expression has never been evaluated, evaluate it and store any resulting R objects in the cache database
- If any cached results exists, lazy-load the results from the cache database and move to the next expression
- If an expression does not create any R objects (i.e, there is nothing to cache), add the expression to the list of expressions where evaluation needs to be forced
- Write out metadata for this expression to the metadata file
- The `cachepackage` function creates a `cacher` package storing
- Source File
- Cached data objects
- Metadata
- Package file is zipped and can be distributed
- Readers can unzip the file and immediately investigate its contents via `cacher` package
## Using `cacher` as a Reader
A journal article says
"... the code and data for this analysis can be found in the cacher package 092dcc7dda4b93e42f23e038a60e1d44dbec7b3f"
```R
library(cacher)
clonecache(id = "092dcc7dda4b93e42f23e038a60e1d44dbec7b3f")
clonecache(id = "092d") ## Same as above
# Created cache directory `.cache`
showfiles()
# [1] "top20.R"
sourcefile("top20.R")
```
## Cloning an Analysis
- Local directories are created
- Source code files and metadata are downloaded
- Data objects are *not* downloaded by default (may be really large)
- References to data objects are loaded and corresponding data can be lazy-loaded on demand
`graphcode()` gives a node graph representing the code
## Running Code
- The `runcode` function executes code in the source file
- By default, expressions that results in an object being created are not run and the resulting objects is lazy-loaded into the workspace
- Expressions not resulting in objects are evaluated
## Checking Code and Objects
- The `checkcode` function evaluates all expressions from scratch (no lazy-loading)
- Results of evaluation are checked against stored results to see if the results are the same as what the author calculated
- Setting RNG seeds is critical for this to work
- The integrity of data objects can be verified with the `checkobjects` function to check for possible corruption of data perhaps during transit.
You can inspect data objects with `loadcache`. This loads in pointers to each of the data objects into the workspace. Once you access the object, it will transfer it from the cache.
## `cacher` Summary
- The `cacher` package can be used by authors to create cache packages from data analyses for distribution
- Readers can use the `cacher` package to inspect others' data analyses by examing cached computations
- `cacher` is mindful of readers' resources and efficiently loads only those data objects that are needed.
# Case Study: Air Pollution
Particulate Matter -- PM
When doing air pollution studies you're looking at particulate matter pollution. The dust is not just one monolithic piece of dirt or soot but it's actually composed of many different chemical constituents.
Metals inert things like salts and other kinds of components so there's a possibility that a subset of those constituents are really harmful elements.
PM is composed of many different chemical constituents and it's important to understand that the Environmental Protection Agency (EPA) monitors the chemical constituents of particulate matter and has been doing so since 1999 or 2000 on a national basis.
## What causes PM to be Toxic?
- PM is composed of many different chemical elements
- Some components of PM may be more harmful than others
- Some sources of PM may be more dangerous than others
- Identifying harmful chemical constituents may lead us to strategies for controlling sources of PM
## NMMAPS
- The National Morbidity, Mortality, and Air Pollution Study (NMMAPS) was a national study of the short-term health effects of ambient air pollution
- Focused primarily on particulate matter ($PM_{10}$) and Ozone ($O_3$)
- Health outcomes include mortality from all causes and hospitalizations for cardiovascular and respiratory diseases
- Key publications
- http://www.ncbi.nlm.nih.gov/pubmed/11098531
- http://www.ncbi.nlm.nih.gov/pubmed/11354823
- Funded by the Heath Effects Institute
- Roger Peng currently serves on the Health Effects Institute Health Review Committee
## NMMAPS and Reproducibility
- Data made available at the Internet-based Health and Air Pollution Surveillance System (http://www.ihapss.jhsph.edu)
- Research and software also available at iHAPSS
- Many studies (over 67 published) have been conducted based on the public data http://www.ncbi.nlm.nih.gov/pubmed/22475833
- Has served as an important test bed for methodological development
## What Causes Particulate Matter to be Toxic?
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1665439
- Lippmann et al. found strong evidence that NI modified the short-term effect of $PM_{10}$ across 60 US communities
- No other PM chemical constituent seemed to have the same modifying effect
- To simple to be true?
## A Reanalysis of the Lippmann et al. Study
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137127
- Rexamine the data from NMMAPS and link with PM chemical constituent data
- Are the findings sensitive for levels of Nickel in New York City?
## Does Nickel Make PM Toxic?
- Long-term average nickel concentrations appear correlated with PM risk
- There appear to be some outliers on the right-hand side (New York City)
## Does Nickel Make PM Toxic?
One of the most important things about those three points to the right is those are called high leverage points. So the regression line can be very senstive to high leverage points. Removing those three points from the dataset brings the regression line's slope down a little bit. Which then produces a line that is no longer statistical significant (p-value about 0.31)
## What Have We Learned?
- New York does have very high levels of nickel and vanadium, much higher than any other US community
- There is evidence of a positive relationship between NI concentrations and $PM_{10}$ risk
- The strength of this relationship is highly sensitive to the observations from New York City
- Most of the information in the data is derived from just 3 observations
## Lessons Learned
- Reproducibility of NMMAPS allowed for a secondary analysis (and linking with PM chemical constituent data) investigating a novel hypothesis (Lippmann et al.)
- Reproducibility also allowed for a critique of that new analysis and some additional new analysis (Dominici et al.)
- Original hypothesis not necessarily invalidated, but evidence not as strong as originally suggested (more work should be done)
- Reproducibility allows for the scientific discussion to occur in a timely and informed manner
- This is how science works.