<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="author" content="Brandon Rozek"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="robots" content="noindex" /> <title>Brandon Rozek</title> <link rel="stylesheet" href="themes/bitsandpieces/styles/main.css" type="text/css" /> <link rel="stylesheet" href="themes/bitsandpieces/styles/highlightjs-github.css" type="text/css" /> </head> <body> <aside class="main-nav"> <nav> <ul> <li class="menuitem "> <a href="index.html%3Findex.html" data-shortcut=""> Home </a> </li> <li class="menuitem "> <a href="index.html%3Fcourses.html" data-shortcut=""> Courses </a> </li> <li class="menuitem "> <a href="index.html%3Flabaide.html" data-shortcut=""> Lab Aide </a> </li> <li class="menuitem "> <a href="index.html%3Fpresentations.html" data-shortcut=""> Presentations </a> </li> <li class="menuitem "> <a href="index.html%3Fresearch.html" data-shortcut=""> Research </a> </li> <li class="menuitem "> <a href="index.html%3Ftranscript.html" data-shortcut=""> Transcript </a> </li> </ul> </nav> </aside> <main class="main-content"> <article class="article"> <h1>Reproducible Research Week 1</h1> <h2>Replication</h2> <p>The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent</p> <ul> <li>Investigators</li> <li>Data</li> <li>Analytical Methods</li> <li>Laboratories</li> <li>Instruments</li> </ul> <p>Replication is particularly important in studies that can impact broad policy or regulatory decisions</p> <h3>What's wrong with replication?</h3> <p>Some studies cannot be replicated</p> <ul> <li>No time, opportunistic</li> <li>No money</li> <li>Unique</li> </ul> <p><em>Reproducible Research:</em> Make analytic data and code available so that others may reproduce findings</p> <p>Reproducibility bridges the gap between replication which is awesome and doing nothing.</p> <h2>Why do we need reproducible research?</h2> <p>New technologies increasing data collection throughput; data are more complex and extremely high dimensional</p> <p>Existing databases can be merged into new "megadatabases"</p> <p>Computing power is greatly increased, allowing more sophisticated analyses</p> <p>For every field "X" there is a field "Computational X"</p> <h2>Research Pipeline</h2> <p>Measured Data -> Analytic Data -> Computational Results -> Figures/Tables/Numeric Summaries -> Articles -> Text</p> <p>Data/Metadata used to develop test should be made publically available</p> <p>The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available</p> <p>"Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported" -- IOM Report</p> <h3>What do we need for reproducible research?</h3> <ul> <li>Analytic data are available</li> <li>Analytic code are available</li> <li>Documentation of code and data</li> <li>Standard means of distribution</li> </ul> <h3>Who is the audience for reproducible research?</h3> <p>Authors:</p> <ul> <li>Want to make their research reproducible</li> <li>Want tools for reproducible research to make their lives easier (or at least not much harder)</li> </ul> <p>Readers:</p> <ul> <li>Want to reproduce (and perhaps expand upon) interesting findings</li> <li>Want tools for reproducible research to make their lives easier.</li> </ul> <h3>Challenges for reproducible research</h3> <ul> <li>Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)</li> <li>Readers must download data/results individually and piece together which data go with which code sections, etc.</li> <li>Readers may not have the same resources as authors</li> <li>Few tools to help authors/readers</li> </ul> <h3>What happens in reality</h3> <p>Authors:</p> <ul> <li>Just put stuff on the web</li> <li>(Infamous for disorganization) Journal supplementary materials</li> <li>There are some central databases for various fields (e.g biology, ICPSR)</li> </ul> <p>Readers:</p> <ul> <li>Just download the data and (try to) figure it out</li> <li>Piece together the software and run it</li> </ul> <h2>Literate (Statistical) Programming</h2> <p>An article is a stream of text and code</p> <p>Analysis code is divided into text and code "chunks"</p> <p>Each code chunk loads data and computes results</p> <p>Presentation code formats results (tables, figures, etc.)</p> <p>Article text explains what is going on</p> <p>Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents</p> <p>Literate programming is a general concept that requires</p> <ol> <li>A documentation language (human readable)</li> <li>A programming language (machine readable)</li> </ol> <p>Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML</p> <h3>Quick summary so far</h3> <p>Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate</p> <p>Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available</p> <p>There is a growing number of tools for creating reproducible documents</p> <p><strong>Golden Rule of Reproducibility: Script Everything</strong></p> <h2>Steps in a Data Analysis</h2> <ol> <li>Define the question</li> <li>Define the ideal data set</li> <li>Determine what data you can access</li> <li>Obtain the data</li> <li>Clean the data</li> <li>Exploratory data analysis</li> <li>Statistical prediction/modeling</li> <li>Interpret results</li> <li>Challenge results</li> <li>Synthesize/write up results</li> <li>Create reproducible code</li> </ol> <p>"Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?" -- Dan Myer</p> <p>Defining a question is the kind of most powerful dimension reduction tool you can ever employ.</p> <h3>An Example for #1</h3> <p><strong>Start with a general question</strong></p> <p>Can I automatically detect emails that are SPAM or not?</p> <p><strong>Make it concrete</strong></p> <p>Can I use quantitative characteristics of emails to classify them as SPAM?</p> <h3>Define the ideal data set</h3> <p>The data set may depend on your goal</p> <ul> <li>Descriptive goal -- a whole population</li> <li>Exploratory goal -- a random sample with many variables measured</li> <li>Inferential goal -- The right population, randomly sampled</li> <li>Predictive goal -- a training and test data set from the same population</li> <li>Causal goal -- data from a randomized study</li> <li>Mechanistic goal -- data about all components of the system</li> </ul> <h3>Determine what data you can access</h3> <p>Sometimes you can find data free on the web</p> <p>Other times you may need to buy the data</p> <p>Be sure to respect the terms of use</p> <p>If the data don't exist, you may need to generate it yourself.</p> <h3>Obtain the data</h3> <p>Try to obtain the raw data</p> <p>Be sure to reference the source</p> <p>Polite emails go a long way</p> <p>If you load the data from an Internet source, record the URL and time accessed</p> <h3>Clean the data</h3> <p>Raw data often needs to be processed</p> <p>If it is pre-processed, make sure you understand how</p> <p>Understand the source of the data (census, sample, convenience sample, etc)</p> <p>May need reformatting, subsampling -- record these steps</p> <p><strong>Determine if the data are good enough</strong> -- If not, quit or change data</p> <h3>Exploratory Data Analysis</h3> <p>Look at summaries of the data</p> <p>Check for missing data</p> <p>-> Why is there missing data?</p> <p>Look for outliers</p> <p>Create exploratory plots</p> <p>Perform exploratory analyses such as clustering</p> <p>If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis</p> <p><code>plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)</code></p> <h3>Statistical prediction/modeling</h3> <p>Should be informed by the results of your exploratory analysis</p> <p>Exact methods depend on the question of interest</p> <p>Transformations/processing should be accounted for when necessary</p> <p>Measures of uncertainty should be reported.</p> <h3>Interpret Results</h3> <p>Use the appropriate language</p> <ul> <li>Describes</li> <li>Correlates with/associated with</li> <li>Leads to/Causes</li> <li>Predicts</li> </ul> <p>Gives an explanation</p> <p>Interpret Coefficients</p> <p>Interpret measures of uncertainty</p> <h3>Challenge Results</h3> <p>Challenge all steps:</p> <ul> <li>Question</li> <li>Data Source</li> <li>Processing</li> <li>Analysis</li> <li>Conclusions</li> </ul> <p>Challenge measures of uncertainty</p> <p>Challenge choices of terms to include in models</p> <p>Think of potential alternative analyses</p> <h3>Synthesize/Write-up Results</h3> <p>Lead with the question</p> <p>Summarize the analyses into the story</p> <p>Don't include every analysis, include it</p> <ul> <li>If it is needed for the story</li> <li>If it is needed to address a challenge</li> <li>Order analyses according to the story, rather than chronologically</li> <li>Include "pretty" figures that contribute to the story</li> </ul> <h3>In the lecture example...</h3> <p>Lead with the question</p> <p> Can I use quantitative characteristics of the emails to classify them as SPAM?</p> <p>Describe the approach</p> <p> Collected data from UCI -> created training/test sets</p> <p> Explored Relationships</p> <p> Choose logistic model on training set by cross validation</p> <p> Applied to test, 78% test set accuracy</p> <p>Interpret results</p> <p> Number of dollar signs seem reasonable, e.g. "Make more money with Viagra $ $ $ $"</p> <p>Challenge Results</p> <p> 78% isn't that great</p> <p> Could use more variables</p> <p> Why use logistic regression?</p> <h2>Data Analysis Files</h2> <p>Data</p> <ul> <li>Raw Data</li> <li>Processed Data</li> </ul> <p>Figures</p> <ul> <li>Exploratory Figures</li> <li>Final Figures</li> </ul> <p>R Code</p> <ul> <li>Raw/Unused Scripts</li> <li>Final Scripts</li> <li>R Markdown Files</li> </ul> <p>Text</p> <ul> <li>README files</li> <li>Text of Analysis/Report</li> </ul> <h3>Raw Data</h3> <p>Should be stored in the analysis folder</p> <p>If accessed from the web, include URL, description, and date accessed in README</p> <h3>Processed Data</h3> <p>Processed data should be named so it is easy to see which script generated the data</p> <p>The processing script -- processed data mapping should occur in the README</p> <p>Processed data should be tidy</p> <h3>Exploratory Figures</h3> <p>Figures made during the course of your analysis, not necessarily part of your final report</p> <p>They do not need to be "pretty"</p> <h3>Final Figures</h3> <p>Usually a small subset of the original figures</p> <p>Axes/Colors set to make the figure clear</p> <p>Possibly multiple panels</p> <h3>Raw Scripts</h3> <p>May be less commented (but comments help you!)</p> <p>May be multiple versions</p> <p>May include analyses that are later discarded</p> <h3>Final Scripts</h3> <p>Clearly commented</p> <ul> <li> <p>Small comments liberally - what, when, why, how</p> </li> <li>Bigger commented blocks for whole sections</li> </ul> <p>Include processing details</p> <p>Only analyses that appear in the final write-up</p> <h3>R Markdown Files</h3> <p>R Markdown files can be used to generate reproducible reports</p> <p>Text and R code are integrated</p> <p>Very easy to create in RStudio</p> <h3>Readme Files</h3> <p>Not necessary if you use R Markdown</p> <p>Should contain step-by-step instructions for analysis</p> <h3>Text of the document</h3> <p>It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)</p> <p>It should tell a story</p> <p>It should not include every analysis you performed</p> <p>References should be included for statistical methods</p> </article> </main> <script src="themes/bitsandpieces/scripts/highlight.js"></script> <script src="themes/bitsandpieces/scripts/mousetrap.min.js"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ], processEscapes: true } }); </script> <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> </script> <script> hljs.initHighlightingOnLoad(); document.querySelectorAll('.menuitem a').forEach(function(el) { if (el.getAttribute('data-shortcut').length > 0) { Mousetrap.bind(el.getAttribute('data-shortcut'), function() { location.assign(el.getAttribute('href')); }); } }); </script> </body> </html>