website/static/~brozek/index.html?courses%2FReproducibleResearch%2Fweek1.html

338 lines
13 KiB
HTML
Raw Normal View History

2020-01-15 23:07:02 -05:00
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="author" content="Fredrik Danielsson, http://lostkeys.se">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="robots" content="noindex" />
<title>Brandon Rozek</title>
<link rel="stylesheet" href="themes/bitsandpieces/styles/main.css" type="text/css" />
<link rel="stylesheet" href="themes/bitsandpieces/styles/highlightjs-github.css" type="text/css" />
</head>
<body>
<aside class="main-nav">
<nav>
<ul>
<li class="menuitem ">
<a href="index.html%3Findex.html" data-shortcut="">
Home
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Fcourses.html" data-shortcut="">
Courses
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Flabaide.html" data-shortcut="">
Lab Aide
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Fpresentations.html" data-shortcut="">
Presentations
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Fresearch.html" data-shortcut="">
Research
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Ftranscript.html" data-shortcut="">
Transcript
</a>
</li>
</ul>
</nav>
</aside>
<main class="main-content">
<article class="article">
<h1>Reproducible Research Week 1</h1>
<h2>Replication</h2>
<p>The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent</p>
<ul>
<li>Investigators</li>
<li>Data</li>
<li>Analytical Methods</li>
<li>Laboratories</li>
<li>Instruments</li>
</ul>
<p>Replication is particularly important in studies that can impact broad policy or regulatory decisions</p>
<h3>What's wrong with replication?</h3>
<p>Some studies cannot be replicated</p>
<ul>
<li>No time, opportunistic</li>
<li>No money</li>
<li>Unique</li>
</ul>
<p><em>Reproducible Research:</em> Make analytic data and code available so that others may reproduce findings</p>
<p>Reproducibility bridges the gap between replication which is awesome and doing nothing.</p>
<h2>Why do we need reproducible research?</h2>
<p>New technologies increasing data collection throughput; data are more complex and extremely high dimensional</p>
<p>Existing databases can be merged into new &quot;megadatabases&quot;</p>
<p>Computing power is greatly increased, allowing more sophisticated analyses</p>
<p>For every field &quot;X&quot; there is a field &quot;Computational X&quot;</p>
<h2>Research Pipeline</h2>
<p>Measured Data -&gt; Analytic Data -&gt; Computational Results -&gt; Figures/Tables/Numeric Summaries -&gt; Articles -&gt; Text</p>
<p>Data/Metadata used to develop test should be made publically available</p>
<p>The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available</p>
<p>&quot;Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported&quot; -- IOM Report</p>
<h3>What do we need for reproducible research?</h3>
<ul>
<li>Analytic data are available</li>
<li>Analytic code are available</li>
<li>Documentation of code and data</li>
<li>Standard means of distribution</li>
</ul>
<h3>Who is the audience for reproducible research?</h3>
<p>Authors:</p>
<ul>
<li>Want to make their research reproducible</li>
<li>Want tools for reproducible research to make their lives easier (or at least not much harder)</li>
</ul>
<p>Readers:</p>
<ul>
<li>Want to reproduce (and perhaps expand upon) interesting findings</li>
<li>Want tools for reproducible research to make their lives easier.</li>
</ul>
<h3>Challenges for reproducible research</h3>
<ul>
<li>Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)</li>
<li>Readers must download data/results individually and piece together which data go with which code sections, etc.</li>
<li>Readers may not have the same resources as authors</li>
<li>Few tools to help authors/readers</li>
</ul>
<h3>What happens in reality</h3>
<p>Authors:</p>
<ul>
<li>Just put stuff on the web</li>
<li>(Infamous for disorganization) Journal supplementary materials</li>
<li>There are some central databases for various fields (e.g biology, ICPSR)</li>
</ul>
<p>Readers:</p>
<ul>
<li>Just download the data and (try to) figure it out</li>
<li>Piece together the software and run it</li>
</ul>
<h2>Literate (Statistical) Programming</h2>
<p>An article is a stream of text and code</p>
<p>Analysis code is divided into text and code &quot;chunks&quot;</p>
<p>Each code chunk loads data and computes results</p>
<p>Presentation code formats results (tables, figures, etc.)</p>
<p>Article text explains what is going on</p>
<p>Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents</p>
<p>Literate programming is a general concept that requires</p>
<ol>
<li>A documentation language (human readable)</li>
<li>A programming language (machine readable)</li>
</ol>
<p>Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML</p>
<h3>Quick summary so far</h3>
<p>Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate</p>
<p>Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available</p>
<p>There is a growing number of tools for creating reproducible documents</p>
<p><strong>Golden Rule of Reproducibility: Script Everything</strong></p>
<h2>Steps in a Data Analysis</h2>
<ol>
<li>Define the question</li>
<li>Define the ideal data set</li>
<li>Determine what data you can access</li>
<li>Obtain the data</li>
<li>Clean the data</li>
<li>Exploratory data analysis</li>
<li>Statistical prediction/modeling</li>
<li>Interpret results</li>
<li>Challenge results</li>
<li>Synthesize/write up results</li>
<li>Create reproducible code</li>
</ol>
<p>&quot;Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?&quot; -- Dan Myer</p>
<p>Defining a question is the kind of most powerful dimension reduction tool you can ever employ.</p>
<h3>An Example for #1</h3>
<p><strong>Start with a general question</strong></p>
<p>Can I automatically detect emails that are SPAM or not?</p>
<p><strong>Make it concrete</strong></p>
<p>Can I use quantitative characteristics of emails to classify them as SPAM?</p>
<h3>Define the ideal data set</h3>
<p>The data set may depend on your goal</p>
<ul>
<li>Descriptive goal -- a whole population</li>
<li>Exploratory goal -- a random sample with many variables measured</li>
<li>Inferential goal -- The right population, randomly sampled</li>
<li>Predictive goal -- a training and test data set from the same population</li>
<li>Causal goal -- data from a randomized study</li>
<li>Mechanistic goal -- data about all components of the system</li>
</ul>
<h3>Determine what data you can access</h3>
<p>Sometimes you can find data free on the web</p>
<p>Other times you may need to buy the data</p>
<p>Be sure to respect the terms of use</p>
<p>If the data don't exist, you may need to generate it yourself.</p>
<h3>Obtain the data</h3>
<p>Try to obtain the raw data</p>
<p>Be sure to reference the source</p>
<p>Polite emails go a long way</p>
<p>If you load the data from an Internet source, record the URL and time accessed</p>
<h3>Clean the data</h3>
<p>Raw data often needs to be processed</p>
<p>If it is pre-processed, make sure you understand how</p>
<p>Understand the source of the data (census, sample, convenience sample, etc)</p>
<p>May need reformatting, subsampling -- record these steps</p>
<p><strong>Determine if the data are good enough</strong> -- If not, quit or change data</p>
<h3>Exploratory Data Analysis</h3>
<p>Look at summaries of the data</p>
<p>Check for missing data</p>
<p>-&gt; Why is there missing data?</p>
<p>Look for outliers</p>
<p>Create exploratory plots</p>
<p>Perform exploratory analyses such as clustering</p>
<p>If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis</p>
<p><code>plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)</code></p>
<h3>Statistical prediction/modeling</h3>
<p>Should be informed by the results of your exploratory analysis</p>
<p>Exact methods depend on the question of interest</p>
<p>Transformations/processing should be accounted for when necessary</p>
<p>Measures of uncertainty should be reported.</p>
<h3>Interpret Results</h3>
<p>Use the appropriate language</p>
<ul>
<li>Describes</li>
<li>Correlates with/associated with</li>
<li>Leads to/Causes</li>
<li>Predicts</li>
</ul>
<p>Gives an explanation</p>
<p>Interpret Coefficients</p>
<p>Interpret measures of uncertainty</p>
<h3>Challenge Results</h3>
<p>Challenge all steps:</p>
<ul>
<li>Question</li>
<li>Data Source</li>
<li>Processing</li>
<li>Analysis</li>
<li>Conclusions</li>
</ul>
<p>Challenge measures of uncertainty</p>
<p>Challenge choices of terms to include in models</p>
<p>Think of potential alternative analyses</p>
<h3>Synthesize/Write-up Results</h3>
<p>Lead with the question</p>
<p>Summarize the analyses into the story</p>
<p>Don't include every analysis, include it</p>
<ul>
<li>If it is needed for the story</li>
<li>If it is needed to address a challenge</li>
<li>Order analyses according to the story, rather than chronologically</li>
<li>Include &quot;pretty&quot; figures that contribute to the story</li>
</ul>
<h3>In the lecture example...</h3>
<p>Lead with the question</p>
<p> Can I use quantitative characteristics of the emails to classify them as SPAM?</p>
<p>Describe the approach</p>
<p> Collected data from UCI -&gt; created training/test sets</p>
<p> Explored Relationships</p>
<p> Choose logistic model on training set by cross validation</p>
<p> Applied to test, 78% test set accuracy</p>
<p>Interpret results</p>
<p> Number of dollar signs seem reasonable, e.g. &quot;Make more money with Viagra $ $ $ $&quot;</p>
<p>Challenge Results</p>
<p> 78% isn't that great</p>
<p> Could use more variables</p>
<p> Why use logistic regression?</p>
<h2>Data Analysis Files</h2>
<p>Data</p>
<ul>
<li>Raw Data</li>
<li>Processed Data</li>
</ul>
<p>Figures</p>
<ul>
<li>Exploratory Figures</li>
<li>Final Figures</li>
</ul>
<p>R Code</p>
<ul>
<li>Raw/Unused Scripts</li>
<li>Final Scripts</li>
<li>R Markdown Files</li>
</ul>
<p>Text</p>
<ul>
<li>README files</li>
<li>Text of Analysis/Report</li>
</ul>
<h3>Raw Data</h3>
<p>Should be stored in the analysis folder</p>
<p>If accessed from the web, include URL, description, and date accessed in README</p>
<h3>Processed Data</h3>
<p>Processed data should be named so it is easy to see which script generated the data</p>
<p>The processing script -- processed data mapping should occur in the README</p>
<p>Processed data should be tidy</p>
<h3>Exploratory Figures</h3>
<p>Figures made during the course of your analysis, not necessarily part of your final report</p>
<p>They do not need to be &quot;pretty&quot;</p>
<h3>Final Figures</h3>
<p>Usually a small subset of the original figures</p>
<p>Axes/Colors set to make the figure clear</p>
<p>Possibly multiple panels</p>
<h3>Raw Scripts</h3>
<p>May be less commented (but comments help you!)</p>
<p>May be multiple versions</p>
<p>May include analyses that are later discarded</p>
<h3>Final Scripts</h3>
<p>Clearly commented</p>
<ul>
<li>
<p>Small comments liberally - what, when, why, how</p>
</li>
<li>Bigger commented blocks for whole sections</li>
</ul>
<p>Include processing details</p>
<p>Only analyses that appear in the final write-up</p>
<h3>R Markdown Files</h3>
<p>R Markdown files can be used to generate reproducible reports</p>
<p>Text and R code are integrated</p>
<p>Very easy to create in RStudio</p>
<h3>Readme Files</h3>
<p>Not necessary if you use R Markdown</p>
<p>Should contain step-by-step instructions for analysis</p>
<h3>Text of the document</h3>
<p>It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)</p>
<p>It should tell a story</p>
<p>It should not include every analysis you performed</p>
<p>References should be included for statistical methods</p>
</article>
</main>
<script src="themes/bitsandpieces/scripts/highlight.js"></script>
<script src="themes/bitsandpieces/scripts/mousetrap.min.js"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
processEscapes: true
}
});
</script>
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script>
hljs.initHighlightingOnLoad();
document.querySelectorAll('.menuitem a').forEach(function(el) {
if (el.getAttribute('data-shortcut').length > 0) {
Mousetrap.bind(el.getAttribute('data-shortcut'), function() {
location.assign(el.getAttribute('href'));
});
}
});
</script>
</body>
</html>