website/static/~brozek/index.html?courses%2FReproducibleResearch%2Fweek1.html
2022-02-15 01:14:58 -05:00

337 lines
13 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="author" content="Brandon Rozek">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="robots" content="noindex" />
<title>Brandon Rozek</title>
<link rel="stylesheet" href="themes/bitsandpieces/styles/main.css" type="text/css" />
<link rel="stylesheet" href="themes/bitsandpieces/styles/highlightjs-github.css" type="text/css" />
</head>
<body>
<aside class="main-nav">
<nav>
<ul>
<li class="menuitem ">
<a href="index.html%3Findex.html" data-shortcut="">
Home
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Fcourses.html" data-shortcut="">
Courses
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Flabaide.html" data-shortcut="">
Lab Aide
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Fpresentations.html" data-shortcut="">
Presentations
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Fresearch.html" data-shortcut="">
Research
</a>
</li>
<li class="menuitem ">
<a href="index.html%3Ftranscript.html" data-shortcut="">
Transcript
</a>
</li>
</ul>
</nav>
</aside>
<main class="main-content">
<article class="article">
<h1>Reproducible Research Week 1</h1>
<h2>Replication</h2>
<p>The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent</p>
<ul>
<li>Investigators</li>
<li>Data</li>
<li>Analytical Methods</li>
<li>Laboratories</li>
<li>Instruments</li>
</ul>
<p>Replication is particularly important in studies that can impact broad policy or regulatory decisions</p>
<h3>What's wrong with replication?</h3>
<p>Some studies cannot be replicated</p>
<ul>
<li>No time, opportunistic</li>
<li>No money</li>
<li>Unique</li>
</ul>
<p><em>Reproducible Research:</em> Make analytic data and code available so that others may reproduce findings</p>
<p>Reproducibility bridges the gap between replication which is awesome and doing nothing.</p>
<h2>Why do we need reproducible research?</h2>
<p>New technologies increasing data collection throughput; data are more complex and extremely high dimensional</p>
<p>Existing databases can be merged into new &quot;megadatabases&quot;</p>
<p>Computing power is greatly increased, allowing more sophisticated analyses</p>
<p>For every field &quot;X&quot; there is a field &quot;Computational X&quot;</p>
<h2>Research Pipeline</h2>
<p>Measured Data -&gt; Analytic Data -&gt; Computational Results -&gt; Figures/Tables/Numeric Summaries -&gt; Articles -&gt; Text</p>
<p>Data/Metadata used to develop test should be made publically available</p>
<p>The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available</p>
<p>&quot;Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported&quot; -- IOM Report</p>
<h3>What do we need for reproducible research?</h3>
<ul>
<li>Analytic data are available</li>
<li>Analytic code are available</li>
<li>Documentation of code and data</li>
<li>Standard means of distribution</li>
</ul>
<h3>Who is the audience for reproducible research?</h3>
<p>Authors:</p>
<ul>
<li>Want to make their research reproducible</li>
<li>Want tools for reproducible research to make their lives easier (or at least not much harder)</li>
</ul>
<p>Readers:</p>
<ul>
<li>Want to reproduce (and perhaps expand upon) interesting findings</li>
<li>Want tools for reproducible research to make their lives easier.</li>
</ul>
<h3>Challenges for reproducible research</h3>
<ul>
<li>Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)</li>
<li>Readers must download data/results individually and piece together which data go with which code sections, etc.</li>
<li>Readers may not have the same resources as authors</li>
<li>Few tools to help authors/readers</li>
</ul>
<h3>What happens in reality</h3>
<p>Authors:</p>
<ul>
<li>Just put stuff on the web</li>
<li>(Infamous for disorganization) Journal supplementary materials</li>
<li>There are some central databases for various fields (e.g biology, ICPSR)</li>
</ul>
<p>Readers:</p>
<ul>
<li>Just download the data and (try to) figure it out</li>
<li>Piece together the software and run it</li>
</ul>
<h2>Literate (Statistical) Programming</h2>
<p>An article is a stream of text and code</p>
<p>Analysis code is divided into text and code &quot;chunks&quot;</p>
<p>Each code chunk loads data and computes results</p>
<p>Presentation code formats results (tables, figures, etc.)</p>
<p>Article text explains what is going on</p>
<p>Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents</p>
<p>Literate programming is a general concept that requires</p>
<ol>
<li>A documentation language (human readable)</li>
<li>A programming language (machine readable)</li>
</ol>
<p>Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML</p>
<h3>Quick summary so far</h3>
<p>Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate</p>
<p>Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available</p>
<p>There is a growing number of tools for creating reproducible documents</p>
<p><strong>Golden Rule of Reproducibility: Script Everything</strong></p>
<h2>Steps in a Data Analysis</h2>
<ol>
<li>Define the question</li>
<li>Define the ideal data set</li>
<li>Determine what data you can access</li>
<li>Obtain the data</li>
<li>Clean the data</li>
<li>Exploratory data analysis</li>
<li>Statistical prediction/modeling</li>
<li>Interpret results</li>
<li>Challenge results</li>
<li>Synthesize/write up results</li>
<li>Create reproducible code</li>
</ol>
<p>&quot;Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?&quot; -- Dan Myer</p>
<p>Defining a question is the kind of most powerful dimension reduction tool you can ever employ.</p>
<h3>An Example for #1</h3>
<p><strong>Start with a general question</strong></p>
<p>Can I automatically detect emails that are SPAM or not?</p>
<p><strong>Make it concrete</strong></p>
<p>Can I use quantitative characteristics of emails to classify them as SPAM?</p>
<h3>Define the ideal data set</h3>
<p>The data set may depend on your goal</p>
<ul>
<li>Descriptive goal -- a whole population</li>
<li>Exploratory goal -- a random sample with many variables measured</li>
<li>Inferential goal -- The right population, randomly sampled</li>
<li>Predictive goal -- a training and test data set from the same population</li>
<li>Causal goal -- data from a randomized study</li>
<li>Mechanistic goal -- data about all components of the system</li>
</ul>
<h3>Determine what data you can access</h3>
<p>Sometimes you can find data free on the web</p>
<p>Other times you may need to buy the data</p>
<p>Be sure to respect the terms of use</p>
<p>If the data don't exist, you may need to generate it yourself.</p>
<h3>Obtain the data</h3>
<p>Try to obtain the raw data</p>
<p>Be sure to reference the source</p>
<p>Polite emails go a long way</p>
<p>If you load the data from an Internet source, record the URL and time accessed</p>
<h3>Clean the data</h3>
<p>Raw data often needs to be processed</p>
<p>If it is pre-processed, make sure you understand how</p>
<p>Understand the source of the data (census, sample, convenience sample, etc)</p>
<p>May need reformatting, subsampling -- record these steps</p>
<p><strong>Determine if the data are good enough</strong> -- If not, quit or change data</p>
<h3>Exploratory Data Analysis</h3>
<p>Look at summaries of the data</p>
<p>Check for missing data</p>
<p>-&gt; Why is there missing data?</p>
<p>Look for outliers</p>
<p>Create exploratory plots</p>
<p>Perform exploratory analyses such as clustering</p>
<p>If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis</p>
<p><code>plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)</code></p>
<h3>Statistical prediction/modeling</h3>
<p>Should be informed by the results of your exploratory analysis</p>
<p>Exact methods depend on the question of interest</p>
<p>Transformations/processing should be accounted for when necessary</p>
<p>Measures of uncertainty should be reported.</p>
<h3>Interpret Results</h3>
<p>Use the appropriate language</p>
<ul>
<li>Describes</li>
<li>Correlates with/associated with</li>
<li>Leads to/Causes</li>
<li>Predicts</li>
</ul>
<p>Gives an explanation</p>
<p>Interpret Coefficients</p>
<p>Interpret measures of uncertainty</p>
<h3>Challenge Results</h3>
<p>Challenge all steps:</p>
<ul>
<li>Question</li>
<li>Data Source</li>
<li>Processing</li>
<li>Analysis</li>
<li>Conclusions</li>
</ul>
<p>Challenge measures of uncertainty</p>
<p>Challenge choices of terms to include in models</p>
<p>Think of potential alternative analyses</p>
<h3>Synthesize/Write-up Results</h3>
<p>Lead with the question</p>
<p>Summarize the analyses into the story</p>
<p>Don't include every analysis, include it</p>
<ul>
<li>If it is needed for the story</li>
<li>If it is needed to address a challenge</li>
<li>Order analyses according to the story, rather than chronologically</li>
<li>Include &quot;pretty&quot; figures that contribute to the story</li>
</ul>
<h3>In the lecture example...</h3>
<p>Lead with the question</p>
<p> Can I use quantitative characteristics of the emails to classify them as SPAM?</p>
<p>Describe the approach</p>
<p> Collected data from UCI -&gt; created training/test sets</p>
<p> Explored Relationships</p>
<p> Choose logistic model on training set by cross validation</p>
<p> Applied to test, 78% test set accuracy</p>
<p>Interpret results</p>
<p> Number of dollar signs seem reasonable, e.g. &quot;Make more money with Viagra $ $ $ $&quot;</p>
<p>Challenge Results</p>
<p> 78% isn't that great</p>
<p> Could use more variables</p>
<p> Why use logistic regression?</p>
<h2>Data Analysis Files</h2>
<p>Data</p>
<ul>
<li>Raw Data</li>
<li>Processed Data</li>
</ul>
<p>Figures</p>
<ul>
<li>Exploratory Figures</li>
<li>Final Figures</li>
</ul>
<p>R Code</p>
<ul>
<li>Raw/Unused Scripts</li>
<li>Final Scripts</li>
<li>R Markdown Files</li>
</ul>
<p>Text</p>
<ul>
<li>README files</li>
<li>Text of Analysis/Report</li>
</ul>
<h3>Raw Data</h3>
<p>Should be stored in the analysis folder</p>
<p>If accessed from the web, include URL, description, and date accessed in README</p>
<h3>Processed Data</h3>
<p>Processed data should be named so it is easy to see which script generated the data</p>
<p>The processing script -- processed data mapping should occur in the README</p>
<p>Processed data should be tidy</p>
<h3>Exploratory Figures</h3>
<p>Figures made during the course of your analysis, not necessarily part of your final report</p>
<p>They do not need to be &quot;pretty&quot;</p>
<h3>Final Figures</h3>
<p>Usually a small subset of the original figures</p>
<p>Axes/Colors set to make the figure clear</p>
<p>Possibly multiple panels</p>
<h3>Raw Scripts</h3>
<p>May be less commented (but comments help you!)</p>
<p>May be multiple versions</p>
<p>May include analyses that are later discarded</p>
<h3>Final Scripts</h3>
<p>Clearly commented</p>
<ul>
<li>
<p>Small comments liberally - what, when, why, how</p>
</li>
<li>Bigger commented blocks for whole sections</li>
</ul>
<p>Include processing details</p>
<p>Only analyses that appear in the final write-up</p>
<h3>R Markdown Files</h3>
<p>R Markdown files can be used to generate reproducible reports</p>
<p>Text and R code are integrated</p>
<p>Very easy to create in RStudio</p>
<h3>Readme Files</h3>
<p>Not necessary if you use R Markdown</p>
<p>Should contain step-by-step instructions for analysis</p>
<h3>Text of the document</h3>
<p>It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)</p>
<p>It should tell a story</p>
<p>It should not include every analysis you performed</p>
<p>References should be included for statistical methods</p>
</article>
</main>
<script src="themes/bitsandpieces/scripts/highlight.js"></script>
<script src="themes/bitsandpieces/scripts/mousetrap.min.js"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
processEscapes: true
}
});
</script>
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script>
hljs.initHighlightingOnLoad();
document.querySelectorAll('.menuitem a').forEach(function(el) {
if (el.getAttribute('data-shortcut').length > 0) {
Mousetrap.bind(el.getAttribute('data-shortcut'), function() {
location.assign(el.getAttribute('href'));
});
}
});
</script>
</body>
</html>