mirror of
				https://github.com/Brandon-Rozek/website.git
				synced 2025-10-30 13:41:12 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			337 lines
		
	
	
	
		
			13 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			337 lines
		
	
	
	
		
			13 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <!DOCTYPE html>
 | ||
| <html>
 | ||
| <head>
 | ||
|   <meta charset="utf-8" />
 | ||
|   <meta name="author" content="Brandon Rozek">
 | ||
|   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 | ||
|   <meta name="robots" content="noindex" />
 | ||
|     <title>Brandon Rozek</title>
 | ||
|   <link rel="stylesheet" href="themes/bitsandpieces/styles/main.css" type="text/css" />
 | ||
|   <link rel="stylesheet" href="themes/bitsandpieces/styles/highlightjs-github.css" type="text/css" />
 | ||
| </head>
 | ||
| <body>
 | ||
| 
 | ||
| <aside class="main-nav">
 | ||
| <nav>
 | ||
|   <ul>
 | ||
|           <li class="menuitem ">
 | ||
|         <a href="index.html%3Findex.html" data-shortcut="">
 | ||
|           Home
 | ||
|                   </a>
 | ||
|       </li>
 | ||
|           <li class="menuitem ">
 | ||
|         <a href="index.html%3Fcourses.html" data-shortcut="">
 | ||
|           Courses
 | ||
|                   </a>
 | ||
|       </li>
 | ||
|           <li class="menuitem ">
 | ||
|         <a href="index.html%3Flabaide.html" data-shortcut="">
 | ||
|           Lab Aide
 | ||
|                   </a>
 | ||
|       </li>
 | ||
|           <li class="menuitem ">
 | ||
|         <a href="index.html%3Fpresentations.html" data-shortcut="">
 | ||
|           Presentations
 | ||
|                   </a>
 | ||
|       </li>
 | ||
|           <li class="menuitem ">
 | ||
|         <a href="index.html%3Fresearch.html" data-shortcut="">
 | ||
|           Research
 | ||
|                   </a>
 | ||
|       </li>
 | ||
|           <li class="menuitem ">
 | ||
|         <a href="index.html%3Ftranscript.html" data-shortcut="">
 | ||
|           Transcript
 | ||
|                   </a>
 | ||
|       </li>
 | ||
|       </ul>
 | ||
| </nav>
 | ||
| </aside>
 | ||
| <main class="main-content">
 | ||
|   <article class="article">
 | ||
|     <h1>Reproducible Research Week 1</h1>
 | ||
| <h2>Replication</h2>
 | ||
| <p>The ultimate standard for strengthening scientific evidence is replication of finding and conducting studies with independent</p>
 | ||
| <ul>
 | ||
| <li>Investigators</li>
 | ||
| <li>Data</li>
 | ||
| <li>Analytical Methods</li>
 | ||
| <li>Laboratories</li>
 | ||
| <li>Instruments</li>
 | ||
| </ul>
 | ||
| <p>Replication is particularly important in studies that can impact broad policy or regulatory decisions</p>
 | ||
| <h3>What's wrong with replication?</h3>
 | ||
| <p>Some studies cannot be replicated</p>
 | ||
| <ul>
 | ||
| <li>No time, opportunistic</li>
 | ||
| <li>No money</li>
 | ||
| <li>Unique</li>
 | ||
| </ul>
 | ||
| <p><em>Reproducible Research:</em> Make analytic data and code available so that others may reproduce findings</p>
 | ||
| <p>Reproducibility bridges the gap between replication which is awesome and doing nothing.</p>
 | ||
| <h2>Why do we need reproducible research?</h2>
 | ||
| <p>New technologies increasing data collection throughput; data are more complex and extremely high dimensional</p>
 | ||
| <p>Existing databases can be merged into new "megadatabases"</p>
 | ||
| <p>Computing power is greatly increased, allowing more sophisticated analyses</p>
 | ||
| <p>For every field "X" there is a field "Computational X"</p>
 | ||
| <h2>Research Pipeline</h2>
 | ||
| <p>Measured Data -> Analytic Data -> Computational Results -> Figures/Tables/Numeric Summaries -> Articles -> Text</p>
 | ||
| <p>Data/Metadata used to develop test should be made publically available</p>
 | ||
| <p>The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available</p>
 | ||
| <p>"Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps. All aspects of the analysis needs to be transparently reported" -- IOM Report</p>
 | ||
| <h3>What do we need for reproducible research?</h3>
 | ||
| <ul>
 | ||
| <li>Analytic data are available</li>
 | ||
| <li>Analytic code are available</li>
 | ||
| <li>Documentation of code and data</li>
 | ||
| <li>Standard means of distribution</li>
 | ||
| </ul>
 | ||
| <h3>Who is the audience for reproducible research?</h3>
 | ||
| <p>Authors:</p>
 | ||
| <ul>
 | ||
| <li>Want to make their research reproducible</li>
 | ||
| <li>Want tools for reproducible research to make their lives easier (or at least not much harder)</li>
 | ||
| </ul>
 | ||
| <p>Readers:</p>
 | ||
| <ul>
 | ||
| <li>Want to reproduce (and perhaps expand upon) interesting findings</li>
 | ||
| <li>Want tools for reproducible research to make their lives easier.</li>
 | ||
| </ul>
 | ||
| <h3>Challenges for reproducible research</h3>
 | ||
| <ul>
 | ||
| <li>Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)</li>
 | ||
| <li>Readers must download data/results individually and piece together which data go with which code sections, etc.</li>
 | ||
| <li>Readers may not have the same resources as authors</li>
 | ||
| <li>Few tools to help authors/readers</li>
 | ||
| </ul>
 | ||
| <h3>What happens in reality</h3>
 | ||
| <p>Authors:</p>
 | ||
| <ul>
 | ||
| <li>Just put stuff on the web</li>
 | ||
| <li>(Infamous for disorganization) Journal supplementary materials</li>
 | ||
| <li>There are some central databases for various fields (e.g biology, ICPSR)</li>
 | ||
| </ul>
 | ||
| <p>Readers:</p>
 | ||
| <ul>
 | ||
| <li>Just download the data and (try to) figure it out</li>
 | ||
| <li>Piece together the software and run it</li>
 | ||
| </ul>
 | ||
| <h2>Literate (Statistical) Programming</h2>
 | ||
| <p>An article is a stream of text and code</p>
 | ||
| <p>Analysis code is divided into text and code "chunks"</p>
 | ||
| <p>Each code chunk loads data and computes results</p>
 | ||
| <p>Presentation code formats results (tables, figures, etc.)</p>
 | ||
| <p>Article text explains what is going on</p>
 | ||
| <p>Literate programs can be weaved to produce human-readable documents and tagled to produce machine-readable documents</p>
 | ||
| <p>Literate programming is a general concept that requires</p>
 | ||
| <ol>
 | ||
| <li>A documentation language (human readable)</li>
 | ||
| <li>A programming language (machine readable)</li>
 | ||
| </ol>
 | ||
| <p>Knitr is an R package that brings a variety of documentation languages such as Latex, Markdown, and HTML</p>
 | ||
| <h3>Quick summary so far</h3>
 | ||
| <p>Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate</p>
 | ||
| <p>Infrastructure is needed for creating and distributing reproducible document, beyond what is currently available</p>
 | ||
| <p>There is a growing number of tools for creating reproducible documents</p>
 | ||
| <p><strong>Golden Rule of Reproducibility: Script Everything</strong></p>
 | ||
| <h2>Steps in a Data Analysis</h2>
 | ||
| <ol>
 | ||
| <li>Define the question</li>
 | ||
| <li>Define the ideal data set</li>
 | ||
| <li>Determine what data you can access</li>
 | ||
| <li>Obtain the data</li>
 | ||
| <li>Clean the data</li>
 | ||
| <li>Exploratory data analysis</li>
 | ||
| <li>Statistical prediction/modeling</li>
 | ||
| <li>Interpret results</li>
 | ||
| <li>Challenge results</li>
 | ||
| <li>Synthesize/write up results</li>
 | ||
| <li>Create reproducible code</li>
 | ||
| </ol>
 | ||
| <p>"Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didn't have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?" -- Dan Myer</p>
 | ||
| <p>Defining a question is the kind of most powerful dimension reduction tool you can ever employ.</p>
 | ||
| <h3>An Example for #1</h3>
 | ||
| <p><strong>Start with a general question</strong></p>
 | ||
| <p>Can I automatically detect emails that are SPAM or not?</p>
 | ||
| <p><strong>Make it concrete</strong></p>
 | ||
| <p>Can I use quantitative characteristics of emails to classify them as SPAM?</p>
 | ||
| <h3>Define the ideal data set</h3>
 | ||
| <p>The data set may depend on your goal</p>
 | ||
| <ul>
 | ||
| <li>Descriptive goal -- a whole population</li>
 | ||
| <li>Exploratory goal -- a random sample with many variables measured</li>
 | ||
| <li>Inferential goal -- The right population, randomly sampled</li>
 | ||
| <li>Predictive goal -- a training and test data set from the same population</li>
 | ||
| <li>Causal goal -- data from a randomized study</li>
 | ||
| <li>Mechanistic goal -- data about all components of the system</li>
 | ||
| </ul>
 | ||
| <h3>Determine what data you can access</h3>
 | ||
| <p>Sometimes you can find data free on the web</p>
 | ||
| <p>Other times you may need to buy the data</p>
 | ||
| <p>Be sure to respect the terms of use</p>
 | ||
| <p>If the data don't exist, you may need to generate it yourself.</p>
 | ||
| <h3>Obtain the data</h3>
 | ||
| <p>Try to obtain the raw data</p>
 | ||
| <p>Be sure to reference the source</p>
 | ||
| <p>Polite emails go a long way</p>
 | ||
| <p>If you load the data from an Internet source, record the URL and time accessed</p>
 | ||
| <h3>Clean the data</h3>
 | ||
| <p>Raw data often needs to be processed</p>
 | ||
| <p>If it is pre-processed, make sure you understand how</p>
 | ||
| <p>Understand the source of the data (census, sample, convenience sample, etc)</p>
 | ||
| <p>May need reformatting, subsampling -- record these steps</p>
 | ||
| <p><strong>Determine if the data are good enough</strong> -- If not, quit or change data</p>
 | ||
| <h3>Exploratory Data Analysis</h3>
 | ||
| <p>Look at summaries of the data</p>
 | ||
| <p>Check for missing data</p>
 | ||
| <p>-> Why is there missing data?</p>
 | ||
| <p>Look for outliers</p>
 | ||
| <p>Create exploratory plots</p>
 | ||
| <p>Perform exploratory analyses such as clustering</p>
 | ||
| <p>If it's hard to see your plots since it's all bunched up, consider taking the log base 10 of an axis</p>
 | ||
| <p><code>plot(log10(trainSpan$capitalAve + 1) ~ trainSpam$type)</code></p>
 | ||
| <h3>Statistical prediction/modeling</h3>
 | ||
| <p>Should be informed by the results of your exploratory analysis</p>
 | ||
| <p>Exact methods depend on the question of interest</p>
 | ||
| <p>Transformations/processing should be accounted for when necessary</p>
 | ||
| <p>Measures of uncertainty should be reported.</p>
 | ||
| <h3>Interpret Results</h3>
 | ||
| <p>Use the appropriate language</p>
 | ||
| <ul>
 | ||
| <li>Describes</li>
 | ||
| <li>Correlates with/associated with</li>
 | ||
| <li>Leads to/Causes</li>
 | ||
| <li>Predicts</li>
 | ||
| </ul>
 | ||
| <p>Gives an explanation</p>
 | ||
| <p>Interpret Coefficients</p>
 | ||
| <p>Interpret measures of uncertainty</p>
 | ||
| <h3>Challenge Results</h3>
 | ||
| <p>Challenge all steps:</p>
 | ||
| <ul>
 | ||
| <li>Question</li>
 | ||
| <li>Data Source</li>
 | ||
| <li>Processing</li>
 | ||
| <li>Analysis</li>
 | ||
| <li>Conclusions</li>
 | ||
| </ul>
 | ||
| <p>Challenge measures of uncertainty</p>
 | ||
| <p>Challenge choices of terms to include in models</p>
 | ||
| <p>Think of potential alternative analyses</p>
 | ||
| <h3>Synthesize/Write-up Results</h3>
 | ||
| <p>Lead with the question</p>
 | ||
| <p>Summarize the analyses into the story</p>
 | ||
| <p>Don't include every analysis, include it</p>
 | ||
| <ul>
 | ||
| <li>If it is needed for the story</li>
 | ||
| <li>If it is needed to address a challenge</li>
 | ||
| <li>Order analyses according to the story, rather than chronologically</li>
 | ||
| <li>Include "pretty" figures that contribute to the story</li>
 | ||
| </ul>
 | ||
| <h3>In the lecture example...</h3>
 | ||
| <p>Lead with the question</p>
 | ||
| <p>   Can I use quantitative characteristics of the emails to classify them as SPAM?</p>
 | ||
| <p>Describe the approach</p>
 | ||
| <p>   Collected data from UCI -> created training/test sets</p>
 | ||
| <p>   Explored Relationships</p>
 | ||
| <p>   Choose logistic model on training set by cross validation</p>
 | ||
| <p>   Applied to test, 78% test set accuracy</p>
 | ||
| <p>Interpret results</p>
 | ||
| <p>   Number of dollar signs seem reasonable, e.g. "Make more money with Viagra $ $ $ $"</p>
 | ||
| <p>Challenge Results</p>
 | ||
| <p>   78% isn't that great</p>
 | ||
| <p>   Could use more variables</p>
 | ||
| <p>   Why use logistic regression?</p>
 | ||
| <h2>Data Analysis Files</h2>
 | ||
| <p>Data</p>
 | ||
| <ul>
 | ||
| <li>Raw Data</li>
 | ||
| <li>Processed Data</li>
 | ||
| </ul>
 | ||
| <p>Figures</p>
 | ||
| <ul>
 | ||
| <li>Exploratory Figures</li>
 | ||
| <li>Final Figures</li>
 | ||
| </ul>
 | ||
| <p>R Code</p>
 | ||
| <ul>
 | ||
| <li>Raw/Unused Scripts</li>
 | ||
| <li>Final Scripts</li>
 | ||
| <li>R Markdown Files</li>
 | ||
| </ul>
 | ||
| <p>Text</p>
 | ||
| <ul>
 | ||
| <li>README files</li>
 | ||
| <li>Text of Analysis/Report</li>
 | ||
| </ul>
 | ||
| <h3>Raw Data</h3>
 | ||
| <p>Should be stored in the analysis folder</p>
 | ||
| <p>If accessed from the web, include URL, description, and date accessed in README</p>
 | ||
| <h3>Processed Data</h3>
 | ||
| <p>Processed data should be named so it is easy to see which script generated the data</p>
 | ||
| <p>The processing script -- processed data mapping should occur in the README</p>
 | ||
| <p>Processed data should be tidy</p>
 | ||
| <h3>Exploratory Figures</h3>
 | ||
| <p>Figures made during the course of your analysis, not necessarily part of your final report</p>
 | ||
| <p>They do not need to be "pretty"</p>
 | ||
| <h3>Final Figures</h3>
 | ||
| <p>Usually a small subset of the original figures</p>
 | ||
| <p>Axes/Colors set to make the figure clear</p>
 | ||
| <p>Possibly multiple panels</p>
 | ||
| <h3>Raw Scripts</h3>
 | ||
| <p>May be less commented (but comments help you!)</p>
 | ||
| <p>May be multiple versions</p>
 | ||
| <p>May include analyses that are later discarded</p>
 | ||
| <h3>Final Scripts</h3>
 | ||
| <p>Clearly commented</p>
 | ||
| <ul>
 | ||
| <li>
 | ||
| <p>Small comments liberally - what, when, why, how</p>
 | ||
| </li>
 | ||
| <li>Bigger commented blocks for whole sections</li>
 | ||
| </ul>
 | ||
| <p>Include processing details</p>
 | ||
| <p>Only analyses that appear in the final write-up</p>
 | ||
| <h3>R Markdown Files</h3>
 | ||
| <p>R Markdown files can be used to generate reproducible reports</p>
 | ||
| <p>Text and R code are integrated</p>
 | ||
| <p>Very easy to create in RStudio</p>
 | ||
| <h3>Readme Files</h3>
 | ||
| <p>Not necessary if you use R Markdown</p>
 | ||
| <p>Should contain step-by-step instructions for analysis</p>
 | ||
| <h3>Text of the document</h3>
 | ||
| <p>It should contain a title, introduction (motivation), methods (statistics you used), results (including measures of uncertainty), and conclusions (including potential problems)</p>
 | ||
| <p>It should tell a story</p>
 | ||
| <p>It should not include every analysis you performed</p>
 | ||
| <p>References should be included for statistical methods</p>
 | ||
|   </article>
 | ||
| </main>
 | ||
| 
 | ||
| <script src="themes/bitsandpieces/scripts/highlight.js"></script>
 | ||
| <script src="themes/bitsandpieces/scripts/mousetrap.min.js"></script>
 | ||
| <script type="text/x-mathjax-config">
 | ||
|   MathJax.Hub.Config({
 | ||
|     tex2jax: {
 | ||
|       inlineMath: [ ['$','$'], ["\\(","\\)"] ],
 | ||
|       processEscapes: true
 | ||
|     }
 | ||
|   });
 | ||
| </script>
 | ||
| 
 | ||
| <script type="text/javascript"
 | ||
|     src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
 | ||
| </script>
 | ||
| <script>
 | ||
|   hljs.initHighlightingOnLoad();
 | ||
|   
 | ||
|   document.querySelectorAll('.menuitem a').forEach(function(el) {
 | ||
|     if (el.getAttribute('data-shortcut').length > 0) {
 | ||
|       Mousetrap.bind(el.getAttribute('data-shortcut'), function() {
 | ||
|         location.assign(el.getAttribute('href'));
 | ||
|       });       
 | ||
|     }
 | ||
|   });
 | ||
| </script>
 | ||
| 
 | ||
| </body>
 | ||
| </html>
 |