The Case for Deterministic Imputation in Predictive Modeling

While multiple imputation is widely accepted for handling missing data in clinical research, its default use in predictive modeling may be inappropriate. Multiple imputation relies on access to the outcome variable to avoid bias, an assumption that breaks down in real-world deployment where the outcome is unknown. This talk argues that deterministic imputation methods, which do not depend on the outcome and are computationally efficient, are better suited for building predictive models intended for deployment. We present theoretical results and simulation evidence demonstrating that deterministic imputation maintains model validity and performance without introducing information leakage. We conclude that for predictive tasks, particularly in clinical settings where transparency, reproducibility, and alignment with deployment conditions are essential, deterministic imputation should be the standard.

The Why Behind Including Y in your Imputation Model

Handling missing data is a frequent challenge in analyses of health data, and imputation techniques are often employed to address this issue. This talk focuses on scenarios where a covariate with missing values is to be imputed and examines the prevailing recommendation to include the outcome variable in the imputation model. Specifically, we delve into stochastic imputation methods and their effects on accurately estimating the relationship between the imputed covariate and the outcome. Through mathematical proofs and a series of simulations, we demonstrate that incorporating the outcome variable in imputation models is essential for achieving unbiased results with stochastic imputation. Furthermore, we address the concern that this practice constitutes “double dipping” or data dredging. By providing both theoretical and empirical evidence, we show why including the outcome variable is a legitimate and necessary approach rather than a source of bias.

Untangling Causal Effects: Understanding the Limits of Statistics

This talk will delve into two major causal inference obstacles: (1) identifying which variables to account for and (2) assessing the impact of unmeasured variables. The first half of the talk will showcase a Causal Quartet. In the spirit of Anscombe’s Quartet, this is a set of four datasets with identical statistical properties, yet different true causal effects due to differing data generating mechanisms. These simple datasets provide a straightforward example for statisticians to point to when explaining these concepts to collaborators and students. The second half of the talk will focus on how statistical techniques can be leveraged to examine the impact of a potential unmeasured confounder. We will examine sensitivity analyses under several scenarios with varying levels of information about potential unmeasured confounders, introducing the tipr R package, which provides tools for conducting sensitivity analyses in a flexible and accessible manner.

It’s ME hi, I’m the collider it’s ME

This talk will focus on framing measurement error as a collider from a causal inference perspective. We will begin by demonstrating how to visually display measurement error in directed acyclic graphs (DAGs). We will then show how these graphs can be used to help communicate when corrections for measurement error are needed and how to implement these corrections in order to estimate unbiased effects. Finally, we will demonstrate how sensitivity analyses traditionally used to address omitted variable bias can be used to quantify the potential impact of measurement error.

Including the outcome in your imputation model – why isn’t this ‘double dipping’?

An often repeated question is whether including the outcome in an imputation model is ‘double dipping’ or ‘peeking’ at the outcome in a way that can negatively impact the Type 1 error in studies. This talk will dive into this myth and help dispel these concerns. We mathematically demonstrate that including the outcome variable in imputation models when using stochastic methods is required to avoid biased results. A discussion of these results along with practical advice will follow.

Power and sample size calculations for testing the ratio of reproductive values in phylogenetic samples

The quality of the inferences we make from pathogen sequence data is determined by the number and composition of pathogen sequences that make up the sample used to drive that inference. However, there remains limited guidance on how to best structure and power studies when the end goal is phylogenetic inference. One question that we can attempt to answer with molecular data is whether some people are more likely to transmit a pathogen than others. In this talk we will present an estimator to quantify differential transmission, as measured by the ratio of reproductive numbers between people with different characteristics, using transmission pairs linked by molecular data, along with a sample size calculation for this estimator. We will also provide extensions to our method to correct for imperfect identification of transmission linked pairs, overdispersion in the transmission process, and group imbalance. We validate this method via simulation and provide tools to implement it in an R package, phylosamp.

Evaluating the Alignment of a Data Analysis between Analyst and Audience

A challenge that all data analysts face is building a data analysis that is useful for a given audience. In this talk, we will begin by proposing a set of principles for describing data analyses. We will then introduce a concept that we call the alignment of a data analysis between the data analyst and audience. We define a successfully aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. We will propose a statistical model and general framework for evaluating the alignment of a data analysis. This framework can be used as a guide for practicing data scientists and students in data science courses for how to build better data analyses.

Why You Must Include the Outcome in Your Imputation Model (and Why It’s Not Double Dipping)

Handling missing data is a frequent challenge in analyses of health data, and imputation techniques are often employed to address this issue. This talk focuses on scenarios where a covariate with missing values is to be imputed and examines the prevailing recommendation to include the outcome variable in the imputation model. Specifically, we delve into stochastic imputation methods and their effects on accurately estimating the relationship between the imputed covariate and the outcome. Through mathematical proofs and a series of simulations, we demonstrate that incorporating the outcome variable in imputation models is essential for achieving unbiased results with stochastic imputation. Furthermore, we address the concern that this practice constitutes “double dipping” or data dredging. By providing both theoretical and empirical evidence, we show why including the outcome variable is a legitimate and necessary approach rather than a source of bias.

Statistical Rigor in Academic Medicine: Practical Strategies for Improvement

This talk will focus on best practices for using modern statistics in health sciences.

A Framework for Developing AI-powered Negotiation Agents to Examine Producer-Consumer Dynamics in Data Science

This talk will discuss a framework for developing AI-powered agents for observing data analytic negotiations between an analyst and client.

When to Include the Outcome in Your Imputation Model: A Mathematical Demonstration and Practical Advice

Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. This talk will investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation with a fixed value) and stochastic imputation (i.e., single or multiple imputation with random values) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Likewise, we mathematically demonstrate that including the outcome variable in imputation models when using deterministic methods is not recommended, and doing so will induce biased results. A discussion of these results along with practical advice will follow.

Bridging the Gap Between Theory and Practice: When to Include the Outcome in Your Imputation Model

Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. This talk will investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation with a fixed value) and stochastic imputation (i.e., single or multiple imputation with random values) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This talk aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations.

Bridging the Gap Between Imputation Theory and Practice

Handling missing data presents a significant challenge in epidemiological data analysis, with imputation frequently employed to handle this issue. It is often advised to use the outcome variable in the imputation model for missing covariates, though the rationale of this advice is not always clear. This presentation will explore both deterministic imputation (i.e., single imputation using fixed values) and stochastic imputation (i.e., single or multiple imputation using random values) approaches and their effects on estimating the association between an imputed covariate and outcome. We will show that the inclusion of the outcome variable in imputation models is not merely a suggestion but a necessity for obtaining unbiased estimates in stochastic imputation approaches. Furthermore, we will clarify misconceptions regarding deterministic imputation models and explain why the outcome variable should be excluded from these models. The goal of this presentation is to connect theory behind imputation and its practical application, offering mathematical proofs to elucidate common statistical guidelines.

Causal Inference is Not Just a Statistics Problem

In this talk we will discuss four datasets, similar to Anscombe’s quartet, that aim to highlight the challenges involved when estimating causal effects. Each of the four datasets is generated based on a distinct causal mechanism: the first involves a collider, the second involves a confounder, the third involves a mediator, and the fourth involves the induction of M-Bias by an included factor. Despite the fact that the statistical summaries and visualizations for each dataset are identical, the true causal effect differs, and estimating it correctly requires knowledge of the data-generating mechanism. These example datasets can help practitioners gain a better understanding of the assumptions underlying causal inference methods and emphasize the importance of gathering more information beyond what can be obtained from statistical tools alone.

Integrating Design Thinking in the Data Analytic Process

As biostatisticians, we are often tasked with collaborating on a data analysis with many stakeholders. While much has been written about statistical thinking when designing these analyses, a complementary form of thinking that appears in the practice of data analysis is design thinking – the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a biostatistician (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. This talk will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We also provide empirical evidence of variation of these principles within and between producers of data analyses. We then provide a mathematical framework for alignment between the data analysts and their audience. This will hopefully provide guidance for future work in characterizing the data analytic process.

Analytic Design Theory: Framework for Alignment Between Analyst and Audience

This talk will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We will then introduce a mathematical framework for describing the alignment in these principles between the analyst and audience.

Moving Towards Best Practice: Dispelling Statistical Myths

This talk will focus on best practices for using modern statistics in health sciences.

Design Principles of Data Analysis

The data revolution has sparked greater interest in data analysis practices. While much attention has been given to statistical thinking, another type of complementary thinking that appears in data analysis is design thinking – a problem-solving approach focused on understanding the intended users of a product. When facing a problem, differences arise in how data analysts construct data analyses, including choices in methods, tools, and workflows. These choices impact the analysis outputs and user experience. Therefore, a data analyst’s role can be seen as designing the analysis with specific principles. This webinar will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We also provide empirical evidence of variation of these principles within and between data analysts. This will hopefully provide guidance for future work in characterizing the data analytic process.

Design Principles of Data Analysis

The data revolution has sparked greater interest in data analysis practices. While much attention has been given to statistical thinking, another type of complementary thinking that appears in data analysis is design thinking – a problem-solving approach focused on understanding the intended users of a product. When facing a problem, differences arise in how data analysts construct data analyses, including choices in methods, tools, and workflows. These choices impact the analysis outputs and user experience. Therefore, a data analyst’s role can be seen as designing the analysis with specific principles. This webinar will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We also provide empirical evidence of variation of these principles within and between data analysts. This will hopefully provide guidance for future work in characterizing the data analytic process.

The ‘Why’ Behind Including ‘Y’ in Your Imputation Model

Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. In this talk, we investigate the scenario where covariates used in an analysis have missingness and will be imputed. There are recommendations to include the ultimate outcome in the imputation model for missing covariates, but it’s not necessarily clear when this recommendation holds and why this is true. We examine deterministic imputation (i.e., single imputation where the imputed values are treated as fixed) and stochastic imputation (i.e., single imputation with a random value or multiple imputation) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This talk aims to bridge the gap between imputation in theory and practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model.

Causal Quartet: When Statistics Alone Do Not Tell the Full Story

This talk will delve into two major causal inference obstacles: (1) identifying which variables to account for and (2) assessing the impact of unmeasured variables. The first half of the talk will showcase a Causal Quartet. In the spirit of Anscombe’s Quartet, this is a set of four datasets with identical statistical properties, yet different true causal effects due to differing data generating mechanisms. These simple datasets provide a straightforward example for statisticians to point to when explaining these concepts to collaborators and students. The second half of the talk will focus on how statistical techniques can be leveraged to examine the impact of a potential unmeasured confounder. We will examine sensitivity analyses under several scenarios with varying levels of information about potential unmeasured confounders, introducing the tipr R package, which provides tools for conducting sensitivity analyses in a flexible and accessible manner.

Estimating Causal Effects: This be Madness, Yet There is Method in It

This talk will delve into two major causal inference obstacles: (1) identifying which variables to account for and (2) assessing the impact of unmeasured variables. The first half of the talk will showcase a Causal Quartet. In the spirit of Anscombe’s Quartet, this is a set of four datasets with identical statistical properties, yet different true causal effects due to differing data generating mechanisms. These simple datasets provide a straightforward example for statisticians to point to when explaining these concepts to collaborators and students. To adjust or not adjust, that is the question; we demonstrate that statistics alone cannot be used to establish which variables to adjust for when estimating causal effects. The second half of the talk will focus on how statistical techniques can be leveraged to address unmeasured confounding. We will examine sensitivity analyses under several scenarios with varying levels of information about potential unmeasured confounders. These techniques will be applied using the tipr R package, which provides tools for conducting sensitivity analyses in a flexible and accessible manner.

A Visual Diagnostic Tool for Causal Inference

This talk covers several diagnostic plots that are helpful when attempting to answer a causal question. They can be used to visualize the target population, balance, and treatment effect heterogeneity.

Causal Inference is Not a Statistics Problem

In this session, Dr. Lucy D’Agostino McGowan will discuss some of the major challenges in causal inference, and why statistical tools alone cannot uncover the data-generating mechanism when attempting to answer causal questions. As part of this talk, Lucy will showcase the Causal Quartet, which consists of four datasets that have the same statistical properties, but different true causal effects due to different ways in which the data was generated. These examples illustrate the limitations of relying solely on statistical tools in data analyses and highlight the crucial role of domain-specific knowledge.

Causal Inference Challenges that Go Beyond Statistics

This talk will delve into two major causal inference obstacles: (1) identifying which variables to account for and (2) assessing the impact of unmeasured variables. The first half of the talk will showcase a Causal Quartet. In the spirit of Anscombe’s Quartet, this is a set of four datasets with identical statistical properties, yet different true causal effects due to differing data generating mechanisms. These simple datasets provide a straightforward example for biostatisticians to point to when explaining these concepts to collaborators and students. Here, statistics can’t solve your causal inference problem because statistics alone cannot be used to establish which variables to adjust for when estimating causal effects. Statistics can help us explore the impact of unmeasured variables. The second half of the talk will focus on how statistical techniques can be leveraged to address unmeasured confounding. We will examine sensitivity analyses under several scenarios with varying levels of information about potential unmeasured confounders. These techniques will be applied using the tipr R package, which provides tools for conducting sensitivity analyses in a flexible and accessible manner.

Practical Principles for Data Analysis Design

The data revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking – the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a data analyst (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. This talk will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We also provide empirical evidence of variation of these principles within and between producers of data analyses. This will hopefully provide guidance for future work in characterizing the data analytic process.

Statistical Myths and Misuse: Abandoning Outdated Statistical Practices

This talk will focus on best practices for using modern statistics in health sciences.

ConTESSA: A Shiny Application to Assist with Evaluating the Impact of COVID-19 test-trace-isolate Programs

This talk will focus on an application, ConTESSA, along with the accompanying R package, tti, designed to help quantify the impact of contact tracing programs. The talk will walk through the technical aspects of the underlying model as well as highlight how R, and in particular shiny, were used to create this product.

Design Thinking: Empirical Evidence for Six Principles of Data Analysis

The data revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking – the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a data analyst (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. This talk will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We also provide empirical evidence of variation of these principles within and between producers of data analyses. This will hopefully provide guidance for future work in characterizing the data analytic process.

Modern Statistical Communication in the Social Media Era

This talk will focus on leveraging social media to communicate statistical concepts. From summarizing other’s content to promoting your own work, we will discuss best practices for effective statistical communication that simultaneously is clear, engaging, and understandable while remaining rigorous and mathematically correct. It is increasingly important for people to be able to sift through what is important and what is noise, what is evidence and what is an anecdote. This talk focuses on techniques to strike an appropriate balance, with specifics on how to communicate complex statistical concepts in an engaging manner without sacrificing truth and content.

The Journey to True: Accurate Statistical Communication

Clear statistical communication is both an educational and public health priority. This talk will focus on best practices for effective statistical communication that simultaneously is clear, engaging, and understandable while remaining rigorous and mathematically correct. It is increasingly important for people to be able to sift through what is important and what is noise, what is evidence and what is an anecdote. This talk focuses on techniques to strike an appropriate balance, with specifics on how to communicate complex statistical concepts in an engaging manner without sacrificing truth and content.

Bringing Data Science Communication into the Classroom

Data Science, as a broad and interdisciplinary field, is one of the fastest growing areas of student interest (and employment opportunities). The traditional introductory statistics courses that would typically serve as a gateway to data science need modernized curricula and pedagogy in order to adapt to today’s increasingly large and complex data sources and data science questions. In this session, we share our experience to address the following issues: • What constitutes the fundamentals of good data science practice? • How to teach a data science course with innovative pedagogy? • How to improve communication skills to bridge data scientists and practitioners? • How to take advantage of virtual learning? Discussant: Linda Zhao Speakers: Leanna House on Adapting student engagement strategies for a virtual environment, Lucy D’Agostino McGowan on Bringing Data Science Communication into the Classroom, and Nusrat Jahan on Data Science Education in Undergraduate Setting. There will be three speakers and a discussant in this session.

Examining the Impact of Software Instruction on Completion of Data Analysis Tasks

We are interested in studying best practices for introducing students in statistics or data science to the programming language R. The “tidyverse” is a suite of R packages created to help with common statistics and data science tasks that follow a consistent philosophy. We have created two sets of online learning modules, one that introduces tidyverse concepts first and then dives into idiosyncrasies of R as a programming language, the second that takes a more traditional approach, first introducing R broadly and then following with an introduction to a particular suite of packages, the tidyverse. We have created a randomized study to examine whether the order certain concepts are introduced impacts whether learning objectives are met and/or how engaged students are with the material. This talk will focus on the mechanics of this study: how it was designed, how we enrolled participants, and how we evaluated outcomes.

Let’s Get Meta: Analyzing your R code with tidycode

This talk will cover two R packages: matahari ( https://github.com/jhudsl/matahari) and tidycode ( https://lucymcgowan.github.io/tidycode/). The matahari package is a simple package for tidy logging of everything you type into the R console. The tidycode package allows users to analyze R expressions in a tidy way (i.e. take the code captured from matahari and put it in a tidy table for downstream analysis with the tidyverse).

Hill’s Criteria for the Data Scientist: Incorporating Causal Inference Techniques

This talk will walk through Sir Austin Bradford Hill’s viewpoints for causality, using XKCD comics along the way.

Designing Randomized Studies using Shiny

This talk will walk through building a self-contained randomized study using Shiny and learnr modules. We will discuss building informed consent, the randomization process, demographic surveys, and R-based studies into a single online framework to allow users to seamlessly enroll and participate in randomized studies via a single URL. The talk will include both practical recommendations as well as technical code snippets.

Applied Demonstration of the tipr R Package

This talk will focus on the tipr R package.

ConTESSA: A Shiny App to Help Quantify Contact Tracing Efficacy

This talk will focus on an application, ConTESSA, along with the accompanying R package, tti, designed to help quantify the efficacy of contact tracing programs. The talk will walk through the technical aspects of the underlying model as well as highlight how R, and in particular shiny, were used to create this product.

The Ups and Downs of Communicating Complex Statistics

In the age of “big data” there is an information overload. It is increasingly important for people to be able to sift through what is important and what is noise, what is evidence and what is an anecdote. Accordingly, the effective communication of statistical concepts to diverse audiences is currently an education and public health priority. This talk focuses on techniques to strike an appropriate balance, with specifics on how to communicate complex statistical concepts in an engaging manner without sacrificing truth and content, specifically addressing how to help the general public read past headlines to the actual evidence, or lack there of. We will discuss engaging with the public via organizations such as TED Ed - focusing both best practices and lessons learned.

Best Practices for Teaching R A Randomized Controlled Trial

We are interested in studying best practices for introducing students in statistics or data science to the programming language R. The “tidyverse” is a suite of R packages created to help with common statistics and data science tasks that follow a consistent philosophy. We have created two sets of online learning modules, one that introduces tidyverse concepts first and then dives into idiosyncrasies of R as a programming language, the second that takes a more traditional approach, first introducing R broadly and then following with an introduction to a particular suite of packages, the tidyverse. We have created a randomized study to examine whether the order certain concepts are introduced impacts whether learning objectives are met and/or how engaged students are with the material. This talk will focus on the mechanics of this study: how it was designed, how we enrolled participants, and how we evaluated outcomes.

Tools for Analyzing R code the Tidy Way

With the current emphasis on reproducibility and replicability, there is an increasing need to examine how data analyses are conducted. In order to analyze the between researcher variability in data analysis choices as well as the aspects within the data analysis pipeline that contribute to the variability in results, we have created two R packages: matahari and tidycode. These packages build on methods created for natural language processing; rather than allowing for the processing of natural language, we focus on R code as the substrate of interest. The matahari package facilitates the logging of everything that is typed in the R console or in an R script in a tidy data frame. The tidycode package contains tools to allow for analyzing R calls in a tidy manner. We demonstrate the utility of these packages as well as walk through two examples.

Challenges in Augmenting Randomized Trials with Observational Health Records

This talk addresses challenges with making health record data and clinical trial data compatible. The data collected in trials is collected regularly and in an organized way, while data from health records is messier and more haphazard. A clinical trial has a clear start and endpoint, while health record data is collected continuously. Additionally, clinical trial participants may be healthier than patients we see in health records. Covariates are defined in advance for a trial, but must be predicted or imputed from the health record. In this talk I will discuss some of the challenges we have encountered in trying to integrate trial data with observational health records to improve power and design new trials.

Exploring Finite-sample Bias in Propensity Score Weights

The principle limitation of all observational studies is the potential for unmeasured confounding. Various study designs may perform similarly in controlling for bias due to measured confounders while differing in their sensitivity to unmeasured confounding. Design sensitivity (Rosenbaum, 2004) quantifies the strength of an unmeasured confounder needed to nullify an observed finding. In this presentation, we explore how robust certain study designs are to various unmeasured confounding scenarios. We focus particularly on two exciting new study designs - ATM and ATO weights. We illustrate the performance in a large electronic health records based study and provide recommendations for sensitivity to unmeasured confounding analyses in ATM and ATO weighted studies, focusing primarily on the potential reduction in finite-sample bias.

Making Causal Claims as a Data Scientist: Tips and Tricks Using R

Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.

papr: Tinder for pre-prints, a Shiny Application for collecting gut-reactions to pre-prints from the scientific community

papr is an R Shiny web application and social network for evaluating bioRxiv pre-prints. The app serves multiple purposes, allowing the user to quickly swipe through pertinent abstracts as well as find a community of researchers with similar interests. It also serves as a portal for accessible “open science”, getting abstracts into the hands of users of all skill levels. Additionally, the data could help build a general understanding of what research the community finds exciting.

We allow the user to log in via Google to track multiple sessions and have implemented a recommender engine, allowing us to tailor which abstracts are shown based on each user’s previous abstract rankings. While using the app, users view an abstract pulled from bioRxiv and rate it as “exciting and correct”, “exciting and questionable”, “boring and correct”, or “boring and questionable” by swiping the abstract in a given direction. The app includes optional social network features, connecting users who provide their twitter handle to users who enjoy similar papers.

This presentation will demonstrate how to incorporate tactile interfaces, such as swiping, into a Shiny application using a package we created for this functionality shinysense, store real-time user data on Dropbox using drop2, login in capabilities using googleAuthR and googleID, how to implement a recommender engine using principle component analysis, and how we have handled issues of data safety/security through proactive planning and risk mitigation. Finally, we will report the app activity, summarizing both the user traffic and what research users are finding exciting.

Streamline Your Workflow: Integrating SAS, LaTeX, and R into a Single Reproducible Document

There is an industry-wide push toward making workflows seamless and reproducible. Incorporating reproducibility into the workflow has many benefits; among them are increased transparency, time savings, and accuracy. We walk through how to seamlessly integrate SAS®, LaTeX, and R into a single reproducible document. We also discuss best practices for general principles such as literate programming and version control.

Simplifying and Contextualizing Sensitivity to Unmeasured Confounding Tipping Point Analyses

The strength of evidence provided by epidemiological and observational studies is inherently limited by the potential for unmeasured confounding. Thus, we would expect every observational study to include a quantitative sensitivity to unmeasured confounding analysis. However, we reviewed 90 recent studies with statistically significant findings, published in top tier journals, and found 41 mentioned the issue of unmeasured confounding as a limitation, but only 4 included a quantitative sensitivity analysis. Moreover, the rule of thumb that considers effects 2 or greater as robust can be misleading in being too low for studies missing an important confounder and too high for studies that extensively control for confounding. We simplify the seminal work of Rosenbaum and Rubin (1983) and Lin, Pstay, and Kronmal (1998). We focus on three key quantities: the observed bound of the confidence interval closest to the null, a plausible residual effect size for an unmeasured binary confounder, and a realistic prevalence difference for this hypothetical confounder. We offer guidelines to researchers for anchoring the tipping point analysis in the context of the study and provide examples.

Assessing the Association Between Accident Injury Severity and NCAP Car Safety Ratings”

The U.S. New Car Assessment Program (NCAP) evaluates the safety of new cars through their 5-Star Safety Ratings program. In 2010, this program enhanced their protocol, making the ratings more stringent for cars in model years 2011 and onwards. We are interested in assessing this rating system’s ability to predict accident injury severity. To evaluate this question, we use data reported in the National Highway Traffic Safety Administration’s (NHTSA) General Estimates System (GES) database for the years 2011 to 2014, matched to NCAP overall safety ratings for 291 unique make, model, model year combinations. We fit a proportional odds regression model predicting injury severity for 23,641 individual passengers involved in car crashes, adjusting for accident-level covariates, such as the speed of the car and point of impact, and individual-level covariates, such as age and seating position.

Integrating SAS and R to Perform Optimal Propensity Score Matching

In studies where randomization is not possible, imbalance in baseline covariates (confounding by indication) is a fundamental concern. Propensity score matching (PSM) is a popular method to minimize this potential bias, matching individuals who received treatment to those who did not, to reduce the imbalance in pre-treatment covariate distributions. PSM methods continue to advance, as computing resources expand. Optimal matching, which selects the set of matches that minimizes the average difference in propensity scores between mates, has been shown to outperform less computationally intensive methods. However, many find the implementation daunting. SAS/IML® software allows the integration of optimal matching routines that execute in R, e.g. the R optmatch package. This presentation walks through performing optimal PSM in SAS® through implementing R functions, assessing whether covariate trimming is necessary prior to PSM. It covers the propensity score analysis in SAS, the matching procedure, and the post-matching assessment of covariate balance using SAS/STAT® 13.2 and SAS/IML procedures.

Using PROC SURVEYREG and PROC SURVEYLOGISTIC to Assess Potential Bias

The Behavioral Risk Factor Surveillance System (BRFSS) collects data on health practices and risk behaviors via telephone survey. This study focuses on the question, On average, how many hours of sleep do you get in a 24-hour period? Recall bias is a potential concern in interviews and questionnaires, such as BRFSS. The 2013 BRFSS data is used to illustrate the proper methods for implementing PROC SURVEYREG and PROC SURVEYLOGISTIC, using the complex weighting scheme that BRFSS provides.

Using SAS/STAT® Software to Validate a Health Literacy Prediction Model in a Primary Care Setting

Existing health literacy assessment tools developed for research purposes have constraints that limit their utility for clinical practice. The measurement of health literacy in clinical practice can be impractical due to the time requirements of existing assessment tools. Single Item Literacy Screener (SILS) items, which are self-administered brief screening questions, have been developed to address this constraint. We developed a model to predict limited health literacy that consists of two SILS and demographic information (for example, age, race, and education status) using a sample of patients in a St. Louis emergency department. In this paper, we validate this prediction model in a separate sample of patients visiting a primary care clinic in St. Louis. Using the prediction model developed in the previous study, we use SAS/STAT® software to validate this model based on three goodness of fit criteria: rescaled R-squared, AIC, and BIC. We compare models using two different measures of health literacy, Newest Vital Sign (NVS) and Rapid Assessment of Health Literacy in Medicine Revised (REALM-R). We evaluate the prediction model by examining the concordance, area under the ROC curve, sensitivity, specificity, kappa, and gamma statistics. Preliminary results show 69% concordance when comparing the model results to the REALM-R and 66% concordance when comparing to the NVS. Our conclusion is that validating a prediction model for inadequate health literacy would provide a feasible way to assess health literacy in fast-paced clinical settings. This would allow us to reach patients with limited health literacy with educational interventions and better meet their information needs.

Small Areal Estimation of Racial Disparities in Diabetes Using Multilevel Reweighted Regression

Introduction: The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress made in reducing these disparities. We use a small area analysis technique to fill the void for county level disparities data. Methods: Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of diabetes by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county-level. A multilevel reweighted regression model is fit to obtain county level prevalence estimates by race. To examine whether racial disparities exist at the county-level, these rates are compared using risk difference and rate ratio. Results: The District of Columbia was ranked as having the largest average disparity in both absolute and relative terms (risk difference and risk ratio). Based on the average risk difference of counties within a state, the next five states with the largest average disparity are: Massachusetts, Kansas, Ohio, North Carolina, and Kentucky. The next five states with the largest average relative disparity, calculated with rate ratio, were: Massachusetts, Colorado, Kansas, Illinois, and Ohio. Discussion: Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need.

Using PROC GLIMMIX and PROC SGPLOT to Demonstrate County-level Racial Disparities in Obesity in North Carolina

The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress in reducing these disparities. Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of obesity by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county-level, and a multilevel reweighted regression model using PROC GLIMMIX is fit to obtain county-level prevalence estimates by race. To examine whether racial disparities exist at the county-level, these rates are compared using risk difference and rate ratio. These county-level estimates are then compared graphically using PROC SGPLOT. The distribution of prevalence estimates for Blacks is shifted to the right in comparison to the distribution for Whites; based on a two-sample test for differences in proportions the mean of the distribution of obesity prevalence estimates for Blacks is 35.7% higher than for Whites in North Carolina. This difference is statistically significant (p<.0001). Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need as well as providing a graphical representation of disparities, allowing for the implementation of interventions and dissemination of information to occur more effectively and efficiently.

Using PROC GLIMMIX and PROC SGPLOT to Demonstrate County-level Racial Disparities in Obesity in North Carolina

The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress in reducing these disparities. Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of obesity by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county-level, and a multilevel reweighted regression model using PROC GLIMMIX is fit to obtain county-level prevalence estimates by race. To examine whether racial disparities exist at the county-level, these rates are compared using risk difference and rate ratio. These county-level estimates are then compared graphically using PROC SGPLOT. The distribution of prevalence estimates for Blacks is shifted to the right in comparison to the distribution for Whites; based on a two-sample test for differences in proportions the mean of the distribution of obesity prevalence estimates for Blacks is 35.7% higher than for Whites in North Carolina. This difference is statistically significant (p<.0001). Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need as well as providing a graphical representation of disparities, allowing for the implementation of interventions and dissemination of information to occur more effectively and efficiently.

Multilevel Reweighted Regression Models to Estimate County-Level Racial Health Disparities Using PROC GLIMMIX

The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized, and this disconnect may be slowing the progress made in reducing these disparities. This paper focuses on establishing county-level prevalence estimates of diabetes among Non-Hispanic Whites and Non-Hispanic Blacks. These estimates are produced using multilevel reweighted regression models through the GLIMMIX procedure with 2006-2010 Behavioral Risk Factor Surveillance System data and 2010 census data. To examine whether racial disparities exist at the county level, the paper estimates the risk difference of prevalence estimates between races. It subsequently ranks counties and states by the magnitude of disparities.

Oral Presentation