Design Thinking: Empirical Evidence for Six Principles of Data Analysis

The data revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking – the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle differences in how a data analyst (or producer of a data analysis) constructs, creates, or designs a data analysis, including differences in the choice of methods, tooling, and workflow. These choices can affect the data analysis products themselves and the experience of the consumer of the data analysis. Therefore, the role of a producer can be thought of as designing the data analysis with a set of design principles. This talk will introduce six design principles for data analysis and describe how they can be mapped to data analyses in a quantitative and informative manner. We also provide empirical evidence of variation of these principles within and between producers of data analyses. This will hopefully provide guidance for future work in characterizing the data analytic process.

Bringing Data Science Communication into the Classroom

Data Science, as a broad and interdisciplinary field, is one of the fastest growing areas of student interest (and employment opportunities). The traditional introductory statistics courses that would typically serve as a gateway to data science need modernized curricula and pedagogy in order to adapt to today’s increasingly large and complex data sources and data science questions. In this session, we share our experience to address the following issues: • What constitutes the fundamentals of good data science practice? • How to teach a data science course with innovative pedagogy? • How to improve communication skills to bridge data scientists and practitioners? • How to take advantage of virtual learning? Discussant: Linda Zhao Speakers: Leanna House on Adapting student engagement strategies for a virtual environment, Lucy D’Agostino McGowan on Bringing Data Science Communication into the Classroom, and Nusrat Jahan on Data Science Education in Undergraduate Setting. There will be three speakers and a discussant in this session.

Designing Randomized Studies using Shiny

This talk will walk through building a self-contained randomized study using Shiny and learnr modules. We will discuss building informed consent, the randomization process, demographic surveys, and R-based studies into a single online framework to allow users to seamlessly enroll and participate in randomized studies via a single URL. The talk will include both practical recommendations as well as technical code snippets.

R-Ladies Panel: Improving Gender Diversity in a Male-dominated Community

R-Ladies is a worldwide organization whose mission is to promote gender diversity in the R community. We are interested in presenting a panel of regional leaders in the R-Ladies movement. We will discuss topics such as diversity data in the R community, best practices for starting up a meetup in your own community, best practices for running and continued success of a meetup in your community, and funding opportunities. We will also diagnose different obstacles and discuss how we attack them, for example increasing women’s competence versus confidence versus recognition in the R community. Finally we will provide resources and details about how to get involved with local meetups.

Contextualizing Sensitivity Analysis in Observational Studies: Calculating Bias Factors for Known Covariates

The strength of evidence provided by epidemiological and observational studies is inherently limited by the potential for unmeasured confounding. While methods exist to quantify the potential effect of a specified unmeasured confounder, these methods should be anchored and contextualized within each study. We put forward a method for merging sensitivity to unmeasured confounding analyses with the impacts of the observed covariates. We graphically display what we call the observed bias factors with the tipping point sensitivity analysis. We illustrate the method under various study designs and provide an application created to simplify the implementation of this methodology.

papr: Tinder for pre-prints, a Shiny Application for collecting gut-reactions to pre-prints from the scientific community

papr is an R Shiny web application and social network for evaluating bioRxiv pre-prints. The app serves multiple purposes, allowing the user to quickly swipe through pertinent abstracts as well as find a community of researchers with similar interests. It also serves as a portal for accessible “open science”, getting abstracts into the hands of users of all skill levels. Additionally, the data could help build a general understanding of what research the community finds exciting.

We allow the user to log in via Google to track multiple sessions and have implemented a recommender engine, allowing us to tailor which abstracts are shown based on each user’s previous abstract rankings. While using the app, users view an abstract pulled from bioRxiv and rate it as “exciting and correct”, “exciting and questionable”, “boring and correct”, or “boring and questionable” by swiping the abstract in a given direction. The app includes optional social network features, connecting users who provide their twitter handle to users who enjoy similar papers.

This presentation will demonstrate how to incorporate tactile interfaces, such as swiping, into a Shiny application using a package we created for this functionality shinysense, store real-time user data on Dropbox using drop2, login in capabilities using googleAuthR and googleID, how to implement a recommender engine using principle component analysis, and how we have handled issues of data safety/security through proactive planning and risk mitigation. Finally, we will report the app activity, summarizing both the user traffic and what research users are finding exciting.

Simplifying and Contextualizing Sensitivity to Unmeasured Confounding Tipping Point Analyses

The strength of evidence provided by epidemiological and observational studies is inherently limited by the potential for unmeasured confounding. Thus, we would expect every observational study to include a quantitative sensitivity to unmeasured confounding analysis. However, we reviewed 90 recent studies with statistically significant findings, published in top tier journals, and found 41 mentioned the issue of unmeasured confounding as a limitation, but only 4 included a quantitative sensitivity analysis. Moreover, the rule of thumb that considers effects 2 or greater as robust can be misleading in being too low for studies missing an important confounder and too high for studies that extensively control for confounding. We simplify the seminal work of Rosenbaum and Rubin (1983) and Lin, Pstay, and Kronmal (1998). We focus on three key quantities: the observed bound of the confidence interval closest to the null, a plausible residual effect size for an unmeasured binary confounder, and a realistic prevalence difference for this hypothetical confounder. We offer guidelines to researchers for anchoring the tipping point analysis in the context of the study and provide examples.

Assessing the Association Between Accident Injury Severity and NCAP Car Safety Ratings”

The U.S. New Car Assessment Program (NCAP) evaluates the safety of new cars through their 5-Star Safety Ratings program. In 2010, this program enhanced their protocol, making the ratings more stringent for cars in model years 2011 and onwards. We are interested in assessing this rating system’s ability to predict accident injury severity. To evaluate this question, we use data reported in the National Highway Traffic Safety Administration’s (NHTSA) General Estimates System (GES) database for the years 2011 to 2014, matched to NCAP overall safety ratings for 291 unique make, model, model year combinations. We fit a proportional odds regression model predicting injury severity for 23,641 individual passengers involved in car crashes, adjusting for accident-level covariates, such as the speed of the car and point of impact, and individual-level covariates, such as age and seating position.

Census Tract-Level Disparities: Examining Food Swamps and Food Deserts

Examining disparities in resources on the census tract-level is currently a public health priority. The Modified Retail Food Environment Index (mRFEI), provided by the CDC, incorporates two food environment metrics, ‘food deserts’, areas with no access to healthy foods, and ‘food swamps’, areas in which the quantity of unhealthy food options overwhelm healthy ones. We assess the association between the census tract racial make-up and food environment. Multiple logistic regression models are fit, controlling for census-tract level covariates from 2008-2012 ACS estimates, as well as state. Percent black is significantly associated with food swamps, with an absolute increase of 14.4 percent black living in food swamps (p< 0.01). Percent Hispanic is associated with food swamps, with an absolute increase of 9.1 percent Hispanic living in food swamps (p< 0.01), but inversely related to food deserts (absolute difference -6.8, p< 0.01). After adjustment, all associations remain significant. The strong association between the census tract-level racial make-up and food swamps shown here will allow for targeted interventions to census tracts where these disparities exist.

Developing County-Level Estimates of Racial Disparities in Obesity Using Multilevel Reweighted Regression

Background: The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress made in reducing these disparities. We use a small area analysis technique to fill the void for county-level disparities data. Methods:Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of obesity by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county level. A multilevel reweighted regression model is fit to obtain county-level prevalence estimates by race. To examine whether racial disparities exist at the county level, these rates are compared using risk difference and rate ratio. Results: Gulf County, Florida was ranked as having the largest disparity in absolute terms (risk difference). New York County, New York was ranked as having the largest disparity in relative terms (risk ratio). Based on the average risk difference, the top five states with the largest average disparity were: Oklahoma, Kentucky, Ohio, Washington D.C., and Kansas. The top five states with the largest average relative disparity were: Washington D.C., Massachusetts, Colorado, Kentucky, and New York. Conclusions: Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need.

Small Areal Estimation of Racial Disparities in Diabetes Using Multilevel Reweighted Regression

Introduction: The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress made in reducing these disparities. We use a small area analysis technique to fill the void for county level disparities data. Methods: Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of diabetes by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county-level. A multilevel reweighted regression model is fit to obtain county level prevalence estimates by race. To examine whether racial disparities exist at the county-level, these rates are compared using risk difference and rate ratio. Results: The District of Columbia was ranked as having the largest average disparity in both absolute and relative terms (risk difference and risk ratio). Based on the average risk difference of counties within a state, the next five states with the largest average disparity are: Massachusetts, Kansas, Ohio, North Carolina, and Kentucky. The next five states with the largest average relative disparity, calculated with rate ratio, were: Massachusetts, Colorado, Kansas, Illinois, and Ohio. Discussion: Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need.

Mining Through Resumes: Utilizing SAS to Increase Efficiency and Objectivity in the Hiring Process

In the current job market, it is common to be inundated with resumes and applications. It has become increasingly important to streamline the evaluation process in order to sift through these candidates. Anecdotally, we recently received 50 resumes for 2 positions, many of which did not meet the minimum qualifications for employment. In order to minimize the time spent evaluating these resumes, and maximize the objectivity and efficiency of the process, we developed a SAS macro to determine which candidates should progress to a first round interview.

Using PROC GLIMMIX and PROC SGPLOT to Demonstrate County-level Racial Disparities in Obesity in North Carolina

The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress in reducing these disparities. Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of obesity by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county-level, and a multilevel reweighted regression model using PROC GLIMMIX is fit to obtain county-level prevalence estimates by race. To examine whether racial disparities exist at the county-level, these rates are compared using risk difference and rate ratio. These county-level estimates are then compared graphically using PROC SGPLOT. The distribution of prevalence estimates for Blacks is shifted to the right in comparison to the distribution for Whites; based on a two-sample test for differences in proportions the mean of the distribution of obesity prevalence estimates for Blacks is 35.7% higher than for Whites in North Carolina. This difference is statistically significant (p<.0001). Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need as well as providing a graphical representation of disparities, allowing for the implementation of interventions and dissemination of information to occur more effectively and efficiently.

Using PROC GLIMMIX and PROC SGPLOT to Demonstrate County-level Racial Disparities in Obesity in North Carolina

The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized. This disconnect may be slowing the progress in reducing these disparities. Behavioral Risk Factor Surveillance System data is used to estimate the prevalence of obesity by county among Non-Hispanic Whites and Non-Hispanic Blacks. A modified weighting system was developed based on demographics at the county-level, and a multilevel reweighted regression model using PROC GLIMMIX is fit to obtain county-level prevalence estimates by race. To examine whether racial disparities exist at the county-level, these rates are compared using risk difference and rate ratio. These county-level estimates are then compared graphically using PROC SGPLOT. The distribution of prevalence estimates for Blacks is shifted to the right in comparison to the distribution for Whites; based on a two-sample test for differences in proportions the mean of the distribution of obesity prevalence estimates for Blacks is 35.7% higher than for Whites in North Carolina. This difference is statistically significant (p<.0001). Addressing disparities based on factors such as race/ethnicity, geographic location, and socioeconomic status is a current public health priority. This study takes a first step in developing the statistical infrastructure needed to target disparities interventions and resources to the local areas with greatest need as well as providing a graphical representation of disparities, allowing for the implementation of interventions and dissemination of information to occur more effectively and efficiently.

SAS ® for Budgeting an Ideal Wedding

When considering beverages at a wedding reception, there are often two possible payment options: (1) a set price per person per hour; (2) a fixed price per drink. We developed a SAS macro to help choose the most cost effective option.

Multilevel Reweighted Regression Models to Estimate County-Level Racial Health Disparities Using PROC GLIMMIX

The agenda to reduce racial health disparities has been set primarily at the national and state levels. These levels may be too far removed from the individual level where health outcomes are realized, and this disconnect may be slowing the progress made in reducing these disparities. This paper focuses on establishing county-level prevalence estimates of diabetes among Non-Hispanic Whites and Non-Hispanic Blacks. These estimates are produced using multilevel reweighted regression models through the GLIMMIX procedure with 2006-2010 Behavioral Risk Factor Surveillance System data and 2010 census data. To examine whether racial disparities exist at the county level, the paper estimates the risk difference of prevalence estimates between races. It subsequently ranks counties and states by the magnitude of disparities.

Contributed