DAVID RITTENHOUSE LABORATORY, ROOM 3N6

The purpose of this sequence of four workshops is to take students from no experience in R or data science to a level of proficiency in which they can (1) visualize, (2) manipulate, and (3) perform introductory statistical methods on a dataset. The sequence is divided into 4 full-day sessions, each with a morning and afternoon session that consists of lecture, pair programming, and review.

Because no programming experience is expected, courses will simultaneously teach the programming concept and the R implementation. I personally believe that larger concepts in programming are best taught after students learn an implementation, so sessions will alternate between R programming, followed by slides codifying the larger concept where appropriate.

The course structure is loosely designed around R for Data Science by Grolemund and Wickham (freely available at http://r4ds.had.co.nz/index.html). The main difference between this course and typical R courses is that this one forgoes the typical programming-based introduction (teaching about data types, etc) in favor of beginning with real datasets and real analysis. We will incorporate important Base R functions and programming techniques along the way, but they are not a focus.

I am firm believer that the skill of how to figure out coding on your own is the most important skill one can have, and will emphasize problem solving, help files, and extensive use of online resources.

**Workshop I: Intro to R and Data Visualization**

The biggest hurdle in getting to use R is simply to not be intimidated by the coding nature. This workshop introduces students to the basic structure of RStudio and code-based commands. It does so in the context of the important and rewarding tasks of visualizing data.

Session Objective: Students will be able to

- Load a data set into the workspace and explore values manually.
- Create variables in a dataframe.
- Use ggplot to create scatter plots, line plots, histograms, violin plots.
- Use numeric and logical vectors.

Commands learned:

- Base R: <-, ?, read.csv(), head(), [], $, 1:4, c(), logical operators (<, >, ==), table(), order(), dim, length.
- Basic math and logical operations: max, min, which, which.max, sum, mean, &, |, ==, <, >, !=, NA, is.na()
- Visualizations: library(), ggplot, aes, geom_scatter, geom_line, geom_histogram, facet_wrap, ggtitle, scale_x_log10, scale_color_continuous, scale_color_discrete, geom_text, geom_smooth, ggsave

Datasets Used:

- Smoking and Birthweight from “The Costs of Low Birth Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083.
- ACS Philadelphia Wage Data

**About the Instructor**

Jonathan Tannen, Ph.D., is a Director at Econsult Solutions, Inc (ESI). Jonathan’s dissertation research used GIS and large-scale computational techniques to develop a Bayesian method to measure the movement of neighborhood boundaries. Broadly, his work showed that gentrification between 2000 and 2010 in Philadelphia and other dense cities occurred by emergent boundaries moving through space–the gentrified regions expanded and blocks switch dichotomously–rather than gradual block-level demographic changes.

Jonathan was born and raised in West Philadelphia, which is still his home. From 2007 to 2009, he taught at West Philadelphia High School through Teach for America, an experience that heavily informs his understanding of cycles of poverty and the nature of segregation in Philadelphia.

Jonathan received his Ph.D. in Public Policy in Urban and Population Policy from the Woodrow Wilson School at Princeton University in June 2016. Jonathan’s research interests include GIS, spatial statistics. His research with Douglas Massey on trends in Black hypersegregation was cited by the New York Times Editorial Board. Jonathan received his BA in Physics and Math cum laude from Harvard University in 2007, and a M.S.Ed. in Urban Education from the University of Pennsylvania in 2009.

**Location**

DAVID RITTENHOUSE LABORATORY, ROOM 3N6