Fagin 114

The purpose of this sequence of four workshops is to take students from no experience in R or data science to a level of proficiency in which they can (1) visualize, (2) manipulate, and (3) perform introductory statistical methods on a dataset. The sequence is divided into 4 full-day sessions, each with a morning and afternoon session that consists of lecture, pair programming, and review.

Because no programming experience is expected, courses will simultaneously teach the programming concept and the R implementation. I personally believe that larger concepts in programming are best taught after students learn an implementation, so sessions will alternate between R programming, followed by slides codifying the larger concept where appropriate.

The course structure is loosely designed around R for Data Science by Grolemund and Wickham (freely available at http://r4ds.had.co.nz/index.html). The main difference between this course and typical R courses is that this one forgoes the typical programming-based introduction (teaching about data types, etc) in favor of beginning with real datasets and real analysis. We will incorporate important Base R functions and programming techniques along the way, but they are not a focus.

I am firm believer that the skill of how to figure out coding on your own is the most important skill one can have, and will emphasize problem solving, help files, and extensive use of online resources.

**Workshop IV: Regression in R**

The original use case for R was for statistical models, and it is still what R does perhaps best. In this workshop, we introduce regression, focusing on both the basics of fitting and interpreting regressions, and also methods using the regression output objects. We will also introduce the concepts of holding data out of sample, and use this as an opportunity to introduce for loops.

Session Objective - Students will be able to:

- Fit a regression in R.
- Assess the results of the regression using (1) output summaries, (2) visualizations of residuals and predicted values, (3) out of sample prediction.
- Replicate a task using a for loop.

Commands Learned:

- Regressions: lm, summary, coef, resid, predict, tidy, augment
- Base R: str(x, max.level = 2), for(), sample.int

Datasets Used:

- Smoking and Birthweight from Douglas Almond, Kenneth Chay, and David Lee, “The Costs of Low Birth Weight,” Quarterly Journal of Economics, August 2005, 120(3): 1031-1083.
- Twins earnings: Ashenfelter and Rouse Twinsburg Data

**About the Instructor**

Jonathan Tannen, Ph.D., is a Director at Econsult Solutions, Inc (ESI). Jonathan’s dissertation research used GIS and large-scale computational techniques to develop a Bayesian method to measure the movement of neighborhood boundaries. Broadly, his work showed that gentrification between 2000 and 2010 in Philadelphia and other dense cities occurred by emergent boundaries moving through space–the gentrified regions expanded and blocks switch dichotomously–rather than gradual block-level demographic changes.

Jonathan was born and raised in West Philadelphia, which is still his home. From 2007 to 2009, he taught at West Philadelphia High School through Teach for America, an experience that heavily informs his understanding of cycles of poverty and the nature of segregation in Philadelphia.

Jonathan received his Ph.D. in Public Policy in Urban and Population Policy from the Woodrow Wilson School at Princeton University in June 2016. Jonathan’s research interests include GIS, spatial statistics. His research with Douglas Massey on trends in Black hypersegregation was cited by the New York Times Editorial Board. Jonathan received his BA in Physics and Math cum laude from Harvard University in 2007, and a M.S.Ed. in Urban Education from the University of Pennsylvania in 2009.

** **