Methodological Workshops

Text Analysis for Applied Social Science

From June 8th to 12th, professor Molly Roberts (University of California, San Diego) will teach a course of 15 hours.

The course will be taught in English, is free of charge and is designed for social scientists (professors, research fellows, graduate students) who are interested in text analysis for applied social science.

All classes are from 10:00 to 13:00

Those interested in attending the course should contact Magdalena Nebreda (secretaria@march.uc3m.es) prior to May 20th, and provide a CV and a brief explanation for their interest in the course. A selection process will be adopted if a large number of applications is received: selection results will be notified before May 25th.
Course Description
Statistical analysis of text data has become increasingly common in the social sciences (Grimmer and Stewart, 2013). Applications can be found in political science, economics, sociology, and psychology, for example. In this week long workshop we introduce scholars to the necessary tools for doing text analysis in a rigorous, replicable, way. We cover both pragmatic aspects but also cover the statistical details of workhorse text analysis models.
Course Schedule with References
For full syllabus see (here)
Day 1: Introduction to Text Analysis
This introduction will introduce an overview of text analysis as a methodology. It will begin to introduce text and the basics of text processing necessary to use these tools. This unit will cover:

Day 2: Word Counts and Basic Text Manipulations
This unit will also discuss using word counts for text data. It will introduce software to count words and software to identify discriminating words. This unit will cover:

Day 3: Supervised Text Methods
This unit will focus on supervised methods for text analysis. Supervised methods leverage some form of human training or guidance which is then used directly in the analysis of textual data. We will cover the statistical foundations of the models and describe their use. This unit will cover:
  • ReadMe (Hopkins and King, 2010)
  • Classifying political parties from speech (Yu, Kaufmann and Diermeier, 2008)
  • Classiffiers and ensembles (Hillard, Purpura and Wilkerson, 2008)
  • RTextTools (Jurka et al., 2011)

Day 4: Unsupervised Text Methods
This unit will focus on unsupervised methods for text analysis. Unsupervised methods lever- age use statistical tools to discover common patterns in textual data, which then require human interpretation and validation. We will cover the statistical foundations of the models and describe their use. This unit will cover:
  • Introduction to inference for latent variable models (Bishop et al., 2006, Chapter 1)
  • Latent Dirichelet Allocation (Blei, Ng and Jordan, 2003)
  • Structural Topic Models (Roberts et al., 2014, 2013)
  • Clustering (Grimmer and King, 2011)

Day 5: New Applications
This Final day will cover applications of the text analysis methods described above to interesting social science questions.
  • Reverse engineering censorship in China (King, Pan and Roberts, 2013, 2014)
  • Text analysis for comparative politics (Lucas et al., 2015)
  • Measuring political communication (Grimmer, 2010)
  • Measuring anti-Americanism (Jamal et al., 2014)

Statistical Packages
Throughout the course we will leverage several statistical packages that we or others have contributed to the open source community. These packages will be helpful for students wishing to complete optional workshops. These include:
  • Python for web-scraping
  • Python package BeautifulSoup
  • Yoshikoder for word counts http://www.yoshikoder.org/
  • R package textir for multinomial inverse regression
  • R package ReadMe
  • R package RTextTools for classiffiers
  • R package implements the Structural Topic Model (Roberts, Stewart and Tingley, Submitted) (available at www.structuraltopicmodel.com)