Statistical Semantics and History with R

My English class this semester has turned out to be really interesting. It’s a course offered by the Center for Digital Humanities at Carolina. The center facilitates the use of computing and technical methodologies in research in the humanities. They also make pretty, dynamic sites for the public to explore data and findings of research. I’d like to work there doing some development next semester.

In this particular class we are learning how to use R to do textual analysis. As a person who likes words and statistics, it’s fun. Most of our readings so far have been theory. The first was Widdows’ “Geometry of Meaning”, a lucid explanation of how the frequency of words in context can be mapped to statistical vectors, then manipulated to quantify associations and semantics. A review by Turney and Pantel outlined specific methodologies of vector space models.

These methods are the most basic forms of search engine information retrieval, natural language processing, and sentiment analysis. I think if I had been asked to produce some of the code we have used, but in an algorithms class, it would have been more difficult. Because the engineering professors I’ve had would not have given a theoretical grounding of why you can analyze semantic meaning quantitatively. Maybe I’m neurotic in not liking to crunch out code that I don’t understand, but it sure is refreshing to have a course that explains theory.

We also read some out-there essays about space and place, including “On Space” by De Certeau. There were some interesting bits about discourse being part of the practice of place. I do think that my professor kind of just wanted to throw some capital-T Theory to mess with us. It was fun to read, but I take that flowery, verbose style with a grain of salt. I can’t imagine writing extended essays interpreting it.

The case study we are working on now has to do with the industrialization of Scotland and the transformation of Highlands culture and identity. 18th and 19th century Scottish literature, gazetteer and census records form the corpus for our textual analysis in R. I would like to implement the (limited) geographic information systems capabilities of R to make some nice maps as well. If the corpus has enough relevant information, I plan to examine how industrialization and the erosion of clanship affected class inequality.

I’m looking forward to the final project, which is an open choice of corpus to explore using the methodologies and theory covered. However, the available text must be “georeferentially rich.” I have two ideas right now. The first would be to examine white supremacist literature and recorded actions in the South in the 20th century. Southerners have an obsession with revisionist history deeply tied to place. It seems like every public square in South Carolina has a statue which heroizes a slaveholder, segregationist, or known vitriolic racist. Ubiquitous historical markers downplay the atrocities perpetrated against people of color, in places where it is known that innocent blood was shed. On Charleston plantation tours, tourists learn about how nice the master was to his chattel.

It’s disgusting. The South has been telling “alternative facts” way longer than Kellyanne. They’re not just falsehoods, they’re lies with an aim to rewrite history in support of an insidious cause.

I would like to map the apologist narratives, which are well-referenced to place, against unbiased historical records of racism.

If that turns out to be too much my plan B is to explore textual and Internet references to real-world queer spaces in the past 50 years. It may come down to whichever has the most accessible data that fits with the required methodology. I plan to do much more playing around with are than the scope of this class and possibly put some documentation of the process up on GitHub.