Debate Project Data Cleaning

Data Acquisition and Cleaning

  1. Locate transcripts online
  2. Copy paste save as text files
  3. Basic clean up (MSWord) : record time/place; edit name variations (first mentions usually include full titles such as "Former governor of X and one time Y So and So"), basic structure should be in sketchy-semi-BNF form:
<debate> ::= <debate_header> <debate_text>
<debate_header> ::= <date> <place>
<debate_text> ::= <utterance> | <utterance> <debate_text>
<utterance> ::= <speaker> <tab> <paragraph> | <utterance> <paragraph>
<paragraph> ::= <sentence> <crlf> | <sentence> <paragraph>
<sentence> ::= <word> <period> | <word> <space> <sentence>

word = [a-z]+

numbers in transcribed text come in many forms – spelled out, with or without commas, etc. Many can be caught simply by treating them as words that start with a digit with the extra allowance for internal commas and periods.


Quick and Dirty Contraction cleaner-upper

isn't aren't don't haven't hadn't wouldn't weren't i've you've we've i'm it's
won't can't need second fix wo not > will not, can not > cannot

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(LOWER(F166),"n't"," not"),"'re"," are"),"i'm","i am"),"it's","it is"),"'ve"," have")


One quick and dirty approach to stemming.

  1. Sort list of words
  2. Identify all words that are word above +s, +es, +d, +ed, +_s (from 's)
  3. Produce two side by side lists, one with the root/stem and one with the added word
  4. Use this as a lookup/substitute concorandance
  5. Repeat 2 or three times to pick up multiple forms.

=IF(OR(C2=(C1&"ing")),C2,"") But discarded cases when verb stem is common expression of personal stance such as feel, guess, think, etc.
NOTE :the above is just looking to collapse words in our corpus. In cases where we only have the inflected form we don't do anything at this point.