Data Acquisition and Cleaning
- Locate transcripts online
- Copy paste save as text files
- Basic clean up (MSWord) : record time/place; edit name variations (first mentions usually include full titles such as "Former governor of X and one time Y So and So"), basic structure should be in sketchy-semi-BNF form:
<debate> | ::= | <debate_header> <debate_text> |
<debate_header> | ::= | <date> <place> |
<debate_text> | ::= | <utterance> | <utterance> <debate_text> |
<utterance> | ::= | <speaker> <tab> <paragraph> | <utterance> <paragraph> |
<paragraph> | ::= | <sentence> <crlf> | <sentence> <paragraph> |
<sentence> | ::= | <word> <period> | <word> <space> <sentence> |
word = [a-z]+
numbers in transcribed text come in many forms – spelled out, with or without commas, etc. Many can be caught simply by treating them as words that start with a digit with the extra allowance for internal commas and periods.
Quick and Dirty Contraction cleaner-upper
isn't aren't don't haven't hadn't wouldn't weren't i've you've we've i'm it's
won't can't need second fix wo not > will not, can not > cannot
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(LOWER(F166),"n't"," not"),"'re"," are"),"i'm","i am"),"it's","it is"),"'ve"," have")
Stemming
One quick and dirty approach to stemming.
- Sort list of words
- Identify all words that are word above +s, +es, +d, +ed, +_s (from 's)
- Produce two side by side lists, one with the root/stem and one with the added word
- Use this as a lookup/substitute concorandance
- Repeat 2 or three times to pick up multiple forms.
=IF(ISERROR(VLOOKUP(D2,'suffix-stemmer'!A$1:B$699,2,FALSE)),"xxx",VLOOKUP(D2,'suffix-stemmer'!A$1:B$699,2,FALSE))
=IF(OR(C8240=(C8239&"er"),C8240=(C8239&"est")),C8240,"")
=IF(OR(C8240=(C8239&"s"),C8240=(C8239&"d"),C8240=(C8239&"es"),C8240=(C8239&"ed"),C8240=(C8239&"_s")),C8240,"")
=IF(OR(C2=(C1&"ing")),C2,"") But discarded cases when verb stem is common expression of personal stance such as feel, guess, think, etc.
NOTE :the above is just looking to collapse words in our corpus. In cases where we only have the inflected form we don't do anything at this point.