In this workshop we will use Microsoft Office (Word and Excel) to analyze the language in works of literature. Along the way we will learn something about:

  • character encoding and non-printing characters
  • tab (tsv) and comma separated value (csv) file formats
  • sophisticated find/replace operations, wildcards, and regular expressions
  • functions in Excel
  • stepwise refinement as a method for constructing complex functions
  • spreadsheets vs. databases
  • data as case by variable tables
  • using pivot tables to convert caseXvariable to variableXvariable
  • frequency distributions
  • n-grams
  • indexes and lookup tables
  • formal description of languages (e.g., BNF)
  • pattern recognition
  • iterative problem solving

Getting Started

  1. Find an interesting text via Project Gutenberg.
  2. Note the font and format of the text on the web page. We will be working in "text only" mode - our data will contain only text, punctuation, "white space," and a few non-printing characters (specifically, the "end-of-paragraph" mark).
  3. Copy and paste the entire text into a new MS Word document (use PASTE-SPECIAL UNFORMATTED TEXT).
  4. Skim over the document to familiarize yourself with it.
  5. Delete any ancillary material at the top or bottom of the file so that the text begins with the title and finishes with "The End."
  6. Save a copy as your "SURNAME-TITLE-rawdata" and then save again as "SURNAME-TITLE-working-copy."

Step One: Book to List of Words

To expedite the first part of the project we'll just look at an excerpt of the book. Note the total number of words and the number of chapters and delete enough to leave you with around 10,000 words.

Examine the text and make a list of punctuation marks that appear in it (e.g., " ' , . ! - ? _). We will use the find/replace function in Word to remove these.
But since sometimes only a punctuation mark might separate two words, we don't want simply to cut the punctuation. We will, instead, replace each punctuation mark with a space. For each one you replace, note how many
! 135
? 981
; 234

, 7268
. 5621
- 1080
_ 268
" 6106
' 3341

  • 32

Now scan the text to see if there are others and to inspect the results of your work. Note that there are a lot of places where we have separated an "s" from a noun where there was a possessive and a "t" from a contraction. Let's go back two steps and see if we can fix this.

Look, and think. If an apostrophe is used as a quote within a quote we want to remove it, but if it is used as possessive or contraction we'd like to keep it. We can approach this empirically. We saw above how many apostrophes (or single quotes) were found. Let's look for them again, except this time, we will look for an apostrophe followed by any letter

3205 - this tells me there are some that are not 's or 't in contractions.
SP' 154
'SP 126
3485

Next we will put each word on a different line by replacing all spaces with paragraph marks. And then we will collapse multiple paragraph marks into single ones.

Copy to Excel (do it in google sheets so we can collaborate?). Do pivot table. Observe

We forgot parentheses. Ampersand. Upper/lower case

UNIQUE WORDS and STEMMING
Let's first create a simple list of unique entries in our word list and a count (frequency) of each. Copy and paste (special-values only) the list into a Sheets document starting in cell B2
Label the column in cell B1 "Words"
Highlight the entire column and use the DATA>PIVOT TABLE command to create a pivot table in a new worksheet with Words as the row variable and words as the value variable.
Glance down the data and notice that we have cases where we have the a word like animal followed by animals or advise followed by advised and advises. We often do not want to count these inflected forms separately. When we take a text corpus and reduce inflected and derived words to their word stems we call it stemming. Copy column A of this new sheet and paste (special-values only) into column C changing the column title to Stems.
We will collectively carry out the task by connecting to a shared google sheet.
We will then create a new column on the original sheet and put Stem as its header. Then we will use the function VLOOKUP to match the original raw data word to its row in the pivot table sheet and then copy the text that appears the third column of that row back to the first sheet.
And then we will create a new pivot table using this new column with STEM as the row variable and count of STEM as the value variable. Note that we can continue to improve our stemming and update this frequency counting of our data.