How To Collect Twitter Data

See also

Twitter Data Processing
Simple Twitter Analysis
Illocution Inc, a source for sampled Twitter data

Basics

  1. http://twitter.com
  2. Twitter Help Center (start here)
  3. SEE also: [http://onemansblog.com/2011/04/27/how-to-search-twitter-for-old-tweets-and-how-to-archive-them/]

Project

  1. Build an RSS feed that gathers all tweets on some topic
    1. See How To Find Your RSS Feed or RSS
    2. "Feed" can go in two directions. We might want to know how we can send something we update (like our blog) to Twitter. Or we might want to capture something going on out there.

Strategy 1

Go to hashtags.org
Enter a term and take note of the current tweeting on this term.

Strategy 2

Download The Archivist
Simple Example

Strategy 3

See also http://search.twitter.com

Subtopics

  1. Twitter Lists

Strategy 4

Obtained data from http://simplymeasured.com/blog/2010/06/lakers-vs-celtics-social-media-breakdown-nba/. Uncompressed, cleaned up a bit (details on that another time) and take data subset of about 10k tweets.

Format is

Term Username Name Tweet Time(PDT)

Use Excel pivot table on names to create a lookup table and then vlookup to code each case with an tweeter ID.
Use Excel "substitute" function to turn all spaces in Tweet into a findable string including the record number.

Data Cleaning

See also DATA CLEANING on my Twitter project.

We start with data found on http://simplymeasured.com/blog/2010/06/lakers-vs-celtics-social-media-breakdown-nba/ (RowFeeder for Celtics and Lakers compressed.zip).
Uncompressed data contains about 45,000 tweets for each game and looks like this:

Service Term Username Name Update Location URL Friends Followers Time(PDT) City State/Region Country Metro Latitude Longitude

For our first trial run we will throw away some of the columns and just work with a subset of 10,000 tweets. Our data will include:

Term Username Name Tweet Time PDT

Initial examination of data shows lots of junk characters that may make for analysis headaches. Trying to clean some of this up pre-emptively:

Remove blanks from names

We have both username and name. Not sure which we will eventually want to use. For now, assume it is USERNAME. We take this column and use pivot table to create a list of unique names. From 10,000 tweets we have 8760 unique usernames.

Put serial number next to each unique username. Then back in the data we create a new column and use the formula =VLOOKUP(B2,'name-id list'!A$1:C$8760,3) which we then autofill down and then copy and paste special values. Now we can delete the username and name columns

Next we note that the tweets themselves are full of lots of "irregular" text — some is "tweetese" so we don't want to sweep it all aside, but we do want to establish some cleaning. One to note is that some tweeters use the hashtag and some do not — lots of tweets with #lakers and lost without. Also we not upper and lower case. Also, various punctuation marks and URLs.

SUGGESTION: For first cut, let's zap most of this stuff. We'll turn periods, commas, parentheses and brackets, exclamation and question marks, semicolons, colons, etc. into blank spaces. But note that some of these are parts of emoticons.

twitter-001.jpg

References

The Twitter Fan Wiki
Twitter Social Network Analysis
http://blog.magicbeanlab.com/sna/some-twitter-social-network-analysis/
Geographic Data Analysis and Visualization at U of Oregon
http://www.crimsonhexagon.com/

See also