Skip to Main Content

Digital Text Analysis and Text Mining

Finding word trends across large bodies of text.

Getting to the Plain Text

Typically, your goal should be to obtain or create a Plain Text (UTF 8) version of the work(s) you will be analyzing. This may involve:

  • Finding a copy that is already in Plain Text
  • Finding a digital copy and convert it (if necessary) to Plain Text
  • Scanning and then extracting the text (using Optical Character Recognition) from a physical text

Optical Character Recognition (OCR):

Many free tools exist, but pay attention to their privacy and security practices before giving them your texts. Here are a few that function well with Carleton infrastructure.

Converting File Types

Clean & Manipulate the Text

Most plain text will need some cleaning and manipulation before it's useful for analysis. This can also be an iterative process as you move through your project. Here are common reasons for cleaning and manipulation:

  • Remove extraneous text (license agreements, recurring text that was added to indicate what page of the source document is reproduced here, etc)
  • Normalize spellings across the document/collection
  • Add unique strings of characters to the text to mark times when features of the text appear that you want to analyze (i.e. add "#character-1" every time the first of a few unique and identifiable but unnamed characters are referenced in a text)

Common Tools for Cleaning Text

Scripting and More

You may find that the options here are not powerful enough for you. For example, maybe you want remove a list of language- or domain-specific Stop Words from your texts. Your librarian can help you identify lists of stop words, and Paula Lackie (plackie@carleton.edu) or the Data Squad for data-manipulation support.