Research Guides: Digital Text Analysis and Text Mining: Preparing Texts for Analysis

LibKey Nomad

Getting to the Plain Text

Typically, your goal should be to obtain or create a Plain Text (UTF 8) version of the work(s) you will be analyzing. This may involve:

Finding a copy that is already in Plain Text
Finding a digital copy and convert it (if necessary) to Plain Text
Scanning and then extracting the text (using Optical Character Recognition) from a physical text

Optical Character Recognition (OCR):

Many free tools exist, but pay attention to their privacy and security practices before giving them your texts. Here are a few that function well with Carleton infrastructure.

Adobe Acrobat - Edit Scanned PDFs
Adobe's documentation on creating editable PDFs.
Google Drive - Convert PDFs and Images to Text
Google's instructions on converting PDFs and Images to Text files.
OneNote - Copy Text from Picture
Microsoft's instructions on how to copy text from a picture.

Converting File Types

Calibre: Ebook Management
Powerful free tool for reading and working with ebooks. Note their documentation on converting ebooks to plain text.

Clean & Manipulate the Text

Most plain text will need some cleaning and manipulation before it's useful for analysis. This can also be an iterative process as you move through your project. Here are common reasons for cleaning and manipulation:

Remove extraneous text (license agreements, recurring text that was added to indicate what page of the source document is reproduced here, etc)
Normalize spellings across the document/collection
Add unique strings of characters to the text to mark times when features of the text appear that you want to analyze (i.e. add "#character-1" every time the first of a few unique and identifiable but unnamed characters are referenced in a text)

Common Tools for Cleaning Text

BBEdit
Even the free version of this software allows for excellent plain text manipulations, including find/replace across multiple files.
OpenRefine
This is a powerful tool for working with data (text or otherwise). You can clean and normalize messy data, transforming it from one format into another, and much more. Be sure to take a look at the three short tutorial videos on the main page before beginning your work.

Scripting and More

You may find that the options here are not powerful enough for you. For example, maybe you want remove a list of language- or domain-specific Stop Words from your texts. Your librarian can help you identify lists of stop words, and Paula Lackie (plackie@carleton.edu) or the Data Squad for data-manipulation support.