Research Guides: Digital Text Analysis and Text Mining: Preparing Texts for Analysis

LibKey Nomad

Getting to the Plain Text

Typically, your goal should be to obtain or create a Plain Text (UTF 8) version of the work(s) you will be analyzing. This may involve:

Finding a copy that is already in Plain Text
Finding a digital copy and convert it (if necessary) to Plain Text
Scanning and then extracting the text (using Optical Character Recognition) from a physical text

Optical Character Recognition (OCR):

Many free tools exist, but pay attention to their privacy and security practices before giving them your texts. Here are a few that function well with Carleton infrastructure.

Adobe Acrobat - Edit Scanned PDFs
Adobe's documentation on creating editable PDFs.
Google Drive - Convert PDFs and Images to Text
Google's instructions on converting PDFs and Images to Text files.
OneNote - Copy Text from Picture
Microsoft's instructions on how to copy text from a picture.

Converting File Types

Calibre: Ebook Management
Powerful free tool for reading and working with ebooks. Note their documentation on converting ebooks to plain text.

Clean & Manipulate the Text

Most plain text will need some cleaning and manipulation before it's useful for analysis. This can also be an iterative process as you move through your project. Here are common reasons for cleaning and manipulation:

Remove extraneous text (license agreements, recurring text that was added to indicate what page of the source document is reproduced here, etc)
Normalize spellings across the document/collection
Add unique strings of characters to the text to mark times when features of the text appear that you want to analyze (i.e. add "#character-1" every time the first of a few unique and identifiable but unnamed characters are referenced in a text)

Common Tools for Cleaning Text

BBEdit
Even the free version of this software allows for excellent plain text manipulations, including find/replace across multiple files.
Data Tidying with OpenRefine
OpenRefine allows you to tidy and modify your data in a simplistic way while not having to learn much, if any, programming. The three six-minute tutorials on this page will teach you how to easily modify the whole dataset with just a few quick commands.

more... less...

Collectica (an Excel add-on) and TableauPrep are good alternatives.

Scripting and More

You may find that the options here are not powerful enough for you. For example, maybe you want remove a list of language- or domain-specific Stop Words from your texts. Your librarian can help you identify lists of stop words, and Paula Lackie (plackie@carleton.edu) or the Data Squad for data-manipulation support.