Text and text files

Summary
We consider some basic mechanisms for working with files that contain unformatted text.
Prerequisites
An abbreviated introduction to Racket. Characters and strings. List basics.

As you likely know from your experience with computers, when we want to keep information on the computer for reuse, we store it in what we usually refer to as a file. There are many different kinds of files. Some files store images (in many different possible formats). Some files store formatted text. Some store data. Some store what we often refer to as “plain text”, text without additional markup or formatting information.

As you might expect, digital humanists often work with both formatted and unformatted text files. However, they also work with a wide variety of other kinds of files, including images and geographic data. Because text-processing algorithms are often more straightforward than image-processing algorithms, we will primarily focus on text-processing algorithms. However, we will touch upon some other kinds of humanistic data and algorithms later in the semester.

It is often much easier to work with formatted text. However, as creating formatted text is often a labor-intensive process. Formatting also introduces some additional complexities. In many situations, plain text is a more appropriate tool for storing data.

Computer scientists and digital humanists work with text files in a variety of ways. They might, for example, search for particular words or attempt to rewrite the text in a file into a new form or a new language. They might look for some statistical properties of the text to try to gain some insight. We will consider similar issues.

Reading from plain text files

The csc151 package provides four basic operations for working with text files: file->chars, which reads the contents of a text file and presents the contents as a list of characters; file->words, which reads the contents of a text file and presents the contents as a list of strings, each of which represents one “word” in the file (using a simple metric for “word”); file->lines, which reads the contents of a text file and presents the contents as a list of strings, each of which represents one line of the input file; and file->string, which reads the contents of a text file and presents the contents as a single string.

Suppose we had the previous paragraph in a file (using the not-quite-plain-text format we tend to use for writing these readings). Here’s what we might get reading it each way.

> (take (file->chars "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20)
'(#\T #\h #\e #\space #\` #\c #\s #\c #\1 #\5 #\1 #\` #\space #\p #\a #\c #\k #\a #\g)
> (take (file->words "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 10)
'("The" "csc151" "package" "provides" "four" "basic" "operations" "for" "working" "with")
> (take (file->lines "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 3)
'("The `csc151` package provides four basic operations for working with" 
  "text files: `file->chars`, which reads the contents of a text file and" 
  "presents the contents as a list of characters; `file->words`, which")
> (substring (file->string "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20 120)
"e provides four basic operations for working with\ntext files: `file->chars`, which reads the content"

Writing to plain text files

At times, we’ll want to save the text we create to a file. The csc151 package currently provides only one procedure for writing text to a file: (string->file str fname) writes the given string to the named file.

> (file->string "/home/username/example.txt")
Error! . . ../../Applications/Racket v7.1/collects/racket/private/kw.rkt:1279:57: open-input-file: cannot open input file
Error!   path: /home/username/example.txt
Error!   system error: No such file or directory; errno=2
> (string->file "This is an example.\n" "/home/username/example.txt")
> (file->string "/home/username/example.txt")
"This is an example.\n"

Warning! The string->file procedure will overwrite an existing file, completely eliminating any previous content.

> (take (file->lines "/home/username/exam1.txt") 3)
'("Exam 1" "Random J. Student" "Time required: 10 hours")
> (string->file "I am the 1337 h4x0r. Phear me!" "/home/username/exam1.txt")
> (file->string "/home/username/exam1.txt")
"I am the 1337 h4x0r. Phear me!\n"

Naming files

Racket is surprisingly clueless about finding files. We might say “It’s right there.” But there is not clear to the computer. We will generally store files in the same directory as where we are working to avoid issues. If you want to try other approaches, please discuss with course staff or drop-in tutors.

Self checks

Check 1: Ways of reading files (‡)

Suppose scene.txt contains the following lines.

Prof: Student, how are you today?
Student: Please don't address me in the generic.
Prof: Stu, how are you today?
Student: I'm pretty well.  Thanks for asking.

What output do you expect to get if you call file->chars, file->words, file->lines, and file->string on that file?

Check your work by creating scene.txt within DrRacket and then using Racket to check the results of these function calls. Keep in mind that you’ll need to (require csc151) to call these file-reading functions.