As you likely know from your experience with computers, when we want to keep information on the computer for reuse, we store it in what we usually refer to as a file. There are many different kinds of files. Some files store images (in many different possible formats). Some files store formatted text. Some store data. Some store what we often refer to as “plain text”, text without additional markup or formatting information.
As you might expect, digital humanists often work with both formatted and unformatted text files. However, they also work with a wide variety of other kinds of files, including images and geographic data. Because text-processing algorithms are often more straightforward than image-processing algorithms, we will primarily focus on text-processing algorithms. However, we will touch upon some other kinds of humanistic data and algorithms later in the semester.
It is often much easier to work with formatted text. However, as creating formatted text is often a labor-intensive process. Formatting also introduces some additional complexities. In many situations, plain text is a more appropriate tool for storing data.
Computer scientists and digital humanists work with text files in a variety of ways. They might, for example, search for particular words or attempt to rewrite the text in a file into a new form or a new language. They might look for some statistical properties of the text to try to gain some insight. We will consider similar issues.
The csc151
package provides four basic operations for working with
text files: file->chars
, which reads the contents of a text file and
presents the contents as a list of characters; file->words
, which
reads the contents of a text file and presents the contents as a list
of strings, each of which represents one “word” in the file (using a
simple metric for “word”); file->lines
, which reads the contents
of a text file and presents the contents as a list of strings, each
of which represents one line of the input file; and file->string
,
which reads the contents of a text file and presents the contents as
a single string.
Suppose we had the previous paragraph in a file (using the not-quite-plain-text format we tend to use for writing these readings). Here’s what we might get reading it each way.
> (take (file->chars "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20)
'(#\T #\h #\e #\space #\` #\c #\s #\c #\1 #\5 #\1 #\` #\space #\p #\a #\c #\k #\a #\g)
> (take (file->words "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 10)
'("The" "csc151" "package" "provides" "four" "basic" "operations" "for" "working" "with")
> (take (file->lines "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 3)
'("The `csc151` package provides four basic operations for working with"
"text files: `file->chars`, which reads the contents of a text file and"
"presents the contents as a list of characters; `file->words`, which")
> (substring (file->string "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20 120)
"e provides four basic operations for working with\ntext files: `file->chars`, which reads the content"
At times, we’ll want to save the text we create to a file. The csc151
package currently provides only one procedure for writing text to a
file: (string->file str fname)
writes the given string to the named
file.
> (file->string "/home/username/example.txt")
Error! . . ../../Applications/Racket v7.1/collects/racket/private/kw.rkt:1279:57: open-input-file: cannot open input file
Error! path: /home/username/example.txt
Error! system error: No such file or directory; errno=2
> (string->file "This is an example.\n" "/home/username/example.txt")
> (file->string "/home/username/example.txt")
"This is an example.\n"
Warning! The string->file
procedure will overwrite an existing file,
completely eliminating any previous content.
> (take (file->lines "/home/username/exam1.txt") 3)
'("Exam 1" "Random J. Student" "Time required: 10 hours")
> (string->file "I am the 1337 h4x0r. Phear me!" "/home/username/exam1.txt")
> (file->string "/home/username/exam1.txt")
"I am the 1337 h4x0r. Phear me!\n"
Racket is surprisingly clueless about finding files. We might say “It’s right there.” But there is not clear to the computer. We will generally store files in the same directory as where we are working to avoid issues. If you want to try other approaches, please discuss with course staff or drop-in tutors.
Suppose scene.txt
contains the following lines.
Prof: Student, how are you today?
Student: Please don't address me in the generic.
Prof: Stu, how are you today?
Student: I'm pretty well. Thanks for asking.
What output do you expect to get if you call file->chars
, file->words
,
file->lines
, and file->string
on that file?
Check your work by creating scene.txt
within DrRacket and then using Racket to check the results of these function calls.
Keep in mind that you’ll need to (require csc151)
to call these file-reading functions.