Reading data from files

Summary: We begin our consideration of ways to store data in files and how to work with data once they are stored in files. Along the way, we consider some of the complexities of representing data.

Introduction: Storing data

In our recent exploration of tables we found that it is possible to work with tables of data when they are represented as lists of lists. And that makes sense if we’re working in Scheme. However, people work with data in many languages, so we should not store it in Scheme format. Fortunately, there are a variety of formats that computer scientists use to store data. We will explore a few such formats and the procedures we use to work with data in those formats.

Comma-separated values

The most common file format for tables of data are comma-separated value files, most typically referred to by the initials “CSV” or “csv”. Each line of a csv file contains one row of a table, with the entries separated by commas (hence the name). In general, the assumption in csv files is that something that looks like a number is a number and everything else is a string. For example, here is an entry in our “state capitals” file for Des Moines.

Iowa,Des Moines,41.590939,-93.620866

At first glance, the basic rules seem simple and straightforward: (1) values are separated by commas, (2) numbers represent numbers, (3) everything else is a string. But when we start to probe the details, things get a bit more complicated.

What if we want to include a comma in one of the entries, such as “The Roswell, New Mexico Crash” from our UFO table? The designers of csv sensibly decided that we should be able to put strings that include commas in quotation marks. Hence, we might write the following.

1947,"The Roswell, New Mexico Crash","July, 1947",USA,Yes,Yes,Yes,No

Quotation marks are also useful when we want to represent something that looks like a number, but should be treated verbatim. For example, zip codes look like numbers. However, we typically want the leading zeros, such as 02158 rather than 2158. In representing our data that includes zip codes, we would therefore put each zip code in quotation marks.

"02158",42.385096,-71.208399,"Newton","MA","Middlesex"

There are a few other subtleties, such as how you represent a string that includes both a comma and a double-quotation mark. But those are rare enough that we’ll leave them as an appendix.

Reading from csv files

In Scamper, we need to import the data package with (import data) to access several of the needed procedures.

Scamper allows us to parse in data from other files using the procedure with-file. The file is loaded from browser storage.

(import data)

;;; (with-file filename fn) -> void
;;;   filename: string?
;;;   fn: procedure?
;;; Loads filename from browser storage and passes 
;;; its contents to fn as input. The output of fn is then 
;;; rendered to the screen.
(with-file "Pokemon.csv" 
    (lambda (x) x))
"#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
12,Butterfree,Bug,Flying,395,60,45,50,90,80,70,1,False
17,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,1,False"

After the call to with-file comes the name of the file to load, and then a procedure which tells Scamper how to process the information in the file. In the example above, the procedure used is the identity procedure - a procedure that returns exactly the input. So in this case the printed result is simply the contents of the file without any changes, as a string.

If we want to process the data to be in a different format we can use one of parse-csv, string->chars, or string->words. Let’s see how each work.

(with-file "Pokemon.csv" 
    parse-csv)
(("#" "Name" "Type 1" "Type 2" "Total" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary") ("1" "Bulbasaur" "Grass" "Poison" "318" "45" "49" "49" "65" "65" "45" "1" "False") ("4" "Charmander" "Fire" "" "309" "39" "52" "43" "60" "50" "65" "1" "False") ("7" "Squirtle" "Water" "" "314" "44" "48" "65" "50" "64" "43" "1" "False") ("12" "Butterfree" "Bug" "Flying" "395" "60" "45" "50" "90" "80" "70" "1" "False") ("17" "Pidgeotto" "Normal" "Flying" "349" "63" "60" "55" "50" "50" "71" "1" "False"))
; result is a list of lists, split on newline characters

(with-file "Pokemon.csv" 
    string->chars)
(#\# #\, #\N #\a #\m #\e #\, #\T #\y #\p #\e #\space #\1 #\, #\T #\y #\p #\e #\space #\2 #\, #\T #\o #\t #\a #\l #\, #\H #\P #\, #\A #\t #\t #\a #\c #\k #\, #\D #\e #\f #\e #\n #\s #\e #\, #\S #\p #\. #\space #\A #\t #\k #\, #\S #\p #\. #\space #\D #\e #\f #\, #\S #\p #\e #\e #\d #\, #\G #\e #\n #\e #\r #\a #\t #\i #\o #\n #\, #\L #\e #\g #\e #\n #\d #\a #\r #\y #\newline #\1 #\, #\B #\u #\l #\b #\a #\s #\a #\u #\r #\, #\G #\r #\a #\s #\s #\, #\P #\o #\i #\s #\o #\n #\, #\3 #\1 #\8 #\, #\4 #\5 #\, #\4 #\9 #\, #\4 #\9 #\, #\6 #\5 #\, #\6 #\5 #\, #\4 #\5 #\, #\1 #\, #\F #\a #\l #\s #\e #\newline #\4 #\, #\C #\h #\a #\r #\m #\a #\n #\d #\e #\r #\, #\F #\i #\r #\e #\, #\, #\3 #\0 #\9 #\, #\3 #\9 #\, #\5 #\2 #\, #\4 #\3 #\, #\6 #\0 #\, #\5 #\0 #\, #\6 #\5 #\, #\1 #\, #\F #\a #\l #\s #\e #\newline #\7 #\, #\S #\q #\u #\i #\r #\t #\l #\e #\, #\W #\a #\t #\e #\r #\, #\, #\3 #\1 #\4 #\, #\4 #\4 #\, #\4 #\8 #\, #\6 #\5 #\, #\5 #\0 #\, #\6 #\4 #\, #\4 #\3 #\, #\1 #\, #\F #\a #\l #\s #\e #\newline #\1 #\2 #\, #\B #\u #\t #\t #\e #\r #\f #\r #\e #\e #\, #\B #\u #\g #\, #\F #\l #\y #\i #\n #\g #\, #\3 #\9 #\5 #\, #\6 #\0 #\, #\4 #\5 #\, #\5 #\0 #\, #\9 #\0 #\, #\8 #\0 #\, #\7 #\0 #\, #\1 #\, #\F #\a #\l #\s #\e #\newline #\1 #\7 #\, #\P #\i #\d #\g #\e #\o #\t #\t #\o #\, #\N #\o #\r #\m #\a #\l #\, #\F #\l #\y #\i #\n #\g #\, #\3 #\4 #\9 #\, #\6 #\3 #\, #\6 #\0 #\, #\5 #\5 #\, #\5 #\0 #\, #\5 #\0 #\, #\7 #\1 #\, #\1 #\, #\F #\a #\l #\s #\e)
; result is a list of characters

(with-file "Pokemon.csv" 
    string->words)
("#,Name,Type" "1,Type" "2,Total,HP,Attack,Defense,Sp." "Atk,Sp." "Def,Speed,Generation,Legendary" "1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False" "4,Charmander,Fire,,309,39,52,43,60,50,65,1,False" "7,Squirtle,Water,,314,44,48,65,50,64,43,1,False" "12,Butterfree,Bug,Flying,395,60,45,50,90,80,70,1,False" "17,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,1,False")
; result is a list of strings, split on spaces

The contents of the file will be unavailable to you outside of the call to with-file. For this reason, if you want to do anything more than print the data to the screen, it’s best to save the data using a local binding.

(with-file "Pokemon.csv"
  (lambda (text)
    (let ([data (parse-csv text)])
      data)))
 (("#" "Name" "Type 1" "Type 2" "Total" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary") ("1" "Bulbasaur" "Grass" "Poison" "318" "45" "49" "49" "65" "65" "45" "1" "False") ("4" "Charmander" "Fire" "" "309" "39" "52" "43" "60" "50" "65" "1" "False") ("7" "Squirtle" "Water" "" "314" "44" "48" "65" "50" "64" "43" "1" "False") ("12" "Butterfree" "Bug" "Flying" "395" "60" "45" "50" "90" "80" "70" "1" "False") ("17" "Pidgeotto" "Normal" "Flying" "349" "63" "60" "55" "50" "50" "71" "1" "False"))

Once we’ve read a csv file, we can treat it just as we’ve been treating any other list of lists.

More stuff about files

Note that the file in the example above "Pokemon.csv" should be pre-loaded in your scamper browser. Upload it the same way you upload .scm files.

There is an alternative version of with-file called with-file-chooser.

;;; (with-file-chooser fn) -> void
;;;   fn: procedure?
;;; Renders a file chooser widget. When the user selects a 
;;; file, its contents are passed to fn as input. The output of 
;;; fn is then rendered to the screen.
(with-file-chooser 
   (lambda (x) x))

If you try running this code, you’ll see a button on the output pane that you can click and choose a file from anywhere on your computer. You can even click it again and choose a different file and the code will re-run with the new file.

An aside: Pairs

A pair is a Scheme data type which we have not yet mentioned. A pair is a collection of two values. Use them in places where you want to store/pass around/etc. two values in one! (pair x1 x2) makes a pair from two values x1 and x2. To get the first item from a pair named p, call (car p), and to get the second item, call (cdr p).

(define pair-example (pair 4 "even"))
(car pair-example)
4
(cdr pair-example)
"even"

Finding data in a table

Related to pairs, we have a handy tool in Scheme called associations. Consider that we may associate a key value with some data. Maybe with the Pokemon data above it’s the first entry of each list, the number.

(pair "1" ("Bulbasaur" "Grass" "Poison" "318" "45" "49" "49" "65" "65" "45" "1" "False"))

First of all, how would we change our data to that format? A pretty simple procedure can split a list into two pieces, and then we can map that procedure onto each element of the list.

;;; (key-pair lst) -> pair?
;;;    lst: list?, non-empty
;;; Change the contents of lst into a pair where the first
;;; element of the pair is the first element of lst, and the second
;;; element of the pair is the remainder of lst.
(define key-pair
  (lambda (lst)
    (pair (car lst) (cdr lst))))

;;; Look at our Pokemon data in a keyed arrangement
(with-file "Pokemon.csv"
  (lambda (text)
    (let* ([data (parse-csv text)]
    	   [keyed-data (map key-pair data)
      	keyed-data)))
((pair "#" ("Name" "Type 1" "Type 2" "Total" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary")) (pair "1" ("Bulbasaur" "Grass" "Poison" "318" "45" "49" "49" "65" "65" "45" "1" "False")) (pair "4" ("Charmander" "Fire" "" "309" "39" "52" "43" "60" "50" "65" "1" "False")) (pair "7" ("Squirtle" "Water" "" "314" "44" "48" "65" "50" "64" "43" "1" "False")) (pair "12" ("Butterfree" "Bug" "Flying" "395" "60" "45" "50" "90" "80" "70" "1" "False")) (pair "17" ("Pidgeotto" "Normal" "Flying" "349" "63" "60" "55" "50" "50" "71" "1" "False")))

Second of all, why would we want our data in this format? Using a key-pair data set allows us to use a tool called associations. Once we have our data, we may want to be able to find a particular entry by the key. Is there a Pokemon with key “4”? The answer is yes, with a procedure called assoc-ref.

;;;(assoc-ref k l) -> any
;;;  k: any
;;;  l: list?, an association list
;;;Returns the value associated with key k in association list l.

Let’s see it in action.

;;; (key-pair lst) -> pair?
;;;    lst: list?, non-empty
;;; Change the contents of lst into a pair where the first
;;; element of the pair is the first element of lst, and the second
;;; element of the pair is the remainder of lst.
(define key-pair
  (lambda (lst)
    (pair (car lst) (cdr lst))))
 
 ;;; Find the value in the data with key "4"
(with-file "Pokemon-super-short.csv"
  (lambda (text)
    (let* ([data (parse-csv text)]
           [key-data (map key-pair data)])
       (assoc-ref "4" key-data)
       )))
 ("Charmander" "Fire" "" "309" "39" "52" "43" "60" "50" "65" "1" "False")

How about with key “2”?

(with-file "Pokemon-super-short.csv"
  (lambda (text)
    (let* ([data (parse-csv text)]
           [key-data (map key-pair data)])
       (assoc-ref "2" key-data)
       )))
 void
Runtime error [13:8-13:31]: (assoc-ref) assoc-ref: key 2 not found in association list

Other helpful procedures regarding association pairs include assoc-key? and assoc-set.

;;;(assoc-key? k l) -> any
;;;  k: any
;;;  l: list?, an association list
;;;Returns #t if k is a key in association list l.

;;;(assoc-set k v l) -> list?
;;;  k: any
;;;  v: any
;;;  l: list?, an association list
;;;Returns a new association list containing the same key-value pairs 
;;;as l  except that k is associated with v.

Self checks

Check 1: Your own csv file (‡)

a. Create a csv file that represents your schedule for the semester. For example, a line of your file might read

CSC-151,01,MWF,2:30,3:50,Professor Rosseforp

b. Confirm that you can read the file using with-file.

(Include in your Gradescope submission the contents of your csv file, and the code you wrote to read the file)

Check 2: More Keys

Think about how you might modify the procedure key-pair so that a different element of the list becomes the key. For example, perhaps the name would be more useful as a key than the number.

Copyright © Eric Autry, Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Nicole Eikmeier, Fahmida Hamid, Priscilla Jiménez, Barbara Johnson, Titus Klinge, Peter-Michael Osera, Leah Perlmutter, Samuel A. Rebelsky, William Rebelsky, John David Stone, Anya Vostinar, Henry Walker, and Jerod Weinman.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.