While a select portion of the population is able to find trends and ideas in numeric data and charts, many find that visual representations of the data present allow them to more easily see possible trends, clusters, and other characteristics of the data. This experience is particularly common when the data sets are too large to think about individual lines of data.
As we explore the power of data science, we have some responsibility to consider how to visualize data. While there are a wide array of visualization tools possible, both generally and in DrRacket, we will focus on a few core types of visualization. In particular, we will consider histograms (aka “bar charts”), scatterplots, and line graphs. Those interested in exploring more capabilities can read the DrRacket reference manual.
DrRacket provides a package called plot
that supports a wide variety
of visualizations. All of them can be created by a variant of a call
to the plot
procedure.
The first parameter to that procedure is a description of the data to
be plotted and the way in which to plot those data. For example, we
describe a scatterplot using points
, a line plot using line
, and
histograms with discrete-histogram
or stacked-histogram
.
Let’s start with a simple call to plot
to create a plot of a few points.
(In general, we will generate data from our program; this just gets you
started thinking about how things are shown.)
> (plot (points (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2))))
As the image suggests, this expression gives us a fairly simple diagram with some potential flaws, such as the difficulty of seeing some of the points or the default labels, which tell us very little about the data. We will consider how to override these values in the subsequent sections.
As you’ve seen, if we have a collection of points stored as a list of lists of x,y pairs, we can plot that collection with
(plot (points list-of-points))
Doing so creates a simple scatterplot with the labels “x axis” and “y axis”, the x axis running from the smallest x value to the largest x value, and the y axis running from the smallest y value to the largest y value.
There are at least five things we would likely want to customize about this plot: We should add a title. We should give a name to the x axis (typically, the independent variable). We should give a name to the y axis (typically, the dependent variable). We should expand the x axis so that it’s easier to see the smallest and largest x value. We should expand the y axis so that it’s easier to see the smallest and largest y values.
We can do all of these things with so-called optional parameters. In
DrRacket, an optional parameter has the form #:NAME VAL
. That is, a
point sign, a colon, the name of the optional parameter, a space (or as
much whitespace as we want) and a corresponding value. For example, to
add a title to our plot, we add an optional parameter of the form
#:title
.
> (plot (points (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2)))
#:title "This and That")
We label the x axis and the y axis similarly, using #:x-label
and
#:y-label
> (plot (points (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2)))
#:title "This and That"
#:x-label "This"
#:y-label "That")
While those labels are associated with the plot as a whole, changes to the
ranges of the axes fall within the points
command. The optional arguments
are #:x-min
, #:x-max
, #:y-min
, and #:y-max
. Let’s expand the range
a bit to see the two extreme points.
> (plot (points (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2))
#:x-min -1
#:x-max 11
#:y-min -1
#:y-max 11)
#:title "This and That"
#:x-label "This"
#:y-label "That")
We can choose a wide variety of symbols using the #:sym
optional parameter
and can set the color of those symbols with #:color
or #:fill-color
.
You can find the list of symbols in the Racket documentation. Here’s one example.
> (plot (points (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2))
#:fill-color "red"
#:sym 'fullcircle6
#:x-min -1
#:x-max 11
#:y-min -1
#:y-max 11)
#:title "This and That"
#:x-label "This"
#:y-label "That")
What if we want to plot more than one set of points, perhaps in different colors or with different shapes? We can make a list of calls to points.
> (plot (list (points (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2))
#:fill-color "red"
#:sym 'fullcircle6
#:x-min -1
#:x-max 11
#:y-min -1
#:y-max 11)
(points (list (list 5 10) (list 6 9) (list 8 7))
#:fill-color "black"
#:sym 'fullcircle6))
#:title "This and That"
#:x-label "This"
#:y-label "That")
The DrRacket plot
package also supports line plots. They act much like
pint plots except that you use lines
rather than points
and you don’t
specify the symbols or colors.
> (plot (lines (list (list 0 0) (list 10 10) (list 3 5) (list 1 4)
(list 8 2))))
We can also combine line plots and scatterplots.
> (define elements
(list (list 0 0) (list 10 10) (list 3 5) (list 1 4) (list 8 2)))
> (plot (list (lines elements)
(points elements #:sym 'fullcircle6)))
.
DrRacket’s plot
package also supports two kinds of histograms:
discrete histograms, in which we have only one value associated with
each category, and stacked histograms, in which we might have multiple
values associated with each. Here’s a simple example.
> (plot (discrete-histogram (list (list "A" 5) (list "B" 8) (list "C" 3))))
Here’s one using stacked-histogram
and some optional parameters.
> (plot (stacked-histogram (list (list "A" (list 6 12))
(list "B" (list 5 10 7))
(list "C" (list 15 5)))
#:labels (list "This" "That" "Other")
#:colors (list "black" "red" "gray")))
.
The plot
package contains a wide variety of other plotting options and
opportunities, from additional parameters to each of the approaches we’ve
looked at already to other options for plotting data, including 3D plots
and vector fields. We will leave you to explore those on your own.
As you might expect, we will apply each of these approaches to data from
our “standard” data sets, such as zip codes or public domain novels.
Our first goal will be to determine which visualization might be most
appropriate for each form of data. Our second goal will be to turn the
data into a form that they can be used by these procedures. Our third
will be to use plot
to visualize the data.
> (plot (stacked-histogram (list (list "A" (list 6 12))
(list "B" (list 5 10 7))
(list "C" (list 15 5)))
#:labels (list "This" "That" "Other")
#:colors (list "black" "red" "gray")))
Does the order in which we present the points matter in a scatterplot? Why or why not?
Does the order in which we present the points matter in a line plot? Why or why not?