Class Project: Exploring Data

Proposal Due : Wednesday November 26

Project Due: Friday December 12

Presentations Due: Wednesday December 10

Summary : At this point in your career, you’ve learned a number of techniques for gathering and displaying data. This project is an opportunity for you to explore some techniques in greater depth.

Purposes: To explore some aspect of data science in depth. To emphasize the more creative components of this course. To encourage more purposeful reflection on algorithms.

Collaboration: You will work in groups of size 2 to 4. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.

Some Resources

  • The grading rubric at the bottom of this page will be an important guide for you
  • Kaggle and UCI have some interesting open datasets available.

Assignment

Background: Specification

In this project, your group will identify a data set of interest, design and implement a nontrivial algorithm or algorithms for manipulating the data set and extracting information, run the algorithm, and present the results of the algorithm. The primary intents are that you demonstrate your mastery of the concepts and skills introduced in the class in a novel way.

General expectations

Reasonable size: Your project should be of a scope that it can be completed by your group with approximately ten hours of work per team member over a two-week period (five hours per week).

Moderately large data: You should identify a moderately large data set, preferably with a few thousand or more data points or with many columns of data to work with. Different problems may require different kinds or sizes of data. For example, if you are working with a group of literary works, you may decide to use only a few dozen, since each work has hundreds of thousands of words.

Nontrivial, novel algorithm: Your project should demonstrate the group’s ability to design and implement a nontrivial algorithm that differs reasonably from any of the algorithms we have already defined. You might combine ideas we have discussed in new ways or you may develop a completely new algorithm.

Alternative outcomes: As you’ve likely discovered, we tend to underestimate how much time it takes to complete a computer project. Hence, your project should have three targets: (a) an intended outcome - what you expect to be able to achieve; (b) a satisficing outcome- something not as complex as the intended outcome, but complex enough that it meets the general expectations for the project; and (c) a reach outcome - something that you can try to achieve if the intended goal is more straightforward than you expected.

For example, if the project was to use part of the Gutenberg dataset to develop an algorithm that can distinguish between two 18th century and 19th century British literature, your satisficing goal might be to be able to extract a set of a dozen characteristics from any work that you could then compare to another work and your reach goal might be able to take any two sets of literature and generate a process that distinguishes between them.

Sample categories

Traditional data analysis

Convert a data set from some format to CSV. Clean the data. Compute interesting summary statistics (e.g., tallies of different subgroups). Visualize those summaries. The focus of this kind of project is primarily the complexity of the data and the kinds of ways you process more complex data. A creative visualization might also serve as the focus.

Text generation

After gathering information from a group of texts (e.g., individual word frequency, likelihood of one word following another word (or another sequence of words)), probabilistically generate moderately clear text. The focus of this kind of project is likely the algorithm for successfully gathering data and using that data.

Text identification

Given two sets of written works (e.g., books by Bronte and Doyle; books from different centuries; books from different genres), identify some distinguishing characteristics (e.g., number of distinct words, sentence length, common or uncommon words) and use those characteristics to classify new works as belonging to one set or the other.

Data analysis tools for a particular data set

Given a particular data set, develop tools that someone in the first week or two of CSC 151 could use to explore the data set. You might, for example, allow someone to gather trend data from a larger data set or combine data in new ways.

Part Zero: Group Assignments

To be discussed in class.

Part One: Topic Exploration

Coordinate with your group members to come up with a list of at least 3 possible topics, along with datasets. Submit these topic ideas via Gradescope by Friday November 21 at 10:30 pm CDT. I will get feedback to you as soon as possible about whether the ideas are of appropriate scope for this project.

Part Two: Proposal

Your project proposal describes the core aspects of your project:

  • The general theme of the project. “We are writing a program to distinguish the works of Bronte and Doyle.” “We are writing a program that identifies trends in Twitter posts based on geographical location.” “We are writing a program that visualizes change in income disparity across the past three decades.”
  • The data set or sets with which you are working. You should indicate where the data come from and describe what one “row” of data looks like. If you expect to need to massage or clean the data before processing it, you might explain what transformations you expect to need to do. (You may also have completed those transformations by the time you submit the proposal.)
  • A high-level overview of the algorithm or algorithms you intend to implement.
  • A short description of the preferred outcome, satisfactory outcome, and reach outcome

Your proposal should employ correct grammar and spelling. Approximately one or two pages should suffice. Be sure to consult the rubric!

We will do our best to respond to your proposal in a timely manner. However, given other constraints, we may not be able to do so.

Part Three: Project

After finishing your proposal, you should set to work on implementing your design, making sure to meet all of the specifications outlined above. Be sure to consult the rubric!

Your final project should be accompanied by a short report, most likely based on your proposal. Describe the primary goal of the project, the data, and the algorithm you implemented. Conclude with a short description of what your analysis showed (e.g., “the novels of Bronte and Doyle can easily be distinguished based on …” or “by using the technique of … our algorithm is able to correctly classify cancer cells 95% of the time” or “Throughout the past two decades, naming practices in the US have evolved from dominance by a few names to a more equal distribution amongst a much wider set of names”)

Your final project should also be accompanied by a set of straightforward instructions for running the code.

Part Four: Presentation

You will give a quick (3-5 min) presentation on your project to your peers, in class.

Submission guidelines

Part Zero: Make sure to tell your instructor at the end of class who your group consists of
Part One: Submit the three topics and data sets on Gradescope
Part Two: Submit your proposal on Gradescope
Part Three: Submit your program, data set, and project report to Gradescope
Part Four: Will be presented in class

Questions

Can we reuse code from the assignments and labs?
You may certainly reuse code from the assignments and labs, provided you cite that code. However, you should make sure that your project goes beyond what you did for the assignment or lab. Hence, you will likely want to extend or otherwise rewrite that code. (Even if you extend or rewrite code, you should still cite its origin and influence.)

Grading rubric

In grading your submission, we will look for the following at each level. Note that if a criteria does not pass a lower level, we will likely not check for criteria at the higher levels. We may also identify other characteristics that move your work between levels.

You should read through the rubric and verify that your submission meets the rubric.

Redo or above

Submissions that lack any of these characteristics will get an N.

[] Part 1 is submitted with three unique project ideas
[] Part 2 is submitted
[] Part 3 is submitted 
[] Part 3: Acknowledges appropriately
[] Part 3: All code runs in scamper.
[] Part 3: Instructions are provided on how to run code.

Meets expectations or above

Submissions that lack any of these characteristics but have all of the prior characteristics will get an R.

[] Part 2: The proposal includes all of the required components:
      a heading (with authors) and title, a short description of
      the project goals, a description of the data (with a citation), a description 
      of the algorithm(s), and the intended, satisficing and reach outcomes.
[] Part 2: The proposal is written clearly without overly distracting typos or issues with grammar
[] Part 2: The data description clearly indicates the source of the data,
      the form of the data, and the type and range of each data element.
[] Part 2: The proposal includes a description of each algorithm to be implemented
       as well as an outline of how they will be implemented.
[] Part 2: The proposal includes a clear, well-written, description of the 
      preferred outcome, satisficing outcome, and reach outcome and each 
      outcome is appropriately scoped for a two-week project.
[] Part 3: The submission includes a report, a scheme file, and all necessary
      data files.
[] Part 3: Code has appropriate indentation which makes it easy to read
[] Part 3: Code has some amount of documentation
[] Part 3: The Report includes all of the required components:
      a heading (with authors) and title, a short description of
      the project goals, a description of the data, a description 
      of the algorithm or algorithms, a description of what your analysis
      showed, and instructions for running your code.
[] Part 3: The Report is written clearly without overly distracting typos or issues with grammar
[] Part 3: The Report contains a data description which clearly indicates the source of the data,
      the form of the data, and the type and range of each data element.
[] Part 3: The Report includes a description of each algorithm as well as a brief 
      overview of how each were implemented.
[] Part 3: The Report includes a description of what your analysis showed.
[] Part 4: The presentation is approximately the correct length, and provides a complete
overview of the theme, data sets, goal, and analysis of the project. 

Exemplary / Exceeds expectations

Submissions that lack any of these characteristics but have all of the prior characteristics will get an M.

[] Part 3: All code is exceptionally organized and easy to read, through the use of comments (to explain the purpose of different pieces of the code), decomposition, and highly intuitive naming choices.
[] Part 3: Code is documented in the 151 style for all procedures, and contains correct information.
[] Part 3: The algorithms implemented are non-trivial and use many new techniques 
       and/or combine techniques in a novel way.
[] Part 3: The Report addresses any objections or bias that may have affected your results. 
[] Part 4: The presentation is organized well, appropriate in length, and includes a brief overview of the algorithms implemented.