Proposal Due : Wednesday November 26
Project Due: Friday December 12
Presentations Due: Wednesday December 10
Summary : At this point in your career, you’ve learned a number of techniques for gathering and displaying data. This project is an opportunity for you to explore some techniques in greater depth.
Purposes: To explore some aspect of data science in depth. To emphasize the more creative components of this course. To encourage more purposeful reflection on algorithms.
Collaboration: You will work in groups of size 2 to 4. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Some Resources
In this project, your group will identify a data set of interest, design and implement a nontrivial algorithm or algorithms for manipulating the data set and extracting information, run the algorithm, and present the results of the algorithm. The primary intents are that you demonstrate your mastery of the concepts and skills introduced in the class in a novel way.
Reasonable size: Your project should be of a scope that it can be completed by your group with approximately ten hours of work per team member over a two-week period (five hours per week).
Moderately large data: You should identify a moderately large data set, preferably with a few thousand or more data points or with many columns of data to work with. Different problems may require different kinds or sizes of data. For example, if you are working with a group of literary works, you may decide to use only a few dozen, since each work has hundreds of thousands of words.
Nontrivial, novel algorithm: Your project should demonstrate the group’s ability to design and implement a nontrivial algorithm that differs reasonably from any of the algorithms we have already defined. You might combine ideas we have discussed in new ways or you may develop a completely new algorithm.
Alternative outcomes: As you’ve likely discovered, we tend to underestimate how much time it takes to complete a computer project. Hence, your project should have three targets: (a) an intended outcome - what you expect to be able to achieve; (b) a satisficing outcome- something not as complex as the intended outcome, but complex enough that it meets the general expectations for the project; and (c) a reach outcome - something that you can try to achieve if the intended goal is more straightforward than you expected.
For example, if the project was to use part of the Gutenberg dataset to develop an algorithm that can distinguish between two 18th century and 19th century British literature, your satisficing goal might be to be able to extract a set of a dozen characteristics from any work that you could then compare to another work and your reach goal might be able to take any two sets of literature and generate a process that distinguishes between them.
Convert a data set from some format to CSV. Clean the data. Compute interesting summary statistics (e.g., tallies of different subgroups). Visualize those summaries. The focus of this kind of project is primarily the complexity of the data and the kinds of ways you process more complex data. A creative visualization might also serve as the focus.
After gathering information from a group of texts (e.g., individual word frequency, likelihood of one word following another word (or another sequence of words)), probabilistically generate moderately clear text. The focus of this kind of project is likely the algorithm for successfully gathering data and using that data.
Given two sets of written works (e.g., books by Bronte and Doyle; books from different centuries; books from different genres), identify some distinguishing characteristics (e.g., number of distinct words, sentence length, common or uncommon words) and use those characteristics to classify new works as belonging to one set or the other.
Given a particular data set, develop tools that someone in the first week or two of CSC 151 could use to explore the data set. You might, for example, allow someone to gather trend data from a larger data set or combine data in new ways.
To be discussed in class.
Coordinate with your group members to come up with a list of at least 3 possible topics, along with datasets. Submit these topic ideas via Gradescope by Friday November 21 at 10:30 pm CDT. I will get feedback to you as soon as possible about whether the ideas are of appropriate scope for this project.
Your project proposal describes the core aspects of your project:
Your proposal should employ correct grammar and spelling. Approximately one or two pages should suffice. Be sure to consult the rubric!
We will do our best to respond to your proposal in a timely manner. However, given other constraints, we may not be able to do so.
After finishing your proposal, you should set to work on implementing your design, making sure to meet all of the specifications outlined above. Be sure to consult the rubric!
Your final project should be accompanied by a short report, most likely based on your proposal. Describe the primary goal of the project, the data, and the algorithm you implemented. Conclude with a short description of what your analysis showed (e.g., “the novels of Bronte and Doyle can easily be distinguished based on …” or “by using the technique of … our algorithm is able to correctly classify cancer cells 95% of the time” or “Throughout the past two decades, naming practices in the US have evolved from dominance by a few names to a more equal distribution amongst a much wider set of names”)
Your final project should also be accompanied by a set of straightforward instructions for running the code.
You will give a quick (3-5 min) presentation on your project to your peers, in class.
In grading your submission, we will look for the following at each level. Note that if a criteria does not pass a lower level, we will likely not check for criteria at the higher levels. We may also identify other characteristics that move your work between levels.
You should read through the rubric and verify that your submission meets the rubric.
Submissions that lack any of these characteristics will get an N.
[] Part 1 is submitted with three unique project ideas
[] Part 2 is submitted
[] Part 3 is submitted
[] Part 3: Acknowledges appropriately
[] Part 3: All code runs in scamper.
[] Part 3: Instructions are provided on how to run code.
Submissions that lack any of these characteristics but have all of the prior characteristics will get an R.
[] Part 2: The proposal includes all of the required components:
a heading (with authors) and title, a short description of
the project goals, a description of the data (with a citation), a description
of the algorithm(s), and the intended, satisficing and reach outcomes.
[] Part 2: The proposal is written clearly without overly distracting typos or issues with grammar
[] Part 2: The data description clearly indicates the source of the data,
the form of the data, and the type and range of each data element.
[] Part 2: The proposal includes a description of each algorithm to be implemented
as well as an outline of how they will be implemented.
[] Part 2: The proposal includes a clear, well-written, description of the
preferred outcome, satisficing outcome, and reach outcome and each
outcome is appropriately scoped for a two-week project.
[] Part 3: The submission includes a report, a scheme file, and all necessary
data files.
[] Part 3: Code has appropriate indentation which makes it easy to read
[] Part 3: Code has some amount of documentation
[] Part 3: The Report includes all of the required components:
a heading (with authors) and title, a short description of
the project goals, a description of the data, a description
of the algorithm or algorithms, a description of what your analysis
showed, and instructions for running your code.
[] Part 3: The Report is written clearly without overly distracting typos or issues with grammar
[] Part 3: The Report contains a data description which clearly indicates the source of the data,
the form of the data, and the type and range of each data element.
[] Part 3: The Report includes a description of each algorithm as well as a brief
overview of how each were implemented.
[] Part 3: The Report includes a description of what your analysis showed.
[] Part 4: The presentation is approximately the correct length, and provides a complete
overview of the theme, data sets, goal, and analysis of the project.
Submissions that lack any of these characteristics but have all of the prior characteristics will get an M.
[] Part 3: All code is exceptionally organized and easy to read, through the use of comments (to explain the purpose of different pieces of the code), decomposition, and highly intuitive naming choices.
[] Part 3: Code is documented in the 151 style for all procedures, and contains correct information.
[] Part 3: The algorithms implemented are non-trivial and use many new techniques
and/or combine techniques in a novel way.
[] Part 3: The Report addresses any objections or bias that may have affected your results.
[] Part 4: The presentation is organized well, appropriate in length, and includes a brief overview of the algorithms implemented.