CSCE41403 Project

We have two group projects: practice project (10 points), and data analysis project (20 points).  Each group consists of 3-4 students and will work on the two projects together.

Practice Project (10 points):

This practice project focuses on one of mostly studied dataset — Adult dataset, and involves writing your own code (e.g., scikit-learn/python) for the following specific tasks.

  1. Download the data and read the descriptions in the file adult.names. Remove records with unknown (?) values from both train and test data sets and remove all continuous attributes. For each multi-domain categorical attribute, you can use one-hot encoding to transform data (this step is needed if you choose scikit-learn to build decision tree and naïve classifier). In your report, describe briefly how you develop your algorithm for the following two tasks and include 2-4 screenshots about your algorithm settings and output. (4 points)
    1. Build a decision tree classifier (single tree) and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data.
    2. Build a naïve Bayesian classifier and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data.
  2. Remove records with unknown (?) values from both train and test data sets. For each multi-domain categorical attribute, use one-hot encoding to transform data; for each numerical attribute, use the mean value to transform into binary attribute. (4 points)
    1. Build k-means clustering algorithm over train data with varied k values (3, 5, 10) based on your chosen distance function and report the centroids of the clusters.
    2. Use the last 10 records from test data and use kNN algorithm (with varied k values, 3, 5, 10) to report the prediction accuracy.
  3. Continue using the train datasets from step 2, build a SVM classifier and report the predicted accuracy of the test data. (1 point)
  4. Continue using the train datasets from step 2, build a neural network classifier and report the predicted accuracy of the test data. (1 point)

Timeline

You need to form your group by Sept 4.

Every team member needs to study and finish this practice project by Sept 30 so you can be ready to execute the Data Analysis Project.

Email your practice report to our TA at fshumbus@uark.edu by 11am Oct 7. Note that each group only needs to submit a single report (word or pdf) that contains group number, team member names, screen shots of major codes and results, and optional notes/discussions.

Data Analysis Project (20 points):

Projects in this category involve the development and application of data mining techniques covered in class to one real-world  data analysis problem. The goal is to fully understand/explore/mining the dataset (including the business problem to be solved, the data mining task to be performed, the data mining methods/algorithms, the evaluation of discovered patterns or findings). You may develop, implement, and perform evaluation on any data mining techniques discussed in class or found somewhere (e.g., clustering, association/sequence analysis, classification), or combine various techniques, and use any available data mining or analytical tools/software into your data mining solution.

Choose one of the following problems for your group project:

Note

More datasets can be found here, especially the widely used UCI KDD Repository, .Stanford large Network Dataset Collection, KDD Cup, Network data collection.

Although this is for a group project (a lot collaboration), I encourage each student do this separately first before your group meeting. You should be expected to have the ability to conduct analysis of data by yourself when you finish this project.  In the final report, I need one-page private statement from each team member to specify the responsibilities and contributions of your group in detail. Your grade will be adjusted based on your contributions. If you think the contributions from team members are (roughly) equal, the one-page statement is not required.

Timeline

Final project report will be due on 11:59pm Dec 4.  You can bring your hard copy report to Dec 4’s class or email to xintaowu@uark.edu before the due time. 

Presentation requirement

Each team needs to prepare a 7-minute presentation (including 1-minute Q/A) to receive feedbacks from students and the instructor. The presentations will be given in mid-November classes.

Final project report requirement (6-10 pages)

The final project report expects to include the following:

  • Introduction and motivation
  • Proposes goals and summarized accomplishments  (if you work on San Francisco Crime or Airbnb project, a screen shot of your best score based on the contest metric received from  the organizer should be included in the final report.)
  • Background and related work
  • Design and Implementation
  • Evaluation
  • Summary and Future work
  • Reference
  • Appendix (optional)

 Document Format

Double-column, single-spaced, 10pt font, reasonable margins

You are encouraged to write all of your documents using LaTeX. It is the de-facto tool in which most CS  papers are written. You can use the ACM SIG Proceedings Templates Latex style for your papers. A Microsoft word template is also provided there.

Research Project:

Students who plan doctoral study may consider doing a research project instead of the above practice/data analysis projects. This is an individual project. In some situations, small team (with 2-3 students) is allowed. You need to discuss your topic and get the permission from the instructor before September 4.