(The description here may change over time. It is your responsibility to check this page for updates and changes when you do your project.)
The class roster is here. Form your group and select your topic by Sept 12.
We have two group projects: practice project (10 points), and data analysis project (20 points). Each group consists of 3-4 students and will work on the two projects together.
Practice Project (10 points):
This practice project focuses on one of mostly studied dataset — Adult dataset, and involves writing your own code (e.g., python) or the use of data mining software (e.g., Weka) for the following specific tasks. Note that most students are expected to use scikit-learn/python for this project, and those non-cs major students (who do not have strong programming background) can choose the use of data mining software (e.g., Weka) directly.
- Download the data and read the descriptions in the file adult.names. Remove records with unknown (?) values from both train and test data sets and remove all continuous attributes. For each multi-domain categorical attribute, you can use one-hot encoding to transform data (this step is needed if you choose scikit-learn to build decision tree and naïve classifier; it is optional if you choose Weka). In your report, describe briefly how you develop your algorithm or apply software on the following two tasks and include 2-4 screenshots about your algorithm settings and output. (4 points)
- Build a decision tree classifier (single tree) and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data.
- Build a naïve Bayesian classifier and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data.
- Remove records with unknown (?) values from both train and test data sets. For each multi-domain categorical attribute, use one-hot encoding to transform data; for each numerical attribute, use the mean value to transform into binary attribute. (4 points)
- Build k-means clustering algorithm over train data with varied k values (3, 5, 10) based on your chosen distance function and report the centroids of the clusters.
- Use the last 10 records from test data and use kNN algorithm (with varied k values, 3, 5, 10) to report the prediction accuracy.
- Continue using the train datasets from step 2, build a SVM classifier and report the predicted accuracy of the test data. (1 point)
- Continue using the train datasets from step 2, build a neural network classifier and report the predicted accuracy of the test data. (1 point)
Timeline
No report is required or graded. Every team member needs to study and finish this project by Oct 1 so you can be ready to execute the Data Analysis Project. You can see the sample code for task1-2 and task3-4 or some helpful information here.
Data Analysis Project (20 points):
Projects in this category involve the development and application of data mining techniques covered in class to one real-world data analysis problem. The goal is to fully understand/explore/mining the dataset (including the business problem to be solved, the data mining task to be performed, the data mining methods/algorithms, the evaluation of discovered patterns or findings). You may develop, implement, and perform evaluation on any data mining techniques discussed in class or found somewhere (e.g., clustering, association/sequence analysis, classification), or combine various techniques, and use any available data mining or analytical tools/software into your data mining solution.
Choose one of the following problems for your group project:
- San Francisco Crime Classification from Kaggle
- Airbnb New User Bookings from Kaggle
- Wikipedia vandal (early) detection. Check the link for raw/processed datasets (that you should focus on in your course project) as well as some papers. More (loosely) related resources that could be helpful include: the best paper of CIKM 2016 “Vandalism Detection in Wikidata” that can be downloaded from https://cn.aminer.org/archive/57fdf417654a3f2774eccd31 The source code and data can be found at http://www.heindorf.me/wdvd.html. Please also check this link http://www.uni-weimar.de/en/media/chairs/webis/corpora/corpus-wdvc-15/.
- Microsoft Malware Prediction from Kaggle
- Insider Threat Detection from CMU, the original paper
- Hateful Memes Challenge and Dataset from Facebook AI, the original paper, the winners and a paper based on VLMs
Note
A good data mining software package is weka with an introduction on data mining with weka. More software can be found here, especially the list on Web usage mining, Web content mining, Web searching software, and Web analytics software. More datasets can be found here, especially the widely used UCI KDD Repository, .Stanford large Network Dataset Collection, KDD Cup, Network data collection.
Although this is for a group project (a lot collaboration), I encourage each student do this separately first before your group meeting. You should be expected to have the ability to conduct analysis of data by yourself when you finish this project. In the final report, I need one-page private statement from each team member to specify the responsibilities and contributions of your group in detail. Your grade will be adjusted based on your contributions. If you think the contributions from team members are (roughly) equal, the one-page statement is not required.
Timeline
Final project report will be due on 11:59pm Dec 5.
Presentation requirement
Each team needs to prepare a 7-minute presentation (including 1-minute Q/A) to receive feedbacks from students and the instructor. The presentations will be given in classes of Nov 19 and 21.
Final project report requirement (6-10 pages)
The final project report expects to include the following:
- Introduction and motivation
- Proposes goals and summarized accomplishments (if you work on San Francisco Crime, include your best result based on the metric uses. A confirmation email from the organizer will also be needed.)
- Background and related work
- Design and Implementation
- Evaluation
- Summary and Future work
- Reference
- Appendix (optional)
Document Format
Both the proposal and the final report are recommended to be written using the following format guidelines: Double-column, single-spaced, 10pt font, reasonable margins
You are encouraged to write all of your documents using LaTeX. It is the de-facto tool in which most CS papers are written. You can use the ACM SIG Proceedings Templates Latex style for your papers. A Microsoft word template is also provided there.
Research Project:
Students who plan doctoral study may consider doing a research project instead of the above practice/data analysis projects. This is an individual project. In some situations, small team (with 2-3 students) is allowed. You need to discuss your topic and get the permission from the instructor before September 12.