Data Science Group Project

1. Project Overview

This course does not have a final exam. Instead, students will complete a team-based data science project. Students should form groups of 2–4 members. Working alone is allowed but not recommended.

  • The project applies the full pipeline covered in this course:
    • data acquisition
    • data cleaning
    • exploratory analysis & statistical reasoning
    • modeling
    • visualization
    • report & presentation
  • Each team will:

    • Select one dataset from the approved list

    • Formulate their own data-driven problem: Define the target variables and the type of problem (e.g., regression, classification, clustering)

    • Apply required analytical components described below

    • Present results in a final report & presentation

​ Each team creates one shared GitHub repository and add instructor as a collaborator.

2. Required Project Components (All Teams Must Complete)

(1) Reproducible Data Science Workflow (15%)

Your project must be reproducible.

Required:

  • Public or private GitHub repository (invite instructor as collaborator)

  • Clear folder structure, e.g.:

    data/ # put your dataset here
    notebooks/ # you can use a notebook for communication
    src/  # put your code here
    figures/ # put figure here
    report/ # put your final report here
    
  • README file must include:

    • Team name & team members (full names, emails, and GitHub usernames)
    • Project title
    • Dataset source (website link)
    • Brief project description – What problems are planned to be studied, including which aspects.

Version control usage will be evaluated (GitHub commit history and individual contribution traces will be considered during grading).

(2) Data Cleaning & Exploratory Data Analysis (EDA) (20%)

You must demonstrate:

  • Identification and handling of missing values
  • Detection of outliers or anomalies
  • Use of:
    • Group-by / aggregation
    • Descriptive statistics

Simply dropping data without justification is not acceptable.

(3) Statistical Analysis & Hypothesis Testing (20%)

You must perform at least one formal statistical analysis, such as:

  • t-test
  • ANOVA
  • Chi-square test
  • Correlation tests for numerical features

For each test:

  • State the null hypothesis (default assumption)
  • Explain why the test is appropriate
  • Report test statistic and p-value
  • Interpret the result in context

(4) Data Modeling (20%)

You must implement:

  • One baseline model: Logistic Regression, Decision Tree, Linear Regression;
  • One more advanced model: Random Forest, XGBoost, LightGBM, Neural Networks;

Deep learning models are not required. For time-series datasets: LSTM, GRU, Prophet can be used.

(5) Visualization (15%)

Your project must include at least five figures, covering:

  • Data exploration
  • Statistical analysis
  • Model results
  • Final insights

At least one figure should be:

  • Multi-panel, or
  • Interactive (e.g., Plotly, Altair)

Figures must be readable, labeled, and interpreted.

(6) Final Report & Presentation (10%)

  • Written report (Jupyter notebook is required): No length requirement.
  • Clear structure and logical flow
  • Final presentation during Weeks 15–16, each group will deliver an in-class presentation of approximately 15–20 minutes.

3. Academic Integrity and Use of AI Tools

The use of AI tools is allowed when used responsibly for learning purposes such as understanding concepts, debugging code or understanding error messages. However, all submitted work must reflect the students’ own understanding. Students are expected to write and modify code by hand, make independent analytical decisions, and fully understand the methods and results they present. During project presentations, students may be asked to explain any part of their report, code, or analysis; inability to clearly explain submitted work will result in grade penalties, including partial credit loss or individual score adjustments.

4. Datasets

  • Fast Food Consumption & Health Impact Dataset

    https://www.kaggle.com/datasets/prince7489/fast-food-consumption-and-health-impact-dataset

  • Career Path Recommendations Dataset

    https://www.kaggle.com/datasets/ahsanneural/career-path-recommendations-dataset?resource=download

  • Breast Cancer Dataset

    https://www.kaggle.com/datasets/neurocipher/breast-cancer-dataset

  • EuroMillions Historical Data

    https://www.kaggle.com/datasets/duartepereiradacruz/euromillions-historical-data

  • Delivery Logistics Dataset

    https://www.kaggle.com/datasets/muhammadahmaddaar/delivery-logistics-dataset-india-multi-partner

  • Synthetic Diabetes Prediction Dataset

  • https://www.kaggle.com/datasets/miadul/synthetic-diabetes-prediction-dataset

  • Student Placement Dataset

    https://www.kaggle.com/datasets/sonalshinde123/student-placement-dataset

  • Hospital Readmission Risk dataset

    https://www.kaggle.com/datasets/miadul/hospital-readmission-risk-dataset

  • Dirty Iranian Transactions Dataset

    https://www.kaggle.com/datasets/hosseinbadrnezhad/dirty-iranian-transactions-dataset

  • Customer Churn Dataset

    https://www.kaggle.com/datasets/sonalshinde123/customer-churn-prediction-dataset

  • Crop Yield Prediction

    https://www.kaggle.com/datasets/miadul/smart-crop-recommendation-dataset

  • COVID-19 Patient Symptoms & Diagnosis Dataset

    https://www.kaggle.com/datasets/miadul/covid-19-patient-symptoms-and-diagnosis-dataset

  • Fertilizer Recommendation Dataset

    https://www.kaggle.com/datasets/miadul/fertilizer-recommendation-dataset

  • Synthetic Credit Risk Dataset

    https://www.kaggle.com/datasets/emirhanakku/synthetic-credit-risk-dataset-extreme-imbalance

You can download all datasets here.