Tranditional Statistical Learning: Classification in Self-Assessed Financial Health Status

Disclaimer

This project is an improvement of the final project of the upper-year Statistic course “STAT441: Statistical Learning - Classification” at the University of Waterloo by Bolun Cui and Joe Liang.

Video

A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)

Background

The dataset in this project corresponds to the responses in the German General Social Survey (ALLBUS) between 2005 and 2019. The target variable for machine learning is the last variable “health”. It is an ordinal variable with five categories from 1 to 5 and represents the “self-asset financial health” of each survey response.

There are two parts of the dataset, ”train.csv" and “test.csv”. the samples in ”train.csv" include “health” variables, which are used for model training. And “test.csv” does not have the “health” variable. The goal of this project is to train a classification model using the “train.csv” to classify survey responses in “test.csv” into one of the financial health categories.

Highlights

Compelete documentation can be found in the “Supervised Learning code.rmd” file

Exploratory Data Analysis

  • Outlier anaylsis

Screen Shot 2022-05-09 at 6 39 19 PM

  • Target variables distribution anaylsis and normalization

Screen Shot 2022-05-09 at 6 43 30 PM

Feature Engineering

During analysis, based on our domain knowledge, we derived a new x-variable: the average living space in m2 per person in the household.

Random Foresting

  • Out of Bag (OOB) samples tuning for number of variables to choice and number of trees

Screen Shot 2022-05-09 at 6 59 31 PM

  • Importance of variables from OOB (randomly mix each variables to test the decrease in accuracy)

image

  • Tracing the perfomance of different number of trees

ezgif-5-6cdd33b368

Neural Network

  • Pipline implentation of two hidden layers neural network

pip

  • Tuning Epochs (number of iteration) to balance bias and vairance tradeoff

neural

  • Number Nodes and Layer tuning with validation cross entropy

image
image

Performance of the Model

image

Enviorment

This project uses R with R markdown for better visualization. Please visit the official websites for documentation and installation of R, and R Markdown. R studio is recommended to open the .rmd file.

The required packages to excuate the code in .rmd file are listed below and can be installed in CRAN using

install.package("package_name")

in R or R studio.

  • randomForest: a comprehensive package for Random Forest Model training
  • caret: a machine learning platform with many integrated features, such as cross-validation
  • fastDummies: a package allows you to convert categorical variables into indicator (Dummy) variables
  • Keras: a comprehensive package under Tensorflow for Neural Network (Tensorflow installation is required)
  • gbm: the Generalized Boosting Model is supported
  • nnet: the Multinomial logistic regression model is supported
Joe (Jiazhou) Liang
Joe (Jiazhou) Liang
Data Scientist | Master Student @ University of Toronto

My research interests temporal clustering algorithms and its applications to solve real world problems.