Tranditional Statistical Learning: Classification in Self-Assessed Financial Health Status

Disclaimer
This project is an improvement of the final project of the upper-year Statistic course “STAT441: Statistical Learning - Classification” at the University of Waterloo by Bolun Cui and Joe Liang.
Video
A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)
Background
The dataset in this project corresponds to the responses in the German General Social Survey (ALLBUS) between 2005 and 2019. The target variable for machine learning is the last variable “health”. It is an ordinal variable with five categories from 1 to 5 and represents the “self-asset financial health” of each survey response.
There are two parts of the dataset, ”train.csv" and “test.csv”. the samples in ”train.csv" include “health” variables, which are used for model training. And “test.csv” does not have the “health” variable. The goal of this project is to train a classification model using the “train.csv” to classify survey responses in “test.csv” into one of the financial health categories.
Highlights
Compelete documentation can be found in the “Supervised Learning code.rmd” file
Exploratory Data Analysis
- Outlier anaylsis

- Target variables distribution anaylsis and normalization

Feature Engineering
During analysis, based on our domain knowledge, we derived a new x-variable: the average living space in m2 per person in the household.
Random Foresting
- Out of Bag (OOB) samples tuning for number of variables to choice and number of trees

- Importance of variables from OOB (randomly mix each variables to test the decrease in accuracy)

- Tracing the perfomance of different number of trees

Neural Network
- Pipline implentation of two hidden layers neural network

- Tuning Epochs (number of iteration) to balance bias and vairance tradeoff

- Number Nodes and Layer tuning with validation cross entropy


Performance of the Model

Enviorment
This project uses R with R markdown for better visualization. Please visit the official websites for documentation and installation of R, and R Markdown. R studio is recommended to open the .rmd file.
The required packages to excuate the code in .rmd file are listed below and can be installed in CRAN using
install.package("package_name")
in R or R studio.
- randomForest: a comprehensive package for Random Forest Model training
- caret: a machine learning platform with many integrated features, such as cross-validation
- fastDummies: a package allows you to convert categorical variables into indicator (Dummy) variables
- Keras: a comprehensive package under Tensorflow for Neural Network (Tensorflow installation is required)
- gbm: the Generalized Boosting Model is supported
- nnet: the Multinomial logistic regression model is supported