Tranditional Statistical Learning: Classification in Self-Assessed Financial Health Status

Jan 1, 2022

Disclaimer

This project is an improvement of the final project of the upper-year Statistic course “STAT441: Statistical Learning - Classification” at the University of Waterloo by Bolun Cui and Joe Liang.

Video

A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)

Background

The dataset in this project corresponds to the responses in the German General Social Survey (ALLBUS) between 2005 and 2019. The target variable for machine learning is the last variable “health”. It is an ordinal variable with five categories from 1 to 5 and represents the “self-asset financial health” of each survey response.

There are two parts of the dataset, ”train.csv" and “test.csv”. the samples in ”train.csv" include “health” variables, which are used for model training. And “test.csv” does not have the “health” variable. The goal of this project is to train a classification model using the “train.csv” to classify survey responses in “test.csv” into one of the financial health categories.

Highlights

Compelete documentation can be found in the “Supervised Learning code.rmd” file

Exploratory Data Analysis

Outlier anaylsis

Target variables distribution anaylsis and normalization

Feature Engineering

During analysis, based on our domain knowledge, we derived a new x-variable: the average living space in m2 per person in the household.

Random Foresting

Out of Bag (OOB) samples tuning for number of variables to choice and number of trees

Importance of variables from OOB (randomly mix each variables to test the decrease in accuracy)

Tracing the perfomance of different number of trees

Neural Network

Pipline implentation of two hidden layers neural network

Tuning Epochs (number of iteration) to balance bias and vairance tradeoff

Number Nodes and Layer tuning with validation cross entropy

Performance of the Model

Enviorment

This project uses R with R markdown for better visualization. Please visit the official websites for documentation and installation of R, and R Markdown. R studio is recommended to open the .rmd file.

The required packages to excuate the code in .rmd file are listed below and can be installed in CRAN using

install.package("package_name")

in R or R studio.

randomForest: a comprehensive package for Random Forest Model training
caret: a machine learning platform with many integrated features, such as cross-validation
fastDummies: a package allows you to convert categorical variables into indicator (Dummy) variables
Keras: a comprehensive package under Tensorflow for Neural Network (Tensorflow installation is required)
gbm: the Generalized Boosting Model is supported
nnet: the Multinomial logistic regression model is supported

project