Rent Prediction Project



Background & Purpose



This project was for my Data Science course at Bradley. We were tasked to use Jupyter Notebook, Pandas, Numpy, and MatPlotlib to predict on a dataset. We chose to predict on the UC Irvine Machine Learning Repository 2019 Apartment for Rent Classified in the United States as our dataframe. Our objective was to create a model which could accurately predict the price of rent in the United States based off of parameters entered by the user. (Things like square footage, bathroom count, location, etc.)


This was a collaborative project with other students. We had a group of four working on this. In our journey we used multiple analysis models to predict on the data which will be showcased below; along with the struggles, outcome, and suggestions for improvement. The model I worked on personally was a Random Forest Regression model, as well as helping out where I could.




Cleaning the Data



The dataset was quite verbose to begin with and needed cleaning for what we intend to do. Columns such as the literal address of a house weren't needed for rent prediction and could do more harm than good, as a street name in one city may be high-value where as in another city it could be in a low-income area, affecting the model. We took our data, dropped unwanted rows and cleaned any NaN values.

Before and after of our dataset.

Before and after diagrams of the dataset.

After cleaning we shuffled the dataset using Pandas before doing our test splits in an attempt to eliminate any bias. We then saw some outliers which caused skewing in our model, specifically outliers in the price category. We made the decision to cut off any properties with rent above $15,000. This, on top of scaling the data left us in a much more confident spot to begin analyzing the data using differnet models.




My Random Forest Regression Model



Building out the RFR Model was interesting as it forced me to determine which parameters were best for my model. To do this, I plotted the mean average error against the number of estimators at a depth of 10, and ran that for different test size splits to easily determine which parameters I'd use going forward. I ran this on the 10k version of the dataset to improve compute time. After leaving it to run overnight, I had this graph:

MAE vs Number of Estimators for different test size splits on 10k dataset.

MAE vs Number of Estimators for different test size splits on 10k dataset.

This was conclusive in determining 0.35 was the best ratio for splitting my data. I noticed that the number of estimators at 500 was the lowest MAE which made me think to increase n_estimators to get a better MAE. This was futile, as it'd seem to have reached a point of diminishing returns, only increasing the MAE score by fractions of a percentage- making it not worth the compute time whatsoever. I then took these findings and applied it to the 100k version of the dataset, which showed an improvement in the MAE- which I attribute to the RFR having an order of magnitude more datapoints to work from.

MAE vs Number of Estimators at 0.35 test split on 100k dataset.

MAE vs Number of Estimators at 0.35 test split on 100k dataset.

Similarly to before, the compute time was an issue for the scope of the project- and anything past 400 estimators was too slow to work with. For this reason, I chose 400 as my number of estimators going forward. I ran the model and was pretty pleased with the result. I had an MAE of 164.92, which is higher than I would've hoped for, but this resulted in an R2 Score of 0.89 and an overall accuracy of 89%. I've plotted the best-fit line for my results against the accuracy baseline below.

Result of running the RFR on the 100k dataset.

Result of running the RFR on the 100k dataset.

I then used applied Sklearn's feature importance metrics to my model and was given the top 10 featuresgraph shown below. In hindsight, it's easy to see that having longitude and latitude, as well as the states themselves as different metrics could be causing some bias in prediction. If I were to do this again, I would go state-by-state as state legislation can result in pricing differences- perhaps even breaking it down to per-county within the state.

Top 10 Features of Importance.

Top 10 Features of Importance.



The Other Models



The other members of the group worked on various other model implementations on this data in an effort to obtain a higher accuracy. A linear regression model was created using Sklearn, but the best accuracy with this model was 50%. Although linear regression modeling is sometimes used by some economists for housing market predictions, I believe this particular dataset was not the best fit for this model. The specific parameters of this model are unknown.

Linear Regression model on the 100k dataset. Unknown parameters.

Linear Regression model on the 100k dataset. Unknown parameters.

We then applied a Gradient Boosting Regressor to the data, achieving an overall accuracy of 81.65%, lower than the 89% accuracy from RFR. This model had 100 n_estimators and a max_depth of 3, which could have been improved upon.

Gradient Boosting Regressor on the 100k dataset.

Gradient Boosting Regressor on the 100k dataset.

KNN regression was our next model to be applied, which led us to run a principal component analysis on our numeric values to lower the dimensionality of our data. We then tested the best k for our model's performance, which we found to be k=2. Running this on the 100k dataset led us to obtain an MAE of 264.76 with an overall accuracy of 79.41%.

Principal component analysis on our numerical datapoints.

We then took our KNN model and converted it to a KNN classifier. This required us to create "buckets" for our model to predict upon. Essentially, what range of price does the model think an apartment should fall into based on the apartment's datapoints. Well, first we need to define those ranges. We went with splitting it into < $1100, between $1100-$1599, and > $1600. We felt this was a good spread for the average apartment cost.

Confusion matrix of our KNN classifier.



Conclusion



Ultimately, the RFR model had the best overall accuracy. I believe that if the double-location data was accounted for, this result could be vastly different. Here is all of the model's performances listed head-to-head:

List of all the models used and their performance metrics.

Working with the team and collaborating to solve problems was a rewarding experience. In addition to strengthening my communication and teamwork skills, I also had the opportunity to deepen my understanding of data analysis by learning how to use Pandas and NumPy. I am proud of the work done by the team, and am grateful for the learning experience.