Machine learning is like a drug. The euphoria one gets from building a model that can make predictions with an accuracy level over 90% is overwhelming. I have been working a lot on machine learning over the last two years both for work and as a hobby. My greatest source of knowledge has been blogs and posts from data scientists who have posted their knowledge and code. I have taken a lot from the community and feel like it’s time to give something back.
Recently, I come across a detailed dataset on crimes committed in New South Wales published by the Bureau of Crime Statistics and Research. The dataset has detailed data on number and types of crimes committed in various Local Government Areas (LGA) in NSW:
I have divided the data into either a test or training set. The training set contains data from 1995–2015 and the test set has data from 2015 onwards. My model uses training data, and I tested it’s accuracy by comparing it with test data.
My goal for this exercise was to see if I could predict the overall number of crimes committed in NSW between 2015–2017. I would simply be using past trends to predict the future.
Firstly, I removed the Offence Category and Sub category and summarised everything to state level by combing all LGA data.
It looks like crime numbers have been stable since 2005, but it is trending up. In 2015, there was a 10 year peak in the number of crimes committed.
I used R and forecast and sweep library for my modelling. After some exploratory analysis, I found that ETS (Exponential Time Series) model gave the lowest RMSE. My mean absolute error was 1076. Meaning that the predicted values on average were off by around 1076. Not bad considering the number of crimes committed ranged from 40000 – 60000.
Residuals = Real Values — Predicted Values. The residuals are randomly distributed across the years. It shows that the model is not showing any bias.
The next step is to decompose the data and see the long term trend and any seasonality.
As mentioned earlier, crime has been fairly stable over the past 10 years. It is interesting that seasonality has remained constant over the last 20 years.
Here is the comparison of predicted and actual values side by side:
I am pretty impressed with the results. I have successfully predicted 24 months ahead with accuracy of more than 90% with just a few lines of code and without any optimisation. This just shows the power of R as an open source platform and the countless hours people have spent making and optimising R libraries.
The Data Scientists in our Advanced Analytics team can expertly apply a range of advanced techniques – including Machine Learning – to develop deep business insights and solve problems that go beyond the traditional Business Intelligence domain. Talk to one of our consultants today. [email protected] Sydney: +61 2 9299 4430 Melbourne: +61 3 8605 4880