We are now over six months into our Innovate UK Machine Learning project in association with the University of Plymouth. In addition to making good progress; we’re also discovering the innovative ways that research and investment in AI and machine Learning will be beneficial to the future of the insurance industry.
During the first three months of this KTP, as the resident data scientist on the project, I worked with real internal company data and carried out an initial explanatory analysis of key variables. This involved investigating correlations between those different variables, carrying out statistical tests, data cleansing and preparation for development of machine learning algorithms. At the end of this first phase, an initial predictive model was established as a baseline. After testing with a few machine learning algorithms, we were confident it was able to assess risk using reasonable values of commonly used performance metrics.
The very nature of the dataset that we initially worked with meant that the sites that had previously been visited in person were the ones in possession of the richest levels of data and, therefore, not in need of additional risk prediction input. The existing risk management mechanisms are well placed already to access risk levels. To mitigate this we investigated using predictors from open and external data for the modelling process with a view to providing more accurate risk prediction at those unvisited sites where available data was more sparse.
Open data for increased risk predictability
Consequently, in the next three months we focused on bringing in predictors from external data sources. We worked with the Impact Lab in Exeter, a group specialising in open source datasets. We called on valuable data on open street maps and natural catastrophe datasets. We also made plans to start looking into non-conventional ways to bring in open data via social media platforms.
Open street map data is of high quality but suffers from missing values and gaps for lesser developed countries. However, I spent a great deal of time addressing this issue and investigating APIs in open street maps and was able to gather a viable set of predictors to help in assessing risk. A common problem with machine learning models is that, as the dimensionality of data increases, noise in the modelling process also increases and this impacts the performance of the model. This is another issue that had to be overcome.
In addition, I was able to identify import points and areas of interruption within open street maps which is really useful from a risk engineering aspect. I also spent some time replicating some genuine case studies from a research paper featuring site data analysis from three sites in Portland, Oregon in the USA. This enabled us to demonstrate a good proof of concept of the current predictors we are using in the modelling process. Thankfully the results so far have validated our original choice in taking the approach that we have followed.
As previously stated, it is the sites that have not been physically visited that pose the greatest challenge in terms of predicting risk. Through this research, our aim is to be able to save a survey or risk engineer’s time by automatically enhancing the insight they can gather about a site prior to either visiting it or selecting to visit it. This will ultimately reduce the stress on resources and the overall cost of managing the risk assessment process.
What next for Software Solved Machine Learning?
Our next challenge is to introduce further historic data to help validate the risk. We will therefore be looking to use Natural Language Processing techniques such as information extraction and named entity recognition to extract useful data from the broker chain.
When an insurer is put before a new client, the information they have available on that client is usually only what is presented to them by the broker. Once a client has been insured by the same insurer for a number of years, they will accumulate well structured data that can be extracted and reviewed using these methods.
The next phase of the project will also be looking at using social media platforms to further enrich the data for risk engineering and modelling. By running a search on Twitter, for example, on specific hash tags such as #crime, against a set of co-ordinates you can pick up valuable information on trends and positive or negative sentiments at a given site or area. We are also looking into mining data from Google Trends to assess if search data yields precious insight into risk factors.
Insurers: want to get involved?
The further we progress into this project, the more we appreciate that there are a wealth of opportunities for us to explore further given the amount of valuable data there is out there. Our aim is to get more insurers involved so we can draw on other data sources and different examples to hone the model and make it even better. Being able to call on wider data sets and obtain access to better predictive risk scoring techniques will vastly improve client on-boarding for insurance companies.