Data & model quality: What we learned from Yields.io

Posted on 9-apr-2021 14:16:17

At our first DataOps Ghent Meetup of 2021 we had the opportunity to learn more about DataOps in FinTech. The second speaker of the night was Jos Gheerardyn, co-founder and CEO of Yields.io, creators of Chiron, an AI Platform for model risk management that uses AI for real-time model testing and validation on an enterprise-wide scale.

Yields.io is determined to instil trust in the planet's most impactful algorithms and make tools that help users and developers of models ensure their models live up to their standards.

“Data is today's new oil, it’s the oil of the industry. It’s what many industries are consuming to generate value.” - Jos Gheerardyn

We can compare data to oil in the 18th century, because its value and potential are apparent, but it requires processing to create value. Those who put their time in the process of learning to extract, refine, and utilize Data, will receive a great ROTI (Return On Time Invested), as it did for the oil industry.

 

How Data drives value

Data is used in many ways. A primary use case is displaying data to discover obvious patterns. The visualisation of the Data leads to insights. Typically, this is where BI tools and dashboards come in.

On the other hand, new patterns can be discovered in Data. For this, different types of mathematical algorithms are used.

Issues in data

 “Many algorithms are used on data that isn’t correct and might contain issues”

Let’s look at two situations where this is the case and what effect it can have:

Incomplete data

 

A good algorithm needs the full picture, or in this case all the relevant (Roomba) data. 

Anomalies

An example of the impact of data quality issues on the performance of mathematical models is the simulation of the Melbourne building in windows flight simulator. The creator tried to make 3D maps from 2D plans using machine learning. During this process, the height of this building was put in incorrectly. As you can see on the image below, the building ended up towering over the landscape and defying physics. It might have broken immersion for some, but it proved the importance of data quality checks for everyone.

These are just two of a long list of issues with algorithms caused by data. It’s important to realise that there’s quite a bit of risk related to using models on just about any dataset.

 

Meaning of a model

Model risk has already been actively managed in the financial sector for quite some time. Reasons being: They have been producing massive amounts of data, they have been very intensive users of mathematical models historically and plenty of accidents have happened in the past, f.e. banks have lost billions of dollars because of mistakes in models.

Definition Model published by the FED (The Federal Reserve):

“The term model refers to a quantitative method, system, or approach that applies statistical economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates.” (2011, FED)

Putting it simply: the model is what’s between the input data and the end results, whether that’s a desired output, behaviour, estimate, or prediction.

If you’re comparing data to oi, the model is the machine powered by it producing your desired results. And just like any mechanical engineer would tell you, these need to be managed correctly.

A mismanaged model, just like a mismanaged machine, has its risks that endanger the quality of your end product. Which leads us to model risk.

Definition Model Risk published by the FED (The Federal Reserve):

“The potential for adverse consequences of decisions based on incorrect or misused model outputs and reports. Model risk can lead to financial loss, poor business and strategic decision-making, or damage to a banking organization’s reputation.” (2011, FED)

In summary: model risk is a calculation of what can go wrong while processing the data into, through, and out of the model.

 

Issues with models

There are roughly two types of issues with models: First there are the simple ones, such as bugs and software issues. And secondly there are the more subtle and often encountered ones, such as models applied in contexts for which they haven’t been developed.

An example of the last one is building a model to predict the temperature for tomorrow. To make this happen you’ll need massive amounts of (climate) data, you’ll need to train the model on historical climate data and then apply it to today's climate. It probably won’t work as expected, because the climate has changed since the time you took as reference. 

When you’re using a model, you need to understand on what data it has been trained and for what context it has been developed.

 

Evolution of models

Models has evolved a lot over time. About ten years ago, they were primarily used as bottom-up approaches: you start with statistical assumptions, and then you build a model based on those assumptions. 

Nowadays, we’re able to learn from data and extract non-linear patterns from it with neural networks and other types of algorithms. Due to this, interesting effects can appear, f.e. bias, adversarial attacks...

Model risk management 

Knowing how to handle the uncertainty that comes with mathematical modelling is called model risk management.

Throughout the years models have become so common and impactful that it’s important to understand and know how to avoid the risks, because the outcome can be drastic and this is where yields.io comes in. 

 

Eager to learn more about Jos Gheerardyn, Yields.io and their platform for model risk management? 

You can watch the talk at the DataOps Ghent YouTube channel.

Topics: DataOps