Welcome to The Godel POD. This episode is about the Secrets of Machine Learning where we are joined by Jorge Garcia de Bustos, Technical Presales Consultant at Godel Technologies to discuss, why is it that so many of the machine learning projects that companies undertake fail?

Sarah Foster: Hi, it’s Sarah, Client Director here at Godel. Today I’m joined by Jorge Garcia de Bustos, who is a Technical Presales Consultant across GODEL technologies, Jorge Can you do a little bit of an intro about yourself?

Jorge Garcia de Bustos: Yeah of course. I’m Jorge I’m originally Spanish, I’ve been in the UK for about 20 years. I’m originally an electronics engineer, so my background is lots of maths, lots of physics, lots of of signals and a little bit of software. Doing my degree and very early on, I went into software and became a developer, and rose to be a dev manager, and before I joined Godel Three years ago, I was specifically running development teams in different locations. My role now is to act as a liaison between the teams that we have in our delivery locations on our partners in the UK. What attracted me initially to machine learning is mostly the fact that the heavy maths involved and the way of using mathematics and using some of the stuff that I learned, and when I was in my teens and early 20s.

Now we can actually build machines and programmes that can, to an extent emulate some of our kind of human capabilities.

Sarah Foster: So, you’ve kind of given the game away there, but the topic that we’re talking about today is why machine learning projects that companies undertake often fail. So machine learning isn’t magic, but what is it all about?

Jorge Garcia de Bustos: If you think about the way that we typically write software. So the way we typically operate in a software development house, someone has ideas about what the software should do, that we’ll have either stakeholders or product managers and people like that who will know what the software needs to do, what are the business processes to support. We work a lot top down, we think about what the software needs to do, we encode that into a natural language, English in most cases or Russia in some cases in the case of all the other delivery centres, but the idea is that we stick with what we first write down, what the system needs to do and why we try to understand the business process. The universe and identify better the rules and equations and formulas that govern our behaviour, and we write those requirements in very low level detail because when we write code we need that low level detail and then we encode those kind of low level, natural language instructions using a computer programming language. That works well for loads of things so there are loads of processes that we can encode really well using that.

But not everything works like that, so I can think for instance of situations where it’s very difficult for our human brains to comprehend the full complexity of our interdependencies between the things that we’re trying to model or situations where their requirements may involve like 1000s and millions even of low level rules, I’ll give you an example. Imagine that you wanted to write a programme that scans a bit of text, and tells you whether the sentiment either bit of text is positive or negative, it can be a review of a product or a film or a piece of music, I want to know whether the person who made that review is basically displaying a positive or a negative feeling about that. If we do that from requirements, the natural thing is I would say is okay if so and so word I contain is a positive review, and if so, and so word I say negative review, but those rules, think about the richness of English language and the fact that we can even use sometimes the words in paradoxical meaning. We can say this is not very good, if you look for good, you might think that actually the review is good but they’re no very at the beginning kind of modifies the overall meaning so the number of rules that we will have to build. If we went for the top-down approach is just insane.

So, there are problems that are not well suited for our top-level kind of requirements driven approach. The idea is that instead of top-level requirements driven we go from a bottom-up approach, where we basically work with a data. So, we choose a core algorithm that is kind of relatively simple and but has very flexible behaviour and then we basically what we do is we feed data to this algorithm that very often already has the right answers. So, we give the algorithm this is these are 1000s of positive reviews, and these are 1000s of negative reviews, and we expect the machine based on those to actually figure out what are the hallmarks, what are the signs of a positive or a negative review, but under machine form, say as sort of understanding of the universe. It’s not understanding like a human being like you or me we it’s not, in many cases no even reading the language, it’s just literally looking at a sequence of things in a very dry way, But the sense of the accuracy of some of those models can be astonishing like much better than anything that we would do as basically from a top down approach.

Sarah Foster: So obviously that’s a couple of the ways in which machine learning can handle. Can you give me any other examples at all?

Jorge Garcia de Bustos: Yes, there are. There are good problems some bad problems for machine learning. If you weren’t choosing the right problem is one of the best, so choosing a bad one is one of the typical mistakes that people make when trying to apply machine learning. We are not trying to build something that has the cognitive abilities of a human. No, we’re just trying to build a computer programme that will use statistics and maths to kind of build associations between things and, and there are problems that are really good fits and problems that are really awful fits. So, the good ones are something like, something will go on for instance, something called classification which is the prediction of whether a data record belongs in a given category. And for that we need to give the programme correctly pre classified historical data, but and that is really, really good for things like problems like spam detection, likelihood of loading default or customer churn so we can tell the machine, this is what happened in the past and we divided the users between all the other scenarios between kind of the positive and the negative, and we say to the user to the machine, okay you find out what are the bits in the data that are highly correlated with positive and negative outcomes, so that in the future we can ask you, okay, classify, for me, data that you have never seen.

And we can also use that for programmes that we call regression, which is the prediction of the likely value for a real-world variable given a kind of historical data series so we can basically, say for instance you want to write a programme that gives you a market value of a house or a car, based on a series of data points. So, based on kind of historical data about a particular car model or anything like that, we could actually try to predict a sensible value that you could get for your car, in the open market, that kind of prediction of a real-world variable. And we can also use it for things like natural language processing, so machines have become really, really good at spotting sequences and patterns in the way that we see them text or when we see them kind of a spoken language like speech. This can be used for things like language translation and sentiment analysis, and a few machine learning have also become really, really good at, image processing. So, and they can do things like, identify and locate objects inside and persons even inside images and even generate or reconstruct images that are kind of damaged, so there’s this there are certain good problems. It’s important to say that choosing a bad problem can already set your machine learning effort on the wrong footing because they don’t have magical capabilities.

Sarah Foster: So, with that, let’s talk about data.

Jorge Garcia de Bustos: Machine learning is considered almost like data driven engineering. For that is essential, that the main thing that we need when we do machine learning is high quality data, above the above the phenomenon that we are trying to model. And when I say about like high quality data, there are loads of things that you need to look for. So, the first thing is, most machine learning kind of engines, they need very clean, very consistent, very flat data. It’s like the equivalent to a nice big spreadsheet with lots of rows and loads of columns, and no blanks anywhere. Also gathered with consistent criteria over time so if you see one of the models, like a time series of 10 years’ worth of data, it’s very important that the kind of the gathering criteria. 10 years ago, is the same as you have right now because of, because it is essential to have consistency over time, and many models react very badly to changing conditions, or data that has empty gaps in it. It’s essential to have data that is a scan of provided or manipulated to a very high standard.

And just to give you can give you an example. Imagine that you want to predict. You’re a company with customers and you want to build a model to predict their behaviour. So, the type of data set that you would need to build a model that predicts your customer behaviour is something like a single customer view. So this kind of aggregate consistent representation of all the data that your organisation knows about your customer, so it’s not just a dirty Excel spreadsheet that you extract and massaged manually when you when you need it, it actually needs to be a real time or near to real time data set that is created by a pipeline that is reproducible, auditable and testable. And it actually needs to be something that you can run, not just on demand but actually done on kind of consistently and with consistent data definitions are no gaps. So you need to have a flat presentation where on every row you have each one of your customers and on every column there is every bit of information that each one of your IT systems actually has scattered all over. And only when you have a data set like that with 1000s of rows representing each one of your 1000s of customers, and hundreds, or even 1000s of columns, denoting every bit of information that you know about them is only at that point that you can start building sensible models that predict the behaviour of your customers.

When we talk about a single customer view, it’s if you want to do modelling of predictions about your products, your orders, your inventory, your sales, your employees, your cases, your loan book, your policy book for each of the entities that you want to make a prediction using machine learning, you need a single entity view that is the same. One row for entity, one column for every bit of data you have scattered all over the organisation and gathered and manipulated in a consistent way, not just kind of manipulation. The data volumes that you need for this depends a little bit on the kind of the technology you want to use, or the type of algorithm, but they can range anything between the low 1000s for some of the most simplistic kind of machine learning models, but we talk in some cases about millions of data points that you might need in order to train some of the most sophisticated. So you really need to like, this is not something that you can do, machine learning on the basis or on the back of just data that you capture casually. This is data. This is data capture and data manipulation that you need to be running on an industrial scale if you want.

Sarah Foster: So, data scientists, what do they need, what makes them happy?

Jorge Garcia de Bustos: What makes data scientists happy is I mean, kind of the key is in the job title. They are scientists, what they’re going to try to do is kind of follow a scientific process of formulation of a hypothesis, The design of an experiment to test that hypothesis is the gathering of data based on this single entity views that you give to them, or like in an almost like an unfettered way, and plus the possibility of making the wrong kind of custom transformations if they have a bit of data, with a little bit of manipulation will actually yield better results, give them the option not just given this massive catalogue of data, but also the tools for some additional manipulation of that. And then what they will do is they will run those experiments, test against the null hypothesis, which is the My kind of my habit, my initial hypothesis wrong. This is a bit like the opposite of criminal law. Your model is guilty or useless unless proven useful. So, and which ones are very often the result of this is going to be a hypothesis improvement. So my initial hypothesis was wrong. I need to formulate another one and design another experiment and test it. I do this all over again, in an iterative process until we actually find something that is better and demonstrably better than almost like a coin toss up. Which is very important that we actually make that make the proof that the model is better than a random choice in a scientific kind of testable and there are loads of kind of statistical tests that you need to pass before you actually decide that your model is better than a random choice.

Sarah Foster: So, is there a trade-off between perfect models and perfect data?

Jorge Garcia de Bustos: There are always trade-offs that are the heart of ML, by the way, you’ll never hear me say, artificial intelligence, or AI so I’m only going to talk about ML. I absolutely despise the expression AI. So the trade-offs are the heart of machine learning. The fact that you’re trying to build a programme that is going to try to make an understanding of the world, so that in the future can make predictions. And there we have the choice of how powerful or flexible we want our core algorithm to be. So in some cases we have a simpler, less powerful algorithms that typically require less training data that are typically not very sophisticated in the mathematics that they have underneath, which means they will be prone to kind of no be no be great, they will be just kind of good enough, or kind of middling or good enough in terms of when you test the algorithm, it will give you predictions that are more or less okay when you put into production, they will give you predictions that are more or less, okay. And the good thing about those is that they’re actually very easy to interpret or explain because the complexity of these models is relatively simple.

So, these are the kind of these are models that they have the advantage training times and the volume of data tend to be small, like lower. On the other hand, we have flexibles that are much more powerful and flexible. So we have algorithms that like neural level for instance come to mind, where they are really powerful, they can really capture a lot of the complexity of your real world phenomenon. But the flip side of that is that they require huge volumes of data. I not just the data but also computing resources and time for that. So you need to be flexible because the flexibility of the model is huge. There are loads of kind of internal parameters and lots of things that the model needs to it needs to tweak internally, you just need to give him like massive volumes of data you need to train on a properly dedicated hardware over long periods of time and your results can capture the complexity of the system, very accurately.
But the, the inconvenience of this is that sometimes they behave, or they actually do behave like almost like inscrutable black boxes, it’s impossible to give a simple reason of why your model predicts a positive or a negative result when you run, when you ask it to do a prediction, and also they’re very prone to what’s called overfitting, which is almost like in your student days when revised for the test, but they build a model that matches the kind of the complexities of the of the training data to well, but it doesn’t generalise the data that it hasn’t seen when you, expose to get added like when you put into production. And there also leads to an interesting problem with this very flexible, powerful models that in many cases, we require explain ability. So, say for instance that you want to know that criteria that led a machine learning model to accept or turn down a loan request. There are models that will tell you an idea which are the fields that are looked at in order to give you validation, but there are very complex models where the black boxes do you don’t know why, the model will actually tell you why I made that decision.

And sometimes they can look at things that are nonsensical and what comes to mind is an experiment that someone did with, with a neural network classifier to tell apart huskies, and wolves, and he turned out that when they when they analyse that model very closely. The reason why it was so good is because he was looking at whether their background had white snow or not. Assuming that white snow meant wolves, whereas Huskies. That were more like going to the training that the data that they are the images that they have trained the model with, they were wolves in the wild with snowy backdrops and Huskies, that were mostly kind of domestic animals, so the machine learning model the no look at the faces of the animals of this now to the age of the tail all the, all the code until then we just look in where there’s a lot of white in the background. So sometimes the model is going to take shortcuts and it’s going to be very difficult to tell whether the model is actually taking those shortcuts, which means that in cases I don’t know, things like take for instance of medical diagnosis, you want to know that the criteria that the model is using to make a determination positive or negative is sound, and which means sometimes you need to exclude really powerful models for certain applications on the basis that you need the predictions to be explainable. And you need the, the explanation to be kind of sensible.

Sarah Foster: So, I’m going to kind of go off the questions that I had for you, something that you just mentioned there, and it is something that’s kind of piqued my interest. So, machine learning in the real world. Does that not give this kind of ethical risks in letting machines handle decision making?

Jorge Garcia de Bustos: Absolutely and to their credit, the kind of the giants of the of the machine learning world, they’ve at least recognised that and people like Microsoft for instance are very vocal about the ethics of what they call the ethics of AI, as I said I despise the expression AI but the ethics of machine learning are important. I’ll give you a practical example. Imagine, for instance, that you were trying to decide whether to grant loans, you’re a financial institution, and you’re deciding whether to grant loans and one of your fields is ethnicity. And your past data has shown that certain ethnic minorities are actually have a higher chance to actually default on their loans. So, what do you do, do you factor in ethnicity on your own, as part of your criteria, which if it’s a strongly correlated with the idea of a loan default, strictly speaking, you should actually factor that in, or you can do the sensible thing which is understand that ethnicity is a score. It’s a proxy for much complex kind of socio economic problems, which is the fact that I don’t know ethnic minorities have a worse kind of economic outcomes in many cases due to discrimination. People are not defaulting those loans because they are of a minority or the background is in many cases they are defaulting on the loans because they are playing the game of life with harder settings, as a result of the way that many people kind of interact with them so it’s a harder life. So as a result of that lower salaries, worse housing, worse kind of health outcomes and everything so chances are that they might have a higher chance of default.

So, as I say, some machine learning practitioner, it is essential that you don’t build those course biases into your machine learning models and you try to go back to the source that says that they, you actually look for the racist or discriminatory proxy measure that you actually try to understand the model much closer to try to get a better view that is a bit more fairer and less prone to discrimination.

Sarah Foster: So what are some of the best practices that engineering teams can follow?

Jorge Garcia de Bustos: So we talked about that they can also add data already so, you cannot do machine learning, without a really kind of mature data engineering practice, and, and you need to have a consistent, auditable, testable data pipelines. But that is kind of what you do in your data world is typically a resection what you do in your engineering world as well. So, data cannot be the kind of the poor relation of your technical organisation so in any machine learning, you should, in general data driven engineering, you should be as strict as you are your traditional software engineering so you want to have a kind of a granular microservices design approach or a what is called a Lambda architecture that allows you to do complex on the spot transformations, so that your data scientists and your machine learning engineers can retrieve data and manipulate data on the spot. You want to have kind of a kind of service-oriented architectures, you want to have a culture where peers review each other’s experiments and code before actually contributing to your code base, you want to have things like functional or non-functional testing automation you want to have automated pipelines to build periodically rebuild as well and test and deploy your models.

So, an essential thing is something that we call a kind of a model drift is the fact that what your model learns with a static set of data is good for the time when the data was captured, but if the real-world changes, and you don’t retrain that model periodically with fresh data, you’re effectively working with an understanding that might be lagging months or even years behind. So imagine that you build a model with a chatbot of how the English language was used, five to 10 years ago, is going to struggle when you put it in front of people with neologisms and with the kind of a new uses of the language so it is very important to retrain your models periodically. And of course, kind of try to use the same kind of architecture improvements that we do in our normal engineering, so adoption of things like containerization, adoption of cloud environments with infrastructures code and everything. So try to be tried to apply that kind of those consistent engineering standards, not just to your requirements driven software engineering, but also to your data driven software engineering and adopt something like ML ops, just like you have a DevOps culture, you have an ML ops culture on your on your machine learning side.

Sarah Foster: And just to finish up. Is there anything else that you would like to add about secrets in machine learning.

Jorge Garcia de Bustos: The secrets of machine learning is, it’s not a secret. It’s in every literature that 90% of projects don’t ever make it to production. There are loads and loads of pitfalls that started with choosing the wrong problem, not working with the right data, treating this as a one off, as opposed to a process that usually try to industrialise, with the aim of putting into production, consistently. One of my favourites there which is the fact that it’s having the machine learning, and the data scientists teams can have no work in collaboration with your close collaboration with your engineering teams, so I don’t know how then report to your – because typically insights are used by marketing teams, this happened report to a VP of Marketing, anything like that and then be surprised that there are massive differences in the in the engineering practices that are followed up the on the data science side and on the regular engineering side is the people writing data driven, kind of engineering modules, machine learning models, they really benefit from a close collaboration with the people building kind of requirements driven or top down engineering modules. So the best outcomes are typically in organisations that acknowledge that and don’t segregate the developers and the data engineers on one end and the other for data scientists to work on the different departments. So, having all your data people, the people generating the data and the people experimenting with that data, and they want the same data for common rules and following personal practices is typically the, the least of the best outcomes.

Sarah Foster: So, thank you for your time Jorge, it’s been great to learn more.

Jorge Garcia de Bustos: Thank you.