Data governance (DG) is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal data standards and policies that also control data usage. Effective data governance ensures that data is consistent and trustworthy and doesn’t get misused.

Data governance has not just become prominent over the past several years, it’s always been about data governance and data quality and there are a lot of terms in this matter. Data Governance is frequently confused with other closely related terms and concepts, including data management and master data management.

Many different institutions try and force these initiatives by data governance. Anyone using any data platform will always speak and think about data governance. From my understanding and experience, there are a number of frameworks and ideas of how to set up everything, but from a practical perspective, the situation is more difficult than expected because there are a lot of challenges that come with it.

Data governance in practice

When you come across a small problem you can create small but viable solutions. Within a client engagement, one of the main things we focus on is data. Sometimes, it is the only project that is focused on data and relates to governance quality, security etc. When the client started their professional journey, they started from ground zero, which is a starting point for the data platform. They began to grow their data platform with data they have in the company without having any framework for data governance or security.

Another challenge we have come across, which we will see more and more in data projects is that a lot of companies are moving to microservices, which is quite a buzzword at the moment. Mostly, you have huge companies starting to move in this direction. They started moving to microservices about 5 years ago which was a new challenge, even for us as data engineers because the typical way that we worked in the ‘ancient’ times 10 years ago for example was a lot different.

Data catalogue is one of the changes of data governance that did not exist at the time. There were a handful of “core” business systems and entities they operated on that were well known to at least a dozen people, even in relatively small enterprises. And also, we as engineers of data analytical platforms were pretty sure what, from, where and how to pull over from those systems.

When we moved to microservices, the idea of microservices is to be very flexible and be decoupled of everything and the rest of the system. It gives some flexibility and robustness because there is no single point of failure in the system of microservices, but it also gives the challenge for data understanding, looking at how is it connected to each other and what is the data source?

Below are three themes we explored to achieve improvements in data governance for a Godel client.

Event Gateway and Schema Registry

Our client is a UK PLC which runs more than 300 different microservices to operate their business. Of course, the task of data analytics can still be considered as secondary to operational activities (which is a very controversial statement in data-driven companies that are increasingly shifting towards data), so the company is forced to “teach” each of its services to additionally send useful information about its activities to the so-called event gateway.

When I joined the team, they already had this gateway to accept the information from internal processes. This is a very good first step to have some understanding of what is happening in the cloud and microservices. Everyone who wants to send information to this gateway will have to first register some schema about the information.

At Godel, we look to move forward in our expertise and give a complete solution for our clients. Our particular role for this project as a data division is when we get a new gateway, it creates a new challenge for us. For example, the client might outsource some of their requirements from a third party and they wouldn’t be aware of this gateway at all and do not know how to provide us with new information. We act as mediators between the third party and the gateway. We take information from the third party including data, KPIs and files. This unified architecture suits more or less all use cases.

From my point of view, it’s a good pattern of working because we will have a single architecture for everything, but we need to create different software to work in the way of mediators. Again, it gives some understanding of the data source.

Enterprise Data Catalogue

Next year with the client we hope to build an enterprise data catalogue (EDC), which is a single source of all the information that is needed to work effectively with data. When comparing and selecting data governance tools, it’s important to focus on selecting tools that help you realise the business benefits laid out in your data governance strategy. It is a good way to visualise the data to search for different sources and layers. But you also need to provide some information to link to DataHub to visualise. We are currently in the middle of fulfilling this data to DataHub to create metadata about our sources and processes.

There are some preparty versions from Informatica, Oracle and Teradata and they lead the way with these types of tools for the case. Now there are a lot of small and medium-sized companies that have huge amounts of data, which is also one of the triggers or challenges in our data practices. When I first started working with data, the tools were expensive and only applied to wealthier companies. Now, even the smaller companies still have a lot of data they are trying to work with, and new open-source tools have appeared in the market and compete with the ancient tools.

Data Lineage

Another theme we came across for the future is an OpenLineage project which is an open platform for the collection and analysis of data lineage. A lot of these tools are coming from an in-house solution from big companies that can afford the tool to create something production-ready.

From this project, they can make open-source and try to spread it amongst a lot of companies. It’s a good approach as a lot of people can contribute to the ideas of this tool and the implementation of the tool creating a unified data model and making connections with the metadata. The complexity of the architecture is great.

Conclusion

Solving the problem surrounding data catalogue and data governance will allow you to work with huge amounts of data. When you have just one database or file, it’s easy to review everything with data architects or data stewards.

Overall, if you have a huge solution, you must implement the specific tools. However, it’s important to remember that tools do not solve everything – integration should be done by a human. These innovative tools mean we can begin plugging the gap with the challenge of machine learning. You can’t do everything manually, so having an automated way to do it which will play a big part. It’s really easy to miss something if you manually input the data so data catalogues and building a data platform means it is automated – making it easier to migrate in the future.

Illustrations created by Artsiom Rabets, Sr. User Experience Designer at Godel