Introduction
Given the world’s growing user base across devices and applications in recent years, we have seen a huge surge in not just the volume of data we are collecting but also in the number and variety of sources. The pandemic has certainly accelerated this trend even more and having high quality and consistency of data has become mission-critical to successfully drive business outcomes for both business and data leaders.
If you are part of the data team in any capacity be it data engineer, data scientist, data product manager, data analyst etc., you would have heard of common data governance issues of different kinds based on the type of data you work with or its primary user groups.
- Data standardization and integrity: upstream data format changes making the data or its derivatives hence less usable
- Data definitions: misunderstood definitions of certain data fields/columns/attributes or duplicate or inconsistent definitions
- Data access and user personas: ambiguity in who is accessing data and for what purposes due to poor logging or insufficient tagging
A data lake or data warehouse hosting data for consumer and producer user groups without proper governance in place leads to chaotic operations with unplanned emergencies very quickly. So controls are necessary for data, its content, structure, use, privacy and security. Every organization needs these controls at different levels based on the complexity of the datasets, types of data they handle and the regulatory requirements around both the data and usage patterns of this data by both producers and consumers of this data.
Enabling these controls typically involves 3 main steps:
- Creation and agreement of policy framework with relevant stakeholders
- Consistent implementation of these policies
- Commitment to continuous evaluation and adaptation
Data governance as a concept covers all these aspects. Traditionally this is an operations function limited to defining the specification of decision rights and an accountability framework to ensure the appropriate behaviour in the valuation, creation, consumption and control of data and analytics. However, to build continuous data governance, the legacy policies and formal meetings to enforce these policies won’t help. Lately, data governance is viewed as a bureaucratic way to control data, which impedes its usage and impairs data-driven decision making culture. So instead of helping democratize data, it is seen as a blocking function. There is no doubt we need data governance to reduce the risk of non-compliance, costs with reusability, improve productivity and in general give confidence in decision making to data consumers.
So, if we want to treat data as a strategic asset, modern data governance should keep technology at the centre and help drive people, processes and tools to enable organizations to formally manage reliability, accountability, usability, trust and compliance of data to support business objectives with as much automation and self-serve capabilities as possible.
Recently DataOps also has emerged as a concept to help us move in this direction and reimagine data governance by bringing together data engineering, analysts, operations, data scientists, data stewards and business teams. Gartner defines DataOps as “a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization”
How can we Achieve this?
There are a few primary goals or objectives for good data governance in general. The following outlines the framework of data governance we currently utilize:
- Building trust and Reliability
Aligned single source of truth, including related change management.
- Accountability definitions:
This involves defining the owners, governance stewards, subject matter or functional experts, and support teams for the full lifecycle of all data.
- Regulatory compliance:
Handling Legal and Privacy requirements as a centralized component
- Continuous communication
Usability aspects are right from data discovery, profiling, business-friendly definitions, flexibility for users to bring in their own tools with computing and another menu of provisioning services as part of the tech stack.
All these functionalities should ideally be part of the data platform governance and cataloguing to support transparency and accountability of the platform and to maintain/update the tools and processes, making it a collective responsibility of all teams involved.
In this article, we will discuss in detail the “Trust and Reliability” aspect of the above Data Governance framework.
Trust and Reliability
In order to build trust into a platform, it’s important to bring transparency and agreement to what constitutes truth and align when, how or where there is a change.
What constitutes truth?
- Metadata management
Metadata management with proactive documentation is perhaps one of the hardest parts of the overall data management practice.
There are 2 different types of metadata – business and technical.
Business metadata involves the definition of each column from a use case perspective and maintaining a contextual data catalogue and documentation for easy use.
Technical metadata involves assigning data types and defining the format of the collected data. This is typically part of the data schema and needs to be aligned with producers and consumers of this data or some business transformed version of the data.
It also helps to follow standardized naming conventions at a platform level for predictability and usability of the data.
- Assess quality
Define metrics in advance to measure data quality including data freshness. There are many data observability frameworks to track the quality of data. Nobody likes stale data so it is important to keep the data fresh by keeping a regular cadence or a rolling window of a refresh to the dataset. Each dataset is different so technology to configure as per business need is a must build trust in the data. Examples below
- Missing or invalid data in columns – % data with Null values or different format
- Data Volume levels – Thresholds to assess unusual increase or decrease in volume of data
- Data freshness level – Keeping a stakeholder-aligned cadence or a rolling window of a refresh for each dataset
- For ad-hoc datasets which are one time uploads, monitoring the data quality statuses with archive functionality turned on as required.
When there is a Change?
- Metadata changes
A clear handshake and standardized process are necessary ahead of time when the format, structure, volume or definition of columns in the dataset changes are planned. Protocols need to be defined with clearly documented policies for change. This should inherently be part of the overall data team culture with buy-in from leadership teams as part of cross-team collaborations.
- To achieve reliability it’s important to define service level agreements not just on the last leg of data pipelines leading to the end consumer but go all the way upstream to the source (raw log). Data engineering and business teams need to align on service level agreements and objectives for data availability. Based on this, appropriate monitoring and alerting should be put in place. Each dataset is of different priority for each consumer so to reduce the number of alarms/alerts, provide configuration functions within the platform that can be self-subscribed.
- Mechanisms to communicate and handle change within the platform should be aligned on and standardized.
How to Handle the Change?
- Tools and Technology
Assessing the impact of the change both from a business standpoint and technical standpoint is required. Traceability and data lineage are some of the features that help with version control both to assess the impact of any potential changes or for auditing or reset.
- Process
When the goal is to have a common source of truth across the board along with the flexibility to the consumers to collaborate, there is a high chance of misalignment. This is a complex problem. In the past, we have had similar challenges for software delivery. Dev ops as a function helped developers and operations teams to collaborate and deliver faster by automating workflows, infrastructure, code testing and continuously measuring the performance.
Similarly, before we publish a dataset or a new source of truth, it’s important that this is validated from the data content or definition standpoint but also dependencies associated with these datasets in terms of SQL query or scripts. This type of Dataops activity is slowly becoming part of the data team function to assess and bridge this type of gap and support data governance, similar to the DevOps teams for Software development.
Conclusion
Typically for data strategy, the focus is on how to get the most value out of the data. However, a better data strategy takes into account end to end data lifecycle management and not just the optimal usage patterns of data. A key aspect of this end to end strategy is proactive data governance instead of a reactive fix to the issue. This will better enable companies to not only measure, experiment, scale, leverage new data sources but also to create new data products and drive higher value end-user usage/adoption patterns from data quickly. Whatsmore, it creates a shared responsibility across both data producers and consumers across all data platform activities, elevating the data-driven culture across the company – the holy grail for leadership teams.
Read the latest articles on our blog.