Without high-quality data, every AI and analytics initiative will be underwhelming at best and actively damaging the business at worst.
To overcome this problem, data producers must be willing to take ownership of production data and collaborate with data consumers to support high-value use cases.
Under a data mesh governance structure, this becomes the default paradigm and the concept of a data contract is an interesting approach.
Data contracts are API-based agreements between producers and consumers designed to solve exactly that problem. Data contracts could be implemented by these steps:
Identify use cases for your data
Create requirements around the schema and values of the data
Document the expected semantics of the dataset
Collaborate with data producers to define the potential value of the use cases
Identify masking policies based on regulatory and privacy requirements
Define the contract and infrastructure, as code, in a source control repository
Validate a robust curation layer above raw CDC events
Automate schema compatibility checks in the CI/CD workflow
Make the data available through the business facing layer, with role based masking
Write integration tests to verify semantic validity
Create data tests to verify data correctness
Generate monitors to alert on shifts in semantics & anomalies
Push all contracts to a catalogue for discoverability and re-use
This framework treats data as code and results in an explosion of conversation and collaboration around what data is meaningful, what it semantically means, how it is used, and where it should be used.
Data Contracts are not a new concept. They are simply new implementations of a very old idea - that producers and consumers should work together to generate high-quality, semantically valid data from the ground up instead of insisting on modelling poor-quality data exclusively after it lands in a Data lake.
Data contracts could eventually form a backbone of production-quality data and are important in driving AI/ML, advanced analytics, and other high-value use cases.
Original post Chad Sanderson, with some edits and added ideas by me.
Yorumlar