The Future of Data Integrity

Compliant Data Lakes III

June 25, 2020

Data Lakes and Proof Pools

Last week we examined data lakes and how they appropriately address the challenges of data mass and complexity. This week we continue our educational series by examining the idea of repositories of digital proofs.

To effectively implement such a repository of digital proofs there are several requirements that have to be fulfilled. Each digital proof needs to immutably verify the origin and existence of data in a certain state at a certain point in time. This also includes the necessity for long-term proof, which excludes the possibility of the proof expiring or being compromised. Further, the stream of proofs must be as consistent as the data stream itself, meaning it needs to be gapless, to implement gapless integrity and gapless trust. This means that even if data would be updated that the proof for the new version would also be stored making it possible to transparently verify the entire lifecycle of data for partners, auditors and machines. Lastly, the proofs should be enriched with unique identifiers and metadata that not only lets you query for it but also connects to the original raw data in your data lake without exposing the rest of your lake.

The original data cannot be exposed through this separate pool without your permission; however, it can help in identifying the original data in a separate or simplified search as well as provide proof of the existence of corresponding data without exposing it. For this, a proof pool would need a schema corresponding to the one already implemented in the data lake and a directory for easy searches. Not only would it lighten the load of the data lake, but it would also passively help reinforce its management, security and protecting the privacy of original data while actively verifying it. Having additionally end-to-end encryption implemented would further open up the opportunity to let stakeholders and machines verify data autonomously, imbuing flexibility and portability to digital proofs and the trust. Lastly, the access and rights management of such a repository need to be actively and effectively manageable to support quick searches but also stay in control of activity in your digital environment. This could entail measures like zero trust authorization for only one session, reducing the attack surface further.

Next week we will examine applications for a repository of proofs. If you cannot wait you can read the entire paper here.