The Future of Data Integrity

Compliant Data Lakes II

June 18, 2020

New Repositories for a New Era

Last week we reviewed previous changes and expected changes of the digital web to set the scene for the question of "what do we do with all that data?". This week we expand on this topic by introducing the idea of data lakes and how they appropriately address the challenges of data mass and complexity. Further, this post will address some challenges specific to data lakes.

To better address the diversity of data types, embrace technological progressions such as AI and machine-learning and support M2M as well as complex analytics programs, many organizations turn towards ‘Data Lakes’. In comparison to more rigid ‘data warehouses’ that store structured and processed data in storage silos, data lakes can store both structured and unstructured raw data. Further, data lakes are intended to increase flexibility and accessibility of data and open up the possibility of big data analytics to increase efficiencies. Another defining element is a flat-architecture contrary to the hierarchical structure of data warehouses. In a data lake, data is assigned a unique identifier and tagged with metadata. To find and utilize data in a lake it would be queried for. This process is often seen as another challenge for the users, to keep the data comprehensible with schemas or other means to keep it from turning into a data swamp. For these reasons it is usually data scientists that utilize the potential of data lakes while business users tend to stumble into this issue more often. (Alrehamy H., 2015) (Rouse, 2020)

Although data lakes have the right properties to handle the increasing amount of diverse data better than preceding storage systems, it can be difficult to comply with data privacy and security regulations with increasing data and a more complex digital environment. Many processes require a quick verification of data to guarantee that it has not been corrupted and increases the overall velocity of verification reliant processes. Considering the worth and weight data has, it will have different levels of necessity, businesses being probably the first to find a solution. Many have introduced more or less elegant solutions to ensure a degree of data integrity. Often times these solutions are slowing down processes by implementing extensive encryption and decryption and/or increase the data load considerably. The load increase could be an electronic timestamp, a cryptographic key or a digital proof of some kind.

How could this need for verifiably true data help to accelerate the following processes, rather than slowing them down?

Considering the increase of inter-organizational cooperation and previously laid out properties of a data lake, something similar may work for digital proofs. As digital proofs are much smaller in size, this storage would be more akin to a pool rather than the grand lake massive amounts of raw data would create, so a sort of ‘proof pool’.

Next week we will continue this topic by expanding on the notion of proof pools as repositories for data verification proofs. If you cannot wait you can read the entire paper here.