Saturday, November 4, 2017

Data Virtualization deep dive
Data evolves over time and with the introduction of new processes. As data ages, it becomes difficult to re-organize it. In some cases, the data is actively used by the business that may not even permit a downtime. Moreover, as data grows, it may be repurposed with changing requirements. As more and more departments and organizations visit the data, it may require separation of concerns. For example, an organization may want to see a customer's identity but not his or her credit cards. Similarly, another might want to see the items purchased by a customer but not the shipping addresses. Data also explodes at a phenomenal rate and once it starts accruing it does not stop until the business shuts down.
Organizations grapple to tame the data with compartmentalized databases. Databases are convenient to store data because they ensure atomicity, consistency, integrity and durability of data. They are also extremely performant and efficient in how the data is stored physically and accessed over the web. By separating databases for different purposes, companies try to be nimble in their effort and reduce the time to release operations to production. However, this is merely suited for expediting new offerings to market. It does not handle data analysis and insights. Consequently, data is staged from operations for loading into a warehouse which is more suited to gather all the data for analysis. Even then the warehouses proliferate. In addition, workflows that extract-transform-load the data between operational databases are found reusable for different databases. This makes more copies of the data. Syntax and semantics varies for the same entity from database to database. Databases also become distributed and separated over regions requiring the usage of web services to pull and process data.
There are many types of databases used by companies because they serve different purposes. A relational database organizes data for efficient querying. A NoSQL database organizes data for large scale distributed batch processing. A graph database persists many forms of relationships between entities. Databases fragment the view of data from the perspective of the business domain. This calls for some unified experience regardless of where or how the data is store. Data Virtualization tries to address this with consistent, wholesome, unified views and manipulation. It introduces a platform and tool that abstracts away the real topology of how data is organized.
The word virtual is a term to indicate that we are no longer looking at physical representation and instead we are looking at the semantics. With data virtualization, we can explore and discover related information. We can also view the entire collection of databases as a unified repository.  The actual data source may not just be a database. It could be a database, a data warehouse, Online Analytical Processing application, web services, Software-as-a-Service, a NoSQL database or any mix of these.
A certain degree of consolidation and consistency is preferred by data virtualization users. It is easier to query something with the same syntax rather than have to change it over and over again. Even though virtualization may aim to span a vast breadth of technologies and software stacks, it cannot be a panacea. Therefore, virtualization runs the risk of being fragmented just like databases. Some have questioned this to another degree. Can each database also come with its own logic and granular enough to make it available over the web? In other words, can each data source be a service in itself so that databases and data virtualization are no longer the frontend for users. Instead they can mix and match different data sources with the same programmability over the web? This so called microservices architecture puts nice boundaries on the source of truth and still manages to hide the complexity of a farm or a cluster behind the service. While services are great for programmers, they are not intended for users who want to visually work with the data using a tool. Therefore data virtualization has moved even closer to the user by pushing down the microservices as a source of data. Finally data virtualization comes with immense capabilities to browse and search the data like none other.

No comments:

Post a Comment