Cluster computing: Azure data platform continued

Monday, March 27, 2023

Azure data platform continued

While the discussion on SQL and NoSQL stacks for data storage and querying has relied on the separation of transactional processing versus analytical processing, there is another angle to this dichotomy from data science perspective. The NoSQL storage, particularly key value stores are incredibly fast and efficient at ingesting data but their queries are inefficient. The SQL store, on the other hand, is just the opposite. They are efficient at querying the data but ingest data slowly and inefficiently.

Organizations are required to have data stacks that provide the results from heavy data processing in a limited time window. SQL databases are also overused for historical reasons. When organizations want the results of the processing within a fixed time window, the database’s inefficiency to intake data delays the actual processing which in turn creates an operational risk. The limitation comes from the data structure used in the relational database. Storing a 1 TB table requires the B+ tree to grow to six levels. If the memory used by the database server is of the order of 125 GB, less than 25% of the data will remain in cache. For every insertion, it must read an average of three data blocks from disk to read the leaf node. This will cause the same number of blocks to be evicted from the cache. This dramatic increase in I/O makes these databases so inefficient to ingest data.

When the data is stored in the NoSQL stores, the querying is inefficient because a single horizontal data fragment could be spread across many files which increases duration in reading the data. This improves if there is an index associated with the data but the volume of data is still quite large. If a single B+ tree is assumed to have 1024 blocks, each search will need to access log (1024) = 10 blocks and there could be many files each with its own index, which means if there are 16 B+ trees, a total of 160 blocks would be read instead of the 10 blocks in one index in a database server. Cloud document stores are capable of providing high throughput and low latency for queries by providing a single container for unlimited sized data and charging based on reads and writes. To improve the performance of the cloud database, we must set the connection policy to direct mode, set the protocol to TCP, avoid startup latency on first request, collocate clients in the same Azure region for performance and increase the number of threads/tasks.

Ingestion is also dependent on the data sources. Different data sources often require different connectors but fortunately they come out-of-the box from cloud services. Many analytical stacks can easily connect to the storage via existing and available connectors reducing the need for integration. Services for analysis from the public cloud are rich, robust and very flexible to work with.

When companies want to generate value and revenue from accumulated data assets, they are not looking to treat all data equally like some of the analytical systems do. Creating an efficient pipeline that can scale to different activities and using appropriate storage and analytical systems for specific types of data helps meet the business goals with rapid development. Data collected in these sources for insurance technology software could be as varied as user information, claim details, social media or government data, demographic information, current state of the user, medical history, income category or credit score, agent customer interaction, and call center or support tickets. ML templates and BI tools might be different for each of these data categories but the practice of using data pipelines and data engineering best practices bring cloud-first approach that delivers on the performance requirements expected for those use cases.

Cluster computing

Monday, March 27, 2023

Azure data platform continued

No comments:

Post a Comment