While the discussion on SQL and NoSQL stacks for data
storage and querying has relied on the separation of transactional processing
versus analytical processing, there is another angle to this dichotomy from
data science perspective. The NoSQL storage,
particularly key value stores are incredibly fast and efficient at ingesting
data but their queries are inefficient. The SQL store, on the other hand, is
just the opposite. They are efficient at querying the data but ingest data
slowly and inefficiently.
Organizations are required to have data stacks that provide
the results from heavy data processing in a limited time window. SQL databases
are also overused for historical reasons. When organizations want the results
of the processing within a fixed time window, the database’s inefficiency to
intake data delays the actual processing which in turn creates an operational
risk. The limitation comes from the data
structure used in the relational database. Storing a 1 TB table requires the B+
tree to grow to six levels. If the memory used by the database server is of the
order of 125 GB, less than 25% of the data will remain in cache. For every
insertion, it must read an average of three data blocks from disk to read the
leaf node. This will cause the same number of blocks to be evicted from the
cache. This dramatic increase in I/O makes these databases so inefficient to
ingest data.
When the data is stored in the NoSQL stores, the querying is
inefficient because a single horizontal data fragment could be spread across
many files which increases duration in reading the data. This improves if there
is an index associated with the data but the volume of data is still quite
large. If a single B+ tree is assumed to have 1024 blocks, each search will
need to access log (1024) = 10 blocks and there could be many files each with
its own index, which means if there are 16 B+ trees, a total of 160 blocks
would be read instead of the 10 blocks in one index in a database server. Cloud
document stores are capable of providing high throughput and low latency for
queries by providing a single container for unlimited sized data and charging
based on reads and writes. To improve the performance of the cloud database, we
must set the connection policy to direct mode, set the protocol to TCP, avoid
startup latency on first request, collocate clients in the same Azure region
for performance and increase the number of threads/tasks.
Ingestion is also dependent on the data sources. Different
data sources often require different connectors but fortunately they come
out-of-the box from cloud services. Many analytical stacks can easily connect
to the storage via existing and available connectors reducing the need for
integration. Services for analysis from the public cloud are rich, robust and
very flexible to work with.
When companies want to generate value and revenue from
accumulated data assets, they are not
looking to treat all data equally like some of the analytical systems do.
Creating an efficient pipeline that can scale to different activities and using
appropriate storage and analytical systems for specific types of data helps
meet the business goals with rapid development. Data collected in these sources
for insurance technology software could be as varied as user information, claim
details, social media or government data, demographic information, current
state of the user, medical history, income category or credit score, agent
customer interaction, and call center or support tickets. ML templates and BI tools might be different
for each of these data categories but the practice of using data pipelines and
data engineering best practices bring cloud-first approach that delivers on the
performance requirements expected for those use cases.
No comments:
Post a Comment