The unification of processing over sequences:
While we have elaborated on dedicated storage types for sequences, processing over these storage types has merely been enumerated as sequential, batch, or stream oriented processing depending on the data store. However, in this section, we would like to say that processing is for analytics and it is not easy to confine them to specific object storages. Analytical packages like Apache FLink are tending towards combining options for processing while remaining a champion of say stream processing.
There are a few advantages to this scenario for the developers engaging in analysis. First, they have a broad range of capabilities handy from the same package. The code that they write is more maintainable, written once and targets all the capabilities via a single package. Second, the package decouples the processing from the storage concerns allowing algorithms to change for the same strategy and the same data set. Third, it is easier for the developers to target the data set with the same package if they do not have to concern themselves about scaling to large data sizes. When the data sizes inrease to orders of magnitude, code to process the sequences changes considerably but if the onus is taken by a package rather than the custom code from the developer, then it improves considerably requiring little or no attention for the future. Finally, the packages are themselves are tried and tested with integration to other packages or storage products that serve tier2 storage over stream, blobs, files or blocks. This makes it more appealing to use the same package for a variety of purposes. Open source packages have demonstrated code reusability far more than buy your own options. Code written against open source packages is also suitable to code being published to other
When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits
While we have elaborated on dedicated storage types for sequences, processing over these storage types has merely been enumerated as sequential, batch, or stream oriented processing depending on the data store. However, in this section, we would like to say that processing is for analytics and it is not easy to confine them to specific object storages. Analytical packages like Apache FLink are tending towards combining options for processing while remaining a champion of say stream processing.
There are a few advantages to this scenario for the developers engaging in analysis. First, they have a broad range of capabilities handy from the same package. The code that they write is more maintainable, written once and targets all the capabilities via a single package. Second, the package decouples the processing from the storage concerns allowing algorithms to change for the same strategy and the same data set. Third, it is easier for the developers to target the data set with the same package if they do not have to concern themselves about scaling to large data sizes. When the data sizes inrease to orders of magnitude, code to process the sequences changes considerably but if the onus is taken by a package rather than the custom code from the developer, then it improves considerably requiring little or no attention for the future. Finally, the packages are themselves are tried and tested with integration to other packages or storage products that serve tier2 storage over stream, blobs, files or blocks. This makes it more appealing to use the same package for a variety of purposes. Open source packages have demonstrated code reusability far more than buy your own options. Code written against open source packages is also suitable to code being published to other
When groups and sequences run in large number, they can be collected in batches. When these batches are stored, they can be blobs or files. Blobs have several advantages similar to log indexes in that they can participate in index creation and term-based search while remaining web accessible and with unlimited storage from the product. The latter aspect allows all groups to be stored without any planning for capacity which can be increased with no limitations. Streams on the other hand are continuous and this helps with the groups processing in windows or segments. Both have their benefits
No comments:
Post a Comment