Cluster computing

Monday, August 3, 2020

Stream store with TLS

Testing pravega-tls mode:

Pravega is a stream store that comprises of a control plane server and a data plane server. The former is called controller and the latter is called segmentstore. Both are available to communicate with a client but only the controller provisions the REST api.

Both the controller and the segmentstore can enable TLS. The pravega source includes default certificates, keys, keystore and truststore in its config folder.

TLS is enabled by including the following options in the config folder:

controller.security.tls.enable: "true"

controller.security.tls.trustStore.location: /opt/pravega/conf/client.truststore.jks

controller.security.tls.server.certificate.location: /opt/pravega/conf/server-cert.crt

controller.security.tls.server.privateKey.location: /opt/pravega/conf/server-key.key

controller.security.tls.server.keyStore.location: /opt/pravega/conf/server.keystore.jks

controller.security.tls.server.keyStore.pwd.location: /opt/pravega/conf/server.keystore.jks.passwd

controller.segmentstore.connect.channel.tls: "true"

where the corresponding files are available in the pravega/pravega:latest image that is loaded on a host container.

This configuration did not work because it could not be found by the controller server as shown from the logs included here:

2020-08-03 00:04:17,403 1512 [ControllerServiceStarter STARTING] INFO i.p.c.s.ControllerServiceStarter - Controller serviceConfig = ControllerServiceConfigImpl(threadPoolSize=80, storeClientConfig=StoreClientConfigImpl(storeType=PravegaTable, zkClientConfig=Optional[ZKClientConfigImpl(connectionString: zookeeper-client:2181, namespace: pravega/pravega, initialSleepInterval: 5000, maxRetries: 5, sessionTimeoutMs: 10000, secureConnectionToZooKeeper: false, trustStorePath is unspecified, trustStorePasswordPath is unspecified)]), hostMonitorConfig=io.pravega.controller.store.host.impl.HostMonitorConfigImpl@3b434741, controllerClusterListenerEnabled=true, timeoutServiceConfig=TimeoutServiceConfig(maxLeaseValue=120000), tlsEnabledForSegmentStore=, eventProcessorConfig=Optional[ControllerEventProcessorConfigImpl(scopeName=_system, commitStreamName=_commitStream, commitStreamScalingPolicy=ScalingPolicy(scaleType=FIXED_NUM_SEGMENTS, targetRate=0, scaleFactor=0, minNumSegments=2), abortStreamName=_abortStream, abortStreamScalingPolicy=ScalingPolicy(scaleType=FIXED_NUM_SEGMENTS, targetRate=0, scaleFactor=0, minNumSegments=2), scaleStreamName=_requeststream, scaleStreamScalingPolicy=ScalingPolicy(scaleType=FIXED_NUM_SEGMENTS, targetRate=0, scaleFactor=0, minNumSegments=2), commitReaderGroupName=commitStreamReaders, commitReaderGroupSize=1, abortReaderGroupName=abortStreamReaders, abortReaderGroupSize=1, scaleReaderGroupName=scaleGroup, scaleReaderGroupSize=1, commitCheckpointConfig=CheckpointConfig(type=Periodic, checkpointPeriod=CheckpointConfig.CheckpointPeriod(numEvents=10, numSeconds=10)), abortCheckpointConfig=CheckpointConfig(type=Periodic, checkpointPeriod=CheckpointConfig.CheckpointPeriod(numEvents=10, numSeconds=10)), scaleCheckpointConfig=CheckpointConfig(type=None, checkpointPeriod=null), rebalanceIntervalMillis=120000)], gRPCServerConfig=Optional[GRPCServerConfigImpl(port: 9090, publishedRPCHost: null, publishedRPCPort: 9090, authorizationEnabled: false, userPasswordFile is unspecified, tokenSigningKey is specified, accessTokenTTLInSeconds: 600, tlsEnabled: false, tlsCertFile is unspecified, tlsKeyFile is unspecified, tlsTrustStore is unspecified, replyWithStackTraceOnError: false, requestTracingEnabled: true)], restServerConfig=Optional[RESTServerConfigImpl(host: 0.0.0.0, port: 10080, tlsEnabled: false, keyFilePath is unspecified, keyFilePasswordPath is unspecified)])

When the values were overridden, the following error was encountered:

java.lang.IllegalArgumentException: File does not contain valid certificates: /opt/pravega/conf/client.truststore.jks

at io.netty.handler.ssl.SslContextBuilder.trustManager(SslContextBuilder.java:182)

at io.pravega.client.netty.impl.ConnectionPoolImpl.getSslContext(ConnectionPoolImpl.java:280)

at io.pravega.client.netty.impl.ConnectionPoolImpl.getChannelInitializer(ConnectionPoolImpl.java:237)

at io.pravega.client.netty.impl.ConnectionPoolImpl.establishConnection(ConnectionPoolImpl.java:194)

at io.pravega.client.netty.impl.ConnectionPoolImpl.getClientConnection(ConnectionPoolImpl.java:128)

at io.pravega.client.netty.impl.ConnectionFactoryImpl.establishConnection(ConnectionFactoryImpl.java:62)

at io.pravega.client.netty.impl.RawClient.<init>(RawClient.java:87)

at io.pravega.controller.server.SegmentHelper.updateTableEntries(SegmentHelper.java:403)

at io.pravega.controller.store.stream.PravegaTablesStoreHelper.lambda$addNewEntry$10(PravegaTablesStoreHelper.java:178)

at io.pravega.controller.store.stream.PravegaTablesStoreHelper.lambda$null$54(PravegaTablesStoreHelper.java:534)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.security.cert.CertificateException: found no certificates in input stream

at io.netty.handler.ssl.PemReader.readCertificates(PemReader.java:98)

at io.netty.handler.ssl.PemReader.readCertificates(PemReader.java:64)

at io.netty.handler.ssl.SslContext.toX509Certificates(SslContext.java:1071)

at io.netty.handler.ssl.SslContextBuilder.trustManager(SslContextBuilder.java:180)

... 19 common frames omitted

This required passing custom certificate, key, truststore and keystore to the controller.

In the form of:

TLS_ENABLED = "true"

TLS_KEY_FILE = "/etc/secret-volume/controller01.key.pem"

TLS_CERT_FILE = "/etc/secret-volume/controller01.pem"

TLS_TRUST_STORE = "/etc/secret-volume/client.truststore.jks"

TLS_ENABLED_FOR_SEGMENT_STORE = "true"

REST_KEYSTORE_FILE_PATH = "/etc/secret-volume/server01.keystore.jks"

REST_KEYSTORE_PASSWORD_FILE_PATH = "/etc/secret-volume/pass-secret-tls"

These options are also hard to pass without the use of environment variables especially when used in conjunction with other software that installs and configures the pravega container image. One of the ways to overcome this is to edit the configmap with the overrides above and recreate the pods.

Sunday, August 2, 2020

Support for small to large footprint introspection database and query

An application can choose to offer a set of packaged queries available for the user to choose from a dropdown menu while internalizing all upload of corresponding programs, their execution and the return of the results. One of the restrictions that comes with packaged queries exported via REST APIs is their ability to scale since they consume significant resources on the backend and continue to run for a long time. These restrictions cannot be relaxed without some reduction on their resource usage. The API must provide a way for consumers to launch several queries with trackers and they should be completed reliably even if they are done one by one. This is facilitated with the help of a reference to the query and a progress indicator. The reference is merely an opaque identifier that only the system issues and uses to look up the status. The indicator could be another api that takes the reference and returns the status. It is relatively easy for the system to separate read-only status information from read-write operations so the number of times the status indicator is called has no degradation on the rest of the system. There is a clean separation of the status information part of the system which is usually periodically collected or pushed from the rest of the system. The separation of read-write from read-only also helps with their treatment differently. For example, it is possible to replace the technology for the read-only separately from the technology for read-write. Even the technology for read-only can be swapped from one to another for improvements on this side.

The design of all REST APIs generally follows a convention. This practice gives well recognized uri qualifier patterns, query parameters and methods. Exceptions, errors and logging are typically done with the best usage of http protocol. The 1.1 version of that protocol is revised so it helps with the

The SQL query can also be translated from REST APIs with the help of engines like Presto or LogParser.

Tools like LogParser allow sql queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. Aggregation operations have had the benefit that they can support both batch and streaming mode operations. These operations therefore can operate on large datasets because they view only a portion at a time. The batch operation can be parallelized to a number of processors. This has generally been the case with Big Data and cluster mode operations. Both batch and stream processing can operate on unstructured data.

Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.

Just like the standard query operators of .NET the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.

Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.

Also, Apache Flink provides a SQL abstraction over its Table API

While Apache Spark provided ". map(...)" and ". reduce(...)" programmability syntax to support batch-oriented processing, Apache Flink provides Table APIs with ".groupby(...)" and ". order(...)" syntax. It provides SQL abstraction and supports steam processing as the norm.

The APIs vary when the engines change even though they form an interface such that any virtualized query can be executed and the APIs form their own query processing syntax. They vary because the requests and responses vary and their size and content vary. Some APIs are better suited for stream processing and others for batch processing.

Analytical applications that perform queries on data are best visualized via catchy charts and graphs. Most applications using storage and analytical products have a user interface that lets them create containers for their data but have very little leverage for the customer to author streaming queries. Even the option to generate a query is nothing more than the uploading of a program using a compiled language artifact. The application reuses a template to make requests and responses but this is very different from long running and streaming responses that are typical for these queries.

Query Interface is not about keywords.

When it comes to a user interface for querying data, people often associate Google user interface for search terms. Avoid Google. This is not necessarily the right interface for Streaming queries. If anything, it is simplistic and keywords based. It is not operators based. Splunk interface is a much cleaner interface since it has seasoned from search over machine data. Streaming queries are not table queries. They also don’t operate on big data on Hadoop file system only. They do have standard operators that can directly translate to operator keywords and can be chained the same manner in a java file as pipe operators on the search bar. These translations of query operator expressions on the search bar to a java language FLink streaming query is a patentable design for streaming user interface and one that will prove immensely flexible for direct execution of user queries as compared to package queries.

Saturday, August 1, 2020

Support for small to large footprint introspection database and query

One of the restrictions that comes with packaged queries exported via REST APIs is their ability to scale since they consume significant resources on the backend and continue to run for a long time. These restrictions cannot be relaxed without some reduction on their resource usage. The API must provide a way for consumers to launch several queries with trackers and they should be completed reliably even if they are done one by one. This is facilitated with the help of a reference to the query and a progress indicator. The reference is merely an opaque identifier that only the system issues and uses to look up the status. The indicator could be another api that takes the reference and returns the status. It is relatively easy for the system to separate read-only status information from read-write operations so the number of times the status indicator is called has no degradation on the rest of the system. There is a clean separation of the status information part of the system which is usually periodically collected or pushed from the rest of the system. The separation of read-write from read-only also helps with their treatment differently. For example, it is possible to replace the technology for the read-only separately from the technology for read-write. Even the technology for read-only can be swapped from one to another for improvements on this side.

The SQL query can also be translated from REST APIs with the help of engines like Presto or LogParser.

Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.

Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.

Also Apache Flink provides a SQL abstraction over its Table API

Friday, July 31, 2020

Support for small to large footprint introspection database and query

The query processing ability from the API will also help automations that rely on scripts that run from the command line. Administrators generally prefer invocation from the command-line whenever possible. Health check by querying the introspection datastore as reported above is not only authoritative but also richer since all the components contribute to the data.

The REST API is a front for other tools and packages like command line interface and SDK which enable programmability for packaged queries of introspection store. An SDK does away with the requirement on the client to make the REST calls directly. This is a best practice favored by many cloud providers who know their cloud computing capability will be called from a variety of clients and devices. A variety of programming languages can be used to provide the SDK which widens the client base. The SDKs are provided as a mere convenience so this could be something that the developer community might initiate and maintain. Another benefit to using the SDK is that now the API versioning and service maintenance onus is somewhat reduced as more customers use their language friendly SDK while the internal wire call compatibility may change. SDKs also support platform independent and interoperable programming which helps test and automation.

Similarly command line interface CLI also improves the customer base. Tests and automations can make use of commands and scripts to drive the service. This is a very powerful technique for administrators as well as for day to day usage by end users. Command line usage has one additional benefit - it can be used for troubleshooting and diagnostics. While SDKs provide an additional layer and are hosted on the client side, command line interface may be available on the server side and provide fewer variables to go wrong. With detailed logging and request-response capture CLI and SDK help ease the calls made to the services SDK so that clients don’t have to make the REST calls directly. This is a best practice favored by many cloud providers who know their cloud computing capability will be called from a variety of clients and devices.

Similarly command line interface CLI also improve the customer base. Tests and automations can make use of commands and scripts to drive the service. This is a very powerful technique for administrators as well as for day to day usage by end users. Command line usage has one additional benefit - it can be used for troubleshooting and diagnostics. While SDKs provide an additional layer and are hosted on the client side, command line interface may be available on the server side and provide fewer variables to go wrong. With detailed logging and request-response capture CLI and SDK help ease the calls made to the services

Thursday, July 30, 2020

Support for small to large footprint introspection database and query

he REST API is a front for other tools and packages like command line interface and SDK which enable programmability for packaged queries of introspection store. A SDK does away with the requirement on the client to make the REST calls directly. This is a best practice favored by many cloud providers who know their cloud computing capability will be called from a variety of clients and devices. A variety of programming languages can be used to provide the SDK which widens the client base. The SDKs are provided as a mere convenience so this could be something that the developer community might initiate and maintain. Another benefit to using the SDK is that now the API versioning and service maintenance onus is somewhat reduced as more customers use their language friendly SDK while the internal wire call compatibility may change. SDKs also support platform independent and interoperable programming which helps test and automation.

Wednesday, July 29, 2020

Support for small to large footprint introspection database and query

The application that uses the introspection store may access it over APIs. In such case, the execution of the packaged query could imply launching an FlinkJob on a cluster and processing the results of the job. This is typical for any api implementation that involves asynchronous processing.

The execution of a query can take arbitrary amount of time. The REST based api implementation does not have to be blocking in nature since the call sequence in the underlying analytical and storage systems are also asynchronous. The REST api can provide a handle and a status checker or a webhook for completion reporting. This kind of mechanism is common at any layer and is also available from REST.

The ability to provide the execution of results inlined in the HTTP response is specific to the API implementation. Even if the results are large, the API responses can return a filestream so that the caller can process the response at their end. The ability to upload and download files is easily implementable for the API with the use of suitable parameter and return value for the corresponding calls.

The use of upload APIs is also useful to upload analytical queries in the form of Flink artifacts that the API layer can forward and execute on a cluster. The results of the query can then similarly be passed back to the user. The result of the processing is typically going to be a summary and will generally be smaller in size than overall data on which the query is run. This makes it convenient for the API return the results in one call.

Tuesday, July 28, 2020

Support for small to large footprint introspection database and query

The propagation of information from the introspection store along channels other than the send home channel will be helpful to augment the value for their subscribers.

Finally, a benefit of including packaged queries with introspection store in high end deployments. Flink Applications are known for improved programmability with both stream and batch processing. Packaged queries can present summary information from introspection stores.

The use of windows helps process the events in sequential nature. The order is maintained with the help of virtual time and this helps with the distributed processing as well. It is a significant win over traditional batch processing because the events are continuously ingested and the latency for the results is low.

The stream processing is facilitated with the help of DataStream API instead of the DataSet API. The latter is more suitable for batch processing. Both the APIs operate on the same runtime that handles the distributed streaming data flow. A streaming data flow consists of one or more transformations between a source and a sink.

The difference between Kafka and Flink is that Kafka is better suited for data pipelines while Flink is better suited for streaming applications. Kafka tends to reduce the batch size for the processing. Flink tends to run an operator on a continuous basis, processing one record at a time. Both are capable of running in standalone mode. Kafka may be deployed in a high availability cluster mode whereas Flink can scale up in a standalone mode because it allows iterative processing to take place on the same node. Flink manages its memory requirement better because it can perform processing on a single node without requiring cluster involvement. It has a dedicated master node for co-ordination.

Both systems are able to execute with stateful processing. They can persist the states to allow the processing to pick up where it left off. The states also help with fault tolerance. This persistence of state protects against failures including data loss. The consistency of the states can also be independently validated with a checkpointing mechanism also available from Flink. The checkpointing can persist the local state to a remote store. Stream processing applications often take in the incoming events from an event log. This event log stores and distributes event streams which are written to durable append only log on tier 2 storage where they remain sequential by time. Flink can recover a stateful streaming application by restoring its state from a previous checkpoint. It will adjust the read position on the event log to match the state from the checkpoint. Stateful stream processing is not only suited for fault tolerance but also reentrant processing and improved robustness with the ability to make corrections. It has become the norm for event-driven applications, data pipeline applications and data analytics application.

Whether Kafka applications or Flink applications are bundled with the introspection store can depend on the usage and preference of the administrator. The benefit to the administrator in both cases is that no code has to be written for the analytics.