Cluster computing

Saturday, January 25, 2020

Filters:
Filters are common in applications that query data. When data is stored in tables, filter usually appears as predicates that reduce the number of records to view. Tables were earmarked with lot of metadata and queries were prepared and cached that made running these filters very fast.
With the advent of BigData storage, the applications that involved map-reduce had to do their own filtering. There was neither an obvious way of preparing the data nor an easy way of caching the results.
The same is true for streaming applications. An application that wants to write a filter as follows:
    private static class EventFilter implements FilterFunction<String> {
        @Override
        public boolean filter(String line) throws Exception {
               return !line.contains("NonEvent"); 
        }
    }
has no knowledge of how much time it would take to go through the data in the stream for the results of the filtering.
If the filters are not re-used, then they can be in-lined with the application logic since the set of operations taken to evaluate one record for complete analysis will not take much longer than the single operation on the filter.
If the filters are re-used, then it is possible to package them independently. This allows for the filtering logic to be run again and again on different data streams.
Another strategy to make filtering superfast is to do it once, rather than several times. For example, the filtering to one data stream will result in another stream with just the data of interest. And this resulting stream can be used as a feed for all application logic.
The strategy to make filtering superfast is to combine transformation with enhancement of statistics. For example, the purpose of filtering was just to count, the counts from non-transformed stream could be collected via map-reduce and saved in forms that can be re-used later.
These are some of the techniques for filtering on large data sets.

Friday, January 24, 2020

Organizing the APIs was only needed for gRPC. REST Apis are independently provisioned per resource. This flat enumeration of APIs helps with their microservice model and available for mashups at a higher level. SOAP requires tools to inspect the message. REST can be intercepted by web proxy and displayed with browser and add-on.
SOAP methods require declarative address, binding and contract. REST is URI based and has qualifiers for resources.
There is a technique called HATEOAS where a client can get more information from the web API servers with the help of hypermedia. This technique helps the client to get information outside the documentation by machines instead of humans. Since the storage products are considered commodity, this technique is suitable for plugging in one storage product versus another. The clients that use the web API programmability can then switch products with ease.
Some of the implementors for REST API follow a convention. This convention makes it easy to construct the APIs for the resources and to bring consistency among those for different resources.
REST Apis also make authentication and authorization simpler to setup and check. There are several solutions to choose from with some involving offloaded functionality or the use of frameworks like Java which can support annotation-based checks for adequate privilege and access control.
Supported actions are filtered. Whether a user can read or write is determined based on the capability for the corresponding read or write permission. The user is checked for the permission involved. This could be an RBAC access control such that it is just a check against the role for the user. The system user for instance has all the privileges. The capabilities a user has is determined from the license manager for instance or by the explicitly added capabilities. This is also necessary from audit purposes. If the auditing is not necessary, then the check against capabilities could always return true. On the other hand, if all the denies were to be audited, it would leave a trail in the audit log.
Pravega serves as a stream store. Its control path is available at 9090 port in standalone mode with REST API. The data path is over Flink connector to segment store port 6000. The functionality for the POST method should not involve abstraction and higher-level modules. Instead it should be as close to the methods of the component as possible.
@Override
public CompleteableFuture<Void> createEvent(String scopeName, String streamName, String message) {
final ClientFactoryimpl clientFactory = new ClientFactoryImpl(scopeName, this);
final Serializer<String> serializer = new JavaSerializer<>();
final Random random = new Random();
final Supplier<String> keyGenerator = () -> String.valueOf(random.nextInt());
EventStreamWriter<String> writer = clientFactory.createEventWriter(streamName, serializer, EventWriterConfig.builder().build());
return writer.writeEvent(keyGenerator.get().message);
}

Thus, a mix of conventional design and new APIs improves the audience for streaming data platform.

Thursday, January 23, 2020

There is also one more advantage to exposing REST API s from the store. The APIs can be deployed with the same deployment charts as that of the store
The deployment charts are a significant utility for any solutions involving streams of data. They make it easy to deploy the stream store and its components along with their APIs and services. Whether the deployment is over one container orchestration framework or another, the charts come helpful in declaring the requirements. These same charts also help with the performance and scalability of the store because they determine the container deployment strategies. The use of containers over standalone deployment enables resilience, fault tolerance, horizontal scaling and load balancing. All of the techniques for which the container orchestration frameworks are well-known for, are equally applicable to the stream store when deployed with its charts.
The number of resources for each component depends on the load. Some storage products can be deployed in high availability mode. This involves a cluster instead of a single host. The number of nodes in the cluster can be dynamically scaled based on load. The stream store is no different. Deployment charts provide an ability to scale the nodes of the store services. The deployment charts also define order of creation and destruction of pods with the help of a statefulset. The pod deployment policies, their security and initialization routines are based on the application charts.
Another important aspect of APIs is the ability to include documentation. Swagger, for example, uses automatic documentation generation whenever the API descriptions change. The documentation is a critical aspect for the popularity of APIs.
Organizing the APIs was only needed for gRPC. REST Apis are independently provisioned per resource. This flat enumeration of APIs helps with their microservice model and available for mashups at a higher level. SOAP requires tools to inspect the message. REST can be intercepted by web proxy and displayed with browser and add-on.

SOAP methods require declarative address, binding and contract. REST is URI based and has qualifiers for resources.
There is a technique called HATEOAS where a client can get more information from the web API servers with the help of hypermedia. This technique helps the client to get information outside the documentation by machines instead of humans. Since the storage products are considered commodity, this technique is suitable for plugging in one storage product versus another. The clients that use the web API programmability can then switch products with ease.

Thus, a mix of conventional design and new APIs improves the audience for streaming data platform.

Wednesday, January 22, 2020

Tuesday, January 21, 2020

Stream storage is not a commodity. It is an emerging market. There are a handful of players at best and very few solutions built on top of their products. The solution architects find developer appeal with the new products that combine analytics with storage. However, the devOps engineers prefer scriptability to programmability to expedite plugging the product in data pipelines.
The conventional solution to addressing the need for scriptability is demonstrated very well by Amazon Web Services (AWS) software development kit (SDK). There exists one for every popular language and widens the access to underlying services. Each service is accessible over the web with Representation state transfer (REST) Application programming interface (API) which is then automated and made more scriptable with client-side local SDK that can be integrated easily with existing and new solutions because it speaks the same language that the client wants.
The stream stores integrated with analytical software with store specific connectors that put to good use the gRPC protocol. The REST interface for the data path was somewhat lacking which made it difficult to embrace the conventional solution. REST and gRPC are essentially data transfer languages except that REST is a way of requesting ‘resources’ from the remote end via standard verbs such as GET, PUT etc and gRPC is a way of specifying arbitrary procedures to be invoked on the server. These routines are the equivalent of verbs and resources and some treat this communication as a refinement of erstwhile Simple Object Access Protocol (SOAP).
The advantages of REST are:
• Requires HTTP/1.1
• Supports subscription mechanisms with REST hooks
• Comes with widely accepted tool and browser support
• Well defined road to development of the service that provides this communication
• Supports discovery of resource identifiers with subsequent request response models.
• Is supportive of software development kit where more than one language can be supported for the use of these communication interfaces.
The disadvantages are:
• Is considered chatty because there are a number of requests and responses
• Is considered heavy because the payload is usually large.
• Is considered inflexible at times with versioning costs
The advantages of gRPC are:
• Supports high speed communication because it is lightweight and does not require the traversal of stack all the way up and down the networking layers.
• The messages are over “Protocol Buffer” which is known for being efficient in packing and unpacking data
• It works over newer HTTP/2
• Best for traffic from devices (IoT)
The disadvantages are
• Requires client to write code
• Does not support browser

The conventional solution therefore has to be enhanced with the provisioning of REST APIs because command line invocation and browser support wins hands down for scriptability.

Monday, January 20, 2020

The Jenkins automation to read upstream job information is made possible with APIs. For example:

for (String jobName : jobNames.keySet()) {

def imageName = jobNames.get(jobName)

def uri = "https://<jenkinsServer>/job/${jobName}/lastSuccessfulBuild/api/json"

def buildJson = ["wget", "-qO-", "${uri}"].execute().text

def start = buildJson.indexOf('"description":"')

def end = buildJson.indexOf('"', start+15)

def remoteBuildVersion = ""

println("${start}:${end}")

if (start != -1 && end != -1 && end > start) {

remoteBuildVersion = buildJson.substring(start+15, end)

}

manifestFileLines.each { line ->

if (line.contains("${imageName}:")) {

localBuildVersion = line.split(":")[1]

if (remoteBuildVersion != localBuildVersion) {

pattern=remoteBuildVersion

replacement=localBuildVersion

println("replacing ${pattern} with ${replacement} in ${imageName}")

filesToBeModified.eachFileRecurse(

{file ->

fileText = file.text;

def backupFile = file.path + ".bak"

writeFile(file: backupFile, text: fileText)

fileText = fileText.replaceAll(pattern, replacement)

writeFile(file: file.path, text: fileText)

})

} else {

println("${jobName} build version ${remoteBuildVersion} matches ${localBuildVersion}")

}

Note the above script is specific to Jenkinsfile and avoids popular groovy syntax even though groovy can be used in Jenkinsfile otherwise we might see errors such as : “Scripts not permitted to use staticMethod org.codehaus.groovy.runtime.DefaultGroovyMethods execute java.util.List. Administrators can decide whether to approve or reject this signature.

org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use staticMethod org.codehaus.groovy.runtime.DefaultGroovyMethods execute java.util.List”

Sunday, January 19, 2020

There are many ways to use Configuration annotations in Spring Java application. @Configuration annotation is not the same as a bean. A POJO object defined as a configurationproperties can be imported into an @configuration.
For example,

The proper annotations to use with Kubernetes secrets in Spring Java application:

@Configuration

@Primary

@EnableWebSecurity

@EnableConfigurationProperties(KubernetesSecrets.class)

public class WebConfig extends WebSecurityConfigurerAdapter {

private static final Logger LOG = LoggerFactory.getLogger(WebConfig.class);

KubernetesSecrets secrets;

public WebConfig(KubernetesSecrets secrets) {

this.secrets = secrets;

}

Here the KubernetesSecrets object is a ConfigurationProperty which will get its values externally.
When a new instance of the KubernetesSecrets is created as part of this class in a method, that method will have an @Bean annotation. The same annotation does not hold for the above member variable and constructor

The differences between a Configuration, ConfigurationProperty and Bean are somewhat unclear when they are used merely to refer to an external source. The ConfigurationProperty is only a way to tell that these properties will have values defined externally. The Configuration object is defined on an object that uses ConfigurationProperties.

Please note that we don't use @Autowired for the member variable or the constructor above. If it were a @Bean that would have been appropriate.

A configuration is essential for context initialization which in turn helps the SpringBootApplication with initialization.

The @Bean is used to declare a single bean. Spring does it automatically when a @Component, @Service, or an @Repository is used because they come from classpath scanning.