Cluster computing

Saturday, May 16, 2020

Data import and export tool

The importer has one challenge different from the exporter. The destination for the exporter can accept multipart upload but it has limitations in sending the same payload back other sending it back as an OutputStream which makes importer to have to wait until the progress completes. The exporter, on the other hand, has the ability to pause and resume independent of the destination. This facilitates the sender to be smart in the way to orchestrate simultaneous transfers to multiple destinations without requiring more replicas.

As with any application, importers and exporters can be a dedicated single thread of activity that can be tested, serviced and monitored with the best practice from testing, dev-ops and call-home functionalities. These functionalities can be independently added via their own stacks or applications that sit well with those deployed by the tool. There is very little need for the application to take on the onus from these perspectives since specialized products continue to serve similar functionalities across applications. For example, reporting stacks can work off the logs and read only queries from the importer and exporter.

API functionality is a separate concern from the above and belongs exclusively to the tool. The tool may take in parameters over the API requiring little or no redeployment for the end user. This kind of functionality alleviates the setup and teardown associated with adhoc and changing requirements.

The importer and exporter have the ability to append or read sections of the stream.

The importer and exporter also make push and pull model easy by acting as a relay in the middle. The role of the importer and exporter then becomes an adapter between heterogeneous systems. The API for example, is a pull model. But most metrics and time series database are a push model relying on the agents like telegraf to transfer data.

The importer and the exporter enables a stream store to participate in a data pipeline. This is a critical business value for the stream store because it adds value as co-inhabitant of a data ecosystem as opposed to competing with time-series database

Friday, May 15, 2020

data export and import tool

Let us change the topic for today to resume another thread of discussion for the data export and import tool.

The tool can work in both export mode and import mode. The export mode is used to send data from a source stream to the target bucket. The import mode is used to reverse the data from the s3 bucket to the local stream. The export is done by readers that read from the stream. The import is done by writers that write to the stream. The readers and writers are both capable of being paused and resumed. They do this with the help of the stream store functionalities. The S3 store is already web-accessible so the requests and responses are granular and the uploads can be multipart. The writer may be generated one for each transfer operation with the ability to perform its operation over a long time. This kind of action is independent and isolated both for readers and writers. There can be many readers for the same stream without affecting each other and each writer is writing to a stream reserved for it. Since each event is sequenced, the last position is known which helps with progress and time remaining. The size of an event is finite. When the data exceeds an event, it can be written into another event. The size of the object and the size of the event do not have to match. They can both accept spillover to another object/event.

Thursday, May 14, 2020

Kotlin vs Java continued

Sample code to print the stack trace in Kotlin

try {

Throw RuntimeException(“printStackTrace”)

} catch (e : Exception) {

e.printStackTrace()

LOG.error(“Exception: “, e)

}

As an aside, StackTraces are valuable for the analysis of faults and their categorization.

For example, Software support folks in the field often encounter crash dumps at the customer's premise. They may have access to some dumps but they find situations where they do not have any access to symbols particularly when the customer's operating system is Windows. What they see in the dumps or crash log reports are stack frames with raw addresses such as 0x7ff34555 and these are not yet resolved as module!class::method. Resolving of stackframes is the first step towards recognizing and categorizing them for identifying software components that have been generating exceptions. A debugger helps with finding the raw frames but symbols are needed to resolve the stack frame. The production support personnel can't take the symbols to the customer's machine. This results in a round-trip or call-home with the exception or the dump for putting the symbols and the frames together. This can be achieved by using the same module that the debuggers use to translate the stack trace - dbghelp.dll when the symbols are available This module exports methods such as StackWalk and StackWalk64 that gives the stack trace when pointed to the dump and it has methods SymFromAddr and SymFromAddrW that just seems to do what we want. A call to SymInitialize is required first so that we have a handle to the process. The method takes the following parameters :

- a handle retrieved from the SymInitialize method

- an Address pertaining to a frame in the stack for which the symbol should be located.

- a displacement from the beginning of a symbol or zero and this is optional

- a buffer to receive the SYMBOL_INFO resolved from the address.

If the function succeeds, it returns true. The SYMBOL_INFO struct has a member called MaxNameLen that should be set to the size of the name in characters. The name can be requested to be undecorated for easy readability.

The code from MSDN is something like this:

DWORD64 dwDisplacement = 0;

DWORD64 dwAddress = SOME_ADDRESS;

char buffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(TCHAR)];

PSYMBOL_INFO pSymbol = (PSYMBOL_INFO)buffer;

pSymbol->SizeOfStruct = sizeof(SYMBOL_INFO);

pSymbol->MaxNameLen = MAX_SYM_NAME;

if (SymFromAddr(hProcess, dwAddress, &dwDisplacement, pSymbol))

{

// SymFromAddr returned success

printf("%s\n", buffer);

}

else

{

// SymFromAddr failed

DWORD error = GetLastError();

printf("SymFromAddr returned error : %d\n", error);

}

When we give some address, it should be the relative offset from the base address. A stackhasher application involves the parsing of stacktraces, their hashing into buckets and their reporting say via bar charts based on categories.

Npw returning back to the runtime determination of the origination of objects using stack traces, we find that Kotlin runtime adds a layer over the native runtime even resulting in quite deep stack traces unlike go language.

The Kotlin runtime is also tightly bound to the garbage collector. The GC allocation errors are frequently tied to the runtime requirements of POJO like objects. When the objects are marshaled across boundaries, they are often serialized and desersialized resulting in lots of byte range copying and small allocations depending on the type and composition of the objects

Wednesday, May 13, 2020

Data Import tool

We talked about a data export tool as follows:

When Applications are hosted on Kubernetes, they choose to persist their state using persistent volumes. The data stored on these volumes is available between application restarts. The storageclass which provides storage for these persistent volumes will be external to the pods and the container on which the application is running. When the tier 2 storage is nfs, the persistent volumes appear as mounted file system and this is usable with all standard shell tools including those for backup and export such as duplicity. The backups usually exist together with the source and as another persistent volume which can then be exposed to users via curl requests. Therefore, there is a two-part separation – one which involves an extract-transform-load between a source and destination and another that relays the prepared data to the customer.

Both can take arbitrary amount of data and prolonged processing. In the Kubernetes world, with arbitrary lifetime of pods and containers, this kind of processing becomes prone to failures. It is this special consideration that sets apart the application logic from traditional data export techniques. The ETL may be written in Java but a Kubernetes Job will need to be specified in the operator code base so that the jobs can be launched on user demand and survive all the interruptions and movements possible in the control plane of Kubernetes.

Kubernetes jobs run to completion. It creates one or more pods and as the pods complete, the job tracks the completions. The job has ownership of the pods so the pods will be cleaned up when the jobs are deleted. The job spec can be used to describe the job and usually requires the pod template, apiVersion, kind and metadata fields. The selector field is optional. Jobs may be sequential, parallel with a fixed completion count and parallel jobs as in a work queue – all of which are suitable for multi-part export of data.

Data Export from the Kubernetes data plane can be ensured to be on demand and associated with a corresponding K8s resource – custom or standard for visibility in the control plane.

An alternative technique to this solution is to enable a multipart download REST API that exposes the filesystem or S3 storage directly. This kind of pattern keeps the data transfer out of the Kubernetes control plane and exposed only internally which is then used from the user interface.

The benefits of this technique is that the actions are tied to the user interface-based authentication and all actions are on –demand. The trade-off is that the user interface has to relay the api call to another pod and it does not work for long downloads without interruptions.

Regardless of the preparation of the data to be streamed to the client behind an api call, it is better to not require relays in the data transfer. The api call is useful to make the request for the perpared data to be on demand and the implementation can scale to as many requests as necessary.

This can also be a data import tool

Tuesday, May 12, 2020

Kotlin vs Java continued

Dynamic component system can be specified with the help of OSGI specifications. Components are packaged in bundles which communicate locally or via services across the gateway and hence the acronym for open services gateway interface. Kotlin offers support for this specification with the help of a separate Kotlin-Osgi-bundle library which replaces regular libraries such as Kotlin-runtime, kotlin-stdlib and Kotlin-reflect. The usual gradle tag to specify exclusions Aldo works for these above referenced libraries. The manifest requirement in other languages to specify the Osgi bundle is also possible in Kotlin but it is not sufficient. That option is certainly the most preferred way but it does not solve a well-known package split issue.

The Kotlin standard library includes

• Higher order functions that apply let, apply, synchronized and use

• Extension functions that support querying operations

• Utilities to work with String and char sequence

• Extensions for jdk classes that help programming with io, threading and files

Kotlin core functions and types are supported on all platforms and are available via the Kotlin package.

Kotlin. Annotation package provides annotation facility

Kotlin.Browser provides top level browser properties.

Regular collectiontypes are packaged in kotlin. Collections.

Kotlin. Comparison and Kotlin. Concurrent provide comparator and parallel programming facilities.

Kotlin contracts and coroutines provide DSL and suspend facilities

Monday, May 11, 2020

Kotlin vs Java continued

Kotlin has support for a number of tools to help with the compilation and build.

Annotation processing is fine with the help of kapt tool. A plugin for gradle by the name Kotlin-Kapt is also available. Key value pairs can be passed in to Kapt as arguments.

Kapt tasks can be run in parallel. It can also leverage gradle’s compile avoidance feature to skip notation processing altogether. This plugin is available in the jar form to make it easy to be run from the command line.

The language fit documentation in Kotlin is Kdoc. It takes the comments preceding class and member declarations and converts it into documentation.

There is a convention to be followed for those comments to appear in the docs the first line. Is usually a summary followed by a more detailed description and then the parameters and return types described. The latter is specified with the help of prefixes such as

@param, @return, @constructor, @properties, @throws, @exception, @sample, @see, @author, @since,@suppress. Text can also include link to references on the same page by enclosing terms in square braces. Module and package annotation is made with declarations in a separate file and passed to a tool called Dokka.

Dynamic component system can be specified with the help of OSGI specifications.

Components are packaged in bundles which communicate locally or via services across the gateway and hence the acronym for open services gateway interface.

Kotlin offers support for this specification with the help of a separate Kotlin-Osgi-bundle library which replaces regular libraries such as Kotlin-runtime, kotlin-stdlib and Kotlin-reflect. The usual gradle tag to specify exclusions Aldo works for these above referenced libraries. The manifest requirement in other languages to specify the Osgi bundle is also possible in Kotlin but it is not sufficient. That option is certainly the most preferred way but it does not solve a well-known package split issue.

Sunday, May 10, 2020

Kotlin vs Java continued

Kotlin has a wide variety of tools available. These tools are available from gradle, maven, ant, compiler opts, plug-ins, Kapt, Dokka and Osgi. These tools and their plug-ins help Kotlin support multiple languages.

Gradle has a number of plugins which are determined at the outset and very early before the compilation of the code.

Gradle can optimize the loading and re-use of plugin classes, allow different plugins for different versions of classes and provide editors which detail information about the potential properties and values in the buildscript.

The plugins can also be extracted during compilation rather than require pre-installation. It might look like the traditional apply() method in gradle is no different from the plugins block and that they both serve to list plugins and their versions but the latter is actually more recent, has more rigorous checks, constraints and restrictions. If we want to avoid the restrictions, we could make use of the buildScript block. Gradle has support for multi-project builds so the build.gradle is composable for different projects

The KotlinOptions in gradle allows us to set compiler options. Common options include:

-nowarn to suppress display warnings during compilation. And it’s opposite

-Werror which turns warnings into errors

-script which evaluates a Kotlin script file

-Kotlin-home which specifies a custom path to the Kotlin compiler

-plugin and corresponding version to include plugins

-progressive mode where the compiler evaluates deprecation and bug fixes for unstable code instead of going through a graceful migration cycle. The progressive mode is important for stability and backward compatibility of code because it makes the changes in the code for breaking changes in the compiler. A breaking change for a compiler is one where the compiler throws an error now when it did not earlier.

In addition to the above parameters, JVM parameters can also be passed to the build.