Saturday, June 6, 2020

Application troubleshooting continued...

Application Scaling: 

The runtime for stream processing such as Flink is deployed in the form of a dedicated cluster along with the necessary software components such as zookeeper, stream analytics platform components and even clients to the stream store. When the application is deployed using the isolation of a project in the stream analytics platform, all the necessary cpu, memory and disk is specified upfront for all these software components to run. It becomes necessary to predict the resource usage of the application.  

This section focuses on just the Flink cluster with its jobManagertaskManagerresourceManager and dispatcher. The jobManager is the master process that controls the execution of a single application. Each application is controlled by a different jobManager and is usually represented by a logical dataflow graph called the jobGraph and a jar file. The jobManager converts the job graph into a physical data flow graph and parallelizes as much of the execution as possible.  

There are multiple resource managers based on the cluster of choice. It is responsible for managing task manager’s slots which is the unit of processing resources. It offers idle slots to the jobManager and launches new containers.  

The task manager are the worker processes of Flink. There are multiple task managers running in a Flink setup. Each task manager provides a certain number of slots. These slots limit the number of tasks that a task manager can execute.  

The dispatcher runs across job execution and when an application is submitted via its REST interface, it starts a jobManager and hands the application over. The REST interface enables the dispatcher to serve as an HTTP entrypoint to clusters that are behind a firewall. 

Since the Flink runs as a standalone within the project namespace in the analytics platform, its dashboard is made accessible as-is with a project namespace qualified name. This gives us enough indicators of the resource usages for Flink job execution and enables the project user to deploy the project with the adequate resources for mission critical deployments. 

It is recommended to scale out the cluster based on the loads over the software components. The compute intensive resources will require more members in the cluster while the storage intensive components may require larger fileshare allocation. 

 

Multiple levels of scaling: 

There are multiple levels of scaling involved. First, level is the Kubernetes level. Second level is the per component cluster level. Third level is the resources assigned to the application at the application level.  Not all scaling is the same. Some increase replicas while others increase resources per cluster.  

No comments:

Post a Comment