Saturday, December 14, 2019

We discussed event time, we now discuss state. State is useful to resume task. Task can be done in various scopes and levels. There is an operator state bound to one operator instance. If there are multiple parallel instances, each instance maintains this state.
Keyed state is an Operator state that is bound to a tupe of <operator, key> where the key corresponds to a partition.  There is a state partition per key.
Keyed state comprises of key groups which are distribution units so that they can be transferred during parallelism. Each parallel instance works with the keys of one or more key groups.
State regardless of keyed state or operator state exists either as managed or raw.
Managed state usually is stored in internal hash tables or Rocks DB.  The state is encoded and written into the checkpoints.
Raw state is never shared outside the operator. When it is checkpointed, the raw state is dumped as byte range.  The Flink runtime knowns nothing about these raw bytes.
DataStream functions use managed state.  The serialization should be standard otherwise it may break between versions of Flink runtime
A program makes uses of state with methods such as
ValueState<T> which keeps a value that can be updated or retrieved
ListState<T> which keeps an iterable over state elements
ReducingState<T> which keeps a single value aggregated from the state
AggregatingState<T> which also keeps a single value aggregated from the state but not necessarily of the same type
FoldingState<T> which is similar to the above but makes use of a folding function
MapState<UK, UV> which keeps a list of mappings
Since state is stored externally, it has a handle often reffered to as the StateDescriptor similar to the notion of a filedescriptor
State is accessed only via the runtimecontext.
For example:
Sum = getRuntimeContext().getState(descriptor);

Also, every state has a time to live also called TTL.  State has to be ‘cleaned up’.

No comments:

Post a Comment