public class OperatorCoordinatorHolder extends Object implements OperatorCoordinatorCheckpointContext, AutoCloseable
OperatorCoordinatorHolder holds the OperatorCoordinator and manages all its
interactions with the remaining components. It provides the context and is responsible for
checkpointing and exactly once semantics.
The semantics are described under OperatorCoordinator.checkpointCoordinator(long,
CompletableFuture).
This implementation can handle one checkpoint being triggered at a time. If another checkpoint is triggered while the triggering of the first one was not completed or aborted, this class will throw an exception. That is in line with the capabilities of the Checkpoint Coordinator, which can handle multiple concurrent checkpoints on the TaskManagers, but only one concurrent triggering phase.
The mechanism for exactly once semantics is as follows:
OperatorEventValve. If we are not
currently triggering a checkpoint, then events simply pass through.
afterSourceBarrierInjection(long)) the valves are
opened again and the events are sent.
IMPORTANT: A critical assumption is that all events from the scheduler to the Tasks are transported strictly in order. Events being sent from the coordinator after the checkpoint barrier was injected must not overtake the checkpoint barrier. This is currently guaranteed by Flink's RPC mechanism.
Consider this example:
Coordinator one events: => a . . b . |trigger| . . |complete| . . c . . d . |barrier| . e . f Coordinator two events: => . . x . . |trigger| . . . . . . . . . .|complete||barrier| . . y . . z
Two coordinators trigger checkpoints at the same time. 'Coordinator Two' takes longer to complete, and in the meantime 'Coordinator One' sends more events.
'Coordinator One' emits events 'c' and 'd' after it finished its checkpoint, meaning the events must take place after the checkpoint. But they are before the barrier injection, meaning the runtime task would see them before the checkpoint, if they were immediately transported.
'Coordinator One' closes its valve as soon as the checkpoint future completes. Events 'c' and 'd' get held back in the valve. Once 'Coordinator Two' completes its checkpoint, the barriers are sent to the sources. Then the valves are opened, and events 'c' and 'd' can flow to the tasks where they are received after the barrier.
This component runs strictly in the Scheduler's main-thread-executor. All calls "from the outside" are either already in the main-thread-executor (when coming from Scheduler) or put into the main-thread-executor (when coming from the CheckpointCoordinator). We rely on the executor to preserve strict order of the calls.
Actions from the coordinator to the "outside world" (like completing a checkpoint and sending an event) are also enqueued back into the scheduler main-thread executor, strictly in order.
| Modifier and Type | Method and Description |
|---|---|
void |
abortCurrentTriggering() |
void |
afterSourceBarrierInjection(long checkpointId) |
void |
checkpointCoordinator(long checkpointId,
CompletableFuture<byte[]> result) |
void |
close() |
OperatorCoordinator |
coordinator() |
static OperatorCoordinatorHolder |
create(org.apache.flink.util.SerializedValue<OperatorCoordinator.Provider> serializedProvider,
ExecutionJobVertex jobVertex,
ClassLoader classLoader,
CoordinatorStore coordinatorStore) |
int |
currentParallelism() |
void |
handleEventFromOperator(int subtask,
OperatorEvent event) |
void |
lazyInitialize(GlobalFailureHandler globalFailureHandler,
org.apache.flink.runtime.concurrent.ComponentMainThreadExecutor mainThreadExecutor) |
int |
maxParallelism() |
void |
notifyCheckpointAborted(long checkpointId)
We override the method here to remove the checked exception.
|
void |
notifyCheckpointComplete(long checkpointId)
We override the method here to remove the checked exception.
|
OperatorID |
operatorId() |
void |
resetToCheckpoint(long checkpointId,
byte[] checkpointData)
Resets the coordinator to the checkpoint with the given state.
|
void |
start() |
void |
subtaskFailed(int subtask,
Throwable reason) |
void |
subtaskReset(int subtask,
long checkpointId)
Called if a task is recovered as part of a partial failover, meaning a failover
handled by the scheduler's failover strategy (by default recovering a pipelined region).
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetIdspublic void lazyInitialize(GlobalFailureHandler globalFailureHandler, org.apache.flink.runtime.concurrent.ComponentMainThreadExecutor mainThreadExecutor)
public OperatorCoordinator coordinator()
public OperatorID operatorId()
operatorId in interface OperatorInfopublic int maxParallelism()
maxParallelism in interface OperatorInfopublic int currentParallelism()
currentParallelism in interface OperatorInfopublic void close()
throws Exception
close in interface AutoCloseableExceptionpublic void handleEventFromOperator(int subtask,
OperatorEvent event)
throws Exception
Exceptionpublic void subtaskReset(int subtask,
long checkpointId)
OperatorCoordinatorCheckpointContextIn contrast to this method, the OperatorCoordinatorCheckpointContext.resetToCheckpoint(long, byte[]) method is called
in the case of a global failover, which is the case when the coordinator (JobManager) is
recovered.
subtaskReset in interface OperatorCoordinatorCheckpointContextpublic void checkpointCoordinator(long checkpointId,
CompletableFuture<byte[]> result)
checkpointCoordinator in interface OperatorCoordinatorCheckpointContextpublic void notifyCheckpointComplete(long checkpointId)
OperatorCoordinatorCheckpointContextCheckpointListener.notifyCheckpointComplete(long) for more detail semantic of the
method.notifyCheckpointComplete in interface org.apache.flink.api.common.state.CheckpointListenernotifyCheckpointComplete in interface OperatorCoordinatorCheckpointContextpublic void notifyCheckpointAborted(long checkpointId)
OperatorCoordinatorCheckpointContextCheckpointListener.notifyCheckpointAborted(long) for more detail semantic of the
method.notifyCheckpointAborted in interface org.apache.flink.api.common.state.CheckpointListenernotifyCheckpointAborted in interface OperatorCoordinatorCheckpointContextpublic void resetToCheckpoint(long checkpointId,
@Nullable
byte[] checkpointData)
throws Exception
OperatorCoordinatorCheckpointContextThis method is called with a null state argument in the following situations:
In both cases, the coordinator should reset to an empty (new) state.
resetToCheckpoint in interface OperatorCoordinatorCheckpointContextExceptionpublic void afterSourceBarrierInjection(long checkpointId)
afterSourceBarrierInjection in interface OperatorCoordinatorCheckpointContextpublic void abortCurrentTriggering()
abortCurrentTriggering in interface OperatorCoordinatorCheckpointContextpublic static OperatorCoordinatorHolder create(org.apache.flink.util.SerializedValue<OperatorCoordinator.Provider> serializedProvider, ExecutionJobVertex jobVertex, ClassLoader classLoader, CoordinatorStore coordinatorStore) throws Exception
ExceptionCopyright © 2014–2023 The Apache Software Foundation. All rights reserved.