@PublicEvolving public interface FileRecordFormat<T> extends Serializable, org.apache.flink.api.java.typeutils.ResultTypeQueryable<T>
This format is for cases where the readers need access to the file directly or need to create a custom
stream. For readers that can directly on input streams, consider using the StreamFormat, which
is more robust.
The outer class FileRecordFormat acts mainly as a configuration holder and factory for the reader.
The actual reading is done by the FileRecordFormat.Reader, which is created based on an
input stream in the createReader(Configuration, Path, long, long) method
and restored (from checkpointed positions) in the method
restoreReader(Configuration, Path, long, long, long).
File splitting means dividing a file into multiple regions that can be read independently.
Whether a format supports splitting is indicated via the isSplittable() method.
Splitting has the potential to increase parallelism and performance, but poses additional constraints on the format readers: Readers need to be able to find a consistent starting point within the file near the offset where the split starts, (like the next record delimiter, or a block start or a sync marker). This is not necessarily possible for all formats, which is why splitting is optional.
Readers can optionally return the current position of the reader, via the
FileRecordFormat.Reader.getCheckpointedPosition(). This can improve recovery speed from
a checkpoint.
By default (if that method is not overridden or returns null), then recovery from a checkpoint works by reading the split again and skipping the number of records that were processed before the checkpoint. Implementing this method allows formats to directly seek to that position, rather than read and discard a number or records.
The position is a combination of offset in the file and a number of records to skip after
this offset (see CheckpointedPosition). This helps formats that cannot describe all
record positions by an offset, for example because records are compressed in batches or stored
in a columnar layout (e.g., ORC, Parquet).
The default behavior can be viewed as returning a CheckpointedPosition where the offset
is always zero and only the CheckpointedPosition.getRecordsAfterOffset() is incremented
with each emitted record.
Like many other API classes in Flink, the outer class is serializable to support sending instances to distributed workers for parallel execution. This is purely short-term serialization for RPC and no instance of this will be long-term persisted in a serialized form.
Internally in the file source, the readers pass batches of records from the reading threads (that perform the typically blocking I/O operations) to the async mailbox threads that do the streaming and batch data processing. Passing records in batches (rather than one-at-a-time) much reduce the thread-to-thread handover overhead.
This batching is by default based a number of records. See RECORDS_PER_FETCH
to configure that handover batch size.
| Modifier and Type | Interface and Description |
|---|---|
static interface |
FileRecordFormat.Reader<T>
The actual reader that reads the records.
|
| Modifier and Type | Field and Description |
|---|---|
static org.apache.flink.configuration.ConfigOption<Integer> |
RECORDS_PER_FETCH
Config option for the number of records to hand over in each fetch.
|
| Modifier and Type | Method and Description |
|---|---|
FileRecordFormat.Reader<T> |
createReader(org.apache.flink.configuration.Configuration config,
org.apache.flink.core.fs.Path filePath,
long splitOffset,
long splitLength)
Creates a new reader to read in this format.
|
org.apache.flink.api.common.typeinfo.TypeInformation<T> |
getProducedType()
Gets the type produced by this format.
|
boolean |
isSplittable()
Checks whether this format is splittable.
|
FileRecordFormat.Reader<T> |
restoreReader(org.apache.flink.configuration.Configuration config,
org.apache.flink.core.fs.Path filePath,
long restoredOffset,
long splitOffset,
long splitLength)
Restores a reader from a checkpointed position.
|
static final org.apache.flink.configuration.ConfigOption<Integer> RECORDS_PER_FETCH
The number should be large enough so that the thread-to-thread handover overhead is amortized across the records, but small enough so that the these records together do not consume too memory to be feasible.
FileRecordFormat.Reader<T> createReader(org.apache.flink.configuration.Configuration config, org.apache.flink.core.fs.Path filePath, long splitOffset, long splitLength) throws IOException
restoreReader(Configuration, Path, long, long, long) for details.IOExceptionFileRecordFormat.Reader<T> restoreReader(org.apache.flink.configuration.Configuration config, org.apache.flink.core.fs.Path filePath, long restoredOffset, long splitOffset, long splitLength) throws IOException
FileRecordFormat.Reader.getCheckpointedPosition() a value with non-negative
offset. That value is supplied as the restoredOffset.
If the reader never produced a CheckpointedPosition with a non-negative offset before, then
this method is not called, and the reader is created in the same way as a fresh reader via the method
createReader(Configuration, Path, long, long) and the appropriate number of
records are read and discarded, to position to reader to the checkpointed position.
IOExceptionboolean isSplittable()
See top-level JavaDocs (section "Splitting") for details.
Copyright © 2014–2020 The Apache Software Foundation. All rights reserved.