public final class ChunkerBuilder
extends java.lang.Object
Chunker instances.
The builder offers highly customizable content-defined-chunking algorithms. Offered algorithms are:
FastCDC (original)- Wen Xia et al. (publication)FastCDC Rust- Nathan Fiedler (source),slightly modified version of the original algorithmFixed-Size-Chunking (FSC)- Baseline, chunks the data stream every x-th byte, without interpreting the content
Chunker.
A custom algorithm can be set by using setChunker(Chunker) for full control or setChunkerCore(IterativeStreamChunkerCore)
for a simplified interface.
setChunkerOption(ChunkerOption) can be used to choose from the predefined algorithms.
The algorithms will try to strive for an expected chunk size given by setExpectedChunkSize(int),
a minimal chunk size given by setMinimalChunkSizeFactor(double) and a maximal chunk size given by setMaximalChunkSizeFactor(double).
Most of the algorithms internally use a hash table as source for predicted noise to steer the algorithm, a custom
table can be provided by setHashTable(long[]).
Alternatively, setHashTableOption(HashTableOption) can be used to choose from predefined tables.
The algorithms are heavily steered by masks which define the cut-points. By default they are generated randomly using
a fixed seed that can be changed by using setMaskGenerationSeed(long). There are different techniques available
to generate masks, they can be set using setMaskOption(MaskOption).
To achieve a distribution of chunk sizes as close as possible to the expected size, normalization levels are used
during mask generation. setNormalizationLevel(int) is used to change the level. The higher the level, the closer
the sizes are to the expected size, for the cost of a worse deduplication rate.
Alternatively, masks can be set manually using setMaskSmall(long) for the mask used when the chunk is still
smaller than the expected size and setMaskLarge(long) for bigger chunks respectively.
After a chunk has been read, a hash is generated based on its content. The algorithm used for this process can be
set by setHashMethod(String), it has to be supported and accepted by MessageDigest.
Finally, a chunker using the selected properties can be created using build().
The default configuration of the builder is:
- Chunker option:
ChunkerOption.FAST_CDC - Expected size:
8 * 1024 - Minimal size factor:
0.25 - Maximal size factor:
8 - Hash table option:
HashTableOption.RTPAL - Mask generation seed:
941568351 - Mask option:
MaskOption.FAST_CDC - Normalization level:
2 - Hash method:
SHA-1
fastCdc(), nlFiedlerRust() and fsc() can be used to get a configuration
that uses the given algorithms as originally proposed.- Author:
- Daniel Tischner <zabuza.dev@gmail.com>
-
Constructor Summary
Constructors Constructor Description ChunkerBuilder() -
Method Summary
Modifier and Type Method Description Chunkerbuild()Builds a chunker using the set properties.ChunkerBuilderfastCdc()Sets the builder to a configuration for the original FastCDC algorithm.ChunkerBuilderfsc()Sets the builder to a configuration for the baseline Fixed-Size-Chunking algorithm.ChunkerBuildernlFiedlerRust()Sets the builder to a configuration for the modified FastCDC algorithm of Nathan Fiedlers Rust implementation.ChunkerBuildersetChunker(Chunker chunker)Sets the chunker to use.ChunkerBuildersetChunkerCore(IterativeStreamChunkerCore chunkerCore)Sets the core to use for an iterative stream chunker.ChunkerBuildersetChunkerOption(ChunkerOption chunkerOption)Sets the chunker option to use.ChunkerBuildersetExpectedChunkSize(int expectedChunkSize)Sets the expected size of chunks, in bytes.ChunkerBuildersetHashMethod(java.lang.String hashMethod)Sets the hash method to use for representing the data of chunks.ChunkerBuildersetHashTable(long[] hashTable)Sets the hash table to use by the chunker algorithm.ChunkerBuildersetHashTableOption(HashTableOption hashTableOption)Sets the option to use for the hash table used by the chunker algorithm.ChunkerBuildersetMaskGenerationSeed(long maskGenerationSeed)Sets the seed to use for mask generation.ChunkerBuildersetMaskLarge(long maskLarge)Sets the mask for the fingerprint that is used for bigger windows, to increase the likelihood of a split.ChunkerBuildersetMaskOption(MaskOption maskOption)Sets the algorithm used to generate the masks used by certain chunkers.ChunkerBuildersetMaskSmall(long maskSmall)Sets the mask for the fingerprint that is used for smaller windows, to decrease the likelihood of a split.ChunkerBuildersetMaximalChunkSizeFactor(double maximalChunkSizeFactor)Sets the factor to apply to the expected chunk size to receive the maximal chunk size.ChunkerBuildersetMinimalChunkSizeFactor(double minimalChunkSizeFactor)Sets the factor to apply to the expected chunk size to receive the minimal chunk size.ChunkerBuildersetNormalizationLevel(int normalizationLevel)Sets the normalization level used for choosing the masks in certain chunkers.
-
Constructor Details
-
ChunkerBuilder
public ChunkerBuilder()
-
-
Method Details
-
build
Builds a chunker using the set properties.- Returns:
- A chunker using the set properties
-
fastCdc
Sets the builder to a configuration for the original FastCDC algorithm.- Returns:
- This builder instance
-
fsc
Sets the builder to a configuration for the baseline Fixed-Size-Chunking algorithm.- Returns:
- This builder instance
-
nlFiedlerRust
Sets the builder to a configuration for the modified FastCDC algorithm of Nathan Fiedlers Rust implementation.- Returns:
- This builder instance
-
setChunker
Sets the chunker to use. Has priority oversetChunkerCore(IterativeStreamChunkerCore)andsetChunkerOption(ChunkerOption).- Parameters:
chunker- The chunker to use- Returns:
- This builder instance
-
setChunkerCore
Sets the core to use for an iterative stream chunker. Has priority oversetChunkerOption(ChunkerOption).- Parameters:
chunkerCore- The core to use- Returns:
- This builder instance
-
setChunkerOption
Sets the chunker option to use.- Parameters:
chunkerOption- The option to use- Returns:
- This builder instance
-
setExpectedChunkSize
Sets the expected size of chunks, in bytes.- Parameters:
expectedChunkSize- The expected size of chunks, in bytes. Must be positive.- Returns:
- This builder instance
-
setHashMethod
Sets the hash method to use for representing the data of chunks.- Parameters:
hashMethod- The hash method to use, has to be accepted and supported byMessageDigest.- Returns:
- This builder instance
-
setHashTable
Sets the hash table to use by the chunker algorithm. Has priority oversetHashTableOption(HashTableOption).- Parameters:
hashTable- The hash table to use. Must have a length of exactly 256, one hash per byte value.- Returns:
- This builder instance
-
setHashTableOption
Sets the option to use for the hash table used by the chunker algorithm.- Parameters:
hashTableOption- The option to use for the hash table- Returns:
- This builder instance
-
setMaskGenerationSeed
Sets the seed to use for mask generation.- Parameters:
maskGenerationSeed- The seed to use- Returns:
- This builder instance
-
setMaskLarge
Sets the mask for the fingerprint that is used for bigger windows, to increase the likelihood of a split.- Parameters:
maskLarge- The mask to set- Returns:
- This builder instance
-
setMaskOption
Sets the algorithm used to generate the masks used by certain chunkers.- Parameters:
maskOption- The mask option to set- Returns:
- This builder instance
-
setMaskSmall
Sets the mask for the fingerprint that is used for smaller windows, to decrease the likelihood of a split.- Parameters:
maskSmall- The mask to set- Returns:
- This builder instance
-
setMaximalChunkSizeFactor
Sets the factor to apply to the expected chunk size to receive the maximal chunk size.- Parameters:
maximalChunkSizeFactor- The factor to apply, must be greater equals 1.0- Returns:
- This builder instance
-
setMinimalChunkSizeFactor
Sets the factor to apply to the expected chunk size to receive the minimal chunk size.- Parameters:
minimalChunkSizeFactor- The factor to apply, must be smaller equals 1.0- Returns:
- This builder instance
-
setNormalizationLevel
Sets the normalization level used for choosing the masks in certain chunkers.- Parameters:
normalizationLevel- The normalization level to use for choosing the masks in certain chunkers, must be positive.- Returns:
- This builder instance
-