Class ChunkerBuilder

java.lang.Object
de.zabuza.fastcdc4j.external.chunking.ChunkerBuilder

public final class ChunkerBuilder
extends java.lang.Object
Builder for convenient construction of Chunker instances.

The builder offers highly customizable content-defined-chunking algorithms. Offered algorithms are:

  • FastCDC (original) - Wen Xia et al. (publication)
  • FastCDC Rust - Nathan Fiedler (source),slightly modified version of the original algorithm
  • Fixed-Size-Chunking (FSC) - Baseline, chunks the data stream every x-th byte, without interpreting the content
It is also possible to add custom algorithms by simply implementing Chunker. A custom algorithm can be set by using setChunker(Chunker) for full control or setChunkerCore(IterativeStreamChunkerCore) for a simplified interface. setChunkerOption(ChunkerOption) can be used to choose from the predefined algorithms.

The algorithms will try to strive for an expected chunk size given by setExpectedChunkSize(int), a minimal chunk size given by setMinimalChunkSizeFactor(double) and a maximal chunk size given by setMaximalChunkSizeFactor(double).

Most of the algorithms internally use a hash table as source for predicted noise to steer the algorithm, a custom table can be provided by setHashTable(long[]). Alternatively, setHashTableOption(HashTableOption) can be used to choose from predefined tables.

The algorithms are heavily steered by masks which define the cut-points. By default they are generated randomly using a fixed seed that can be changed by using setMaskGenerationSeed(long). There are different techniques available to generate masks, they can be set using setMaskOption(MaskOption). To achieve a distribution of chunk sizes as close as possible to the expected size, normalization levels are used during mask generation. setNormalizationLevel(int) is used to change the level. The higher the level, the closer the sizes are to the expected size, for the cost of a worse deduplication rate. Alternatively, masks can be set manually using setMaskSmall(long) for the mask used when the chunk is still smaller than the expected size and setMaskLarge(long) for bigger chunks respectively.

After a chunk has been read, a hash is generated based on its content. The algorithm used for this process can be set by setHashMethod(String), it has to be supported and accepted by MessageDigest.

Finally, a chunker using the selected properties can be created using build().

The default configuration of the builder is:

The methods fastCdc(), nlFiedlerRust() and fsc() can be used to get a configuration that uses the given algorithms as originally proposed.
Author:
Daniel Tischner <zabuza.dev@gmail.com>
  • Constructor Details

  • Method Details

    • build

      public Chunker build()
      Builds a chunker using the set properties.
      Returns:
      A chunker using the set properties
    • fastCdc

      public ChunkerBuilder fastCdc()
      Sets the builder to a configuration for the original FastCDC algorithm.
      Returns:
      This builder instance
    • fsc

      public ChunkerBuilder fsc()
      Sets the builder to a configuration for the baseline Fixed-Size-Chunking algorithm.
      Returns:
      This builder instance
    • nlFiedlerRust

      public ChunkerBuilder nlFiedlerRust()
      Sets the builder to a configuration for the modified FastCDC algorithm of Nathan Fiedlers Rust implementation.
      Returns:
      This builder instance
    • setChunker

      public ChunkerBuilder setChunker​(Chunker chunker)
      Sets the chunker to use. Has priority over setChunkerCore(IterativeStreamChunkerCore) and setChunkerOption(ChunkerOption).
      Parameters:
      chunker - The chunker to use
      Returns:
      This builder instance
    • setChunkerCore

      public ChunkerBuilder setChunkerCore​(IterativeStreamChunkerCore chunkerCore)
      Sets the core to use for an iterative stream chunker. Has priority over setChunkerOption(ChunkerOption).
      Parameters:
      chunkerCore - The core to use
      Returns:
      This builder instance
    • setChunkerOption

      public ChunkerBuilder setChunkerOption​(ChunkerOption chunkerOption)
      Sets the chunker option to use.
      Parameters:
      chunkerOption - The option to use
      Returns:
      This builder instance
    • setExpectedChunkSize

      public ChunkerBuilder setExpectedChunkSize​(int expectedChunkSize)
      Sets the expected size of chunks, in bytes.
      Parameters:
      expectedChunkSize - The expected size of chunks, in bytes. Must be positive.
      Returns:
      This builder instance
    • setHashMethod

      public ChunkerBuilder setHashMethod​(java.lang.String hashMethod)
      Sets the hash method to use for representing the data of chunks.
      Parameters:
      hashMethod - The hash method to use, has to be accepted and supported by MessageDigest.
      Returns:
      This builder instance
    • setHashTable

      public ChunkerBuilder setHashTable​(long[] hashTable)
      Sets the hash table to use by the chunker algorithm. Has priority over setHashTableOption(HashTableOption).
      Parameters:
      hashTable - The hash table to use. Must have a length of exactly 256, one hash per byte value.
      Returns:
      This builder instance
    • setHashTableOption

      public ChunkerBuilder setHashTableOption​(HashTableOption hashTableOption)
      Sets the option to use for the hash table used by the chunker algorithm.
      Parameters:
      hashTableOption - The option to use for the hash table
      Returns:
      This builder instance
    • setMaskGenerationSeed

      public ChunkerBuilder setMaskGenerationSeed​(long maskGenerationSeed)
      Sets the seed to use for mask generation.
      Parameters:
      maskGenerationSeed - The seed to use
      Returns:
      This builder instance
    • setMaskLarge

      public ChunkerBuilder setMaskLarge​(long maskLarge)
      Sets the mask for the fingerprint that is used for bigger windows, to increase the likelihood of a split.
      Parameters:
      maskLarge - The mask to set
      Returns:
      This builder instance
    • setMaskOption

      public ChunkerBuilder setMaskOption​(MaskOption maskOption)
      Sets the algorithm used to generate the masks used by certain chunkers.
      Parameters:
      maskOption - The mask option to set
      Returns:
      This builder instance
    • setMaskSmall

      public ChunkerBuilder setMaskSmall​(long maskSmall)
      Sets the mask for the fingerprint that is used for smaller windows, to decrease the likelihood of a split.
      Parameters:
      maskSmall - The mask to set
      Returns:
      This builder instance
    • setMaximalChunkSizeFactor

      public ChunkerBuilder setMaximalChunkSizeFactor​(double maximalChunkSizeFactor)
      Sets the factor to apply to the expected chunk size to receive the maximal chunk size.
      Parameters:
      maximalChunkSizeFactor - The factor to apply, must be greater equals 1.0
      Returns:
      This builder instance
    • setMinimalChunkSizeFactor

      public ChunkerBuilder setMinimalChunkSizeFactor​(double minimalChunkSizeFactor)
      Sets the factor to apply to the expected chunk size to receive the minimal chunk size.
      Parameters:
      minimalChunkSizeFactor - The factor to apply, must be smaller equals 1.0
      Returns:
      This builder instance
    • setNormalizationLevel

      public ChunkerBuilder setNormalizationLevel​(int normalizationLevel)
      Sets the normalization level used for choosing the masks in certain chunkers.
      Parameters:
      normalizationLevel - The normalization level to use for choosing the masks in certain chunkers, must be positive.
      Returns:
      This builder instance