zChunk provides three main chunking strategies to cover different document processing needs:
- NaiveChunk (fixed size chunking)::
- Principle of operation: Mechanical segmentation of text according to a preset number of characters
- Scenario: Simple documents in a well-formed format (e.g. log files)
- Advantages: fast processing speed, low resource consumption - SemanticChunk (embedded similarity chunking)::
- How it works: text embedding vector-based clustering analysis
- Scenario: ordinary documents that need to maintain the integrity of the paragraph
- Benefits: Balancing performance and semantic coherence - zChunk Algorithm (LLM hint chunking)::
- Working Principle: Using Llama-70B to Generate Intelligent Segmentation Prompts
- Scenario: complex professional documents (e.g. legal contracts)
- Advantages: accurate capture of semantic boundaries, support for dynamic adaptation
These three strategies can be freely switched through the hyperparameter tuning pipeline, and it is recommended that users gradually upgrade their strategy choices based on document complexity.
This answer comes from the articlezChunk: a generic semantic chunking strategy based on Llama-70BThe































