Document Screening Principle
Yek uses a multi-layer filtering mechanism to ensure that high-value text content is processed:
- Basic Filtration::
- Strict enforcement of .gitignore rules
- Automatic skipping of binary files (by content detection)
- Exclude oversized files (default threshold is configurable)
- Advanced Screening::
- Analyzing Git commit frequency to identify core files
- Determining file activity in conjunction with last modification time
- Support for extending filtering rules via the yek.toml configuration file
The design ensures processing efficiency while focusing on the source code and documentation resources that are most valuable for LLM training.
This answer comes from the articleYek: reading git repository text files and quickly chunking them for use in large modelsThe































