The whole process of knowledge base construction
- Data preprocessing: convert PDF/Word documents to JSON format (each entry contains entity and description fields)
- Conversion to quantitative: Run
generate_kb_embeddings.py
Scripts with optional embedded models such as OpenAI or MiniLM - model enhancement: By
integrate.py
Injecting *.npy vector files into base models such as Llama - dynamic update (Internet): regenerate vectors after modifying source JSON, perform incremental integration (no full retraining required)
Configuration of key parameters
- Embedding dimension: default 768 dimensions (needs to be aligned with the base model hidden layer)
- Batch size: -B parameter can be adjusted downward when video memory is insufficient
- Similarity threshold: controls how strictly knowledge is activated (regulated by -threshold)
best practice
It is recommended that the document is firstPhysical extractioncap (a poem)de-duplicationMicrosoft's official example shows that the structured knowledge base can improve Q&A accuracy by 42%. For Chinese documents, additional configuration of the word segmentation tool is required.
This answer comes from the articleKBLaM: An Open Source Enhanced Tool for Embedding External Knowledge in Large ModelsThe