Medical-RAG is designed for Chinese medical data characterization.Automated Processing Lines, contains three innovative modules:
Intelligent labeling system
- Support HTTP/GPU dual-mode invocation of LLM (e.g., Qwen2:7b) for batch labeling
- Automatic identification of the department (6 major classifications) and type of problem (8 major categories) to which a medical problem belongs
- Output structured annotation results for subsequent search and filtering
Domain lexicon construction
- Multi-threaded technology to process large amounts of medical text
- Integration of a medical-specific lexer (pkuseg) to extract specialized terminology
- Generate compressed word list files (vocab.pkl.gz) to optimize BM25 retrieval efficiency
Mixed Vector Generation
- Parallel generation of dense vectors (via embedding model) and sparse vectors (based on word lists)
- Supports batch embedding and incremental updating, adapting to the dynamic expansion of the knowledge base
- Automatically handles text chunking and metadata association to ensure retrieval context integrity
The entire process is accomplished throughannotation.py,build_vocab.pycap (a poem)insert_data_to_collection.pyThree scripts automate the end-to-end processing so that users only need to prepare raw QA data.
This answer comes from the articleMedical-RAG: A Retrieval-Augmented Generation Framework for Constructing Chinese Medical Q&AsThe































