ImBD (Imitate Before Detect) 是一个开创性的机器生成文本检测项目,该项目发表于AAAI 2025会议。随着ChatGPT等大语言模型(LLMs)的广泛应用,识别AI生成的文本内容变得越来越具有挑战性。ImBD项目提出了一种新颖的”先模仿后检测”方法,通过深入理解和模仿机器文本的风格特征来提升检测效果。该方法首次提出对齐机器文本的风格偏好,建立了一个全面的文本检测框架,能够有效识别经过人工修改的机器生成文本。项目采用Apache 2.0开源许可证,提供了完整的代码实现、预训练模型和详细文档,方便研究人员和开发者在此基础上进行进一步的研究和应用开发。

Demo address: https://ai-detector.fenz.ai/ai-detector
Function List
- Supports high-precision detection of machine-generated text
- Provide pre-trained models for immediate deployment and use
- Novel textual style feature alignment algorithm implemented
- Includes detailed experimental datasets and evaluation benchmarks
- Provide complete training and inference code
- Supports customized training data for model fine-tuning
- Includes detailed API documentation and usage examples
- Provides command line tools for quick testing and evaluation
- Supports batch text processing
- Includes visualization tools to display test results
Using Help
1. Environmental configuration
First you need to configure your Python environment and install the necessary dependencies:
git clone https://github.com/Jiaqi-Chen-00/ImBD
cd ImBD
pip install -r requirements.txt
2. Data preparation
Before starting to use ImBD, training and test data need to be prepared. The data should contain the following two categories:
- Manually prepared original text
- Machine-generated or machine-modified text
Data format requirements:
- Text files need to be UTF-8 encoded
- Each sample takes up one row
- It is proposed to divide the dataset into training set, validation set and test set in the ratio of 8:1:1
3. Model training
Run the following command to start training:
python train.py \
--train_data path/to/train.txt \
--val_data path/to/val.txt \
--model_output_dir path/to/save/model \
--batch_size 32 \
--learning_rate 2e-5 \
--num_epochs 5
4. Model evaluation
Evaluate model performance using test sets:
python evaluate.py \
--model_path path/to/saved/model \
--test_data path/to/test.txt \
--output_file evaluation_results.txt
5. Text detection
Detection of individual texts:
python detect.py \
--model_path path/to/saved/model \
--input_text "要检测的文本内容" \
--output_format json
Batch detection of text:
python batch_detect.py \
--model_path path/to/saved/model \
--input_file input.txt \
--output_file results.json
6. Advanced functions
6.1 Model fine-tuning
If you need to optimize for domain-specific text, you can fine-tune the model using your own dataset:
python finetune.py \
--pretrained_model_path path/to/pretrained/model \
--train_data path/to/domain/data \
--output_dir path/to/finetuned/model
6.2 Visualization analysis
Use the built-in visualization tools to analyze the test results:
python visualize.py \
--results_file path/to/results.json \
--output_dir path/to/visualizations
6.3 API Service Deployment
Deploy the model as a REST API service:
python serve.py \
--model_path path/to/saved/model \
--host 0.0.0.0 \
--port 8000
7. Caveats
- It is recommended to use GPUs for model training to improve efficiency
- Training data quality has a significant impact on model performance
- Regularly update the model to accommodate new AI-generated text features
- Pay attention to model versioning when deploying in production environments
- It is recommended to save the test results for subsequent analysis and model optimization
8. Frequently asked questions
Q: What languages does the model support?
A: Currently, we mainly support English, other languages need to be trained with corresponding datasets.
Q: How can I improve the accuracy of my tests?
A: Performance can be improved by adding training data, tuning model parameters, and fine-tuning using domain-specific data.
Q: How can detection speed be optimized?
A: Detection speed can be improved by batch processing, model quantization, and using GPU acceleration.