Overseas access: www.kdjingpai.com
Ctrl + D Favorites
Current Position:fig. beginning " Course materials

FineTuningLLMs: A Practical Guide to Efficiently Fine-Tuning Large Language Models on a Single GPU

2025-07-09 24

FineTuningLLMs is a GitHub repository created by author dvgodoy, based on his book A Hands-On Guide to Fine-Tuning LLMs with PyTorch and Hugging Face. This repository provides developers with a practical and systematic guide focused on efficiently fine-tuning Large Language Models (LLMs) on a single consumer GPU. It explains how to optimize model performance using tools such as PyTorch, LoRA adapters, quantization techniques, and more, in conjunction with the Hugging Face ecosystem. The repository covers the complete process from model loading to deployment and is suitable for machine learning practitioners and researchers. Users can learn methods for fine-tuning, deployment, and troubleshooting through code examples and detailed documentation. The project is shared as open source to encourage community contribution and learning.

FineTuningLLMs: A Practical Guide to Efficiently Fine-Tuning Large Language Models on a Single GPU-1

 

Function List

  • Provides a complete LLM fine-tuning process covering data preprocessing, model loading and parameter optimization.
  • Supports the use of LoRA and quantization techniques to reduce hardware requirements for single GPU fine-tuning.
  • Integrates with the Hugging Face ecosystem to provide sample configurations of pre-trained models and tools.
  • Includes a performance comparison of Flash Attention and PyTorch SDPA to optimize model training speed.
  • Supports conversion of fine-tuned models to GGUF format for easy local deployment.
  • Provides deployment guides for Ollama and llama.cpp to simplify the model go-live process.
  • Includes a troubleshooting guide that lists common errors and their solutions.

 

Using Help

Installation process

FineTuningLLMs is a GitHub-based code repository and users need to install the necessary development environment first. Below are the detailed installation and configuration steps:

  1. clone warehouse
    Open a terminal and run the following command to clone the repository locally:

    git clone https://github.com/dvgodoy/FineTuningLLMs.git
    cd FineTuningLLMs
    
  2. Installing the Python Environment
    Make sure Python 3.8 or later is installed on your system. A virtual environment is recommended to isolate dependencies:

    python -m venv venv
    source venv/bin/activate  # Linux/Mac
    venv\Scripts\activate     # Windows
    
  3. Installation of dependencies
    The repository provides a requirements.txt file containing the necessary Python libraries (e.g. PyTorch, Hugging Face transformers, etc.). Run the following command to install it:

    pip install -r requirements.txt
    
  4. Installation of optional tools
    • If you need to deploy the model, install Ollama or llama.cpp. According to the official documentation, Ollama can be installed with the following command:
      curl https://ollama.ai/install.sh | sh
      
    • If using the GGUF format model, you need to install llama.cpp and refer to its GitHub page to complete the configuration.
  5. Verification Environment
    Run the sample script from the repository test_environment.py(if any) to ensure that the dependencies are installed correctly:

    python test_environment.py
    

Main Functions

1. Data pre-processing

FineTuningLLMs provides data formatting tools to help users prepare training datasets suitable for fine-tuning. Users need to prepare datasets in JSON or CSV format containing input text and target output. The repository of data_preprocessing.py Scripts (sample files) are available for cleaning and formatting data. Run command:

python data_preprocessing.py --input input.json --output formatted_data.json

Ensure that the input data conforms to Hugging Face's dataset specifications, fields typically include text cap (a poem) labelThe

2. Model fine-tuning

The core function of the repository is to fine-tune models on a single GPU using LoRA and quantization techniques. Users can choose from pre-trained models provided by Hugging Face (e.g. LLaMA or Mistral). Configuring LoRA parameters (e.g., rank and alpha values) in the config/lora_config.yaml file is completed. Example Configuration:

lora:
rank: 8
alpha: 16
dropout: 0.1

Run the fine-tuning script:

python train.py --model_name llama-2-7b --dataset formatted_data.json --output_dir ./finetuned_model

The script loads the model, applies the LoRA adapter and starts training. Quantization options (e.g. 8-bit integers) can be enabled via command line arguments:

python train.py --quantization 8bit

3. Performance optimization

The repository supports both Flash Attention and PyTorch SDPA attention mechanisms. Users can find out more about these mechanisms in the train.py hit the nail on the head --attention flash maybe --attention sdpa Select the mechanism. flash Attention is usually faster but requires higher hardware compatibility. Run the following command to see the performance difference:

python benchmark_attention.py --model_name llama-2-7b

The script outputs training speed and memory usage data, making it easy for the user to choose a suitable configuration.

4. Model deployment

The fine-tuned model can be converted to GGUF format for local inference. Run the conversion script:

python convert_to_gguf.py --model_path ./finetuned_model --output_path model.gguf

Use Ollama to deploy the model:

ollama serve --model model.gguf

Users can interact with the model via HTTP API or command line:

curl http://localhost:11434/api/generate -d '{"model": "model.gguf", "prompt": "你好,世界!"}'

5. Troubleshooting

The repository contains a troubleshooting.md file that lists common problems such as memory overflows or model loading failures. Users can refer to this file to resolve errors. For example, if encountering insufficient CUDA memory, try reducing the batch size:

python train.py --batch_size 4

Featured Function Operation

LoRA fine-tuning

LoRA (Low-Rank Adaptation) is the core technology of the repository, allowing users to update only some of the model's parameters, significantly reducing computational requirements. Users are required to update the model's parameters in the config/lora_config.yaml Set the rank (rank) and scaling factor (alpha) in the When running fine-tuning, the LoRA adapter is automatically applied to the attention layer of the model. The user can verify the LoRA effect with the following command:

python evaluate.py --model_path ./finetuned_model --test_data test.json

Quantitative support

Quantization converts model weights from 16-bit floating point numbers to 8-bit integers, reducing memory footprint. Users can enable quantization during training or inference:

python train.py --quantization 8bit

The quantized models also run efficiently on consumer GPUs such as the NVIDIA RTX 3060.

local deployment

With Ollama or llama.cpp, users can deploy the model to local devices. ollama provides a simple web interface for quick testing. Run the following command to launch the web interface:

ollama web

Users can access in their browsers http://localhost:11434 Interact with the model.

 

application scenario

  1. Personalized Chatbots
    Users can fine-tune the model to generate domain-specific conversations, such as customer service or technical support. After preparing a dataset of domain-relevant conversations and running the fine-tuning scripts, the model can generate answers that better fit the specific scenario.
  2. Text generation optimization
    Writers or content creators can use the fine-tuned model to generate text that matches a specific style, such as technical documentation or creative writing. By adjusting the training data, the model can mimic the target text style.
  3. Localized Model Deployment
    Enterprises and developers can deploy fine-tuned models to local servers for offline inference. support for the GGUF format and Ollama makes it possible to run models in low-resource environments.
  4. Education and Research
    Students and researchers can use the repository to learn LLM fine-tuning techniques. Code samples and documentation are suitable for beginners to help them understand the implementation of quantization, LoRA, and attention mechanisms.

 

QA

  1. Are FineTuningLLMs suitable for beginners?
    Yes, the repository provides detailed code comments and documentation for beginners with basic Python and machine learning. Users need to understand basic PyTorch and Hugging Face operations.
  2. Need a high-end GPU?
    Not required. The Warehouse focuses on single-GPU fine-tuning that can be run by consumer GPUs such as the RTX 3060 with 12GB of RAM, with LoRA and quantization techniques further reducing hardware requirements.
  3. How to choose the right pre-training model?
    The user can choose a model based on the task requirements. Hugging Face offers LLaMA or Mistral which are suitable for most NLP tasks. The repository documentation recommends starting with a small model (e.g., 7B parameters) for testing.
  4. Do I need additional tools to deploy the model?
    Yes, it is recommended to use Ollama or llama.cpp for deployment. Both are open source and easy to install, see the repository's deploy_guide.mdThe

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish