Posição atual:fig. início " Tutoriais práticos de IA

O Unsloth resolve o problema de inferência duplicada na versão quantificada do QwQ-32B

2025-03-08

455

Recentemente, a equipe da Qwen lançou QwQ-32B um modelo de inferência que tem sido usado em muitas Referências É um show que rivaliza com o DeepSeek-R1 O desempenho do sistema é excelente. No entanto, muitos usuários se depararam com geração infinita, excesso de duplicatas, problemas de token e problemas de ajuste fino. Este artigo tem como objetivo fornecer um guia detalhado para ajudá-lo a depurar e resolver esses problemas e liberar todo o potencial do QwQ-32B.

Sem pano O modelo carregado pela equipe corrige os erros acima e permite melhor suporte a ferramentas e estruturas como ajuste fino, vLLM e Transformers. Para aqueles que usam llama.cpp e outros usuários com o llama.cpp como mecanismo de back-end, consulte a seção este link Obtenha orientação para corrigir o problema de geração infinita.

Remoção do modelo QwQ-32B (bug corrigido):

Configurações oficiais recomendadas

⚙️ Configurações oficiais recomendadas

Com base nas recomendações oficiais da Qwen, a seguir estão as configurações de parâmetros recomendadas para a inferência do modelo:

Temperatura: 0,6
Top_K: 40 (faixa recomendada de 20 a 40)
Min_P: 0,1 (opcional, mas funciona bem)
Top_P: 0,95
Penalidade de repetição: 1,0 (em llama.cpp e transformadores, 1,0 significa desativado)
Modelo de bate-papo:<|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n

Configurações recomendadas do llama.cpp

A equipe do Unsloth notou que muitos usuários preferem usar uma versão do Repetition Penalty No entanto, essa abordagem realmente interfere no mecanismo de amostragem do llama.cpp. A penalidade de duplicata tinha o objetivo de reduzir o número de duplicatas geradas, mas os experimentos mostraram que essa abordagem não tem o efeito desejado.

Dito isso, a desativação total da penalidade de repetição (definida como 1,0) também é uma opção. No entanto, a equipe do Unsloth descobriu que uma penalidade de repetição adequada ainda é eficaz na supressão da geração infinita.

Para usar a penalidade de repetição de forma eficaz, a ordem dos amostradores no llama.cpp deve ser ajustada para garantir que, ao aplicar a penalidade Repetition Penalty antes da amostragem, caso contrário, isso resultará em geração infinita. Para fazer isso, adicione o seguinte parâmetro:

--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"

Por padrão, o llama.cpp usa a seguinte ordem de amostradores:

--samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"

A ordem ajustada da equipe do Unsloth basicamente troca as posições de temperatura e seco, e move o min_p para frente. Isso significa que o amostrador será aplicado na seguinte ordem:

top_k=40 
top_p=0.95
min_p=0.1
temperature=0.6
dry
typ_p
xtc

Se o problema persistir, tente mover o --repeat-penalty O valor de 1,0 foi ligeiramente aumentado para 1,2 ou 1,3.

Agradecemos a @krist486 por nos alertar sobre o problema da direção de amostragem no llama.cpp.

Penalidade de Repetição Seca

☀️ Dry Repetição de penalidade

A equipe do Unsloth analisou as sugestões dry penalty e tentamos usar um valor de 0,8. No entanto, os resultados experimentais mostram que odry penalty É mais provável que cause erros de sintaxe, especialmente ao gerar código. Se o usuário ainda encontrar problemas, tente definir a opção dry penalty Aumentar para 0,8.

Se você optar por usar dry penaltyA ordem de amostragem ajustada pode ser igualmente útil.

Ollama executando tutoriais do QwQ-32B

🦙 Ollama Tutorial de execução do QwQ-32B

Se ainda não estiver instalado ollamaInstale-o primeiro!

apt-get update 
apt-get install pciutils -y
curl -fSSL [https://ollama.com/install.sh](https://www.google.com/url?sa=E&q=https%3A%2F%2Follama.com%2Finstall.sh) | sh

Execute o modelo! Se a execução falhar, tente executá-lo em outro terminal ollama serveA equipe do Unsloth incluiu todas as correções e parâmetros sugeridos (temperatura, etc.) no modelo de upload do Hugging Face. param Documentação!

ollama run hf.co/unsloth/QwQ-32B-GGUF:Q4_K_M

llama.cpp Tutorial para execução do QwQ-32B

llama.cpp executando os tutoriais do QwQ-32B

através de (uma lacuna) llama.cpp Obter a versão mais recente llama.cpp. Você pode consultar as instruções de compilação a seguir para compilar. Se você não tiver uma GPU ou quiser apenas fazer inferência na CPU, defina o parâmetro -DGGML_CUDA=ON Substituir por -DGGML_CUDA=OFF.

apt-get update 
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone [https://github.com/ggerganov/llama.cpp](https://www.google.com/url?sa=E&q=https%3A%2F%2Fgithub.com%2Fggerganov%2Fllama.cpp)
cmake llama.cpp -B llama.cpp/build
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Faça o download do modelo (durante a instalação) pip install huggingface_hub hf_transfer (depois). Q4_K_M ou outras versões quantificadas (por exemplo, BF16 de precisão total) podem ser selecionadas. Mais versões estão disponíveis em: https://huggingface.co/unsloth/QwQ-32B-GGUF.

# !pip install huggingface_hub hf_transfer
import os 
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download( 
repo_id="unsloth/QwQ-32B-GGUF",
local_dir="unsloth-QwQ-32B-GGUF",
allow_patterns=[" Q4_K_M "],  # For Q4_K_M
)

Execute o script de teste do Flappy Bird fornecido pelo Unsloth e o resultado será salvo no arquivo Q4_K_M_yes_samplers.txt Documentação.
Ajuste os parâmetros de acordo com a situação real.--threads 32 Defina o número de threads da CPU.--ctx-size 16384 Definir o comprimento do contexto do--n-gpu-layers 99 Defina o número de camadas de descarregamento da GPU. Se a GPU estiver com pouca memória, tente ajustar o --n-gpu-layers valor. Remova esse parâmetro se apenas a inferência da CPU for usada.
--repeat-penalty 1.1 responder cantando --dry-multiplier 0.5 são os parâmetros de penalidade de repetição e penalidade seca, que podem ser ajustados pelo usuário conforme necessário.

./llama.cpp/llama-cli  
--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
--threads 32
--ctx-size 16384
--n-gpu-layers 99
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.1
--top-k 40
--top-p 0.95
-no-cnv
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n <think> \n"
2>&1 | tee Q4_K_M_yes_samplers.txt

As dicas do jogo Flappy Bird acima foram retiradas do site Unsloth's DeepSeekR1-Dynamic 1.58bit Blogs. A palavra-chave completa está abaixo:

<|im_start|>user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>
<|im_start|>assistant
<think>

Aqui estão as partes inicial e final do código Python gerado pelo modelo (com o processo de pensamento removido):

import pygame
import random
import sys
pygame.init()
### Continues
class Bird:
def __init__(self):
### Continues
def main():
best_score = 0
current_score = 0
game_over = False
pipes = []
first_time = True  # Track first game play
# Initial setup
background_color = (173, 216, 230)  # Light blue initially
land_color = random.choice(land_colors)
bird = Bird()
while True:
for event in pygame.event.get():
### Continues
if not game_over:
# Update bird and pipes
bird.update()
### Continues
# Drawing
### Continues
pygame.display.flip()
clock.tick(60)
if __name__ == "__main__":
main()

O modelo gerou com sucesso um jogo Flappy Bird funcional!

Unsloth 解决 QwQ-32B 量化版本重复推理问题-1

接下来，尝试移除 –samplers “top_k;top_p;min_p;temperature;dry;typ_p;xtc” 参数，在不使用 Unsloth 修复的情况下运行相同的命令。输出结果将保存到 Q4_K_M_no_samplers.txt 文件。

./llama.cpp/llama-cli
--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
--threads 32
--ctx-size 16384
--n-gpu-layers 99
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.1
--top-k 40
--top-p 0.95
-no-cnv
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"
2>&1 | tee Q4_K_M_no_samplers.txt

在不使用修复的情况下，模型生成结果会出现循环，并且 Python 语法错误频出，以及其他各种问题。例如，以下代码片段看似正确，但实际上存在错误！第 39 行 pipes.clear() ### <<< NameError: name ‘pipes’ is not defined. Did you forget to import ‘pipes’? 提示 pipes 未定义。

import pygame
import random
pygame.init()
# Constants
WIDTH, HEIGHT = 800, 600
GROUND_HEIGHT = 20
GRAVITY = 0.7
PIPE_SPEED = -3
BIRD_SIZE = 45
MIN_GAP = 130
MAX_GAP = 200
PIPE_COLORS = [(0, 96, 0), (205, 133, 63), (89, 97, 107)]
DARK_BROWN = (94, 72, 4)
YELLOW = (252, 228, 6)
screen = pygame.display.set_mode((WIDTH, HEIGHT))
clock = pygame.time.Clock()
def random_light_color():
return (
random.randint(180, 230),
random.randint(190, 300),
random.randint(250, 255)
)
def reset_game():
global bird_x, bird_y
global pipes, score
global background_color, land_color
global bird_shape, bird_color
# Bird properties
bird_x = WIDTH * 0.3
bird_y = HEIGHT // 2
bird_vel = -5  # Initial upward thrust
pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?

如果进一步将 –repeat-penalty 提高到 1.5，情况会变得更糟，语法错误更加明显，代码也完全无法运行。

import pygame
from random import randint  # For generating colors/shapes/positions randomly 
pygame.init()
# Constants:
WIDTH, HEIGHT =456 ,702   #
BACKGROUND_COLOR_LIGHTS=['lightskyblue']
GAP_SIZE=189           #
BIRD_RADIUS=3.  
PIPE_SPEED=- ( )    ? 
class Game():
def __init__(self):
self.screen_size=( )
def reset_game_vars():
global current_scor e
# set to zero and other initial states.
# Main game loop:
while running :
for event in pygame.event.get() : 
if quit ... etc
pygame.quit()
print("Code is simplified. Due time constraints, full working version requires further implementation.")

可能有人会认为这只是 Q4_K_M 量化版本的问题？BF16 全精度版本应该没问题吧？然而，事实并非如此。即使使用 BF16 全精度模型，如果不应用 Unsloth 团队提供的 -samplers “top_k;top_p;min_p;temperature;dry;typ_p;xtc” 修复方案，并使用 Repetition Penalty，同样会出现生成失败的情况。

O token não é exibido?

🤔 token Não foi mostrado?

Há comentários de usuários de que alguns sistemas podem não ser capazes de gerar o processo de pensamento corretamente devido ao token adicionado por padrão ao modelo de bate-papo. Os usuários precisam editar manualmente o modelo Jinja para incluir:

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %}

Modificado para remover o \n final. A modificação exige que os modelos adicionem manualmente \n durante a inferência, mas isso pode nem sempre funcionar. A equipe do DeepSeek também modificou todos os modelos para adicionar tokens por padrão para forçar o modelo a entrar no modo de inferência.

因此，将 {%- if add_generation_prompt %} {{- ‘<|im_start|>assistant\n\n’ }} {%- endif %} 修改为 {%- if add_generation_prompt %} {{- ‘<|im_start|>assistant\n’ }} {%- endif %} ，即移除 \n 。

Modelo jinja completo com a parte \n excluída.

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}

Notas adicionais

Inicialmente, a equipe do Unsloth supôs que o problema poderia ter origem no seguinte:

O comprimento do contexto do QwQ pode não ser os 128K nativos, mas 32K mais a extensão YaRN. Consulte, por exemplo, o arquivo readme em https://huggingface.co/Qwen/QwQ-32B:

{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}

A equipe do Unsloth tentou reescrever o tratamento do YaRN no llama.cpp, mas o problema persistiu.

--override-kv qwen2.context_length=int:131072
--override-kv qwen2.rope.scaling.type=str:yarn
--override-kv qwen2.rope.scaling.factor=float:4
--override-kv qwen2.rope.scaling.original_context_length=int:32768
--override-kv qqwen2.rope.scaling.attn_factor=float:1.13862943649292 \

A equipe do Unsloth também suspeitou que o valor do RMS Layernorm epsilon poderia estar incorreto e talvez devesse ser 1e-6 em vez de 1e-5. Por exemplo. Este link. em rms_norm_eps=1e-06 e Este link. em rms_norm_eps=1e-05. A equipe do Unsloth também tentou reescrever esse valor, mas o problema ainda não foi resolvido:

--override-kv qwen2.attention.layer_norm_rms_epsilon=float:0.000001 \

Graças a @kalomaze, a equipe do Unsloth também testou os IDs do tokenizador entre o llama.cpp e o Transformers para ver se eles coincidiam. Os resultados mostram que eles correspondem, portanto, a incompatibilidade nos IDs do tokenizador não é a fonte do problema.

Aqui estão os resultados do experimento da equipe do Unsloth:

61KB file_BF16_no_samplers.txt

BF16 Precisão total, nenhum reparo de amostra aplicado

55KB file_BF16_yes_samplers.txt

BF16 Precisão total, reparo de amostragem aplicado

71KB final_Q4_K_M_no_samplers.txt

Q4_K_M Precisão, sem correção de amostra aplicada

65KB final_Q4_K_M_yes_samplers.txt

Q4_K_M Precisão, correção de amostragem aplicada

Correções de erros do tokenizador

✏️ Correção de bug do tokenizador

Unsloth 团队还发现了一些影响微调的具体问题！EOS token 是正确的，但 PAD token 更合理的选择应该是 “<|vision_pad|>” 。Unsloth 团队已在 https://huggingface.co/unsloth/QwQ-32B/blob/main/tokenizer_config.json 中更新了配置。

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

Quantificação dinâmica de 4 bits

🛠️ Quantização dinâmica de 4 bits

A equipe do Unsloth também carregou um modelo de quantização dinâmica de 4 bits, que melhora significativamente a precisão do modelo em comparação com a quantização simples de 4 bits! A figura abaixo mostra a análise de erros dos valores e pesos de ativação do modelo QwQ durante o processo de quantificação:

![alt text](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252F32wjrIWeUEQTMq9PhmbS%252FQwQ%2520quantization%2520errors.png%3Falt%3Dmedia%26token%3D0733fd33-9fe9-4aad-812c-75dbad00373f&width=768&dpr=4&quality=100&sign=aafe447c&sv=2)

A equipe do Unsloth fez o upload do modelo quantitativo dinâmico de 4 bits para: https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit.

desde vLLM A partir da versão 0.7.3 (20 de fevereiro de 2024) https://github.com/vllm-project/vllm/releases/tag/v0.7.3, o vLLM começou a oferecer suporte ao carregamento de modelos quantitativos dinâmicos de 4 bits do Unsloth!

Todos os modelos de formato GGUF podem ser encontrados em https://huggingface.co/unsloth/QwQ-32B-GGUF!

O Unsloth resolve o problema de inferência duplicada na versão quantificada do QwQ-32B

Configurações oficiais recomendadas

Configurações recomendadas do llama.cpp

Penalidade de Repetição Seca

Ollama executando tutoriais do QwQ-32B

llama.cpp Tutorial para execução do QwQ-32B

O token não é exibido?

Notas adicionais

Correções de erros do tokenizador

Quantificação dinâmica de 4 bits

Artigos relacionados

Recomendado

Não consegue encontrar ferramentas de IA? Tente aqui!

depoimentos

mais recente