Fine-Tuning with Hugging Face Transformers and PEFT: Complete Tutorial

The Hugging Face Fine-Tuning Stack

Hugging Face provides a complete stack for fine-tuning open-source models. The core libraries are:

transformers — model loading, tokenization, and inference
peft — parameter-efficient fine-tuning (LoRA, QLoRA)
trl — training utilities specifically for LLMs, including SFTTrainer
datasets — data loading and preprocessing
bitsandbytes — quantization support for QLoRA

Together, these libraries handle everything from data loading to model deployment. This tutorial walks through a complete fine-tuning workflow from start to finish.

Environment Setup

# Install required packages
# pip install torch transformers peft trl datasets bitsandbytes accelerate

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

# Verify GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Loading the Base Model with QLoRA

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Preparing the Dataset

The SFTTrainer works best with datasets in conversational format — a messages column containing lists of role/content dicts.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    DATA[("Curated dataset<br/>instruction or chat")]
    CLEAN["Clean and dedupe<br/>PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA<br/>adapters only"]
    SFT["Full SFT<br/>all params"]
    DPO["DPO or RLHF<br/>preference learning"]
    EVAL["Held out eval<br/>plus regression suite"]
    DEPLOY[("Adapter or<br/>merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff

from datasets import Dataset
import json

def load_training_data(filepath: str) -> Dataset:
    """Load JSONL training data into a Hugging Face Dataset."""
    examples = []
    with open(filepath, "r") as f:
        for line in f:
            data = json.loads(line)
            examples.append({"messages": data["messages"]})
    return Dataset.from_list(examples)

# Load and split dataset
full_dataset = load_training_data("training_data.jsonl")
split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")

# Inspect one example
print(json.dumps(train_dataset[0]["messages"], indent=2))

Configuring LoRA

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

Setting Up the SFT Trainer

The SFTTrainer from TRL handles chat template formatting, packing, and training loop management.

# Training configuration
training_args = SFTConfig(
    output_dir="./llama3-finetune",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size: 4 * 4 = 16
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=3,
    max_seq_length=2048,
    packing=False,                    # Set True to pack multiple examples
    report_to="none",                 # Use "wandb" for experiment tracking
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

# Check trainable parameters
trainer.model.print_trainable_parameters()

Training

# Start training
train_result = trainer.train()

# Print training metrics
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.0f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.1f}")

# Save the LoRA adapter
trainer.save_model("./llama3-finetune/final")
tokenizer.save_pretrained("./llama3-finetune/final")

Evaluation

from transformers import pipeline

# Load the fine-tuned model for inference
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def evaluate_on_test(pipe, test_data, num_samples=20):
    """Run model on test examples and collect results."""
    results = []
    for i in range(min(num_samples, len(test_data))):
        example = test_data[i]
        messages = example["messages"]

        # Use all messages except the last (assistant response) as input
        prompt_messages = messages[:-1]
        expected = messages[-1]["content"]

        output = pipe(
            prompt_messages,
            max_new_tokens=512,
            temperature=0.1,
            do_sample=True,
        )
        generated = output[0]["generated_text"][-1]["content"]

        results.append({
            "input": messages[-2]["content"][:100],
            "expected": expected[:100],
            "generated": generated[:100],
        })

    return results

results = evaluate_on_test(pipe, eval_dataset)
for r in results[:5]:
    print(f"Input:    {r['input']}")
    print(f"Expected: {r['expected']}")
    print(f"Got:      {r['generated']}")
    print("---")

Pushing to Hugging Face Hub

# Login to Hugging Face (run once)
# huggingface-cli login --token hf_YOUR_TOKEN

# Push the LoRA adapter to Hub
trainer.model.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)
tokenizer.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)

# To merge and push the full model:
from peft import PeftModel, AutoPeftModelForCausalLM

merged = trainer.model.merge_and_unload()
merged.push_to_hub(
    "your-username/llama3-medical-coder-merged",
    private=True,
)

FAQ

What is the difference between SFTTrainer and the standard Trainer?

SFTTrainer (Supervised Fine-Tuning Trainer) from TRL is specifically designed for LLM fine-tuning. It automatically handles chat template formatting, supports packing multiple short examples into a single sequence for efficiency, and integrates seamlessly with PEFT adapters. The standard Trainer from transformers works for general training but requires you to handle tokenization, padding, and label masking manually for language model fine-tuning.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How do I choose between packing=True and packing=False?

Packing concatenates multiple training examples into a single sequence to maximize GPU utilization. Enable packing when your examples are short (under 25% of max_seq_length) and you want faster training. Disable packing when example boundaries matter — for instance, if your system prompts vary between examples, packing can create confusing boundaries. Start with packing disabled and enable it only if training is slow due to short sequences.

How do I resume training from a checkpoint if it gets interrupted?

SFTTrainer saves checkpoints automatically based on your save_strategy configuration. To resume, pass the checkpoint directory to the resume_from_checkpoint parameter: trainer.train(resume_from_checkpoint="./llama3-finetune/checkpoint-150"). The trainer restores the model weights, optimizer state, learning rate schedule, and data loader position so training continues exactly where it left off.

#HuggingFace #PEFT #Transformers #TRL #FineTuning #SFT #AgenticAI #LearnAI #AIEngineering

Fine-Tuning with Hugging Face Transformers and PEFT: Complete Tutorial

The Hugging Face Fine-Tuning Stack

Environment Setup

Loading the Base Model with QLoRA

Preparing the Dataset

Configuring LoRA

Setting Up the SFT Trainer

Training

Evaluation

Pushing to Hugging Face Hub

FAQ

What is the difference between SFTTrainer and the standard Trainer?

How do I choose between packing=True and packing=False?

How do I resume training from a checkpoint if it gets interrupted?

Try CallSphere AI Voice Agents

Related Articles You May Like

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Post-Training Pipeline 2026: SFT, DPO, GRPO, and the Rise of Verifiable Rewards

Hugging Face TGI in 2026: Architecture vs vLLM and SGLang Today

When NOT to Fine-Tune in 2026 (Just Write a Better Prompt)

Fine-Tuning Embeddings for Vertical RAG in 2026

Eval-Driven Fine-Tuning Loops for AI Agents (2026)