Skip to content
Learn Agentic AI
Learn Agentic AI14 min read5 views

Fine-Tuning with Hugging Face Transformers and PEFT: Complete Tutorial

A hands-on tutorial for fine-tuning open-source LLMs using Hugging Face Transformers, PEFT, and TRL libraries, covering setup, training configuration, evaluation, and pushing to the Hugging Face Hub.

The Hugging Face Fine-Tuning Stack

Hugging Face provides a complete stack for fine-tuning open-source models. The core libraries are:

  • transformers — model loading, tokenization, and inference
  • peft — parameter-efficient fine-tuning (LoRA, QLoRA)
  • trl — training utilities specifically for LLMs, including SFTTrainer
  • datasets — data loading and preprocessing
  • bitsandbytes — quantization support for QLoRA

Together, these libraries handle everything from data loading to model deployment. This tutorial walks through a complete fine-tuning workflow from start to finish.

Environment Setup

# Install required packages
# pip install torch transformers peft trl datasets bitsandbytes accelerate

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

# Verify GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Loading the Base Model with QLoRA

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Preparing the Dataset

The SFTTrainer works best with datasets in conversational format — a messages column containing lists of role/content dicts.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    DATA[("Curated dataset<br/>instruction or chat")]
    CLEAN["Clean and dedupe<br/>PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA<br/>adapters only"]
    SFT["Full SFT<br/>all params"]
    DPO["DPO or RLHF<br/>preference learning"]
    EVAL["Held out eval<br/>plus regression suite"]
    DEPLOY[("Adapter or<br/>merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff
from datasets import Dataset
import json

def load_training_data(filepath: str) -> Dataset:
    """Load JSONL training data into a Hugging Face Dataset."""
    examples = []
    with open(filepath, "r") as f:
        for line in f:
            data = json.loads(line)
            examples.append({"messages": data["messages"]})
    return Dataset.from_list(examples)

# Load and split dataset
full_dataset = load_training_data("training_data.jsonl")
split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")

# Inspect one example
print(json.dumps(train_dataset[0]["messages"], indent=2))

Configuring LoRA

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

Setting Up the SFT Trainer

The SFTTrainer from TRL handles chat template formatting, packing, and training loop management.

# Training configuration
training_args = SFTConfig(
    output_dir="./llama3-finetune",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size: 4 * 4 = 16
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=3,
    max_seq_length=2048,
    packing=False,                    # Set True to pack multiple examples
    report_to="none",                 # Use "wandb" for experiment tracking
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

# Check trainable parameters
trainer.model.print_trainable_parameters()

Training

# Start training
train_result = trainer.train()

# Print training metrics
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.0f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.1f}")

# Save the LoRA adapter
trainer.save_model("./llama3-finetune/final")
tokenizer.save_pretrained("./llama3-finetune/final")

Evaluation

from transformers import pipeline

# Load the fine-tuned model for inference
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def evaluate_on_test(pipe, test_data, num_samples=20):
    """Run model on test examples and collect results."""
    results = []
    for i in range(min(num_samples, len(test_data))):
        example = test_data[i]
        messages = example["messages"]

        # Use all messages except the last (assistant response) as input
        prompt_messages = messages[:-1]
        expected = messages[-1]["content"]

        output = pipe(
            prompt_messages,
            max_new_tokens=512,
            temperature=0.1,
            do_sample=True,
        )
        generated = output[0]["generated_text"][-1]["content"]

        results.append({
            "input": messages[-2]["content"][:100],
            "expected": expected[:100],
            "generated": generated[:100],
        })

    return results

results = evaluate_on_test(pipe, eval_dataset)
for r in results[:5]:
    print(f"Input:    {r['input']}")
    print(f"Expected: {r['expected']}")
    print(f"Got:      {r['generated']}")
    print("---")

Pushing to Hugging Face Hub

# Login to Hugging Face (run once)
# huggingface-cli login --token hf_YOUR_TOKEN

# Push the LoRA adapter to Hub
trainer.model.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)
tokenizer.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)

# To merge and push the full model:
from peft import PeftModel, AutoPeftModelForCausalLM

merged = trainer.model.merge_and_unload()
merged.push_to_hub(
    "your-username/llama3-medical-coder-merged",
    private=True,
)

FAQ

What is the difference between SFTTrainer and the standard Trainer?

SFTTrainer (Supervised Fine-Tuning Trainer) from TRL is specifically designed for LLM fine-tuning. It automatically handles chat template formatting, supports packing multiple short examples into a single sequence for efficiency, and integrates seamlessly with PEFT adapters. The standard Trainer from transformers works for general training but requires you to handle tokenization, padding, and label masking manually for language model fine-tuning.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

How do I choose between packing=True and packing=False?

Packing concatenates multiple training examples into a single sequence to maximize GPU utilization. Enable packing when your examples are short (under 25% of max_seq_length) and you want faster training. Disable packing when example boundaries matter — for instance, if your system prompts vary between examples, packing can create confusing boundaries. Start with packing disabled and enable it only if training is slow due to short sequences.

How do I resume training from a checkpoint if it gets interrupted?

SFTTrainer saves checkpoints automatically based on your save_strategy configuration. To resume, pass the checkpoint directory to the resume_from_checkpoint parameter: trainer.train(resume_from_checkpoint="./llama3-finetune/checkpoint-150"). The trainer restores the model weights, optimizer state, learning rate schedule, and data loader position so training continues exactly where it left off.


#HuggingFace #PEFT #Transformers #TRL #FineTuning #SFT #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Smolagents lets agents write Python instead of JSON. Why code-as-action reduces tool errors and where the security trade-offs are for production deployments.

Large Language Models

Post-Training Pipeline 2026: SFT, DPO, GRPO, and the Rise of Verifiable Rewards

The 2026 LLM post-training stack — SFT, DPO, RLHF, GRPO, RLVR. What each step does, when to use it, and what frontier labs do differently.

AI Infrastructure

Hugging Face TGI in 2026: Architecture vs vLLM and SGLang Today

TGI relaunched in 2026 with a redesigned core engine. Where it stands against vLLM and SGLang, and where Hugging Face is taking the project over the next 12 months.

Agentic AI

When NOT to Fine-Tune in 2026 (Just Write a Better Prompt)

Across 800+ AI projects, the staged sequence — prompts + RAG first, fine-tune only when production data justifies it — wins more often than any other pattern. We catalog the eight situations where fine-tuning is the wrong tool and what to do instead.

AI Engineering

Fine-Tuning Embeddings for Vertical RAG in 2026

60% of 2026 production RAG projects use both fine-tuning and retrieval together. Domain embeddings boost recall 7%+ on as little as 6.3K samples — and Matryoshka representations cut storage 6x. Here's the recipe used in legal, healthcare, and salon stacks.

AI Engineering

Eval-Driven Fine-Tuning Loops for AI Agents (2026)

Static benchmarks won't catch drift. The 2026 stack runs evals in CI, gates every model update on regression tests, and ties scores back to exact prompt + dataset versions. We show how to wire OpenAI Evals, DeepEval, and W&B Weave into a continuous fine-tuning loop.