> peft-fine-tuning
Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when a user asks to fine-tune a language model, train a custom LLM, adapt a model to their data, use LoRA or QLoRA, fine-tune Llama or Mistral, or train a model on consumer GPUs. Covers PEFT methods for 7B-70B parameter models.
curl "https://skillshub.wtf/TerminalSkills/skills/peft-fine-tuning?format=md"PEFT Fine-Tuning
Overview
Fine-tune large language models efficiently using Parameter-Efficient Fine-Tuning (PEFT) methods. Train 7B to 70B parameter models on consumer GPUs (16-48 GB VRAM) using LoRA, QLoRA, and 25+ adapter methods from the Hugging Face PEFT library. Avoid the cost and hardware requirements of full fine-tuning while achieving comparable results.
Instructions
When a user asks to fine-tune a model, determine the approach:
Task A: Set up the environment
pip install torch transformers datasets peft accelerate bitsandbytes trl
# For Flash Attention 2 (recommended for speed)
pip install flash-attn --no-build-isolation
Verify GPU availability:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
Task B: Fine-tune with LoRA (16+ GB VRAM)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer
# 1. Load base model
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="flash_attention_2",
)
# 2. Configure LoRA
lora_config = LoraConfig(
r=16, # Rank (8-64; higher = more capacity)
lora_alpha=32, # Scaling factor (usually 2x rank)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13.6M || all params: 8.03B || 0.17%
# 3. Load and format dataset
dataset = load_dataset("your-dataset")
def format_prompt(example):
return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
# 4. Train
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
warmup_ratio=0.03,
lr_scheduler_type="cosine",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
formatting_func=format_prompt,
max_seq_length=2048,
)
trainer.train()
trainer.save_model("./lora-adapter")
Task C: Fine-tune with QLoRA (8+ GB VRAM)
QLoRA quantizes the base model to 4-bit, dramatically reducing memory:
from transformers import BitsAndBytesConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
)
# Apply LoRA on top of quantized model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
# Now fine-tune with the same SFTTrainer setup from Task B
VRAM requirements with QLoRA:
- 7B model: ~6 GB
- 13B model: ~10 GB
- 70B model: ~36 GB
Task D: Merge and export the fine-tuned model
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Merge adapter weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.save_pretrained("./merged-model")
# Convert to GGUF for Ollama/llama.cpp
# pip install llama-cpp-python
# python -m llama_cpp.convert ./merged-model --outfile model.gguf
Task E: Prepare a custom dataset
from datasets import Dataset
import json
# Format: instruction-response pairs
data = [
{"instruction": "Summarize this contract clause.", "input": "...", "output": "..."},
{"instruction": "Extract the key dates.", "input": "...", "output": "..."},
]
# Create Hugging Face dataset
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)
# Or load from JSONL file
dataset = load_dataset("json", data_files="training_data.jsonl")
Examples
Example 1: Fine-tune Llama 3.1 8B for customer support
User request: "Fine-tune Llama 8B on our support ticket data"
# Format support tickets as instruction pairs
def format_support(example):
return (
f"### Customer Query:\n{example['question']}\n\n"
f"### Support Response:\n{example['answer']}"
)
# Use QLoRA for 8GB VRAM GPUs
# Train for 3 epochs with lr=2e-4, rank=16
# Result: ~2 hours on RTX 4090, adapter size ~30 MB
Example 2: Domain-adapt a model for medical text
User request: "Adapt Mistral 7B to understand medical terminology"
Use continued pre-training with LoRA on a medical corpus, then instruction-tune on medical QA pairs. Set r=32 for higher capacity on specialized domains.
Example 3: Fine-tune a 70B model with QLoRA on 2x A100
User request: "Fine-tune Llama 70B on our internal documents"
Use QLoRA with device_map="auto" to shard across GPUs. Set per_device_train_batch_size=1 with gradient_accumulation_steps=16. Expect ~24 hours for 3 epochs on 10K samples.
Guidelines
- Start with QLoRA if VRAM is limited; it matches LoRA quality in most benchmarks.
- Use rank
r=16as a default. Increase tor=32-64for complex domain adaptation; decrease tor=8for simple style tuning. - Always set
lora_alpha = 2 * ras a starting point. - Target all linear layers (
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj) for best results. - Use a cosine learning rate scheduler with 3% warmup for stable training.
- Monitor training loss: it should decrease steadily. If it plateaus early, increase rank or learning rate.
- Evaluate on a held-out test set after each epoch to detect overfitting.
- Save checkpoints every epoch; adapter files are small (~30-100 MB).
- Clean, well-formatted training data matters more than quantity. 1,000 high-quality examples often beat 10,000 noisy ones.
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.