Training Loop Implementation¶
This document details the training implementation for LoRA fine-tuning using the transformers and peft libraries, providing a simplified, high-level approach to medical AI model training.
🔄 Training Flow Overview¶
Our training pipeline uses HuggingFace's Trainer class, which handles the complex training loop internally. The process follows these key steps:
graph TD
A[Load Configuration] --> B[Setup Model & Tokenizer]
B --> C[Apply LoRA Configuration]
C --> D[Prepare Datasets]
D --> E[Create Trainer]
E --> F[Start Training]
F --> G[Automatic Evaluation]
G --> H[Save Best Model] 🚀 Main Training Pipeline¶
Core Training Implementation¶
The main training function in main.py orchestrates the entire process:
def run_training(cfg: SimpleConfig):
"""Simplified training pipeline using transformers."""
logger.info("🚀 Starting training...")
# Set environment
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Load and prepare data
raw_dataset = load_and_prepare_data(cfg.data.train_file, cfg.data, cfg.seed)
# Setup model with quantization
model, tokenizer = setup_model(cfg.model.name, cfg.seed)
model = setup_lora(model, cfg.lora)
# Prepare datasets for training
train_dataset, eval_dataset, test_dataset = prepare_datasets(
raw_dataset, tokenizer, cfg.data
)
# Print GPU memory usage
print_gpu_memory_usage()
# Create and run trainer
trainer = create_trainer(
model, tokenizer, train_dataset, eval_dataset, cfg.output_dir, cfg.training
)
trainer.train() # HuggingFace handles the entire training loop!
# Evaluate and save
test_results = trainer.evaluate(test_dataset)
adapter_dir = save_model(model, tokenizer, cfg.output_dir, cfg.model.name)
return adapter_dir
🛠 Key Components¶
1. Model Setup with Quantization¶
def setup_model(model_name: str, seed: int):
"""Setup model and tokenizer with 4-bit quantization."""
set_seed(seed) # The seed makes the randomizer deterministic
# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=bnb_config,
dtype=torch.float16,
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer
2. LoRA Configuration¶
def setup_lora(model, cfg):
"""Apply LoRA configuration to quantized model."""
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False
# Create LoRA configuration
peft_config = LoraConfig(
r=cfg.r, # Low-rank dimension (16)
lora_alpha=cfg.alpha, # Scaling factor (32)
target_modules=cfg.target_modules, # [q_proj, v_proj, k_proj, o_proj]
lora_dropout=cfg.dropout, # Dropout (0.1)
bias="none",
task_type="CAUSAL_LM",
)
# Apply LoRA to model
model = get_peft_model(model, peft_config)
model.print_trainable_parameters() # Shows only LoRA params are trainable
return model
3. Dataset Preparation¶
def prepare_datasets(raw_dataset, tokenizer, cfg):
"""Format and tokenize datasets using chat templates."""
def format_example(example):
# Use HuggingFace chat templates for proper formatting
messages = [
{"role": "system", "content": cfg.system_prompt},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return {"text": text}
def tokenize_batch(batch):
tokenized = tokenizer(
batch["text"],
max_length=cfg.max_length,
truncation=True,
padding="max_length",
return_tensors="pt",
)
tokenized["labels"] = tokenized["input_ids"].clone()
return tokenized
# Apply formatting and tokenization
formatted = raw_dataset.map(
format_example, remove_columns=raw_dataset["train"].column_names
)
tokenized = formatted.map(
tokenize_batch, batched=True, remove_columns=["text"]
)
return (
tokenized["train"].with_format("torch"),
tokenized["validation"].with_format("torch"),
tokenized["test"].with_format("torch"),
)
4. Trainer Creation¶
def create_trainer(model, tokenizer, train_dataset, eval_dataset, output_dir, cfg):
"""Create HuggingFace Trainer with optimized settings."""
training_args = TrainingArguments(
output_dir=output_dir,
max_steps=cfg.max_steps, # Steps instead of epochs for better control
per_device_train_batch_size=cfg.batch_size,
gradient_accumulation_steps=cfg.gradient_accumulation_steps,
learning_rate=cfg.learning_rate,
# Evaluation and saving
eval_strategy="steps",
eval_steps=cfg.logging_steps,
save_steps=cfg.logging_steps * 2,
save_strategy="steps",
save_total_limit=3,
load_best_model_at_end=True,
# Optimization
warmup_ratio=0.03,
lr_scheduler_type="cosine",
weight_decay=0.01,
# Memory optimization
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
# Logging
logging_steps=cfg.logging_steps,
report_to=None, # Disable wandb/tensorboard
remove_unused_columns=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
processing_class=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
callbacks=[
EarlyStoppingCallback(early_stopping_patience=cfg.early_stopping_patience)
],
)
return trainer
⚙️ Configuration¶
Training Configuration in config.yaml¶
training:
batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 2e-4
max_steps: 100
logging_steps: 10
early_stopping_patience: 3
lora:
r: 16
alpha: 32
dropout: 0.1
target_modules: [q_proj, v_proj, k_proj, o_proj]
model:
name: microsoft/Phi-4-mini-instruct
max_length: 512
📊 What the Trainer Does Automatically¶
The HuggingFace Trainer class handles all the complex training loop details:
- ✅ Forward/backward passes
- ✅ Loss computation
- ✅ Gradient accumulation
- ✅ Optimizer steps
- ✅ Learning rate scheduling
- ✅ Gradient clipping
- ✅ Evaluation loops
- ✅ Model checkpointing
- ✅ Mixed precision training
- ✅ Distributed training support
🔍 Training Monitoring¶
Built-in Logging¶
def print_gpu_memory_usage():
"""Print current GPU memory usage - called before/during training."""
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
allocated = torch.cuda.memory_allocated(i) / (1024**3) # GB
reserved = torch.cuda.memory_reserved(i) / (1024**3) # GB
total_memory = torch.cuda.get_device_properties(i).total_memory
total = total_memory / (1024**3) # GB
logger.info(f"🖥️ GPU {i} ({torch.cuda.get_device_name(i)}):")
logger.info(f" 📊 Memory: {allocated:.2f}GB allocated, "
f"{reserved:.2f}GB reserved, {total:.2f}GB total")
logger.info(f" 💾 Free: {total - reserved:.2f}GB")
else:
logger.info("❌ No CUDA GPU available")
Training Output Example¶
🚀 Starting training...
📚 Loading dataset...
Train: 720, Val: 80, Test: 80
⚙️ Setting up model...
✅ Loaded microsoft/Phi-4-mini-instruct from cache
🔧 Configuring LoRA...
trainable params: 83,886,080 || all params: 14,888,534,016 || trainable%: 0.56%
🏃 Creating trainer...
📊 Trainer will run for up to 100 steps for fine-tuning model.
{'train_runtime': 45.67, 'train_samples_per_second': 63.45, 'train_steps_per_second': 2.19,
'train_loss': 1.2345, 'epoch': 2.78}
📊 Evaluating on test dataset...
🎯 Test Results:
eval_loss: 1.1876
eval_samples_per_second: 156.78
💾 Saving model...
✅ Model saved to: ./checkpoints/model/my_custom_llm_Phi-4-mini-instruct
✅ Training complete!
🎯 Key Benefits of This Approach¶
- Simplicity: No custom training loops to debug
- Robustness: Battle-tested HuggingFace implementation
- Feature-rich: Built-in evaluation, checkpointing, early stopping
- Memory efficient: Automatic gradient checkpointing and mixed precision
- Scalable: Easy to extend with callbacks and custom metrics
This streamlined approach leverages the power of modern ML libraries to focus on what matters: getting great results quickly and reliably.