Modern LLM and VLM Training Methods - Comprehensive 2025 Guide

Direct Preference Optimization1 and parameter-efficient methods have revolutionized AI model training, enabling high-performance customization at a fraction of traditional costs. Recent research definitively shows PPO remains superior to DPO for complex alignment tasks2, while parameter-efficient techniques like QLoRA3 achieve 95–98% of full fine-tuning performance with 99% fewer parameters. The emergence of frameworks like VeRL4 with the state-of-the-art DAPO algorithm5 has pushed reasoning capabilities to new heights, with models achieving 50 points on AIME 2024 using 50% fewer training steps than previous methods.

The training landscape has shifted from pure scale to smarter methodologies. Modern approaches emphasize efficiency, alignment, and specialized capabilities, while dramatically reducing computational requirements. Test-time compute scaling now provides 4x+ efficiency gains over parameter scaling, synthetic data generation has become mainstream, and constitutional AI6 enables alignment with minimal human oversight.

TL;DR: LLM & VLM Training Methods Comparison Table

Training Method GitHub Implementation Stars/Activity Data Requirements Effectiveness Cost Use Cases
Pretraining DeepSpeed + Megatron-LM 36.5k + 9.8k 8-15T tokens (web, books, code) Foundation performance $78M-$191M New model creation
ColossalAI 38.7k Same as above High performance High Alternative to DeepSpeed
LitGPT 10.1k Same as above Optimized efficiency High Production pretraining
Supervised Fine-tuning TRL (HuggingFace) 12.5k 1K-1M examples (JSONL) 90-95% base → task $100-$10K Task adaptation
Axolotl 7.8k Same as above High quality Low-Medium YAML-based ease
Unsloth 20k+ Same as above 2-5x faster training Very Low Speed optimization
TorchTune 4.2k Same as above Native PyTorch Low-Medium PyTorch ecosystem
Parameter-Efficient (PEFT)
LoRA PEFT (HuggingFace) 15.8k 1K-100K examples 85-95% of full FT <$10 Memory-constrained
QLoRA QLoRA (artidoro) 10.1k Same as LoRA 95-98% of full FT <$5 Single GPU training
AdaLoRA AdaLoRA 1.6k Same as LoRA 90-96% of full FT <$10 Dynamic adaptation
Reinforcement Learning
PPO TRL 12.5k 10K-100K preferences Superior for complex tasks $1K-$50K Chat alignment
OpenRLHF 2.1k Same as above Production-ready Medium-High Scalable RLHF
DPO TRL 12.5k 10K-100K preference pairs Simpler than PPO $500-$5K Easy alignment
GRPO TRL, OpenRLHF 12.5k, 2.1k Same as PPO Memory efficient Medium DeepSeek approach
DAPO VeRL 2.1k Reasoning datasets SOTA reasoning (50 AIME) Medium-High Math/reasoning
Vision-Language Models
VLM Pretraining LLaVA 19.5k 400M-5B image-text pairs Foundation VLM Very High Multimodal base
VLM Fine-tuning LLaVA-Instruct - 50K-500K image instructions Task-specific VLM High VL applications
Frameworks
VeRL volcengine/verl 2.1k Any RL method data 1.5-20x speedup Variable Production RLHF
DeepSpeed microsoft/deepspeed 36.5k Large-scale data Industry standard High Enterprise scale
OpenRLHF OpenRLHF/OpenRLHF 2.1k RLHF datasets Ray-based scaling Medium-High Distributed RLHF

Data Format Requirements

Method Format Example Structure Typical Size
Pretraining Raw text Plain text files 15TB+
SFT JSONL {"messages": [{"role": "user", "content": "…"}, {"role": "assistant", "content": "…"}]} 1-100MB
RLHF/DPO Preference pairs {"prompt": "…", "chosen": "…", "rejected": "…"} 10-500MB
VLM Image-text {"image": "path.jpg", "conversations": […]} 1-100GB

Performance Effectiveness Ratings

Category Method Performance vs Full Training Speed Memory Best For
Efficient FT QLoRA 95-98% 1.4x slower 33% less Single GPU
LoRA 85-95% Similar 90% less General use
Full FT 100% Baseline Baseline Maximum quality
Alignment PPO Best for complex Slow High Production chat
DPO Good general Fast Low Simple alignment
DAPO SOTA reasoning Medium Medium Math/coding
Frameworks VeRL SOTA algorithms 1.5-20x faster Optimized Research + prod
DeepSpeed Proven scale Good Optimized Large models
Unsloth Speed focus 2-5x faster 70% less Resource limited

Cost Analysis (USD)

Training Type Method Compute Cost Time GPU Requirements
Pretraining 7B Standard $78M+ Months 1000+ H100s
Full Fine-tune 7B Standard $10K-$50K Days 8x A100s
LoRA 7B QLoRA $5-$50 Hours 1x RTX 4090
DPO 7B TRL $500-$5K Hours 4x A100s
PPO 7B OpenRLHF $1K-$10K Days 8x A100s

Repository Activity Status (2025)

Repository Last Update Community Documentation Production Ready
DeepSpeed Active (weekly) Very Large Excellent ✅ Yes
VeRL Active (monthly) Growing Good ✅ Yes
TRL Active (daily) Large Excellent ✅ Yes
Unsloth Active (weekly) Large Good ✅ Yes
QLoRA Stable Medium Good ✅ Yes
OpenRLHF Active (monthly) Medium Good ✅ Yes
Axolotl Active (weekly) Medium Excellent ✅ Yes

Quick Selection Guide

For Beginners: Unsloth + LoRA (fastest setup, lowest cost)
For Production Chat: TRL + DPO → PPO (proven pipeline)
For Research: VeRL + DAPO (cutting-edge algorithms)
For Scale: DeepSpeed + OpenRLHF (enterprise-grade)
For Vision: LLaVA ecosystem (mature VLM tools)
For Speed: Unsloth (2-5x faster training)
For Memory: QLoRA (single GPU training)

Core Training Methodologies and Their Evolution

Pretraining Foundations

Modern LLM pretraining employs sophisticated multi-stage pipelines rather than single-phase training. Core pretraining on 8–15 trillion tokens is followed by context lengthening phases and high-quality annealing on curated datasets. The Phi-3 approach demonstrates that training on smaller, higher-quality datasets can match or exceed the performance of models trained on larger, noisier corpora.

For VLMs, pretraining focuses on vision-language alignment through contrastive learning (CLIP-style), masked multimodal modeling, and joint training on image-text pairs7. Recent innovations include progressive resolution scaling (224px → 448px → 896px) and pixel shuffle strategies that compress visual information while maintaining aspect ratios.

Key 2025 developments include instruction pretraining (integrating Q&A data directly into pretraining), continuous pretraining on domain-specific data8, and the emergence of synthetic data as a primary training source. Meta’s FineWeb dataset9 provides 15 trillion tokens of cleaned Common Crawl data, while specialized datasets like RedPajama10 offer 1.2 trillion tokens replicating successful training recipes.

Supervised Fine-Tuning Advances

Supervised fine-tuning has evolved beyond simple instruction following to encompass multi-task training across diverse domains11. Modern SFT typically uses 500K–1M examples for 2–3 epochs, with an emphasis on high-quality curation over raw volume. The standard JSONL format12 structures conversations with system, user, and assistant roles, enabling complex multi-turn dialogue training.

Data quality requirements have become paramount. Successful datasets like Alpaca13 (52K examples) and Dolly14 (15K human-generated examples) demonstrate that smaller, high-quality datasets often outperform larger, noisier alternatives. Synthetic data integration using LLM-generated examples has become standard practice for specialized domains.

VLM fine-tuning requires multimodal instruction datasets combining visual question answering, image captioning, and vision-language reasoning tasks. LLaVA-Instruct’s15 158K visual instruction examples represent the current standard, with formats combining image paths and conversational structures.

Parameter-Efficient Fine-Tuning Revolution

QLoRA has fundamentally changed3 large model training accessibility. By combining 4-bit NormalFloat quantization with double quantization and paged optimizers, QLoRA enables fine-tuning of 65B+ parameter models on single 48GB GPUs while preserving 16-bit performance. The Guanaco model achieved 99.3% of ChatGPT performance in just 24 hours using this approach.

LoRA variants have proliferated rapidly. AdaLoRA16 provides dynamic rank adjustment during training, DoRA17 decomposes weights into magnitude and direction components, and VB-LoRA18 introduces vector banks for reusable parameter sharing. These methods consistently achieve 85–95% of full fine-tuning performance with less than 1% trainable parameters19.

Implementation best practices include rank selection in the r=4–32 range (r=8 commonly optimal), targeting all transformer layers rather than just attention matrices, and using alpha parameters at 2x the rank value. Results remain remarkably stable across multiple runs, making these methods highly reliable for production use20.


Reinforcement Learning and Preference Optimization

Direct Preference Optimization Ascendant

DPO has emerged as the preferred alignment method for most recent models, including Llama 3 and Qwen 2.5. By eliminating the reward model and treating the language model as an implicit reward function, DPO provides simplified implementation21 with improved stability compared to traditional RLHF. The standard format requires prompt-chosen-rejected triplets, with 10K–100K preference pairs typically sufficient for effective alignment.

However, recent comprehensive research reveals limitations2. The April 2024 study “Is DPO Superior to PPO for LLM Alignment?” found PPO consistently outperforms DPO across dialogue and code generation benchmarks. PPO achieved 22.4% vs 16.4% on CodeContest, with DPO showing particular sensitivity to out-of-distribution data. Production systems from OpenAI, Anthropic, and others continue relying primarily on PPO-based methods.

Advanced Reinforcement Learning Methods

Group Relative Policy Optimization (GRPO)22 eliminates the value function by sampling multiple responses per prompt and using relative ranking for advantage estimation. This approach, used prominently in DeepSeek models, reduces memory requirements while maintaining effectiveness, particularly for mathematical reasoning tasks.

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)5 represents the current state-of-the-art in reasoning model training. Introduced through the VeRL framework4, DAPO incorporates four key innovations: clip-higher strategy (preventing entropy collapse), dynamic sampling (filtering ineffective training samples), token-level loss computation, and overlong reward shaping. DAPO achieved 50 points on AIME 2024 with 50% fewer training steps than previous methods.

VeRL itself has become the premier production-ready RL training framework, supporting multiple algorithms (PPO, GRPO, DAPO, DPO) with efficient distributed training capabilities. Its hybrid-controller programming model and 3D-HybridEngine enable 1.53–20.57x speedup improvements over conventional approaches while supporting models up to 70B parameters.


Open-Source Implementations and Ecosystem

Leading Training Frameworks

DeepSpeed23 (36.5k stars) remains the industry standard for large-scale training, with ZeRO optimizations enabling models up to 170B parameters. Its 3D parallelism and integration with Megatron-LM make it the preferred choice for pre-training applications. Microsoft continues active development with regular performance improvements and expanded model support.

VeRL4 (2.1k stars) represents the cutting edge of RLHF frameworks despite being relatively new. Developed by Volcano Engine, it supports the latest algorithms including DAPO and provides state-of-the-art throughput performance. The framework’s hybrid programming model and efficient model resharding capabilities make it particularly attractive for research and production RLHF deployment.

Unsloth24 (20k+ stars) has gained massive popularity for its exceptional speed optimizations, delivering 2–5x faster training with 70% less memory usage. Its manual backprop engine and OpenAI Triton kernels achieve these gains with 0% accuracy loss, making it particularly attractive for practitioners with limited computational resources.

TRL25 (12.5k stars) provides the official HuggingFace RLHF integration, supporting SFT, PPO, DPO, GRPO, and KTO methods. Its native ecosystem integration and comprehensive documentation make it the go-to choice for teams already invested in the HuggingFace ecosystem.

Framework Selection Guidance

For large-scale pre-training: DeepSpeed + Megatron-LM26 offers proven scalability and performance optimization. ColossalAI27 provides an emerging alternative with advanced memory management features.

For RLHF training: VeRL delivers cutting-edge performance with the latest algorithms, while OpenRLHF28 offers mature Ray-based scaling and TRL provides seamless HuggingFace integration.

For efficient fine-tuning: Unsloth excels in speed optimization, Axolotl29 provides YAML-based ease of use, and TorchTune30 offers native PyTorch integration. The choice depends on existing infrastructure and team preferences.

For production deployment: LitGPT31 provides optimized implementations without abstraction overhead, while DeepSpeed offers battle-tested scalability for high-throughput serving scenarios.


Data Requirements and Practical Considerations

Training Data Specifications

Pre-training requirements have scaled to 15+ trillion tokens for state-of-the-art models, with data sourced from Common Crawl (15T tokens in FineWeb9), books (~21T tokens from 180M titles), academic papers (~2.7T tokens), and code repositories (~0.8T tokens). Preprocessing involves sophisticated filtration for quality and toxicity, deduplication with overlap analysis, and PII redaction for privacy compliance32.

Fine-tuning datasets typically require 1K–1M examples depending on complexity, with high-quality datasets like Alpaca’s13 52K examples often outperforming larger alternatives. The standard JSONL conversation format12 has become ubiquitous, with structured messages containing system, user, and assistant roles.

VLM training demands 400M–5B image-text pairs for pre-training (LAION-5B33 provides 5B+ multilingual pairs) and 50K–500K image-instruction pairs for fine-tuning (LLaVA-Instruct’s15 158K examples represent current standards). Storage requirements scale to terabytes due to image data, requiring careful consideration of infrastructure costs.

RLHF preference data needs 10K–100K preference pairs, with human-annotated quality being crucial. Popular datasets include Anthropic HH-RLHF34 for harmlessness/helpfulness preferences, UltraFeedback35 with 64K multi-aspect preferences, and Stanford SHP36 with 385K Reddit-derived preferences.

Cost-Effectiveness Analysis

Training cost stratification reveals dramatic differences between methods37. Pre-training frontier models costs 191M (GPT-4 estimated range), while LoRA fine-tuning typically costs under $10 for 4 hours on a single GPU. This represents a cost reduction of over 7 orders of magnitude while maintaining 95–98% performance.

Memory requirements similarly vary dramatically. Full fine-tuning requires 12x model size due to optimizer states and gradients, while LoRA adds only 16.78MB for 7B parameter models. QLoRA provides an additional 33% memory savings compared to LoRA, though with 39% longer training time38.

Resource optimization strategies include synthetic data generation for specialized domains, active learning to prioritize informative examples, transfer learning leveraging existing pre-trained models, and efficient data structures with compression for storage optimization39.


Performance Evaluation and Effectiveness

Benchmark Performance Insights

Method comparison studies reveal nuanced performance trade-offs40. While DPO offers implementation simplicity, comprehensive 2024 research2 demonstrates PPO’s consistent superiority across experimental benchmarks. The production success of hybrid approaches (Llama 3’s SFT → PPO → DPO pipeline) suggests combining methods yields optimal results.

Parameter-efficient methods achieve remarkable effectiveness, with LoRA configurations delivering 85–95% of full fine-tuning performance using 0.08–0.1% trainable parameters41. Medical imaging PEFT studies42 across 17 algorithms and 700+ experiments confirm these findings hold across diverse domains and data regimes.

VLM evaluation advances through standardized benchmarks like MMMU43 (11.5K multimodal challenges) and MMBench44 (3,000 single-choice questions across 20 skills) provide reliable performance assessment. The Open VLM Leaderboard45 tracking 54 VLMs across 22 benchmarks has become the definitive evaluation resource.

Practical Implementation Recommendations

For resource-constrained scenarios: LoRA/QLoRA provide 95–98% performance with 99% parameter reduction and 90%+ cost savings. Training times reduce from weeks to hours, making advanced model customization accessible to individual researchers and small teams46.

For maximum performance: Hybrid approaches combining multiple alignment methods (SFT → PPO → DPO) deliver superior results. Constitutional AI6 integration provides safety-critical alignment, while ensemble methods enhance production reliability.

For multiple tasks: Single base models with multiple LoRA adapters enable efficient multi-task deployment. S-LoRA47 provides efficient serving of multiple adapters, while adapter merging optimizes deployment performance.


Emerging Techniques and Future Directions

Breakthrough Innovations

Test-time scaling48 has emerged as more effective than parameter scaling, providing 4x+ efficiency improvements. Models generate multiple solutions using verifier reward models (PRMs) for evaluation and selection, with adaptive compute allocation based on problem difficulty.

Constitutional AI evolution6 now incorporates Collective Constitutional AI (CCAI) with public input and principle-driven training. This enables alignment with minimal human oversight through constitutional principles, reducing dependence on extensive human feedback.

Architecture simplification through Dynamic Tanh (DyT) operations49 replacing Layer Norm delivers an 8.2% training time reduction and 7.8% inference speedup while maintaining performance across vision, language, speech, and DNA modeling tasks.

Mixture of Experts (MoE) scaling50 enables massive parameter scaling with efficient inference. Models like Mixtral 8x7B use 8 experts but select only 2, while DBRX employs fine-grained 16-expert selection for enhanced specialization.

Market Trajectory and Investment Patterns

Synthetic data dominance continues expanding, with expectations that 60% of AI training data will be synthetic by 202451. This trend, exemplified by Phi-4’s primarily GPT-4o-generated training data, offers enhanced control over data quality and domain specialization.

Market growth projections indicate the AI training dataset market will reach $34B by 2033 with 24.9% CAGR from 2025–203252. Major funding continues flowing into data annotation platforms and curation technologies, reflecting the critical importance of high-quality training data.


Conclusion

The training methodology landscape has matured from experimental research to production-ready implementations. Parameter-efficient methods have democratized access to advanced AI customization, while sophisticated frameworks like VeRL enable cutting-edge research and deployment. The shift toward hybrid training approaches, combined with emerging techniques like test-time scaling and constitutional AI, provides clear pathways for developing capable, aligned, and efficient AI systems across diverse applications and resource constraints.

References

  1. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ArXiv preprint arXiv:2305.18290. https://arxiv.org/abs/2305.18290
  2. Xu, J., Liu, X., Wu, Y., Tong, W., Li, Q., Ding, M., Tang, J., & Dong, Y. (2024). Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study. ArXiv preprint arXiv:2404.10719. https://arxiv.org/abs/2404.10719 ↩ ↩2 ↩3
  3. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. ArXiv preprint arXiv:2305.14314. https://arxiv.org/abs/2305.14314 ↩ ↩2
  4. VeRL Team. (2024). VeRL: Volcano Engine Reinforcement Learning for LLMs. GitHub repository. https://github.com/volcengine/verl ↩ ↩2 ↩3
  5. ByteDance Research. (2024). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. ArXiv preprint arXiv:2503.14476. https://arxiv.org/html/2503.14476v1 ↩ ↩2
  6. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ArXiv preprint arXiv:2212.08073. https://arxiv.org/abs/2212.08073 ↩ ↩2 ↩3
  7. Scalable Vision Language Model Training via High Quality Data Curation. (2025). ArXiv preprint arXiv:2501.05952. https://arxiv.org/html/2501.05952v1
  8. Chen, S., Wong, S., Chen, L., & Tian, Y. (2024). Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs. EMNLP Findings 2024. https://arxiv.org/html/2409.14988v1
  9. HuggingFace. (2024). FineWeb Dataset. HuggingFace Datasets. https://huggingface.co/datasets/HuggingFaceFW/fineweb ↩ ↩2
  10. Together Computer. (2023). RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. GitHub repository. https://github.com/togethercomputer/RedPajama-Data
  11. SuperAnnotate. (2025). Fine-tuning large language models (LLMs) in 2025. SuperAnnotate Blog. https://www.superannotate.com/blog/llm-fine-tuning
  12. CodeFriends. (2024). JSONL Format for Fine-Tuning. CodeFriends Resources. https://resources.codefriends.net/en/ai/fine-tuning/basics/chapter-2/jsonl-for-training ↩ ↩2
  13. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following LLaMA model. Stanford GitHub repository. https://github.com/tatsu-lab/stanford_alpaca ↩ ↩2
  14. Databricks. (2023). Dolly: A Large Language Model Trained on the Databricks Machine Learning Platform. Databricks Blog. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
  15. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual Instruction Tuning. ArXiv preprint arXiv:2304.08485. https://github.com/haotian-liu/LLaVA ↩ ↩2
  16. Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. ICLR 2023. https://arxiv.org/abs/2303.10512
  17. Liu, S., Ye, T., Lei, K., Bariah, L., Chen, M., & Du, L. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ArXiv preprint arXiv:2402.09353. https://arxiv.org/abs/2402.09353
  18. Wang, Z., Zhang, Y., Xu, J., Chen, Y., Cheng, Y., Xie, S., Wang, D., Gutfreund, D., Tiwary, P., Carbin, M., & Wornell, G. (2023). VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks. ArXiv preprint arXiv:2405.15179. https://arxiv.org/abs/2405.15179
  19. Raschka, S. (2024). Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation). Sebastian Raschka’s Magazine. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
  20. Lightning AI. (2024). Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments. Lightning AI Blog. https://lightning.ai/pages/community/lora-insights/
  21. Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., & Wolf, T. (2023). Zephyr: Direct Distillation of LM Alignment. ArXiv preprint arXiv:2310.16944. https://arxiv.org/abs/2310.16944
  22. Shao, Z., Xu, R., Zhu, J., Liang, J., Xie, X., Wang, Z., Liu, T., Li, L., & Qiu, X. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. ArXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300
  23. Microsoft. (2020). DeepSpeed: Extreme-scale model training for everyone. GitHub repository. https://github.com/microsoft/DeepSpeed
  24. Unsloth AI. (2024). Unsloth: Finetune Qwen3, Llama 4, TTS, DeepSeek-R1 & Gemma 3 LLMs 2x faster with 70% less memory! GitHub repository. https://github.com/unslothai/unsloth
  25. HuggingFace. (2022). TRL: Transformer Reinforcement Learning. GitHub repository. https://github.com/huggingface/trl
  26. NVIDIA. (2021). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. GitHub repository. https://github.com/NVIDIA/Megatron-LM
  27. HPC-AI Tech. (2021). ColossalAI: Making large AI models cheaper, faster and more accessible. GitHub repository. https://github.com/hpcaitech/ColossalAI
  28. OpenRLHF Team. (2024). OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework. GitHub repository. https://github.com/OpenRLHF/OpenRLHF
  29. Axolotl AI. (2023). Axolotl: Go ahead and axolotl questions. GitHub repository. https://github.com/axolotl-ai-cloud/axolotl
  30. PyTorch Team. (2024). TorchTune: PyTorch native post-training library. GitHub repository. https://github.com/pytorch/torchtune
  31. Lightning AI. (2023). LitGPT: 20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale. GitHub repository. https://github.com/Lightning-AI/litgpt
  32. Labonne, M. (2024). Data Collection and Preprocessing for LLMs. Labellerr Blog. https://www.labellerr.com/blog/data-collection-and-preprocessing-for-large-language-models/
  33. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Worts, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. ArXiv preprint arXiv:2210.08402. https://arxiv.org/abs/2210.08402
  34. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., Mann, B., & Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. ArXiv preprint arXiv:2204.05862. https://arxiv.org/abs/2204.05862
  35. Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., & Sun, M. (2023). UltraFeedback: Boosting Language Models with High-quality Feedback. ArXiv preprint arXiv:2310.01377. https://arxiv.org/abs/2310.01377
  36. Ethayarajh, K., Xu, Y., Muennighoff, S., Jurafsky, D., & Kiela, D. (2022). SHP: Stanford Human Preferences Dataset for Zero-shot Reward Modeling. ArXiv preprint arXiv:2212.10560. https://arxiv.org/abs/2212.10560
  37. CUDO Compute. (2024). What is the cost of training large language models? CUDO Compute Blog. https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models
  38. Red Marble AI. (2024). The Cost of Fine Tuning an LLM. Red Marble Blog. https://redmarble.ai/cost-of-fine-tuning-an-llm/
  39. Turing. (2025). A Comprehensive Guide to LLM Development in 2025. Turing Resources. https://www.turing.com/resources/the-complete-guide-to-llm-development
  40. Raschka, S. (2024). How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? Sebastian Raschka’s Blog. https://sebastianraschka.com/blog/2024/how-good-open-llm.html
  41. Databricks. (2024). Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection for Large Language Models. Databricks Blog. https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms
  42. Dutt, R., Karthik, S., Konam, S., Salunkhe, S., Krishnan, N. C., & Raman, S. (2024). Parameter-Efficient Fine-Tuning for Medical Image Analysis: The Missed Opportunity. PMLR. https://proceedings.mlr.press/v250/dutt24a.html
  43. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Tu, B., Yuan, B., Yue, H., Liang, J., Ouyang, W., Wang, K., Lei, J., Wang, X., & Xie, S. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. ArXiv preprint arXiv:2311.16502. https://arxiv.org/abs/2311.16502
  44. Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., & Lin, D. (2023). MMBench: Is Your Multi-modal Model an All-around Player? ArXiv preprint arXiv:2307.06281. https://arxiv.org/abs/2307.06281
  45. Open VLM Leaderboard. (2024). Hugging Face Open VLM Leaderboard. Hugging Face Spaces. https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
  46. Mercity AI. (2024). In-depth guide to fine-tuning LLMs with LoRA and QLoRA. Mercity Blog. https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
  47. Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). S-LoRA: Serving thousands of concurrent LoRA adapters. ArXiv preprint arXiv:2311.03285. https://arxiv.org/abs/2311.03285
  48. Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. ArXiv preprint arXiv:2408.03314. https://arxiv.org/abs/2408.03314
  49. Chen, S., Sohl-Dickstein, J., Gilmer, J., Norouzi, M., Maddison, C., Dieleman, S., & Hinton, G. (2025). Transformers without Normalization. ArXiv preprint arXiv:2503.10622. https://arxiv.org/abs/2503.10622
  50. HuggingFace. (2024). Mixture of Experts Explained. HuggingFace Blog. https://huggingface.co/blog/moe
  51. MarkTechPost. (2024). Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit? MarkTechPost. https://www.marktechpost.com/2024/05/14/large-language-model-llm-training-data-is-running-out-how-close-are-we-to-the-limit/
  52. Fortune Business Insights. (2025). AI Training Dataset Market Size, Share | Global Report [2032]. Fortune Business Insights. https://www.fortunebusinessinsights.com/ai-training-dataset-market-109241

Modern LLM and VLM Training Methods - Comprehensive 2025 Guide
http://blog.chivier.site/2025-06-09/2025/Modern-LLM-and-VLM-Training-Methods---Comprehensive-2025-Guide/
Author
Chivier Humber
Posted on
June 9, 2025
Licensed under