Modern LLM and VLM Training Methods - Comprehensive 2025 Guide

Direct Preference Optimization¹ and parameter-efficient methods have revolutionized AI model training, enabling high-performance customization at a fraction of traditional costs. Recent research definitively shows PPO remains superior to DPO for complex alignment tasks², while parameter-efficient techniques like QLoRA³ achieve 95–98% of full fine-tuning performance with 99% fewer parameters. The emergence of frameworks like VeRL⁴ with the state-of-the-art DAPO algorithm⁵ has pushed reasoning capabilities to new heights, with models achieving 50 points on AIME 2024 using 50% fewer training steps than previous methods.

The training landscape has shifted from pure scale to smarter methodologies. Modern approaches emphasize efficiency, alignment, and specialized capabilities, while dramatically reducing computational requirements. Test-time compute scaling now provides 4x+ efficiency gains over parameter scaling, synthetic data generation has become mainstream, and constitutional AI⁶ enables alignment with minimal human oversight.

TL;DR: LLM & VLM Training Methods Comparison Table

Training Method	GitHub Implementation	Stars/Activity	Data Requirements	Effectiveness	Cost	Use Cases
Pretraining	DeepSpeed + Megatron-LM	36.5k + 9.8k	8-15T tokens (web, books, code)	Foundation performance	$78M-$191M	New model creation
	ColossalAI	38.7k	Same as above	High performance	High	Alternative to DeepSpeed
	LitGPT	10.1k	Same as above	Optimized efficiency	High	Production pretraining
Supervised Fine-tuning	TRL (HuggingFace)	12.5k	1K-1M examples (JSONL)	90-95% base → task	$100-$10K	Task adaptation
	Axolotl	7.8k	Same as above	High quality	Low-Medium	YAML-based ease
	Unsloth	20k+	Same as above	2-5x faster training	Very Low	Speed optimization
	TorchTune	4.2k	Same as above	Native PyTorch	Low-Medium	PyTorch ecosystem
Parameter-Efficient (PEFT)
LoRA	PEFT (HuggingFace)	15.8k	1K-100K examples	85-95% of full FT	<$10	Memory-constrained
QLoRA	QLoRA (artidoro)	10.1k	Same as LoRA	95-98% of full FT	<$5	Single GPU training
AdaLoRA	AdaLoRA	1.6k	Same as LoRA	90-96% of full FT	<$10	Dynamic adaptation
Reinforcement Learning
PPO	TRL	12.5k	10K-100K preferences	Superior for complex tasks	$1K-$50K	Chat alignment
	OpenRLHF	2.1k	Same as above	Production-ready	Medium-High	Scalable RLHF
DPO	TRL	12.5k	10K-100K preference pairs	Simpler than PPO	$500-$5K	Easy alignment
GRPO	TRL, OpenRLHF	12.5k, 2.1k	Same as PPO	Memory efficient	Medium	DeepSeek approach
DAPO	VeRL	2.1k	Reasoning datasets	SOTA reasoning (50 AIME)	Medium-High	Math/reasoning
Vision-Language Models
VLM Pretraining	LLaVA	19.5k	400M-5B image-text pairs	Foundation VLM	Very High	Multimodal base
VLM Fine-tuning	LLaVA-Instruct	-	50K-500K image instructions	Task-specific VLM	High	VL applications
Frameworks
VeRL	volcengine/verl	2.1k	Any RL method data	1.5-20x speedup	Variable	Production RLHF
DeepSpeed	microsoft/deepspeed	36.5k	Large-scale data	Industry standard	High	Enterprise scale
OpenRLHF	OpenRLHF/OpenRLHF	2.1k	RLHF datasets	Ray-based scaling	Medium-High	Distributed RLHF

Data Format Requirements

Method	Format	Example Structure	Typical Size
Pretraining	Raw text	Plain text files	15TB+
SFT	JSONL	`{"messages": [{"role": "user", "content": "…"}, {"role": "assistant", "content": "…"}]}`	1-100MB
RLHF/DPO	Preference pairs	`{"prompt": "…", "chosen": "…", "rejected": "…"}`	10-500MB
VLM	Image-text	`{"image": "path.jpg", "conversations": […]}`	1-100GB

Performance Effectiveness Ratings

Category	Method	Performance vs Full Training	Speed	Memory	Best For
Efficient FT	QLoRA	95-98%	1.4x slower	33% less	Single GPU
	LoRA	85-95%	Similar	90% less	General use
	Full FT	100%	Baseline	Baseline	Maximum quality
Alignment	PPO	Best for complex	Slow	High	Production chat
	DPO	Good general	Fast	Low	Simple alignment
	DAPO	SOTA reasoning	Medium	Medium	Math/coding
Frameworks	VeRL	SOTA algorithms	1.5-20x faster	Optimized	Research + prod
	DeepSpeed	Proven scale	Good	Optimized	Large models
	Unsloth	Speed focus	2-5x faster	70% less	Resource limited

Cost Analysis (USD)

Training Type	Method	Compute Cost	Time	GPU Requirements
Pretraining 7B	Standard	$78M+	Months	1000+ H100s
Full Fine-tune 7B	Standard	$10K-$50K	Days	8x A100s
LoRA 7B	QLoRA	$5-$50	Hours	1x RTX 4090
DPO 7B	TRL	$500-$5K	Hours	4x A100s
PPO 7B	OpenRLHF	$1K-$10K	Days	8x A100s

Repository Activity Status (2025)

Repository	Last Update	Community	Documentation	Production Ready
DeepSpeed	Active (weekly)	Very Large	Excellent	✅ Yes
VeRL	Active (monthly)	Growing	Good	✅ Yes
TRL	Active (daily)	Large	Excellent	✅ Yes
Unsloth	Active (weekly)	Large	Good	✅ Yes
QLoRA	Stable	Medium	Good	✅ Yes
OpenRLHF	Active (monthly)	Medium	Good	✅ Yes
Axolotl	Active (weekly)	Medium	Excellent	✅ Yes

Quick Selection Guide

For Beginners: Unsloth + LoRA (fastest setup, lowest cost)
For Production Chat: TRL + DPO → PPO (proven pipeline)
For Research: VeRL + DAPO (cutting-edge algorithms)
For Scale: DeepSpeed + OpenRLHF (enterprise-grade)
For Vision: LLaVA ecosystem (mature VLM tools)
For Speed: Unsloth (2-5x faster training)
For Memory: QLoRA (single GPU training)

Core Training Methodologies and Their Evolution

Pretraining Foundations

Modern LLM pretraining employs sophisticated multi-stage pipelines rather than single-phase training. Core pretraining on 8–15 trillion tokens is followed by context lengthening phases and high-quality annealing on curated datasets. The Phi-3 approach demonstrates that training on smaller, higher-quality datasets can match or exceed the performance of models trained on larger, noisier corpora.

For VLMs, pretraining focuses on vision-language alignment through contrastive learning (CLIP-style), masked multimodal modeling, and joint training on image-text pairs⁷. Recent innovations include progressive resolution scaling (224px → 448px → 896px) and pixel shuffle strategies that compress visual information while maintaining aspect ratios.

Key 2025 developments include instruction pretraining (integrating Q&A data directly into pretraining), continuous pretraining on domain-specific data⁸, and the emergence of synthetic data as a primary training source. Meta’s FineWeb dataset⁹ provides 15 trillion tokens of cleaned Common Crawl data, while specialized datasets like RedPajama¹⁰ offer 1.2 trillion tokens replicating successful training recipes.

Supervised Fine-Tuning Advances

Supervised fine-tuning has evolved beyond simple instruction following to encompass multi-task training across diverse domains¹¹. Modern SFT typically uses 500K–1M examples for 2–3 epochs, with an emphasis on high-quality curation over raw volume. The standard JSONL format¹² structures conversations with system, user, and assistant roles, enabling complex multi-turn dialogue training.

Data quality requirements have become paramount. Successful datasets like Alpaca¹³ (52K examples) and Dolly¹⁴ (15K human-generated examples) demonstrate that smaller, high-quality datasets often outperform larger, noisier alternatives. Synthetic data integration using LLM-generated examples has become standard practice for specialized domains.

VLM fine-tuning requires multimodal instruction datasets combining visual question answering, image captioning, and vision-language reasoning tasks. LLaVA-Instruct’s¹⁵ 158K visual instruction examples represent the current standard, with formats combining image paths and conversational structures.

Parameter-Efficient Fine-Tuning Revolution

QLoRA has fundamentally changed³ large model training accessibility. By combining 4-bit NormalFloat quantization with double quantization and paged optimizers, QLoRA enables fine-tuning of 65B+ parameter models on single 48GB GPUs while preserving 16-bit performance. The Guanaco model achieved 99.3% of ChatGPT performance in just 24 hours using this approach.

LoRA variants have proliferated rapidly. AdaLoRA¹⁶ provides dynamic rank adjustment during training, DoRA¹⁷ decomposes weights into magnitude and direction components, and VB-LoRA¹⁸ introduces vector banks for reusable parameter sharing. These methods consistently achieve 85–95% of full fine-tuning performance with less than 1% trainable parameters¹⁹.

Implementation best practices include rank selection in the r=4–32 range (r=8 commonly optimal), targeting all transformer layers rather than just attention matrices, and using alpha parameters at 2x the rank value. Results remain remarkably stable across multiple runs, making these methods highly reliable for production use²⁰.

Reinforcement Learning and Preference Optimization

Direct Preference Optimization Ascendant

DPO has emerged as the preferred alignment method for most recent models, including Llama 3 and Qwen 2.5. By eliminating the reward model and treating the language model as an implicit reward function, DPO provides simplified implementation²¹ with improved stability compared to traditional RLHF. The standard format requires prompt-chosen-rejected triplets, with 10K–100K preference pairs typically sufficient for effective alignment.

However, recent comprehensive research reveals limitations². The April 2024 study “Is DPO Superior to PPO for LLM Alignment?” found PPO consistently outperforms DPO across dialogue and code generation benchmarks. PPO achieved 22.4% vs 16.4% on CodeContest, with DPO showing particular sensitivity to out-of-distribution data. Production systems from OpenAI, Anthropic, and others continue relying primarily on PPO-based methods.

Advanced Reinforcement Learning Methods

Group Relative Policy Optimization (GRPO)²² eliminates the value function by sampling multiple responses per prompt and using relative ranking for advantage estimation. This approach, used prominently in DeepSeek models, reduces memory requirements while maintaining effectiveness, particularly for mathematical reasoning tasks.

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)⁵ represents the current state-of-the-art in reasoning model training. Introduced through the VeRL framework⁴, DAPO incorporates four key innovations: clip-higher strategy (preventing entropy collapse), dynamic sampling (filtering ineffective training samples), token-level loss computation, and overlong reward shaping. DAPO achieved 50 points on AIME 2024 with 50% fewer training steps than previous methods.

VeRL itself has become the premier production-ready RL training framework, supporting multiple algorithms (PPO, GRPO, DAPO, DPO) with efficient distributed training capabilities. Its hybrid-controller programming model and 3D-HybridEngine enable 1.53–20.57x speedup improvements over conventional approaches while supporting models up to 70B parameters.

Open-Source Implementations and Ecosystem

Leading Training Frameworks

DeepSpeed²³ (36.5k stars) remains the industry standard for large-scale training, with ZeRO optimizations enabling models up to 170B parameters. Its 3D parallelism and integration with Megatron-LM make it the preferred choice for pre-training applications. Microsoft continues active development with regular performance improvements and expanded model support.

VeRL⁴ (2.1k stars) represents the cutting edge of RLHF frameworks despite being relatively new. Developed by Volcano Engine, it supports the latest algorithms including DAPO and provides state-of-the-art throughput performance. The framework’s hybrid programming model and efficient model resharding capabilities make it particularly attractive for research and production RLHF deployment.

Unsloth²⁴ (20k+ stars) has gained massive popularity for its exceptional speed optimizations, delivering 2–5x faster training with 70% less memory usage. Its manual backprop engine and OpenAI Triton kernels achieve these gains with 0% accuracy loss, making it particularly attractive for practitioners with limited computational resources.

TRL²⁵ (12.5k stars) provides the official HuggingFace RLHF integration, supporting SFT, PPO, DPO, GRPO, and KTO methods. Its native ecosystem integration and comprehensive documentation make it the go-to choice for teams already invested in the HuggingFace ecosystem.

Framework Selection Guidance

For large-scale pre-training: DeepSpeed + Megatron-LM²⁶ offers proven scalability and performance optimization. ColossalAI²⁷ provides an emerging alternative with advanced memory management features.

For RLHF training: VeRL delivers cutting-edge performance with the latest algorithms, while OpenRLHF²⁸ offers mature Ray-based scaling and TRL provides seamless HuggingFace integration.

For efficient fine-tuning: Unsloth excels in speed optimization, Axolotl²⁹ provides YAML-based ease of use, and TorchTune³⁰ offers native PyTorch integration. The choice depends on existing infrastructure and team preferences.

For production deployment: LitGPT³¹ provides optimized implementations without abstraction overhead, while DeepSpeed offers battle-tested scalability for high-throughput serving scenarios.

Data Requirements and Practical Considerations

Training Data Specifications

Pre-training requirements have scaled to 15+ trillion tokens for state-of-the-art models, with data sourced from Common Crawl (15T tokens in FineWeb⁹), books (~21T tokens from 180M titles), academic papers (~2.7T tokens), and code repositories (~0.8T tokens). Preprocessing involves sophisticated filtration for quality and toxicity, deduplication with overlap analysis, and PII redaction for privacy compliance³².

Fine-tuning datasets typically require 1K–1M examples depending on complexity, with high-quality datasets like Alpaca’s¹³ 52K examples often outperforming larger alternatives. The standard JSONL conversation format¹² has become ubiquitous, with structured messages containing system, user, and assistant roles.

VLM training demands 400M–5B image-text pairs for pre-training (LAION-5B³³ provides 5B+ multilingual pairs) and 50K–500K image-instruction pairs for fine-tuning (LLaVA-Instruct’s¹⁵ 158K examples represent current standards). Storage requirements scale to terabytes due to image data, requiring careful consideration of infrastructure costs.

RLHF preference data needs 10K–100K preference pairs, with human-annotated quality being crucial. Popular datasets include Anthropic HH-RLHF³⁴ for harmlessness/helpfulness preferences, UltraFeedback³⁵ with 64K multi-aspect preferences, and Stanford SHP³⁶ with 385K Reddit-derived preferences.

Cost-Effectiveness Analysis

Training cost stratification reveals dramatic differences between methods³⁷. Pre-training frontier models costs 191M (GPT-4 estimated range), while LoRA fine-tuning typically costs under $10 for 4 hours on a single GPU. This represents a cost reduction of over 7 orders of magnitude while maintaining 95–98% performance.

Memory requirements similarly vary dramatically. Full fine-tuning requires 12x model size due to optimizer states and gradients, while LoRA adds only 16.78MB for 7B parameter models. QLoRA provides an additional 33% memory savings compared to LoRA, though with 39% longer training time³⁸.

Resource optimization strategies include synthetic data generation for specialized domains, active learning to prioritize informative examples, transfer learning leveraging existing pre-trained models, and efficient data structures with compression for storage optimization³⁹.

Performance Evaluation and Effectiveness

Benchmark Performance Insights

Method comparison studies reveal nuanced performance trade-offs⁴⁰. While DPO offers implementation simplicity, comprehensive 2024 research² demonstrates PPO’s consistent superiority across experimental benchmarks. The production success of hybrid approaches (Llama 3’s SFT → PPO → DPO pipeline) suggests combining methods yields optimal results.

Parameter-efficient methods achieve remarkable effectiveness, with LoRA configurations delivering 85–95% of full fine-tuning performance using 0.08–0.1% trainable parameters⁴¹. Medical imaging PEFT studies⁴² across 17 algorithms and 700+ experiments confirm these findings hold across diverse domains and data regimes.

VLM evaluation advances through standardized benchmarks like MMMU⁴³ (11.5K multimodal challenges) and MMBench⁴⁴ (3,000 single-choice questions across 20 skills) provide reliable performance assessment. The Open VLM Leaderboard⁴⁵ tracking 54 VLMs across 22 benchmarks has become the definitive evaluation resource.

Practical Implementation Recommendations

For resource-constrained scenarios: LoRA/QLoRA provide 95–98% performance with 99% parameter reduction and 90%+ cost savings. Training times reduce from weeks to hours, making advanced model customization accessible to individual researchers and small teams⁴⁶.

For maximum performance: Hybrid approaches combining multiple alignment methods (SFT → PPO → DPO) deliver superior results. Constitutional AI⁶ integration provides safety-critical alignment, while ensemble methods enhance production reliability.

For multiple tasks: Single base models with multiple LoRA adapters enable efficient multi-task deployment. S-LoRA⁴⁷ provides efficient serving of multiple adapters, while adapter merging optimizes deployment performance.

Emerging Techniques and Future Directions

Breakthrough Innovations

Test-time scaling⁴⁸ has emerged as more effective than parameter scaling, providing 4x+ efficiency improvements. Models generate multiple solutions using verifier reward models (PRMs) for evaluation and selection, with adaptive compute allocation based on problem difficulty.

Constitutional AI evolution⁶ now incorporates Collective Constitutional AI (CCAI) with public input and principle-driven training. This enables alignment with minimal human oversight through constitutional principles, reducing dependence on extensive human feedback.

Architecture simplification through Dynamic Tanh (DyT) operations⁴⁹ replacing Layer Norm delivers an 8.2% training time reduction and 7.8% inference speedup while maintaining performance across vision, language, speech, and DNA modeling tasks.

Mixture of Experts (MoE) scaling⁵⁰ enables massive parameter scaling with efficient inference. Models like Mixtral 8x7B use 8 experts but select only 2, while DBRX employs fine-grained 16-expert selection for enhanced specialization.

Market Trajectory and Investment Patterns

Synthetic data dominance continues expanding, with expectations that 60% of AI training data will be synthetic by 2024⁵¹. This trend, exemplified by Phi-4’s primarily GPT-4o-generated training data, offers enhanced control over data quality and domain specialization.

Market growth projections indicate the AI training dataset market will reach $34B by 2033 with 24.9% CAGR from 2025–2032⁵². Major funding continues flowing into data annotation platforms and curation technologies, reflecting the critical importance of high-quality training data.

Conclusion

The training methodology landscape has matured from experimental research to production-ready implementations. Parameter-efficient methods have democratized access to advanced AI customization, while sophisticated frameworks like VeRL enable cutting-edge research and deployment. The shift toward hybrid training approaches, combined with emerging techniques like test-time scaling and constitutional AI, provides clear pathways for developing capable, aligned, and efficient AI systems across diverse applications and resource constraints.

References

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ArXiv preprint arXiv:2305.18290. https://arxiv.org/abs/2305.18290 ↩
Xu, J., Liu, X., Wu, Y., Tong, W., Li, Q., Ding, M., Tang, J., & Dong, Y. (2024). Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study. ArXiv preprint arXiv:2404.10719. https://arxiv.org/abs/2404.10719 ↩ ↩2 ↩3
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. ArXiv preprint arXiv:2305.14314. https://arxiv.org/abs/2305.14314 ↩ ↩2
VeRL Team. (2024). VeRL: Volcano Engine Reinforcement Learning for LLMs. GitHub repository. https://github.com/volcengine/verl ↩ ↩2 ↩3
ByteDance Research. (2024). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. ArXiv preprint arXiv:2503.14476. https://arxiv.org/html/2503.14476v1 ↩ ↩2
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ArXiv preprint arXiv:2212.08073. https://arxiv.org/abs/2212.08073 ↩ ↩2 ↩3
Scalable Vision Language Model Training via High Quality Data Curation. (2025). ArXiv preprint arXiv:2501.05952. https://arxiv.org/html/2501.05952v1 ↩
Chen, S., Wong, S., Chen, L., & Tian, Y. (2024). Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs. EMNLP Findings 2024. https://arxiv.org/html/2409.14988v1 ↩
HuggingFace. (2024). FineWeb Dataset. HuggingFace Datasets. https://huggingface.co/datasets/HuggingFaceFW/fineweb ↩ ↩2
Together Computer. (2023). RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. GitHub repository. https://github.com/togethercomputer/RedPajama-Data ↩
SuperAnnotate. (2025). Fine-tuning large language models (LLMs) in 2025. SuperAnnotate Blog. https://www.superannotate.com/blog/llm-fine-tuning ↩
CodeFriends. (2024). JSONL Format for Fine-Tuning. CodeFriends Resources. https://resources.codefriends.net/en/ai/fine-tuning/basics/chapter-2/jsonl-for-training ↩ ↩2
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following LLaMA model. Stanford GitHub repository. https://github.com/tatsu-lab/stanford_alpaca ↩ ↩2
Databricks. (2023). Dolly: A Large Language Model Trained on the Databricks Machine Learning Platform. Databricks Blog. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm ↩
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual Instruction Tuning. ArXiv preprint arXiv:2304.08485. https://github.com/haotian-liu/LLaVA ↩ ↩2
Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. ICLR 2023. https://arxiv.org/abs/2303.10512 ↩
Liu, S., Ye, T., Lei, K., Bariah, L., Chen, M., & Du, L. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ArXiv preprint arXiv:2402.09353. https://arxiv.org/abs/2402.09353 ↩
Wang, Z., Zhang, Y., Xu, J., Chen, Y., Cheng, Y., Xie, S., Wang, D., Gutfreund, D., Tiwary, P., Carbin, M., & Wornell, G. (2023). VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks. ArXiv preprint arXiv:2405.15179. https://arxiv.org/abs/2405.15179 ↩
Raschka, S. (2024). Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation). Sebastian Raschka’s Magazine. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms ↩
Lightning AI. (2024). Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments. Lightning AI Blog. https://lightning.ai/pages/community/lora-insights/ ↩
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., & Wolf, T. (2023). Zephyr: Direct Distillation of LM Alignment. ArXiv preprint arXiv:2310.16944. https://arxiv.org/abs/2310.16944 ↩
Shao, Z., Xu, R., Zhu, J., Liang, J., Xie, X., Wang, Z., Liu, T., Li, L., & Qiu, X. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. ArXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300 ↩
Microsoft. (2020). DeepSpeed: Extreme-scale model training for everyone. GitHub repository. https://github.com/microsoft/DeepSpeed ↩
Unsloth AI. (2024). Unsloth: Finetune Qwen3, Llama 4, TTS, DeepSeek-R1 & Gemma 3 LLMs 2x faster with 70% less memory! GitHub repository. https://github.com/unslothai/unsloth ↩
HuggingFace. (2022). TRL: Transformer Reinforcement Learning. GitHub repository. https://github.com/huggingface/trl ↩
NVIDIA. (2021). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. GitHub repository. https://github.com/NVIDIA/Megatron-LM ↩
HPC-AI Tech. (2021). ColossalAI: Making large AI models cheaper, faster and more accessible. GitHub repository. https://github.com/hpcaitech/ColossalAI ↩
OpenRLHF Team. (2024). OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework. GitHub repository. https://github.com/OpenRLHF/OpenRLHF ↩
Axolotl AI. (2023). Axolotl: Go ahead and axolotl questions. GitHub repository. https://github.com/axolotl-ai-cloud/axolotl ↩
PyTorch Team. (2024). TorchTune: PyTorch native post-training library. GitHub repository. https://github.com/pytorch/torchtune ↩
Lightning AI. (2023). LitGPT: 20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale. GitHub repository. https://github.com/Lightning-AI/litgpt ↩
Labonne, M. (2024). Data Collection and Preprocessing for LLMs. Labellerr Blog. https://www.labellerr.com/blog/data-collection-and-preprocessing-for-large-language-models/ ↩
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Worts, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. ArXiv preprint arXiv:2210.08402. https://arxiv.org/abs/2210.08402 ↩
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., Mann, B., & Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. ArXiv preprint arXiv:2204.05862. https://arxiv.org/abs/2204.05862 ↩
Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., & Sun, M. (2023). UltraFeedback: Boosting Language Models with High-quality Feedback. ArXiv preprint arXiv:2310.01377. https://arxiv.org/abs/2310.01377 ↩
Ethayarajh, K., Xu, Y., Muennighoff, S., Jurafsky, D., & Kiela, D. (2022). SHP: Stanford Human Preferences Dataset for Zero-shot Reward Modeling. ArXiv preprint arXiv:2212.10560. https://arxiv.org/abs/2212.10560 ↩
CUDO Compute. (2024). What is the cost of training large language models? CUDO Compute Blog. https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models ↩
Red Marble AI. (2024). The Cost of Fine Tuning an LLM. Red Marble Blog. https://redmarble.ai/cost-of-fine-tuning-an-llm/ ↩
Turing. (2025). A Comprehensive Guide to LLM Development in 2025. Turing Resources. https://www.turing.com/resources/the-complete-guide-to-llm-development ↩
Raschka, S. (2024). How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? Sebastian Raschka’s Blog. https://sebastianraschka.com/blog/2024/how-good-open-llm.html ↩
Databricks. (2024). Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection for Large Language Models. Databricks Blog. https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms ↩
Dutt, R., Karthik, S., Konam, S., Salunkhe, S., Krishnan, N. C., & Raman, S. (2024). Parameter-Efficient Fine-Tuning for Medical Image Analysis: The Missed Opportunity. PMLR. https://proceedings.mlr.press/v250/dutt24a.html ↩
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Tu, B., Yuan, B., Yue, H., Liang, J., Ouyang, W., Wang, K., Lei, J., Wang, X., & Xie, S. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. ArXiv preprint arXiv:2311.16502. https://arxiv.org/abs/2311.16502 ↩
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., & Lin, D. (2023). MMBench: Is Your Multi-modal Model an All-around Player? ArXiv preprint arXiv:2307.06281. https://arxiv.org/abs/2307.06281 ↩
Open VLM Leaderboard. (2024). Hugging Face Open VLM Leaderboard. Hugging Face Spaces. https://huggingface.co/spaces/opencompass/open_vlm_leaderboard ↩
Mercity AI. (2024). In-depth guide to fine-tuning LLMs with LoRA and QLoRA. Mercity Blog. https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora ↩
Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). S-LoRA: Serving thousands of concurrent LoRA adapters. ArXiv preprint arXiv:2311.03285. https://arxiv.org/abs/2311.03285 ↩
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. ArXiv preprint arXiv:2408.03314. https://arxiv.org/abs/2408.03314 ↩
Chen, S., Sohl-Dickstein, J., Gilmer, J., Norouzi, M., Maddison, C., Dieleman, S., & Hinton, G. (2025). Transformers without Normalization. ArXiv preprint arXiv:2503.10622. https://arxiv.org/abs/2503.10622 ↩
HuggingFace. (2024). Mixture of Experts Explained. HuggingFace Blog. https://huggingface.co/blog/moe ↩
MarkTechPost. (2024). Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit? MarkTechPost. https://www.marktechpost.com/2024/05/14/large-language-model-llm-training-data-is-running-out-how-close-are-we-to-the-limit/ ↩
Fortune Business Insights. (2025). AI Training Dataset Market Size, Share | Global Report [2032]. Fortune Business Insights. https://www.fortunebusinessinsights.com/ai-training-dataset-market-109241 ↩

Skill

Modern LLM and VLM Training Methods - Comprehensive 2025 Guide

http://blog.chivier.site/2025-06-09/2025/Modern-LLM-and-VLM-Training-Methods---Comprehensive-2025-Guide/

Author

Chivier Humber

Posted on

June 9, 2025

Licensed under

AI Industry Comprehensive Analysis - June 2025 State of Development Previous

Tenstorrent GraySkull Note Next