Philipp Schmid 5/2/2023

How to scale LLM workloads to 20B+ with Amazon SageMaker using Hugging Face and PyTorch FSDP

Read Original

This article provides a step-by-step guide to scaling large language model (LLM) fine-tuning workloads for models over 20 billion parameters. It details using PyTorch Fully Sharded Data Parallel (FSDP) with the Hugging Face Transformers library on Amazon SageMaker's multi-node, multi-GPU clusters (like p4d.24xlarge instances) to efficiently distribute model training, covering environment setup, data preparation, and the training process.

How to scale LLM workloads to 20B+ with Amazon SageMaker using Hugging Face and PyTorch FSDP

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser

Top of the Week

1
The Beautiful Web
Jens Oliver Meiert 2 votes
3
LLM Use in the Python Source Code
Miguel Grinberg 1 votes
4
Wagon’s algorithm in Python
John D. Cook 1 votes