Date: 9 April 2025 @ 16:00 - 17:00

Topic: "Too Big to Train: Large model training in PyTorch with Fully Sharded Data Parallel"Speaker: Collin Wilson, SHARCNETVideo link  --- With the popularity of Large Language Models and the general trend of scaling up model and dataset sizes comes challenges in training. Despite hardware improvements, many models are too large to fit onto a single GPU or large enough that small batch sizes lead to long training times. One strategy for parallelizing training is Fully Sharded Data Parallel (FSDP), provided by PyTorch. This strategy splits models into shards and distributes shards across parallel GPUs. This strategy can be used to train very large models and to scale up training. In this talk, we'll discuss implementing FSDP in your training code, examine training performance from an efficiency perspective and compare with another parallelization strategy, data parallelism. Some experience with Python, PyTorch and deep learning is expected.---The Compute Ontario Colloquia are weekly Zoom presentations on Advanced Research Computing, High Performance Computing, Research Data Management, and Research Software topics, delivered by staff from three Compute Ontario consortia (CAC, SciNet, SHARCNET) and guest speakers. The series began January 2023 and superseded similar series previously delivered by individual consortia (e.g. General Interest Seminars by SHARCNET or User Group Meeting TechTalks by SciNet). The colloquia are one hour long and include time for questions. No registration is required. Presentations are usually recorded and uploaded to the hosting consortium video channel (colloquia hosted by SHARCNET go to our youtube channel).

Keywords: RDM, Research Data Management, GPU, HPC, Machine Learning, AI, Python, Programming

Venue: online


Activity log