SHARCNET: Webinar "Checkpoints: why, when and how"
Date: 7 May 2025 @ 16:00 - 17:00
Topic: "Checkpoints: why, when and how"Speaker: Weiguang Guan, SHARCNETVideo link --- Checkpointing is a technique that enables programs to save their current state and resume execution from a saved state in the future. This mechanism is useful in running long jobs, which may be interrupted for various unpredictable causes, such as system failures (either hardware or software), bugs in the running program, timeout, etc. We have a wiki page about checkpoints that only gives general guidelines. In this webinar, we will introduce checkpointing through a few concrete examples to illustrate what is the state of a program and how its states at different points of execution are saved and restored. We will discuss various topics related to checkpoints, such as saving frequency, checkpoint file types, and how to implement the checkpointing mechanism in different computational job categories: serial, threaded, and MPI.---The Compute Ontario Colloquia are weekly Zoom presentations on Advanced Research Computing, High Performance Computing, Research Data Management, and Research Software topics, delivered by staff from three Compute Ontario consortia (CAC, SciNet, SHARCNET) and guest speakers. The series began January 2023 and superseded similar series previously delivered by individual consortia (e.g. General Interest Seminars by SHARCNET or User Group Meeting TechTalks by SciNet). The colloquia are one hour long and include time for questions. No registration is required. Presentations are usually recorded and uploaded to the hosting consortium video channel (colloquia hosted by SHARCNET go to our youtube channel).
Keywords: RDM, Research Data Management
Venue: Online
Activity log