We first wrote about the attempt by an international group of academic volunteers to break into the large language model (LLM) field early last month with a 176 billion parameter model.
Designed to go toe-to-toe with similar models from tech giants such as Google and OpenAI, the BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) not only supports multiple languages but was also trained in complete transparency – which cannot be said of other LLMs available today.
The work was finally concluded on July 16 with the release of BLOOM after a final run of 117 days on a supercomputer that started on March 11. And for those interested in how it was put together, Stas Bekman, a software engineer at Hugging Face has released a blog with the details.
The hardware and dataset
If you recall, BLOOM was trained on the Jean Zay, a French government-funded supercomputer managed by GENCI and installed at the national computing center for the French National Center for Scientific Research; the compute time was donated by GENCI as part of a grant.
The training of BLOOM took approximately one million compute hours, and the hardware consists of the following:
The BigScience initiative had access to a larger manpower pool than the relatively small teams helming LLM projects at the large tech firms. They used this to good effect by hand-picking a significant amount of the dataset (nearly two-thirds). As a comparison, a majority (60%) of GPT-3’s training data came from Common Crawl.
The result was a 1.5TB data repository in 46 languages with redundant data removed and otherwise cleaned up. According to Bekman, this was converted into 350B unique tokens, with the vocabulary size of the model pegged at 250,680 tokens.
The BigScience Corpus A 1.6TB Composite Multilingual Dataset is available here.
Software and parallelism
It is worth pointing out that the BigScience team had the support of experts from various tech firms, including the Microsoft DeepSpeed team, which created the DeepSpeed deep learning optimization library, and the Nvidia team that developed Megatron-LM – which pioneered work in model parallelism to train LLM.
In addition, the PyTorch team also came in to fix various bugs that were discovered and improved the usability of the PyTorch component that the BigScience team relied on for the training. PyTorch is an open-source machine learning framework based on the Torch library.
The BLOOM model was trained using Megatron-DeepSpeed, says Bekman, which is a combination of DeepSpeed and Megatron-LM – and used by LLMs such as GPT-3. The team essentially forked the original Megatron-DeepSpeed and made some improvements to it to support better parallelism.
“The DeepSpeed team developed a 3D parallelism-based implementation by combining ZeRO sharding and pipeline parallelism from the DeepSpeed library with Tensor Parallelism from Megatron-LM,” wrote Bekman, as he went into a lengthy explanation of the design decisions that the team took.
Difficulties and lessons learned
From previous experience training smaller models (104 billion parameters) and other LLM efforts, the team knew that using FP16, or half-precision floating-point, on the Nvidia A100 for training LLMs is fraught with difficulties.
This is due to overflow from multiplying relatively small numbers, which causes problems during training. To get around it, a “BF16Optimizer” workaround was developed by a team member for use with BLOOM.
The team encountered various hurdles, including hardware failures due to the new cluster that the team got to use. This added to a couple of GPU failures a week, which led to some data loss as data was only saved every three hours.
Of course, considering that the team only had access to the supercomputer at the very last moment, there is the stress of setting everything up and overcoming bugs that only cropped up when multiple nodes were spun up. However, the training was smooth sailing once the initial hurdles were overcome.
Already, the BigScience team is looking to the future. In a separate blog entry, they wrote: “[We are] slated to add more languages, compress the model into a more usable version with the same level of performance, and use it as a starting point for more complex architectures.”
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/Andy