A New Open Science Initiative Is Setting AI Free

An international collaboration of academic volunteers is breaking into the large language model field with a new 175 billion parameter model as part of an open science initiative.

Trained with USD7 million worth of publicly-funded computing time, the BLOOM language model will go toe-to-toe with similar models from tech giants such as Google and OpenAI.

Aside from its collaborative roots and the decision to open-source the project, BLOOM is also the first model of this scale to be multilingual and will be made available for research uses.

An open approach

Large language models are ML algorithms that can recognize, predict, and generate human languages by drawing on the enormous text-based data sets used to train them.

They can respond to questions, write essays, or generate computer code with limited instructions. Indeed, GitHub Copilot, which helps software developers write code, is powered by Codex, itself repurposed from the well-known GPT-3 model.

Due to the massive amount of computing required to train such models, large language models to date are built by large tech firms with strong financial backing. Behind the scenes, however, they are helmed by relatively small teams, who turn to easily available resources such as online repositories or popular sites such as Reddit for the data to train their models.

On its part, BLOOM is the work of hundreds of researchers consisting mostly of academics such as ethicists, legal scholars, and philosophers, according to Nature. Data sources were identified through a series of workshops with a much broader base of collaborators, including community groups around the world.

It is understood that the researchers hand-picked nearly two-thirds of the 341-billion-word data set from some 500 sources. This was topped off with a multilingual web crawl filtered for quality.

A publicly-funded model

“Everything happens completely in the open, anyone can participate, and all research artifacts are shared with the entire research community,” Giada Pistilli, an ethicist at AI firm Hugging Face was quoted as saying about BLOOM.

“[BLOOM] is designed as an interdisciplinary research workshop, gathering researchers - academic, industrial, and independent – with a diversity of research interests, [including] AI, natural language processing, social science, legal, ethics, and public policy,” she said.

BLOOM is currently trained on Jean Zay, a French government-funded supercomputer installed at IDRIS, the national computing center for the French National Center for Scientific Research (CNRS).

Access to 384 Nvidia A100 GPUs with 80GB of memory each has been allocated for BLOOM for several months, offering approximately 1.2 million GPU hours. For comparison, Thailand’s National Science and Technology Development Agency’s (NSTDA) new supercomputer will be powered by 704 Nvidia A100 Tensor Core GPUs.

When fully trained, BLOOM will have 176 billion parameters and would have consumed more than 350 billion words from 46 different languages. You can read more about the architecture behind BLOOM and its various design decisions in the research paper here.

AI for everyone

Already, some have called BLOOM the most important AI model of the decade, ahead of Google’s 540-billion parameter Pathways Language Model (PaLM) model or the trail-blazing GPT-3.

With BLOOM, “state-of-the-art AI is no longer reserved for big corporations with big pockets”, argues AI analyst Alberto Romero in a contributed opinion piece.

Romero noted that the funding and building of an open large language model have created intense pressure on the various tech giants to open source their models. Seen from this perspective, BLOOM is the “spearhead” of an impending wave of change in the AI field, he says.

While the groundwork was put in place last year, the actual training of BLOOM started in April. Just a month later, in May, Meta AI announced that it will give away its massive new language model OPT-175B as part of its effort to democratize AI. Coincidence?

For now, the fully trained BLOOM model will be made available for download, though running that will require powerful hardware that not many researchers have access to. However, smaller, less hardware-intensive versions will be made available.

In addition, Hugging Face has also committed to releasing a web application to query BLOOM online. As the code and data set behind the BLOOM model are open, it is hoped that researchers can study it to help improve future iterations.

There is no question that an age of more open AI is at hand.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/RossellaApostoli