AI2 shares its data and AI models that can generate text with the public

AI2, a nonprofit AI research institute founded by Microsoft co-founder Paul Allen, is making several GenAI language models available that are more “open” than others — and, crucially, have a license that allows developers to use them freely for training, testing and even selling

The models, called OLMo for “Open Language MOdels,” and the data set they were trained on, Dolma — one of the biggest public data sets of its kind — were created to explore the advanced science of text-creation AI, AI2 senior software engineer Dirk Groeneveld said.

“‘Open’ is a vague term when it comes to [text-generating models],” Groeneveld said in an email interview with TechCrunch. “We hope researchers and practitioners will use the OLMo framework as a chance to examine a model trained on one of the largest public data sets ever released, along with all the tools needed to build the models.”

There are many open source text-generating models out there, with groups from Meta to Mistral releasing powerful models for any developer to use and adjust. But Groeneveld argues that many of these models are not truly open because they were trained “in secret” and on private, unclear sets of data.

The OLMo models, which were built with the support of partners such as Harvard, AMD and Databricks, come with the code that was used to create their training data and also training and evaluation metrics and logs.

The most powerful OLMo model, OLMo 7B, is a “convincing and robust” alternative to Meta’s Llama 2, Groeneveld claims — depending on the use case. On some benchmarks, especially those related to reading comprehension, OLMo 7B beats Llama 2. But on others, especially question-answering tests, OLMo 7B is slightly behind.

The OLMo models also have some drawbacks, like poor-quality outputs in languages other than English (Dolma mostly has English-language content) and weak code-generation abilities. But Groeneveld emphasized that it’s still early.

Sure, I can help you with that. Here is one possible way to rephrase the paragraph:

He said that “OLMo is not meant to be multilingual — yet.” “[And while] the main focus of the OLMo framework [was not] code generation at this point, to help future code-based fine-tuning projects, OLMo’s data mix has about 15% code in it.”

I asked Groeneveld if he was worried that the OLMo models, which can be used for commercial purposes and are powerful enough to run on consumer GPUs like the Nvidia 3090, might be used in unexpected, possibly harmful ways by malicious actors. A recent study by Democracy Reporting International’s Disinfo Radar project, which tries to find and address disinformation trends and technologies, found that two popular open text-generation models, Hugging Face’s Zephyr and Databricks’ Dolly, consistently generate toxic content — producing “imaginative” harmful content in response to evil prompts.

Groeneveld thinks that the benefits are greater than the harms in the long run.

“[B]uilding this open platform will actually enable more research on how these models can be risky and what we can do to improve them,” he said. “Yes, it’s possible open models may be used wrongly or for unintended purposes. [However, this] approach also encourages technical improvements that lead to more ethical models; is a requirement for verification and reproducibility, as these can only be done with access to the full stack; and lowers a growing concentration of power, creating more fair access.”

In the next few months, AI2 plans to release bigger and more capable OLMo models, including multimodal models (i.e. models that comprehend modalities beyond text), and more data sets for training and fine-tuning. As with the first OLMo and Dolma release, all resources will be free on GitHub and the AI project hosting platform Hugging Face.

Leave a Reply Cancel reply