OpenAI’s Sora, capable of generating videos and interactive 3D environments in real-time, is a significant achievement showcasing the forefront of Generative AI (GenAI).
Interestingly, one of the key innovations that contributed to its development, an AI model architecture informally referred to as the diffusion transformer, has been part of the AI research landscape for several years.
The diffusion transformer, which also underpins Stability AI’s latest image generator, Stable Diffusion 3.0, seems set to revolutionize the GenAI domain by facilitating the scaling of GenAI models beyond previously achievable limits.
The research project that gave birth to the diffusion transformer was initiated in June 2022 by Saining Xie, a computer science professor at NYU. Along with his mentee William Peebles, who was then an intern at Meta’s AI research lab and is now the co-lead of Sora at OpenAI, Xie merged two machine learning concepts — diffusion and the transformer — to forge the diffusion transformer.
The majority of contemporary AI-powered media generators, including OpenAI’s DALL-E 3, depend on a process known as diffusion to produce images, videos, speech, music, 3D meshes, artwork, and more.
While it may not be the most straightforward concept, essentially, noise is incrementally introduced to a piece of media, such as an image, until it becomes indistinguishable. This process is repeated to create a dataset of noisy media. A diffusion model, when trained on this dataset, learns to progressively eliminate the noise, inching closer to a target output media (like a new image) with each step.
Diffusion models usually have a core structure, often referred to as a U-Net. The U-Net core is designed to estimate the noise that needs to be removed, and it does this effectively. However, U-Nets are intricate, with specially engineered modules that can significantly decelerate the diffusion process.
Luckily, transformers can take the place of U-Nets, resulting in enhanced efficiency and performance.
Transformers are the architecture of choice for complex reasoning tasks, powering models like GPT-4, Gemini and ChatGPT. They have several unique characteristics, but by far transformers’ defining feature is their “attention mechanism.” For every piece of input data (in the case of diffusion, image noise), transformers weigh the relevance of every other input (other noise in an image) and draw from them to generate the output (an estimate of the image noise).
Not only does the attention mechanism make transformers simpler than other model architectures but it makes the architecture parallelizable. In other words, larger and larger transformer models can be trained with significant but not unattainable increases in compute.
“What transformers contribute to the diffusion process is akin to an engine upgrade,” Xie told TechCrunch in an email interview. “The introduction of transformers … marks a significant leap in scalability and effectiveness. This is particularly evident in models like Sora, which benefit from training on vast volumes of video data and leverage extensive model parameters to showcase the transformative potential of transformers when applied at scale.”
Considering that the concept of diffusion transformers has been in existence for some time, one might wonder why it took years for projects like Sora and Stable Diffusion to start utilizing them. According to Xie, the significance of having a scalable backbone model only became apparent quite recently.
Xie praised the Sora team for their exceptional efforts in demonstrating the extensive possibilities of this approach when applied on a large scale. He stated that they have essentially established that U-Nets are out, and transformers are in for diffusion models henceforth.
Xie suggests that diffusion transformers could be a straightforward replacement for existing diffusion models, regardless of whether the models generate images, videos, audio, or any other form of media. He acknowledges that the current training process for diffusion transformers might introduce some inefficiencies and performance loss, but he is optimistic that these issues can be resolved over time.
Xie’s main message is clear: abandon U-Nets and switch to transformers because they are faster, more effective, and more scalable. He expressed his interest in merging the realms of content comprehension and creation within the framework of diffusion transformers. Currently, these are like two separate worlds – one for understanding and another for creating. Xie envisions a future where these elements are integrated, and he believes that achieving this integration necessitates the standardization of underlying architectures, with transformers being the perfect candidate for this purpose.
If Sora and Stable Diffusion 3.0 serve as a glimpse of what to anticipate with diffusion transformers, it seems we’re in for an exciting journey.