Many investors and governments are taking significant interest in Sora. This attention stems not only from its capabilities as a video generator but also from its ability to simulate worlds. This dual functionality highlights the storytelling prowess of the developers and OpenAI, which consistently presents a broad, engaging vision.

Sora offers several notable advantages over previous technologies. First, while earlier solutions were limited to generating a narrow range of visual data, Sora can produce more diverse and expansive open-world data. Second, previous technologies were restricted to creating short videos of fixed size and resolution, such as four-second clips at 256x256 pixels. In contrast, Sora is capable of generating videos and images in a variety of durations, aspect ratios, and resolutions, including up to a full minute of high-definition video.

From my perspective, Sora incorporates two essential techniques that are crucial for its functionality. Anyone looking to reimplement Sora should give serious consideration to these techniques. Furthermore, in my view, developing new technologies that surpass Sora would require designing innovative alternatives to these two foundational techniques.

The first technique employed in Sora is called Two-Stage Generation, which, despite my personal aversion to it, plays a critical role in modern image and video generators. This method involves two generators within the generation pipeline. The first generator uses a Variational Autoencoder (VAE), such as VQGAN or VQVAE, which comprises an Encoder and a Decoder. During the generation process, only the Decoder is utilized. Random noise can be sampled as latent data and decoded into an image or video using this decoder. The second generator typically involves a diffusion model or a language model ^[1] that generates the latent data. Notably, even without these models, the VAE alone can produce satisfactory results.

However, this heavy reliance on the VAE presents a significant limitation in contemporary generative models, and alternatives to this approach are relatively underexplored in the literature. The necessity of using VQGAN to train the VAE poses additional challenges, especially given the complexities associated with training GANs or VAEs. This difficulty is compounded in video generation, where a 3D VAE is required.

I have proposed a new method that eliminates the need to train a VAE, potentially simplifying the generation process and reducing the barriers to reproducing technologies like Sora.

The second technique utilized in Sora is known as the Diffusion Transformer, or DiT. Traditionally, diffusion models have predominantly employed a UNet architecture. However, recent research has demonstrated benefits in replacing the UNet with a transformer architecture. This change is advantageous because it leverages what is termed the "scaling law"—the concept that increasing both the network size and the training data volume can significantly enhance the model's performance.

It is important to note that transformers have long been used instead of UNets in language models ^[2] for both image and video generation. This established practice makes the transition from UNet to transformer in diffusion models a logical and intuitive step.

Sora possesses several promising features that set it apart from previous technologies.

Notably, Sora can generate videos with variable durations, resolutions, and aspect ratios using just a single model. This flexibility is a significant advancement over earlier methods, which often required multiple models to achieve similar results. Additionally, Sora enhances video composition by providing complete objects within the frame. This addresses a common issue in prior technologies, where videos sometimes featured only partial objects

Sora demonstrates a strong capability to accurately interpret prompts, a critical aspect of image and video generation where misalignment between the prompt and the generated content often poses a significant challenge. Sora effectively addresses this issue, ensuring better alignment between prompts and outputs. The Diffusion Transformer (DiT) architecture contributes to this success by enhancing the model's understanding and response to input prompts. Additionally, the quality of annotations is significantly improved with the integration of ChatGPT enhancements, which refine short texts into more detailed and descriptive inputs, further aiding in achieving precise prompt-to-video alignment.

prompt: “an old man wearing blue jeans and a white t-shirt taking a pleasant stroll in Johannesburg, South Africa during a winter storm”

The tasks mentioned are related to text-to-video generation, where the video creation process is triggered by a textual prompt. Beyond text, Sora is also equipped to use images and videos as prompts to generate new videos. For instance, Sora can animate a still image. When provided with a video, Sora is capable of performing a backward extension, aligning three different videos to converge seamlessly into a specified ending video. Additionally, it can execute a forward extension to transform a video into an infinite loop. Sora also offers video-to-video editing capabilities. When given two videos, it can generate a connecting clip between them, effectively linking the two into a continuous sequence.

Sora is also capable of generating images, as it can produce a video consisting of a single frame. This feature allows for the creation of still images using the same technology developed for video generation.

Sora is recognized for its impressive capability to simulate the world, which is arguably its most attractive feature. This is particularly evident in the 3D consistency of the videos it generates. For instance, Sora can create a detailed cityscape that is consistent enough in three dimensions to allow for the reconstruction of the city. Additionally, the videos maintain identity consistency; this means that objects and characters preserve their unique identities and features throughout the video, even if they are temporarily obscured or appear in widely separated segments. This feature is crucial for tasks such as object or person re-identification. Moreover, Sora enables dynamic interactions within the world. For example, if a character in a video eats a hamburger, the hamburger will show a bite mark, reflecting the interaction realistically. Lastly, Sora excels in simulating digital environments, akin to rendering dynamic physical scenes in video games with high fidelity, enhancing the realism and immersion of virtual experiences.

Sora has some notable limitations, particularly in accurately simulating basic physical interactions. For instance, when depicting a scenario where a glass filled with liquid drops onto a table, Sora struggles to generate results that adhere to the correct physical laws. This misalignment with real-world physics can affect the credibility of the generated scenes.

[1] It’s important to clarify that in this context, a "language model" does not process natural language but is a directed sequential model in machine learning, used for generating images or videos.

[2] It’s important to clarify that in this context, a "language model" does not process natural language but is a directed sequential model in machine learning, used for generating images or videos.