The exact applications of generative video models remain a mystery, yet major corporations such as Runway, OpenAI, and Meta are investing heavily in their development. Meta’s newest creation, Movie Gen, transforms written prompts into videos that are surprisingly lifelike, complete with sound, though it’s still missing the voice feature. However, they’ve chosen not to release it to the public just yet.
Movie Gen is essentially a suite of models, with the primary one being a text-to-video model. Meta boasts that it surpasses competitors like Runway’s Gen3, LumaLabs’ most recent, and Kling1.5 in performance, though it’s important to note that these comparisons are more about showcasing their capabilities rather than proving Movie Gen’s superiority. The technical details are outlined in a paper Meta has published.
The audio is tailored to fit the video’s content, adding in sounds like engine noises that match car movements, the sound of a waterfall, or thunderclaps at the right moments. It can also include music if it seems appropriate.
The training data for Movie Gen comes from a mix of copyrighted and publicly available datasets, which they refer to as “proprietary/commercially sensitive” and do not disclose further information about. It’s likely that this data includes a lot of content from social media platforms like Instagram and Facebook, along with some proprietary content and a significant amount of content that is not well-protected from scrapers, which would classify it as “publicly available.”
What Meta is clearly aiming for with Movie Gen is not just to claim the title of the best for a short period but to develop a comprehensive, all-encompassing system where a high-quality video can be generated from a simple, natural-language prompt. An example prompt could be, “imagine me as a baker creating a shiny hippo cake during a thunderstorm.”
For example, one major issue with these video creation tools has been their complexity when it comes to editing. If you request a video of someone walking across the street, but then decide you want them walking from left to right instead of right to left, the entire scene might look different when you give that extra instruction. Meta is introducing a straightforward, text-based editing feature where you can simply say “change the background to a busy intersection” or “change her outfit to a red dress” and it will attempt to make that specific change only.
The editing of camera movements is generally understood, with terms like “tracking shot” and “pan left” being considered when creating the video. This approach is still somewhat awkward compared to actual camera control, but it’s a significant improvement over nothing at all.
The model’s limitations are somewhat peculiar. It produces videos that are 768 pixels wide, a size familiar from the well-known but outdated 1024×768, but which is also three times the size of 256, making it compatible with other high-definition formats. The Movie Gen system scales this up to 1080p, which is the reason it’s claimed to generate videos at that resolution. While this isn’t entirely accurate, we’ll overlook it because scaling up is surprisingly effective.
Strangely, it can generate up to 16 seconds of video at a frame rate of 16 frames per second, a frame rate that has never been desired or requested in history. However, you can also create 10 seconds of video at 24 frames per second. Start with that!
As for why it doesn’t include voice, there are likely two reasons. First, it’s incredibly challenging. Generating speech is now relatively easy, but aligning it with lip movements, and those lip movements with facial movements, is a much more complex task. I understand why they might delay this feature, as it would likely result in a failure case. Imagine someone asking for a video of a clown delivering the Gettysburg Address while riding a tiny bike in circles — that would be a disaster waiting to happen.
The second reason is probably political: releasing a tool that could potentially create deepfakes a month before a major election isn’t the best move for public image. Limiting its capabilities in this way, so that if malicious actors try to use it, they would have to put in a lot of effort, is a sensible precaution. One could potentially combine this generative model with a speech generator and an open-source lip syncing tool, but releasing a candidate making outlandish claims wouldn’t be advisable.
“Movie Gen is currently a purely AI research project, and even at this early stage, ensuring safety is our top priority, as it has been with all our generative AI technologies,” said a representative from Meta in response to questions from TechCrunch.
In contrast to, for example, the Llama’s extensive language models, the Movie Gen model will not be accessible to the public. However, you can somewhat mimic its methods by referring to the research paper, though the source code will not be shared, except for the “basic evaluation prompt dataset,” which essentially documents the prompts utilized to create the test videos.