Meta’s AI researchers have released a new model that’s trained in a similar way as today’s large language models, but instead of learning from words, as today’s state-of-the-art language models do, it learns from video.
Yann LeCun, who leads Meta’s FAIR (foundational AI research) group, has been explaining over the past year that the reason children learn about the world so quickly is because they intake lots of information through their optical nerve and through their ears. They learn what things in the world are called and how they work together. Current large language models (LLMs), such as OpenAI’s GPT-4 or Meta’s own Llama models, learn mainly by processing language—they try to learn about the world as its described on the internet. And that, LeCun argues, is why current LLMs aren’t moving very quickly toward artificial general intelligence (where AI is generally smarter than humans).
LLMs are normally trained on thousands of sentences or phrases where some of the words are masked, forcing the model to find the best words to fill in the blanks. In doing so the model learns what words are statistically most likely to come next in a sequence, and they gradually pick up a rudimentary sense of how the world works. They learn, for example, that when a car drives off a cliff it doesn’t just hang in the air—it drops very quickly to the rocks below.
LeCun believes that if LLMs and other AI models could use the same masking technique, but on video footage, they could learn more like babies do. LeCun’s new baby, and the embodiment of his theory, is a research model called Video Joint Embedding Predictive Architecture (V-JEPA). It learns by processing unlabeled video and figuring out what probably happened in a certain part of the screen during the few seconds it was blacked out.
“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” said LeCun in a statement.
Note that V-JEPA isn’t a generative model. It doesn’t answer questions by generating video, but rather by describing concepts, like the relationship between two real-world objects. The Meta researchers say that V-JEPA, after pretraining using video masking, “excels at detecting and understanding highly detailed interactions between objects.”
Meta’s next step after V-JEPA is to add audio to the video, which would give the model a whole new dimension of data to learn from—just like a child watching a muted TV then turning the sound up. The child would not only see how objects move, but also hear people talking about them, for example. A model pretrained this way might learn that after a car speeds off a cliff it not only rushes toward the ground but makes a big sound upon landing.
“Our goal is to build advanced machine intelligence that can learn more like humans do,” LeCun said, “forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”
The research could have big implications for both Meta and the broader AI ecosystem.
Meta has talked before about a “world model” in the context of its work on augmented reality glasses. The glasses would use such a model as the brain of an AI assistant that would, among other things, anticipate what digital content to show the user to help them get things done and have more fun. The model would, out of the box, have an audio-visual understanding of the world outside the glasses, but could then learn very quickly about the unique features of a user’s world through the device’s cameras and microphones.
V-JEPA might also lead toward a change in the way AI models are trained, full stop. Current pretraining methods for foundation models require massive amounts of time and compute power (which has ecological implications). At the moment, in other words, developing foundation models is reserved for the rich. With more efficient training methods, that could change. This would be in line with Meta’s strategy of releasing much of its research as open-source rather than protecting it as valuable IP as OpenAI and others do. Smaller developers might be able to train larger and more capable models if training costs went down.
Meta says it’s releasing the V-JEPA model under a Creative Commons noncommercial license so that researchers can experiment with it and perhaps expand its capabilities.
Autentifică-te pentru a adăuga comentarii
Alte posturi din acest grup

CrowdStrike reiterated its fiscal 2026 first quarter and annual forecast


The latest TikTok trend is leading to fire evacuations at schools across Connecticut.
As part of the trend, students are filming themselves inserting items such as pencils, paper clips,

Netflix is finally pushing out the major TV app redesign it started testing last year, with a top navigation bar and new recommendation features. It’s also experimenting with generative AI a

New AI features from LinkedIn will soon help job seekers find positions that best suit them—without the n

As the arms race in the artificial intelligence world ramps up, Big Tech companies are rushing to become your default AI source. Meta, last week, launched the Meta AI app to challenge ChatGPT and

Residents living near SpaceX headquarters in Boca Chica, Texas, will soon have a new public body through which to raise concerns about everything from road maintenance to garbage collection. Earli