Meta’s new AI model learns by watching videos

Meta’s AI researchers have released a new model that’s trained in a similar way as today’s large language models, but instead of learning from words, as today’s state-of-the-art language models do, it learns from video.

Yann LeCun, who leads Meta’s FAIR (foundational AI research) group, has been explaining over the past year that the reason children learn about the world so quickly is because they intake lots of information through their optical nerve and through their ears. They learn what things in the world are called and how they work together. Current large language models (LLMs), such as OpenAI’s GPT-4 or Meta’s own Llama models, learn mainly by processing language—they try to learn about the world as its described on the internet. And that, LeCun argues, is why current LLMs aren’t moving very quickly toward artificial general intelligence (where AI is generally smarter than humans).

LLMs are normally trained on thousands of sentences or phrases where some of the words are masked, forcing the model to find the best words to fill in the blanks. In doing so the model learns what words are statistically most likely to come next in a sequence, and they gradually pick up a rudimentary sense of how the world works. They learn, for example, that when a car drives off a cliff it doesn’t just hang in the air—it drops very quickly to the rocks below.

LeCun believes that if LLMs and other AI models could use the same masking technique, but on video footage, they could learn more like babies do. LeCun’s new baby, and the embodiment of his theory, is a research model called Video Joint Embedding Predictive Architecture (V-JEPA). It learns by processing unlabeled video and figuring out what probably happened in a certain part of the screen during the few seconds it was blacked out.

“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” said LeCun in a statement.

Note that V-JEPA isn’t a generative model. It doesn’t answer questions by generating video, but rather by describing concepts, like the relationship between two real-world objects. The Meta researchers say that V-JEPA, after pretraining using video masking, “excels at detecting and understanding highly detailed interactions between objects.”

Meta’s next step after V-JEPA is to add audio to the video, which would give the model a whole new dimension of data to learn from—just like a child watching a muted TV then turning the sound up. The child would not only see how objects move, but also hear people talking about them, for example. A model pretrained this way might learn that after a car speeds off a cliff it not only rushes toward the ground but makes a big sound upon landing.

“Our goal is to build advanced machine intelligence that can learn more like humans do,” LeCun said, “forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

The research could have big implications for both Meta and the broader AI ecosystem.

Meta has talked before about a “world model” in the context of its work on augmented reality glasses. The glasses would use such a model as the brain of an AI assistant that would, among other things, anticipate what digital content to show the user to help them get things done and have more fun. The model would, out of the box, have an audio-visual understanding of the world outside the glasses, but could then learn very quickly about the unique features of a user’s world through the device’s cameras and microphones.

V-JEPA might also lead toward a change in the way AI models are trained, full stop. Current pretraining methods for foundation models require massive amounts of time and compute power (which has ecological implications). At the moment, in other words, developing foundation models is reserved for the rich. With more efficient training methods, that could change. This would be in line with Meta’s strategy of releasing much of its research as open-source rather than protecting it as valuable IP as OpenAI and others do. Smaller developers might be able to train larger and more capable models if training costs went down.

Meta says it’s releasing the V-JEPA model under a Creative Commons noncommercial license so that researchers can experiment with it and perhaps expand its capabilities.

https://www.fastcompany.com/91029951/meta-v-jepa-yann-lecun?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

созданный 1y | 15 февр. 2024 г., 18:40:07


Войдите, чтобы добавить комментарий

Другие сообщения в этой группе

GameStop’s Nintendo Switch 2 stapler sells for more than $100,000 on eBay after viral mishap

From being the face of memestock mania to going viral for inadvertently stapling the screens of brand-new video game consoles, GameStop is no stranger to infamy.

Last month, during the m

11 июл. 2025 г., 12:50:04 | Fast company - tech
Don’t take the race for ‘superintelligence’ too seriously

The technology industry has always adored its improbably audacious goals and their associated buzzwords. Meta CEO Mark Zuckerberg is among the most enamored. After all, the name “Meta” is the resi

11 июл. 2025 г., 12:50:02 | Fast company - tech
Why AI-powered hiring may create legal headaches

Even as AI becomes a common workplace tool, its use in

11 июл. 2025 г., 12:50:02 | Fast company - tech
Gen Zers are posting their unemployment era on TikTok—and it’s way too real

Finding a job is hard right now. To cope, Gen Zers are documenting the reality of unemployment in 2025.

“You look sadder,” one TikTok po

11 июл. 2025 г., 10:30:04 | Fast company - tech
The most effective AI tools for research, writing, planning, and creativity

This article is republished with permission from Wonder Tools, a newsletter that helps you discover the most useful sites and apps. 

11 июл. 2025 г., 10:30:04 | Fast company - tech
Tesla sets annual meeting for November amid shareholder pressure

Tesla has scheduled an annual shareholders meeting for November, one day after the

10 июл. 2025 г., 20:40:02 | Fast company - tech