Meta’s new AI model learns by watching videos

Meta’s AI researchers have released a new model that’s trained in a similar way as today’s large language models, but instead of learning from words, as today’s state-of-the-art language models do, it learns from video.

Yann LeCun, who leads Meta’s FAIR (foundational AI research) group, has been explaining over the past year that the reason children learn about the world so quickly is because they intake lots of information through their optical nerve and through their ears. They learn what things in the world are called and how they work together. Current large language models (LLMs), such as OpenAI’s GPT-4 or Meta’s own Llama models, learn mainly by processing language—they try to learn about the world as its described on the internet. And that, LeCun argues, is why current LLMs aren’t moving very quickly toward artificial general intelligence (where AI is generally smarter than humans).

LLMs are normally trained on thousands of sentences or phrases where some of the words are masked, forcing the model to find the best words to fill in the blanks. In doing so the model learns what words are statistically most likely to come next in a sequence, and they gradually pick up a rudimentary sense of how the world works. They learn, for example, that when a car drives off a cliff it doesn’t just hang in the air—it drops very quickly to the rocks below.

LeCun believes that if LLMs and other AI models could use the same masking technique, but on video footage, they could learn more like babies do. LeCun’s new baby, and the embodiment of his theory, is a research model called Video Joint Embedding Predictive Architecture (V-JEPA). It learns by processing unlabeled video and figuring out what probably happened in a certain part of the screen during the few seconds it was blacked out.

“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” said LeCun in a statement.

Note that V-JEPA isn’t a generative model. It doesn’t answer questions by generating video, but rather by describing concepts, like the relationship between two real-world objects. The Meta researchers say that V-JEPA, after pretraining using video masking, “excels at detecting and understanding highly detailed interactions between objects.”

Meta’s next step after V-JEPA is to add audio to the video, which would give the model a whole new dimension of data to learn from—just like a child watching a muted TV then turning the sound up. The child would not only see how objects move, but also hear people talking about them, for example. A model pretrained this way might learn that after a car speeds off a cliff it not only rushes toward the ground but makes a big sound upon landing.

“Our goal is to build advanced machine intelligence that can learn more like humans do,” LeCun said, “forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

The research could have big implications for both Meta and the broader AI ecosystem.

Meta has talked before about a “world model” in the context of its work on augmented reality glasses. The glasses would use such a model as the brain of an AI assistant that would, among other things, anticipate what digital content to show the user to help them get things done and have more fun. The model would, out of the box, have an audio-visual understanding of the world outside the glasses, but could then learn very quickly about the unique features of a user’s world through the device’s cameras and microphones.

V-JEPA might also lead toward a change in the way AI models are trained, full stop. Current pretraining methods for foundation models require massive amounts of time and compute power (which has ecological implications). At the moment, in other words, developing foundation models is reserved for the rich. With more efficient training methods, that could change. This would be in line with Meta’s strategy of releasing much of its research as open-source rather than protecting it as valuable IP as OpenAI and others do. Smaller developers might be able to train larger and more capable models if training costs went down.

Meta says it’s releasing the V-JEPA model under a Creative Commons noncommercial license so that researchers can experiment with it and perhaps expand its capabilities.

https://www.fastcompany.com/91029951/meta-v-jepa-yann-lecun?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Erstellt 1y | 15.02.2024, 18:40:07

Melden Sie sich an, um einen Kommentar hinzuzufügen

Andere Beiträge in dieser Gruppe

A newly discovered exoplanet rekindles humanity’s oldest question: Are we alone?

Child psychologists tell us that around the age of five or six, children begin to seriously contemplate the world around them. It’s a glorious moment every parent recognizes—when young minds start

13.07.2025, 11:10:06 | Fast company - tech

How Watch Duty became a go-to app during natural disasters

During January’s unprecedented wildfires in Los Angeles, Watch Duty—a digital platform providing real-time fire data—became the go-to app for tracking the unfolding disaster and is credit

13.07.2025, 06:30:05 | Fast company - tech

Why the AI pin won’t be the next iPhone

One of the most frequent questions I’ve been getting from business execs lately is whether the

12.07.2025, 12:10:02 | Fast company - tech

Microsoft will soon delete your Authenticator passwords. Here are 3 password manager alternatives

Users of Microsoft apps are having a rough year. First, in May, the Windows maker

12.07.2025, 09:40:03 | Fast company - tech

Yahoo Creators platform hits record revenue as publisher bets big on influencer-led content

Yahoo’s bet on creator-led content appears to be paying off. Yahoo Creators, the media company’s publishing platform for creators, had its most lucrative month yet in June.

Launched in M

11.07.2025, 17:30:04 | Fast company - tech

GameStop’s Nintendo Switch 2 stapler sells for more than $100,000 on eBay after viral mishap

From being the face of memestock mania to going viral for inadvertently stapling the screens of brand-new video game consoles, GameStop is no stranger to infamy.

Last month, during the m

11.07.2025, 12:50:04 | Fast company - tech

Don’t take the race for ‘superintelligence’ too seriously

The technology industry has always adored its improbably audacious goals and their associated buzzwords. Meta CEO Mark Zuckerberg is among the most enamored. After all, the name “Meta” is the resi

11.07.2025, 12:50:02 | Fast company - tech

Tomas_r2