31 Dec The AI Diary: JEPA – The Silent Observer
Dec 31, 2025
Over the past few days, I’ve been diving deep into a new paper released by Meta’s AI chief, Yann LeCun.
If you’ve been following the AI space, you know the narrative: models get bigger, consume more data, and generate text faster. We have become obsessed with “Generative AI” — models that can write sonnets, code Python, and pass the Bar Exam.
But while the world was distracted by chatbots, LeCun has been quietly arguing a different point: Language is not Intelligence.
Today, I want to talk about why he might be right, and why the next era of AI won’t be about who can talk the most, but who can understand the best.
The Problem with “Talking to Think”
Let me ask you a simple question: When you drive a car, do you narrate every single action you take? Do you say to yourself, “I see a red light, therefore I am lifting my foot off the accelerator and applying pressure to the brake”?
Of course not. You just do it. You understand the physics of the road, predict the car’s momentum, and act.
Generative AI (LLMs) doesn’t work like that.

Current models are autoregressive. To understand a video of a man picking up a bottle, an LLM essentially has to “write” the description internally token-by-token: “I see a hand… it is reaching… it touches a cylinder.”
It is like a person who cannot think without speaking out loud. It is slow, computationally expensive, and frankly, inefficient.
VL-JEPA: The Silent Observer
This is where Meta’s new architecture, VL-JEPA (Vision-Language Joint Embedding Predictive Architecture), changes the game.
Unlike GPT-4 or Gemini, VL-JEPA is a Non-Generative model. It doesn’t predict the next word or the next pixel. It predicts meaning.
Think of it this way:
- Standard Vision Models are like a cheap CCTV motion detector. They look at every frame and shout labels: “Bottle! Hand! Movement! Canister!” They are jumpy and have no memory of what happened two seconds ago.
- VL-JEPA is like a human observer. It watches the video stream and builds a silent internal model of what is happening. It stabilizes its understanding over time (“Oh, he’s picking up the canister”) without needing to generate a text label to “know” what it is seeing.
Why Efficiency Wins
The most shocking part of this new research isn’t just the architecture — it’s the efficiency.
We are used to models with hundreds of billions of parameters. VL-JEPA achieves state-of-the-art results with just 1.6 Billion parameters.
Because it doesn’t need a heavy decoder to generate text during training, it uses roughly half the trainable parameters of traditional vision-language models.
The Robotics Gap
Why does this matter for you and your business?
Because this is why we still don’t have domestic robots that can fold laundry.
An LLM is great at writing emails, but it is terrible at understanding the physical world. A robot cannot afford to hallucinate or wait for a text generator to describe a falling cup before catching it. It needs a World Model — an intuitive physics engine that predicts consequences immediately.
Yann LeCun’s thesis is finally being proven: Intelligence is about prediction, not generation.
Prediction
As we move into 2026, I believe we will see a split in the AI market.
-
Generative AI will remain the king of creative tasks and coding.
-
Predictive AI (World Models) like VL-JEPA will unlock the physical world — robotics, autonomous driving, and real-time video understanding.
The smartest AI of the future won’t be the one that talks the most. It will be the one that stays silent, watches, and understands.