cd /news/machine-learning/llm-style-scaling-laws-hold-for-sens… · home topics machine-learning article
[ARTICLE · art-45889] src=empirical.health ↗ pub= topic=machine-learning verified=true sentiment=· neutral

LLM-style scaling laws hold for sensor data

Google researchers have found that scaling laws similar to those for large language models apply to wearable sensor data, with validation loss scaling predictably with compute, model size, and data size. The scaling law for physiological data shows gains flattening around 10 million hours and 100 million parameters, unlike LLMs which continue improving, suggesting opportunities for smaller startups in non-LLM foundation models.

read5 min views1 publishedJul 1, 2026
LLM-style scaling laws hold for sensor data
Image: source

LLM-style scaling laws hold for sensor data #

Much of the magic of LLMs comes from the fact loss scales predictably with model size, dataset size, the amount of compute used for training. It’s easy to take scaling laws for granted, but they only published in 2020 and their structure underlies both the economics of AI (if not for scaling laws, frontier labs couldn’t invest nine figures in a training run) and AI’s emergent capabilities (the Phillip Anderson quote, “more is different”, comes to mind).

Do similar scaling laws apply to non-language foundation models, such as wearable foundation models? It turns out they do. Is the form exactly the same? It is not, which leads to some interesting questions.

First, let me describe an example of a non-LLM scaling law.

A non-LLM scaling law Google’s Scaling Wearable Foundation Models was, to my knowledge, the first paper to establish a scaling law for physioloigical sensor data from wearables: Scaling performance of a wearable foundation model as a function of data size & model size. Source: Scaling wearable foundation models.

Validation loss scaled as:

where is compute, is the power-law exponent, and is an irreducible floor (more on that later). Across multiple orders of magnitude, loss falls along a nearly straight line on the log-log plot before bending toward the floor (the same shape holds when you vary data hours or parameters instead of compute). LSM tested four model sizes (2M, 7M, 110M, and 328M parameters) against data from a few thousand hours up to 40 million. Bigger models and more data both helped on every generative task they measured: random imputation, temporal interpolation, sensor imputation, and forecasting. The payoff on downstream, post-trained tasks was good too. Fine-tuned LSM improved interpolation and forecasting by 16-23% over baselines and lifted activity recognition by 29%.

Non-LLM scaling laws are similar, but not identical to LLM scaling laws LLM scaling laws were first established in Kaplan et al. (2020), and then refined in the 2022 Chincilla paper. In the Chinchilla scaling laws, for a fixed compute budget, you should scale parameters and tokens together, about 20 tokens per parameter. Chinchilla was a 70B model trained on 1.4 trillion tokens, and it beat models several times its size that had been starved of data.

The Chinchilla LLM scaling laws are expressed as:

Here, is the validation loss; represents the irreducible loss floor; and , , , and are fitted constants (exponents and multipliers). One widely cited finding is that, under a fixed compute budget, optimal results are achieved by scaling data and model size together: specifically, the compute-optimal regime is where the number of training tokens is proportional to the number of parameters (in practice, about 20 tokens per parameter).

One major difference is that LSM’s gains flattened out around 10 million hours of data and roughly 100 million parameters. LLMs have shown no such ceiling at consumer scale. Chinchilla used 1.4 trillion tokens and frontier models have gone well past that, with no flattening yet. (Both scaling laws have an irreducible error term, so this isn’t a difference in functional form but rather an empirical result.)

That’s a potentially interesting opportunity for startups. We trained a JEPA-style wearable foundation model, JETS, on the same order of magnitude of data as Google and Apple with a four-person team. So whereas starting another LLM foundation model company requires billions of dollars of investment, non-LLM domains might actually be open for smaller startups.

[Some open questions I have](#some-open-questions-i-have)

While the power laws rhyme, many of the underlying details are pretty different:
Dimension LLM scaling Wearable sensor scaling
Unit of data Tokens (discrete vocabulary) Hours of continuous, multi-channel signal
Pretraining objective Next-token prediction Masked reconstruction (80% of patches hidden, MSE loss)
Loss Cross-entropy / perplexity Mean squared error on held-out patches
Saturation None yet at trillions of tokens Flattens near 10⁷ hours and ~10⁸ parameters
Compute-optimal recipe ~20 tokens per parameter (Chinchilla) Scale data and model together; total hours dominate
Data supply Finite; the public text pool is being exhausted Renewable; billions of devices generate hours continuously
Economics Oligopoly with $1B+ entry cost Capital light?

This leads to several interesting questions:

Data wall. LLMs are running into a data wall, where the stock of high-quality public text is close to spent and synthetic data is an uneasy substitute. As Ilya Sutskevar put it in his talk on the end of pretraining, “we have but one internet.” Physiological data has the opposite problem. Every watch on every wrist generates roughly 8,760 hours of new signal a year, passively, forever. The binding constraints for sensor models are labeled outcomes, compute, and the messiness of real-world data. If we can find architectures that reduce c to 0, there’s actually a very high ceiling on these laws.Market structure. T Rowe Price put it, “the economic rationale for AI capital expenditure ultimately rests on scaling laws.” Marginal compute must lead to marginal performance, which is why frontier labs are essentially an oligopoly with an entry price of $1B+. If the scaling laws are different, does this mean the compeitive dynamics are different?Same or different latent space? We’ve seen interesting implications from aligning vision models into the LLM’s latent space (e.g., CLIP). Will there ultimately be one latent space for all models? Or will we see physiological and other models have their own latent spaces that make meaningful distinctions that are too subtle to fit in language?

Get your free 30-day heart health guide #

Evidence-based steps to optimize your heart health.

── more in #machine-learning 4 stories · sorted by recency
── more on @google 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llm-style-scaling-la…] indexed:0 read:5min 2026-07-01 ·