cd /news/large-language-models/wwdc-2026-apple-s-new-server-llm-on-… · home topics large-language-models article
[ARTICLE · art-26489] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

WWDC 2026 - Apple's new server LLM on Private Cloud Compute: what's in it for developers

Apple announced a new server-side large language model running on Private Cloud Compute (PCC) at WWDC 2026, accessible via the same Swift API as the on-device model. The PCC model offers 32K context, reasoning capabilities, and Apple's privacy guarantees, with a daily per-user cap. Developers can switch between on-device and server models with a single line of code.

read4 min publishedJun 13, 2026

Last year Apple gave us an on-device LLM through the Foundation Models framework. This year that on-device model gets better, and Apple adds something many of us asked for: a larger server model you can call directly from your app, running on Private Cloud Compute (PCC).

The on-device model is great for fast, private, offline tasks, and this year it improved: it now supports image input, follows instructions more reliably, and is better at calling your custom tools.

But some features just need more headroom. Think:

That's where PCC comes in. You get a frontier-class model while keeping Apple's privacy posture intact.

Most server LLMs mean: provision an account, manage API keys, eat token costs, and ship a privacy policy that accounts for it. PCC removes most of that:

The trade-off you're accepting: a network connection is required, and there's a per-user daily cap you need to design around (more on that below).

If you've used Foundation Models before, prompting the on-device model is three lines:

import FoundationModels

let session = LanguageModelSession()
let response = try await session.respond(to: "Summarize this article: \(article)")

Switching to the PCC server model is a single line, you just hand the session a different model:

import FoundationModels

let session = LanguageModelSession(
    model: PrivateCloudComputeLanguageModel()
)
let response = try await session.respond(to: "Summarize this article: \(article)")

That's the headline ergonomic win. Same unified Swift API, larger model behind it.

@Generable

structured output and Tool

calling behave the same whether you're on-device or on PCC. You don't rewrite anything to move between them:

import FoundationModels

@Generable
struct ArticleSummary {
    let oneLineSummary: String
    let keyPoints: [String]
}

struct FindRelatedArticlesTool: Tool {
    // ...
}

let session = LanguageModelSession(
    model: PrivateCloudComputeLanguageModel(),
    tools: [FindRelatedArticlesTool.self]
)

let response = try await session.respond(
    to: "Summarize this article: \(article)",
    generating: ArticleSummary.self
)

PCC, like the on-device model, only runs on Apple Intelligence devices. Check availability and provide a graceful fallback:

import FoundationModels

struct ArticleSummarizationView: View {
    private var model = PrivateCloudComputeLanguageModel()

    var body: some View {
        if model.isAvailable {
            // Show UI for making request
        } else {
            // Fall back
        }
    }
}

Both are private. The rest is a set of trade-offs:

Factor On-device PCC server
Privacy Yes Yes
Works offline Yes No (needs connection)
Request limits None Daily per-user limit
Context size 4K 32K
Reasoning No Yes

The session's advice is worth repeating: pick the model based on data, not vibes. The updated on-device model may handle more than you'd expect, and it has no request limits. The only way to know is to evaluate your specific feature (Apple's new Evaluations framework, covered in "Meet the Evaluations framework," is built for exactly this).

PCC supports reasoning, where the model generates extra "thinking" text in a separate transcript segment before producing the final answer. There are three levels:

.light

gathers a bit of extra context..moderate

reasons a little deeper..deep

can produce a reasoning segment longer than the answer itself.You set it per request:

let response = try await session.respond(
    to: prompt,
    contextOptions: ContextOptions(reasoningLevel: .light)
)
// Reasoning levels: .light, .moderate, .deep

Two things to keep in mind:

.deep

, which can take a while.You can now query context size directly instead of hardcoding it:

SystemLanguageModel().contextSize
// 4096 on 26.0
// 8192 on 27.0 (newer devices)

PrivateCloudComputeLanguageModel().contextSize
// 32768

Because requests are metered against the user's iCloud account, your app will eventually hit a user who's at their daily cap. If the only thing that happens is a thrown error surfaced in the UI, that's a poor, non-actionable experience.

Instead, inspect quotaUsage

and render persistent, actionable UI:

struct ArticleSummarizationView: View {
    private var model = PrivateCloudComputeLanguageModel()

    var body: some View {
        if case .belowLimit(let info) = model.quotaUsage.status {
            if info.isApproachingLimit {
                Text("Nearing usage limit.")
                    .foregroundStyle(Color.orange)
            }
        }
        if model.quotaUsage.isLimitReached {
            Text("Usage limit exceeded.")
                .foregroundStyle(Color.red)
        }
        if let suggestion = model.quotaUsage.limitIncreaseSuggestion {
            Button("Show options") {
                suggestion.show()
            }
        }
    }
}

Design guidance from the session:

limitIncreaseSuggestion

lets the user manage or raise their limit (such as upgrading their iCloud account).You don't need to burn real quota to test this. In your scheme, go to Debug > Options and use Simulate Apple Foundation Models Availability. You can select Quota Usage Limit Reached and Nearing Usage Limit to exercise both code paths.

You're not forced to pick one. A common pattern is to route simple work to the on-device model and escalate harder tasks to PCC. The session points to "Build agentic app experiences with Foundation Models" for that workflow.

The server model is available for apps with fewer than 2M downloads, and you apply on the Apple Developer website. If your feature genuinely needs the larger context or reasoning, it's worth applying early.

--

Summary

If you already use Foundation Models, reaching for a bigger model is now a one-line decision, with privacy handled and no token bill to manage. Evaluate, choose the right tier for each task, and design for the daily limit up front.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/wwdc-2026-apple-s-ne…] indexed:0 read:4min 2026-06-13 ·