# WWDC 2026 - Apple's new server LLM on Private Cloud Compute: what's in it for developers

> Source: <https://dev.to/arshtechpro/wwdc-2026-apples-new-server-llm-on-private-cloud-compute-whats-in-it-for-developers-2edd>
> Published: 2026-06-13 20:48:40+00:00

Last year Apple gave us an on-device LLM through the Foundation Models framework. This year that on-device model gets better, and Apple adds something many of us asked for: a **larger server model** you can call directly from your app, running on **Private Cloud Compute (PCC)**.

The on-device model is great for fast, private, offline tasks, and this year it improved: it now supports **image input**, follows instructions more reliably, and is better at calling your custom tools.

But some features just need more headroom. Think:

That's where PCC comes in. You get a frontier-class model while keeping Apple's privacy posture intact.

Most server LLMs mean: provision an account, manage API keys, eat token costs, and ship a privacy policy that accounts for it. PCC removes most of that:

The trade-off you're accepting: a network connection is required, and there's a per-user daily cap you need to design around (more on that below).

If you've used Foundation Models before, prompting the on-device model is three lines:

``` python
import FoundationModels

let session = LanguageModelSession()
let response = try await session.respond(to: "Summarize this article: \(article)")
```

Switching to the PCC server model is a single line, you just hand the session a different model:

``` python
import FoundationModels

let session = LanguageModelSession(
    model: PrivateCloudComputeLanguageModel()
)
let response = try await session.respond(to: "Summarize this article: \(article)")
```

That's the headline ergonomic win. Same unified Swift API, larger model behind it.

`@Generable`

structured output and `Tool`

calling behave the same whether you're on-device or on PCC. You don't rewrite anything to move between them:

``` python
import FoundationModels

@Generable
struct ArticleSummary {
    let oneLineSummary: String
    let keyPoints: [String]
}

struct FindRelatedArticlesTool: Tool {
    // ...
}

let session = LanguageModelSession(
    model: PrivateCloudComputeLanguageModel(),
    tools: [FindRelatedArticlesTool.self]
)

let response = try await session.respond(
    to: "Summarize this article: \(article)",
    generating: ArticleSummary.self
)
```

PCC, like the on-device model, only runs on Apple Intelligence devices. Check availability and provide a graceful fallback:

``` python
import FoundationModels

struct ArticleSummarizationView: View {
    private var model = PrivateCloudComputeLanguageModel()

    var body: some View {
        if model.isAvailable {
            // Show UI for making request
        } else {
            // Fall back
        }
    }
}
```

Both are private. The rest is a set of trade-offs:

| Factor | On-device | PCC server |
|---|---|---|
| Privacy | Yes | Yes |
| Works offline | Yes | No (needs connection) |
| Request limits | None | Daily per-user limit |
| Context size | 4K | 32K |
| Reasoning | No | Yes |

The session's advice is worth repeating: pick the model based on data, not vibes. The updated on-device model may handle more than you'd expect, and it has no request limits. The only way to know is to evaluate your specific feature (Apple's new Evaluations framework, covered in "Meet the Evaluations framework," is built for exactly this).

PCC supports reasoning, where the model generates extra "thinking" text in a separate transcript segment before producing the final answer. There are three levels:

`.light`

gathers a bit of extra context.`.moderate`

reasons a little deeper.`.deep`

can produce a reasoning segment longer than the answer itself.You set it per request:

``` js
let response = try await session.respond(
    to: prompt,
    contextOptions: ContextOptions(reasoningLevel: .light)
)
// Reasoning levels: .light, .moderate, .deep
```

Two things to keep in mind:

`.deep`

, which can take a while.You can now query context size directly instead of hardcoding it:

```
SystemLanguageModel().contextSize
// 4096 on 26.0
// 8192 on 27.0 (newer devices)

PrivateCloudComputeLanguageModel().contextSize
// 32768
```

Because requests are metered against the user's iCloud account, your app will eventually hit a user who's at their daily cap. If the only thing that happens is a thrown error surfaced in the UI, that's a poor, non-actionable experience.

Instead, inspect `quotaUsage`

and render persistent, actionable UI:

``` js
struct ArticleSummarizationView: View {
    private var model = PrivateCloudComputeLanguageModel()

    var body: some View {
        if case .belowLimit(let info) = model.quotaUsage.status {
            if info.isApproachingLimit {
                Text("Nearing usage limit.")
                    .foregroundStyle(Color.orange)
            }
        }
        if model.quotaUsage.isLimitReached {
            Text("Usage limit exceeded.")
                .foregroundStyle(Color.red)
        }
        if let suggestion = model.quotaUsage.limitIncreaseSuggestion {
            Button("Show options") {
                suggestion.show()
            }
        }
    }
}
```

Design guidance from the session:

`limitIncreaseSuggestion`

lets the user manage or raise their limit (such as upgrading their iCloud account).You don't need to burn real quota to test this. In your scheme, go to **Debug > Options** and use **Simulate Apple Foundation Models Availability**. You can select **Quota Usage Limit Reached** and **Nearing Usage Limit** to exercise both code paths.

You're not forced to pick one. A common pattern is to route simple work to the on-device model and escalate harder tasks to PCC. The session points to "Build agentic app experiences with Foundation Models" for that workflow.

The server model is available for apps with **fewer than 2M downloads**, and you **apply on the Apple Developer website**. If your feature genuinely needs the larger context or reasoning, it's worth applying early.

--

Summary

If you already use Foundation Models, reaching for a bigger model is now a one-line decision, with privacy handled and no token bill to manage. Evaluate, choose the right tier for each task, and design for the daily limit up front.
