{"slug": "what-s-new-for-managed-service-for-apache-spark-clusters", "title": "What's new for Managed Service for Apache Spark clusters", "summary": "Google Cloud announced at Next '26 that Managed Service for Apache Spark clusters now include Lightning Engine, a native C++ vectorized execution engine that delivers up to 4.9x faster performance than standard open-source Spark without requiring code changes. The update also brings generally available Flexible VMs, which allow users to define up to ten ranked machine types to improve cluster resilience against capacity constraints, and new FinOps features for better cost control.", "body_md": "At Google Cloud, our goal is to let you run large-scale analytical and data science workloads with maximum efficiency so you can process big data pipelines, machine learning, and ETL tasks.\n\nWe recently announced that the Dataproc service is now [Managed Service for Apache Spark](https://cloud.google.com/products/managed-service-for-apache-spark), reflecting our deep integration with the [Agentic Data Cloud](https://cloud.google.com/data-cloud).\n\nTo support the diverse architectural needs of today’s modern data teams, we offer the service in two distinct deployment modes: serverless and managed clusters. The serverless deployment mode completely abstracts infrastructure management for ephemeral or ad-hoc jobs, while the managed clusters deployment mode is designed for teams that require fine-grained infrastructure customization, persistent environments, long-running stateful processing, or native integration with custom Compute Engine hardware configurations.\n\nWhen it comes to managed cluster deployments, we’ve re-imagined the experience from the ground up, focusing on three core pillars: making Spark **faster** by supercharging execution speeds, **easier** to run by maximizing resource obtainability and reducing operational overhead, and **smarter** by embedding AI directly into the development and operational lifecycle.\n\nThis blog post focuses specifically on what we announced at Google Cloud Next ‘26 for the Managed Spark clusters deployment mode: providing enhanced flexibility to fine-tune performance and cost through native execution engine, smarter scaling policies, and Gemini-powered extensions. For the latest of the serverless deployment mode, check out [this blog](https://cloud.google.com/blog/products/data-analytics/serverless-managed-service-for-apache-spark-runtime-3-0-features?e=48754805).\n\nArguably the biggest update for Managed Spark clusters is [Lightning Engine](https://cloud.google.com/dataproc/docs/guides/lightning-engine), which introduces massive performance gains for Spark DataFrame/Dataset APIs and heavy Spark SQL queries. Powered by a native, C++ vectorized execution engine built on Velox and Gluten, with specialized internal enhancements, Lightning Engine bypasses JVM execution bottlenecks by compiling query plans into native instructions optimized for SIMD (Single Instruction, Multiple Data) vectorization.\n\nThis native execution engine delivers:\n\n**Up to 4.9x faster performance** than standard open-source Spark\n\n**up to 2x the price-performance **over the leading high-speed Spark alternative\n\nCrucially, taking advantage of these performance gains doesn’t require any code changes to your existing Spark applications. Because your jobs complete faster, you directly reduce your aggregate Compute Engine runtime hours and overall spend.\n\nTo enable Lightning Engine on your managed clusters, simply specify the Lightning Engine option when you’re creating a cluster.\n\nTemporary localized shortages of a specific machine type can stall cluster creation or interrupt autoscaling. To dramatically improve cluster resilience against capacity constraints, [Flexible VMs](https://docs.cloud.google.com/dataproc/docs/concepts/configuring-clusters/flexible-vms) for Managed Spark clusters are now generally available.\n\nFlexible VMs allow you to define up to ten ranked machine types for your master, primary, and secondary worker nodes. Managed Service for Apache Spark pairs this preference with automated regional zone placement, dynamically scanning the entire region to fulfill your capacity requests using the best available hardware layout. This helps ensure your pipelines spin up predictably, drastically reducing resource availability errors, and maximizing your ability to capture cost-effective Spot VM capacity during periods of peak demand.\n\nTo give you better fiscal control over persistent and developmental environments, we recently announced the general availability of two highly requested FinOps features: [zero-scale clusters](https://docs.cloud.google.com/dataproc/docs/guides/create-zero-scale-cluster) and [cluster scheduled stops](https://docs.cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-stop).\n\n**Zero-scale clusters**: You can now provision environments that use exclusively secondary workers (Spot VMs), enabling the cluster to automatically scale down to absolutely zero worker nodes when no processing is active, leaving only the master node online to preserve metadata.\n\n**Cluster scheduled stops**: This feature lets you configure automated cluster shutdown policies based on specific idle-time limits or a precise future timestamp.\n\nBecause these features are natively integrated, they reduce the operational friction of having to delete and reconstruct your environment, while you can stop paying for idle compute overhead during nights and weekends.\n\nTo bridge the gap between generative AI and data engineering, we launched the [Model Context Protocol (MCP) server for Managed Service for Apache Spark](https://docs.cloud.google.com/dataproc/docs/guides/use-dataproc-mcp). This open-standard integration allows LLMs and AI assistants to securely and dynamically interact with your Managed Spark clusters using natural language.\n\nBy utilizing the MCP server, your AI agents can securely connect to your data platform under existing IAM permissions. This allows agents to perform cluster-based operations, such as creating a cluster, submitting a job, or adjusting an autoscaling policy, directly from your AI application.\n\nThe [Google Cloud Data Agent Kit](https://docs.cloud.google.com/data-cloud-extension) extension allows data scientists, engineers, and developers to manage their entire data workload lifecycle directly within their preferred development environment. We rolled out native support for this extension on Managed Spark clusters, enabling teams to seamlessly build and deploy specialized Data Agents for code generation and data wrangling.\n\nDevelopers can choose to use [Antigravity 2.0](https://antigravity.google/blog/introducing-google-antigravity-2-0), Google's standalone, agentic development platform or bring these agentic capabilities into their preferred IDE including VS Code, Claude Code, or Codex via the Data Agent Kit extensions and plugins. By pairing this streamlined workflow with the raw processing power of managed clusters, these intelligent agents can securely execute complex workflows directly over petabyte-scale data lakes. Specifically, the Data Agent Kit enables developers to:\n\n**Build and orchestrate pipelines:** Author multi-node data pipelines and generate comprehensive code documentation using natural language.\n\n**Perform real-time debugging: **Leverage Gemini Cloud Assist to sift through executor logs, pinpoint root causes of job failures, and recommend actionable fixes.\n\n**Easily connect to Spark resources: **Instantly attach to serverless Spark runtimes or managed clusters without manual network configuration or local Spark installations.\n\n**Streamline Git and CI/CD management:** Commit, merge, and deploy code directly from your IDE of choice, triggering automated testing and deployment pipelines without friction.\n\nWe recently launched [Lakehouse](https://docs.cloud.google.com/lakehouse/docs/introduction), which delivers read/write interoperability between engines like Managed Service for Apache Spark and BigQuery. By leveraging the [Lakehouse runtime catalog](https://docs.cloud.google.com/lakehouse/docs/about-lakehouse-catalogs) as a unified, serverless metadata layer, it removes data silos and the need for complex translation layers. This agentic-first approach allows organizations to process open formats directly from Google Cloud Storage, or even query remote AWS datasets using the newly introduced [cross-cloud Lakehouse](https://docs.cloud.google.com/lakehouse/docs/about-cross-cloud-lakehouse), all while maintaining a single source of truth for security and governance.\n\nFor customers utilizing Managed Spark clusters, this integration unlocks several powerful new capabilities. Data teams can now accelerate their most demanding ETL and data science workloads by up to 4.9x using the optimized Lightning Engine.\n\nKeeping pace with the open-source ecosystem, we rolled out [Cluster Image 3.0](https://docs.cloud.google.com/dataproc/docs/release-notes#May_03_2026) in preview, built with Apache Spark 4.1 and that features an upgraded default Java runtime, Java 21. Spark 4.1 introduces a set of core open-source capabilities, including real-time mode for structured streaming. This enables your Spark environment to support real-time streaming with continuous, sub-second latency processing.\n\nThese updates are live and ready to use today in Managed Spark clusters! You can enable these new features directly through the Google Cloud console or via the `gcloud`\n\nCLI.\n\nTo spin up a new Managed Cluster and natively unlocking the performance of **Lightning Engine,** run the following command in your terminal:\n\nAlternatively, navigate to the [Managed Service for Apache Spark page in the console](https://console.cloud.google.com/dataproc), click Create cluster, and select ‘Enable Lightning Engine’ under the cluster configuration settings to automatically activate Lightning Engine for your Spark jobs.\n\nWe look forward to hearing about the environments you build and run as Managed Service for Apache Spark clusters!", "url": "https://wpnews.pro/news/what-s-new-for-managed-service-for-apache-spark-clusters", "canonical_source": "https://cloud.google.com/blog/products/data-analytics/enhancements-to-managed-service-for-apache-spark-clusters/", "published_at": "2026-06-04 16:00:00+00:00", "updated_at": "2026-06-04 16:41:26.229098+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-products", "ai-tools", "machine-learning", "artificial-intelligence"], "entities": ["Google Cloud", "Managed Service for Apache Spark", "Dataproc", "Agentic Data Cloud", "Apache Spark"], "alternates": {"html": "https://wpnews.pro/news/what-s-new-for-managed-service-for-apache-spark-clusters", "markdown": "https://wpnews.pro/news/what-s-new-for-managed-service-for-apache-spark-clusters.md", "text": "https://wpnews.pro/news/what-s-new-for-managed-service-for-apache-spark-clusters.txt", "jsonld": "https://wpnews.pro/news/what-s-new-for-managed-service-for-apache-spark-clusters.jsonld"}}