As AI infrastructure scales, enterprise expectations for operational maturity are increasing. Organizations expect these systems to be provisionable, observable, secure, and manageable at scale—the same standard applied to all critical infrastructure. The moment an AI system moves from development into enterprise deployment, that operational foundation is essential.
NVIDIA DGX Spark and NVIDIA GB10 systems are delivering this foundation with new Enterprise Manageability. As detailed in this post, Enterprise Manageability provides enterprise IT teams with a complete operational framework from first provisioning to end-of-life retirement, including support for fully air-gapped and disconnected deployments.
The DGX Spark manageability framework delivers a modular stack, designed to integrate into the tools enterprise IT teams already use rather than replace them. NVIDIA partners that currently support DGX Spark from an enterprise manageability perspective include Progress Chef, Perforce Puppet, and Canonical Landscape.
The operating model is intentionally simple: agentless SSH execution with bounded standard JSON output. A resident management agent is not required to run on the DGX Spark endpoint. Instead, IT teams invoke tools over SSH, and each tool returns a standardized JSON envelope that integrates directly into CMDB, SIEM, and monitoring pipelines. The pattern is the same regardless of which orchestration platform runs it.
{
"tool": "spark_diagctl.py",
"ts": "2026-01-12T21:17:00Z",
"host": "DGX_HOST",
"status": "ok",
"rc": 0,
"duration_ms": 842,
"summary": { "disk": "ok", "network": "ok", "drivers": "ok" },
"warnings": [],
"artifacts": []
}
The framework ships with production tools and reference scripts, organized across the following six operational lifecycle phases:
The framework deliberately separates collectors (read-only, unprivileged, safe to run frequently) from controllers (state-changing, gated with least-privilege sudo, subject to change management approval). That design maps directly to how enterprise IT governs access.
A substantial portion of the operational complexity in enterprise AI deployments comes from getting the system to a known-good state in the first place, rather than from the running environment. This is particularly true for environments where direct internet access is restricted or prohibited.
DGX Spark Custom Installation directly addresses this challenge. At a high level, it enables enterprise IT teams to:
Under the hood, the patterns rely on cloud-init, an OEM Data partition on the installation USB drive, and a provisioning hook script. An optional on-premises mirror for fully air-gapped fleets can also be used.
This makes it practical to maintain a fully air-gapped DGX Spark fleet using standard enterprise tooling. No custom infrastructure is required beyond an internal server or a USB drive. For the full set of installation patterns and when to use each, see the Enterprise Manageability documentation.
DGX Spark manageability framework provides diagnostic tools specifically designed for observability, diagnostics, and incident response. AI infrastructure failures are often expensive to diagnose remotely. Events such as firmware regressions, PCIe issues, and unexpected resets all require evidence collection before a root cause can be determined—and collecting that evidence at scale, without disrupting the running system, is nontrivial.
The manageability framework provides two diagnostic tools designed to address these challenges: spark_diagctl.py and reset_reason_reporter.py.
spark_diagctl.py is the primary diagnostic tool in the framework. It’s a single script that runs remotely over SSH, providing IT teams with visibility into the health and state of any DGX Spark system without requiring physical access or a resident agent. It operates in two modes:
reset_reason_reporter.py addresses one of the more persistent diagnostic challenges in AI infrastructure: explaining why a system rebooted. The tool correlates multiple evidence sources (system event logs, BMC records, kernel oops, firmware events) and produces a structured root cause assessment. It deliberately uses conservative classifications, flagging ambiguity rather than speculating, making the output more reliable for incident triage and stability trending.
Both tools emit the same JSON envelope format. This means that the same Ansible playbook, Tanium package, or Landscape script that runs health checks can also trigger incident response collections with no changes to the integration layer.
Keeping a fleet of AI systems current can be challenging. DGX Spark brings together tightly coupled layers: kernel, GPU driver, firmware, container runtime, AI frameworks, and security patches. A failed update in any one layer can destabilize the environment. Updates also need to happen inside change management windows, with appropriate rollback options.
spark_updatectl.py is the update control plane. It exposes the system’s current update posture as a JSON report. This includes items such as packages that need updating, firmware updates that are applicable, and whether a reboot is pending. It then provides controlled update operations that coordinate with maintenance window scheduling. It supports staged rollouts across device rings, precheck and postcheck evidence capture, and firmware rollback visibility.
The tool is designed to be driven by whatever orchestration platform the team already uses. An Ansible playbook can query update posture across a fleet, identify systems that are lagging, and stage updates in waves with appropriate approval gates, all using the same agentless SSH execution model as the rest of the framework.
Enterprise AI systems increasingly hold proprietary models, sensitive datasets, and internal intellectual property. Security posture must be auditable, and compliance evidence must be producible on demand. The framework treats security as a first-class requirement throughout.
Specific capabilities include:
The RBAC design reflects a least-privilege model throughout. Collector tools (those that only read state) run without elevated privileges. Controller tools (those that modify state) require explicit sudo grants scoped to the specific operation. This maps cleanly to role separation in enterprise environments where change management and read-only access are governed separately.
Canonical Landscape integration provides a practical path for extending existing Ubuntu fleet management operations to DGX Spark. The reference scripts cover the full security and lifecycle surface: signing verification, verified boot, backup levels, factory reset, health watchdogs, support bundle collection, log retrieval, and encryption-at-rest reporting. Organizations already running Landscape for other Ubuntu infrastructure can bring DGX Spark into the same operational view without building a separate management layer.
Enterprise AI infrastructure carries enterprise expectations. Provisioning, observability, security posture validation, compliance evidence, and lifecycle management are not optional after AI systems move into production.
The DGX Spark Enterprise Manageability framework is designed to meet your IT team where they are: working with the orchestration tools they already use, operating within the security and change management policies they already enforce, and managing systems that may be fully disconnected from the public internet. Stay tuned for deeper dives into specific enterprise manageability capabilities.
Ready to get started? Download these guides:
Both guides are built as operational references, featuring concrete examples, integration patterns, and production-ready sample scripts designed to adapt to the tools and policies each individual team already has in place. For additional documentation, visit DGX Spark Enterprise Manageability.