The NVIDIA GH200 marks a major step forward in AI infrastructure by merging a Grace CPU and Hopper GPU into a single superchip. Together they expose 576 GB of unified memory, combining 96 GB of HBM3 on the GPU with 480 GB of LPDDR5X on the CPU. A dedicated NVLink-C2C interface moves data between the two processors at 900 GB per second, removing the PCIe bottleneck and enabling direct, high-speed coordination for demanding workloads.
Released in May 2023 and now entering broad cloud availability in 2025, the GH200 is built for trillion-parameter models and long-context inference where memory capacity and bandwidth are the defining limits. It allows large-scale AI systems to operate efficiently without the need for complex model partitioning or multi-node orchestration.
This article examines the NVIDIA GH200 specifications, pricing landscape, and where it can be rented or deployed. It also introduces Fluence, a decentralized GPU marketplace expanding its list of GPU models and offering cost-effective access to high-performance hardware for production-scale inference.
Note: GH200 is on Fluence’s roadmap, with availability expected as supply grows.
NVIDIA GH200 at a Glance
The NVIDIA GH200 sits at the top of NVIDIA’s AI hardware lineup, engineered for memory-intensive inference, hybrid compute, and large-model deployment. It integrates a Grace CPU and a Hopper GPU into one unified superchip, eliminating latency and bandwidth limits common in discrete setups. The result is a single addressable memory space where both CPU and GPU share data directly, enabling streamlined performance for large-scale AI workloads.
| Specification | Detail |
| Architecture Family | Grace Hopper Superchip (Hopper GPU + Grace CPU) |
| GPU Memory | 96 GB HBM3 |
| CPU Memory | 480 GB LPDDR5X |
| Total Unified Memory | 576 GB |
| Memory Bandwidth | 4,000 GB/s (GPU) + 512 GB/s (CPU) + 900 GB/s (NVLink-C2C) |
| CUDA Cores | 16,896 |
| Tensor Cores | 528 |
| FP32 Performance | 67 TFLOPS |
| FP16 Performance | 990 TFLOPS |
| INT8 Performance | 1,979 TOPS |
| TDP | 450 W to 1,000 W (combined CPU, GPU, and memory) |
| Form Factor | Superchip (integrated module) |
These specifications define the NVIDIA GH200 GPU as a single-module compute platform built for workloads that exceed the limits of conventional GPU memory, making it ideal for large-scale AI inference and scientific computing.
NVIDIA GH200 Specifications and Architecture
The NVIDIA GH200 fuses a Grace CPU with a Hopper GPU into one coherent compute unit, removing the traditional separation between host and accelerator. Through NVLink-C2C, it delivers unified memory access and high bidirectional bandwidth, enabling both processors to operate on shared data without transfer overhead.
CPU-GPU Integration via NVLink-C2C
A 900 GB per second NVLink-C2C interconnect links the Grace CPU and Hopper GPU, replacing PCIe’s bandwidth ceiling with a seamless, low-latency channel. This direct bridge makes hybrid workloads (CPU preprocessing and GPU inference) behave as if they run on a single device. The unified memory model simplifies programming and data management for large-scale AI systems.
Memory Subsystem
The GH200 combines 96 GB of HBM3 GPU memory with 480 GB of LPDDR5X CPU memory, presenting 576 GB of unified memory to both processors. Bandwidth peaks at 4,000 GB/s on the GPU and 512 GB/s on the CPU, allowing models exceeding 200 billion parameters and 100K-token contexts to run without sharding or offloading.
Compute Capabilities
The Hopper GPU integrates 16,896 CUDA cores and 528 Tensor Cores, reaching 67 TFLOPS (FP32), 990 TFLOPS (FP16), and 1,979 TOPS (INT8). Its Transformer Engine and hardware sparsity acceleration sustain high throughput for transformer and generative workloads within a single node.
Power and Thermal Profile
Power consumption ranges from 450 W to 1,000 W, with liquid cooling recommended for full-load operation. Despite the high envelope, efficiency per watt aligns closely with the H100 thanks to tighter integration and optimized data paths.
Comparison to Neighbors
| Aspect | GH200 | H100 | A100 | AMD MI300X |
| GPU Memory | 96 GB HBM3 | 80 GB HBM3 | 80 GB HBM2e | 192 GB HBM3 |
| CPU Memory | 480 GB LPDDR5X | None | None | None |
| Total Memory | 576 GB | 80 GB | 80 GB | 192 GB |
| Memory Bandwidth | 4,000 GB/s | 3,350 GB/s | 2,039 GB/s | 5,300 GB/s |
| CPU-GPU Bandwidth | 900 GB/s | N/A | N/A | N/A |
| Best For | Long-context, memory-heavy AI | Balanced workloads | Cost-efficient inference | Extreme memory workloads |
| Release Date | May 2023 | Sept 2022 | Aug 2020 | June 2024 |
These specifications define the NVIDIA GH200 GPU as a single-module compute platform built for workloads that exceed the limits of conventional GPU memory, making it ideal for large-scale AI inference and scientific computing.
Fluence’s decentralized GPU marketplace, while not yet hosting GH200 hardware, continues to expand its lineup with new GPU models to serve developers seeking scalable, high-memory performance at significantly lower cost than hyperscalers.
Performance Profile and Ideal Workloads for NVIDIA GH200
The NVIDIA GH200 performs best when workloads are constrained by memory capacity or bandwidth rather than compute. Its 576 GB of unified memory and 900 GB/s CPU-GPU interconnect allow massive models and hybrid pipelines to run efficiently on a single system, eliminating the need for multi-GPU orchestration.
When GH200 Beats Alternatives
The GH200 shines in scenarios where data movement or model size dominates runtime:
- Long-context inference: Processes 100K+ token windows for LLMs like Llama 2 70B and Claude.
- Memory-intensive models: Handles 100B–200B+ parameter models without sharding.
- Hybrid workloads: Executes preprocessing and inference together in one node.
- Batch inference: Maintains stable latency at large batch sizes beyond 256.
- Multi-model serving: Keeps several models resident in unified memory.
- Scientific simulations: Supports physics and molecular dynamics with large data pools.
Proven Use Cases
For long-context LLM inference, GH200 delivers roughly 4x faster token generation at 100K context lengths versus DDR-based baselines and sustains real-time throughput for retrieval-augmented generation systems. In batch inference, MLPerf v5.0 benchmarks show H100-level throughput with lower cost per token. Unified memory also enables multi-model serving, running three to five large models simultaneously with minimal overhead.
In data processing, the Grace CPU’s 18,432 Arm Neoverse V2 cores manage preprocessing while the Hopper GPU accelerates vector operations. Shared memory removes PCIe transfer limits, giving the GH200 strong performance for analytics and ETL pipelines.
When GH200 Is Not the Right Choice
The GH200 is less suited for single-GPU training or compute-bound tasks. The H100 and H200 offer better efficiency for pure training, while the A100 and L40 remain cost-effective for smaller deployments. Teams tied to ROCm or extreme bandwidth workloads may prefer the AMD MI300X.
Pricing and Cost Dynamics for NVIDIA GH200
By 2025, NVIDIA GH200 pricing has stabilized as supply expanded across cloud and decentralized platforms. Costs vary by provider, but total ownership depends on workload duration, egress patterns, and storage needs. Teams now have the choice between direct purchase, traditional cloud rentals, and decentralized options such as Fluence.
Direct Purchase (2025)
A single GH200 superchip sells for $35,000 to $45,000, while a full DGX GH200 system with 256 chips exceeds $500,000. Typical lead times range from 2 to 4 weeks, and systems depreciate roughly 20% per year. Direct ownership suits organizations running continuous production inference or operating their own data centers.
Cloud Rental Pricing (per GPU-hour)
| Provider | Hourly Price (USD) | GPU Type | Reliability | Egress Fees | Best Fit |
| Fluence | Coming soon | Data center | High | None | Production, egress-heavy workloads on supported GPUs |
| Lambda Labs | $1.49 | Data center | High | Yes | Research, training |
| Sesterce | $1.64 | Data center | High | Yes | Development, testing |
| Vultr | $1.99–$2.99 | Data center | High | Yes | Enterprise, multi-region |
| CoreWeave | $6.50 | Data center | High | Yes | Premium, guaranteed availability |
The NVIDIA GH200 price per hour typically ranges from $1.50 to $6.50, depending on provider tier and service level. Fluence applies a similar transparent, zero-egress pricing model across its supported GPUs and will extend the same structure to GH200 when it joins the marketplace.
Cost Comparison: GH200 vs. H100 vs. Multi-GPU Clusters
On Fluence, current GPU offerings already show significant cost savings versus hyperscalers. For example, large-memory nodes on Fluence maintain high throughput while avoiding egress fees that can add $0.08 to $0.12 per GB on centralized clouds. This same economic advantage is expected to carry over once GH200 capacity becomes available.
Pricing Models and Billing
Most providers offer on-demand hourly billing, with reserved plans granting 10–30% discounts for commitments of three to thirty-six months. Spot capacity is rarely available due to limited supply. Minimum session time is typically three hours. Payment options vary: USD or cryptocurrency on Fluence and USD on centralized clouds.
Where to Run NVIDIA GH200: Clouds, Marketplaces, and DePIN
As of 2025, NVIDIA GH200 is available across hyperscalers, specialist GPU clouds, and decentralized marketplaces. Each category balances cost, reliability, and control differently, giving teams flexibility to match deployment needs to budget and compliance requirements.
Hyperscaler Options
Major providers such as AWS, Azure, and Google Cloud offer GH200 systems through limited specialized instances. Pricing typically ranges from $8 to $12 per hour, with egress fees of $0.08–$0.12 per GB. Reliability reaches 99.9 to 99.99% SLA, alongside certifications such as SOC 2, ISO 27001, and HIPAA. These platforms best suit enterprise and regulated workloads that demand compliance, multi-region support, and contractual uptime guarantees.
Specialist GPU Cloud Providers
Specialist clouds provide lower-cost GH200 access with API-first provisioning and predictable performance:
- Lambda Labs: $1.49 per hour, suitable for training and research.
- Vultr: $1.99–$2.99 per hour, strong regional availability and flexible billing.
- CoreWeave: $6.50 per hour, premium service and guaranteed uptime.
- Sesterce: $1.64 per hour, transparent decentralized sourcing.
All operate verified data centers with high reliability. They are ideal for research, model development, and cost-optimized production workloads.
Decentralized Marketplaces (DePIN)
Decentralized GPU networks bring competitive pricing and open access to GH200 capacity.
- Fluence: GH200 is not yet listed, but the marketplace continues to expand rapidly with enterprise-grade GPUs, zero egress fees, and verified providers.
- Vast.ai: $1.94–$2.50 per hour, mixed consumer and data center hardware.
- Io.net: Emerging marketplace with early GH200 listings.
These platforms operate through smart contracts and transparent billing, giving users cost control and no vendor lock-in. They fit egress-heavy, cost-sensitive production pipelines and teams aligned with decentralized ecosystems.
Fluence as an Option for NVIDIA GH200
Fluence delivers access to NVIDIA GH200 via its decentralized GPU marketplace with enterprise reliability and transparent zero-egress pricing. It targets production AI workloads where predictable performance, cost control, and independence from hyperscalers are key.
GH200 Availability on Fluence
As of December 2025, GH200 capacity is not yet available on Fluence. The marketplace is working with verified data center providers to onboard GH200 hardware as supply expands in 2026.
Why Fluence Fits GH200 Workloads
Fluence combines economic efficiency with architectural flexibility. Its zero-egress model eliminates $8–$12 per 100 GB transfer costs typical of centralized clouds. Providers are enterprise-grade and verified for uptime. The decentralized model prevents vendor lock-in and encourages competition.
Although GH200 listings are pending, Fluence already supports high-memory GPUs such as H100 and MI300X, allowing developers to run similar large-scale inference workloads today. Horizontal scaling also enables multiple high-memory instances to run in parallel for GH200-class workloads.
Developers can tailor deployments through:
- Custom OS images and persistent storage sizing
- Full networking control with public IPs and optional VPN
- API-based automation for CI/CD pipelines
- Free data movement through zero-egress design
Fluence’s Distributed Advantage
Fluence sources compute from a globally distributed network of independent providers connected by transparent smart-contract governance. Competition lowers cost while maintaining data-center-level reliability. Native USDC billing supports crypto-aligned and DePIN-native teams. The model ensures vendor independence, letting users shift capacity without rebuilding their infrastructure.
When NVIDIA GH200 Is (and Is Not) the Right Choice
The NVIDIA GH200 is built for workloads where memory size, data bandwidth, and CPU-GPU cooperation drive performance. Choosing it depends on workload type, budget, and infrastructure priorities. The following framework helps teams decide when the GH200 provides a real advantage, and when other accelerators fit better.
Choose GH200 When
GH200 is the right fit when workloads exceed typical GPU memory limits or require unified CPU-GPU processing:
- Models exceed 80 GB or require over 200 GB total memory.
- Long-context inference runs at 100K+ tokens for LLMs or RAG systems.
- Unified CPU-GPU workflows combine preprocessing and inference on one node.
- Cost per token matters in large-batch inference.
- Egress-heavy pipelines need zero-cost data transfer (e.g., via Fluence).
- Single-machine simplicity is preferred over multi-GPU orchestration.
Choose H100 When
Teams focused on cost efficiency or distributed training may prefer the H100:
- Balanced training and inference performance with 40–50% cost savings versus GH200.
- NVLink clustering enables multi-GPU scaling for large training runs.
- Ideal for compute-bound workloads using FP32 or FP16 precision.
- Hourly pricing between $1.50–$1.70, with wider cloud availability.
Choose H200 When
The H200 extends H100’s performance with more memory bandwidth and capacity:
- 141 GB HBM3 memory and 4.8 TB/s bandwidth.
- Suited for GPU-only inference without CPU overhead.
- Increasing availability across providers through 2025.
Choose AMD MI300X When
AMD’s MI300X excels in extreme memory and bandwidth scenarios:
- 192 GB HBM3 and 5.3 TB/s bandwidth for large transformer models.
- Competitive price-performance for large batch inference.
- Best suited for teams optimized for ROCm software.
The GH200 stands out for its unified memory and simplicity rather than pure compute. It is ideal for teams running large inference workloads or hybrid pipelines that benefit from CPU-GPU coherence.
Conclusion
The NVIDIA GH200 defines 2025’s top tier of AI infrastructure, unifying CPU and GPU memory into a single 576 GB space for seamless, high-bandwidth processing. It removes PCIe bottlenecks and enables trillion-parameter inference within one node, ideal for long-context models, multi-model serving, and hybrid CPU-GPU workloads.
Teams should adopt GH200 when model size or memory bandwidth are the main constraints. The H100 and H200 remain better for compute-heavy training, while the MI300X suits extreme memory and ROCm environments.
Fluence, though not yet hosting GH200, is rapidly expanding its GPU marketplace with zero egress fees, verified providers, and enterprise reliability. It already supports large-scale inference on high-memory GPUs at a fraction of hyperscaler costs and is positioned to deliver the same economic edge once GH200 capacity becomes available.