NVIDIA L40S: Pricing, Specs, Best Uses & Where to Run (2026)

NVIDIA L40S defines a new balance of performance and versatility for data centers in 2026. Built for both AI and graphics workloads, it delivers strong results across training, inference, and rendering without needing separate hardware. Released in late 2023, it filled the supply gap left by H100 and A100 shortages and quickly became the preferred option for teams demanding flexibility and reliability.

For AI engineers, infrastructure architects, startup CTOs, and DePIN builders, the L40S stands out as a universal GPU that adapts to changing workloads with minimal configuration overhead. It handles large-scale inference, model fine-tuning, 3D rendering, and video acceleration in a single package that fits standard server environments.

This article breaks down everything that matters about the NVIDIA L40S in 2026: its specifications, architecture, performance profile, pricing across clouds and marketplaces, and where it runs best. It concludes with a practical decision guide to help you match the L40S to the right workload and cost model.

NVIDIA L40S at a Glance

NVIDIA L40S runs on the Ada Lovelace architecture, the same generation behind the RTX 4090 launched in August 2023. It pairs 48 GB of GDDR6 memory with 864 GB/s bandwidth and delivers 18,176 CUDA cores, 568 fourth-generation Tensor Cores, and 142 third-generation RT Cores. With FP8 precision through NVIDIA’s Transformer Engine, it reaches up to 1,466 TFLOPS. The PCIe Gen4 dual-slot design and 350 W power draw make it easy to deploy in standard servers.

L40S stands out because it handles AI and graphics workloads on the same card. It’s built for both compute and visualization, combining AI training and inference with rendering, simulation, and media acceleration. The three NVENC and three NVDEC engines with AV1 encode and decode enable high-quality video processing, real-time streaming, and Omniverse-ready 3D pipelines.

It drops some features found in heavier data-center GPUs. L40S has no MIG partitioning and no NVLink interconnect, which limits scaling and GPU sharing. This trade-off makes sense for teams that prioritize versatility and broad workload support over clustered HPC performance. NVIDIA presents L40S as a universal GPU for multimodal generative AI, LLM training and inference, rendering, and digital twin environments.

NVIDIA L40S Specs and Architecture

NVIDIA L40S builds on Ada Lovelace architecture, tuned for both AI acceleration and advanced graphics workloads. It strikes a middle ground between the compute-heavy H100 and the graphics-oriented RTX 4090, offering strong FP8 performance and broad compatibility with standard server hardware.

Ada Lovelace Architecture Foundation

The Ada Lovelace architecture introduces fourth-generation Tensor Cores with FP8 precision for transformer models, third-generation RT Cores for ray tracing, and enhanced CUDA cores for improved FP32 throughput. The Transformer Engine dynamically switches between FP8 and FP16 formats to balance precision and speed, giving L40S an efficiency boost in LLM and generative AI tasks.

Memory and Bandwidth

L40S carries 48 GB of GDDR6 memory with ECC and 864 GB/s bandwidth. While this is lower than A100’s 2 TB/s or H100’s 3.35 TB/s, it remains ample for models in the 40–70B parameter range using moderate batch sizes. It offers a strong balance between capacity, cost, and availability for inference and fine-tuning.

Compute Performance

L40S delivers 91.6 TFLOPS at FP32, 183 / 366 TFLOPS in TF32 Tensor mode, and up to 733 / 1,466 TFLOPS at FP8 precision with sparsity. It omits FP64 support, signaling that it’s not aimed at traditional HPC or scientific workloads but rather AI and visualization tasks where FP32 and FP8 dominate.

Media and Graphics Capabilities

The GPU includes three NVENC and three NVDEC engines, each supporting AV1 encoding and decoding. It also provides four DisplayPort 1.4a outputs and achieves 212 TFLOPS of RT Core performance. These capabilities make it ideal for video generation, real-time rendering, streaming, and NVIDIA Omniverse applications.

Form Factor and Deployment

Built on PCIe Gen4 x16 with 64 GB/s bidirectional bandwidth, the L40S uses passive cooling in a dual-slot design. It’s NEBS Level 3 ready, includes Secure Boot with Root of Trust, and supports NVIDIA vGPU virtualization. It does not support MIG partitioning or NVLink interconnects, which limits multi-GPU scaling and workload isolation but simplifies deployment in standard racks.

Spec Comparison

SpecL40SA100 80GBH100 80GB
ArchitectureAda LovelaceAmpereHopper
Memory48GB GDDR680GB HBM2e80GB HBM3
Bandwidth864 GB/s2 TB/s3.35 TB/s
CUDA Cores18,1766,91214,592
FP8 TFLOPS1,466N/A3,958
FP32 TFLOPS91.619.667
Media Engines3 NVENC/NVDEC00
MIG SupportNoYesYes
NVLinkNoYes (600 GB/s)Yes (900 GB/s)
TDP350W400W700W
Form FactorPCIe dual slotSXM4SXM5

These specifications make L40S a versatile platform for teams balancing AI workloads with demanding media and visualization tasks. Its efficiency and flexibility give it an edge in environments where cost, availability, and deployment speed matter more than clustered performance. The next section explores how this design translates into real-world throughput across LLM inference, fine-tuning, and generative AI workloads.

Performance Profile and Ideal Workloads for NVIDIA L40S

NVIDIA L40S delivers balanced performance across inference, training, and graphics-intensive workloads. It is not designed to compete with H100 in raw throughput but to offer dependable speed and efficiency for small to mid-scale AI deployments. The following benchmarks and use cases illustrate where the GPU performs best and how teams can size it for production workloads.

LLM Inference Performance

Benchmarks show L40S achieving 43.79 tokens per second on Llama 3.1 8B at batch size 1 and 325.14 tokens per second at batch size 8. That’s roughly 50–55% of H100 throughput and around 55% of A100 performance at small batch sizes. Latency is around twice that of H100 and 1.8 times slower than A100 for single-batch inference. 

These numbers make it best suited for small to medium batch inference, typically between 1 and 8, and for serving models in the 7B to 40B range. NVIDIA reports around 1.2 times faster Stable Diffusion inference versus A100, reinforcing its edge for generative workloads that mix image and text.

LLM Training and Fine-Tuning

According to NVIDIA, the L40S runs GPT-40B LoRA training about 1.7 times faster than the A100. That positions it well for fine-tuning and smaller model training tasks below 70B parameters. It’s not ideal for large-scale pre-training since it lacks NVLink, which limits multi-GPU communication efficiency.

Image and Video Generation

L40S handles roughly 100 Stable Diffusion images per minute at 512×512 and about 25 images per minute on SDXL at 1024×1024. Its triple NVENC/NVDEC setup supports real-time video encoding and decoding, making it effective for generative AI services, content pipelines, and streaming workloads that need both compute and media acceleration.

Graphics and Rendering

With 212 TFLOPS of RT Core performance and DLSS 3 support, L40S excels in real-time rendering and simulation tasks. It is well suited for 3D visualization, CAD, and digital twin applications running in Omniverse or similar environments that rely on GPU ray tracing and AI upscaling.

Multi-Modal AI Workloads

The 48 GB memory footprint allows L40S to run text, image, and video models concurrently, creating room for hybrid applications that combine vision and language tasks. This makes it suitable for AI services and industrial simulation environments that merge inference, graphics, and compute within the same node.

Proven Use Cases

  • Production LLM inference for 7B–70B models
  • Image and video generation for Stable Diffusion, SDXL, and video synthesis
  • AI-powered graphics and real-time rendering in Omniverse
  • Multi-modal AI services combining text, image, and video processing
  • Fine-tuning and LoRA customization for existing transformer models

Pricing and Cost Dynamics for NVIDIA L40S

NVIDIA L40S spans a wide price range in 2026. Teams can either buy cards outright or rent them across clouds and marketplaces. The best option depends on utilization, workload egress volume, and reliability requirements.

Direct Purchase Pricing (2026)

Direct purchase makes sense for high, steady utilization. Unit prices typically fall between $7,500 and $10,000 per GPU, with lead times of two to four weeks. Under continuous workloads, breakeven against cloud rental occurs in less than a year when rental rates average $1–$2 per hour. Including power, cooling, and maintenance, total ownership costs sit around $0.20–$0.30 per hour.

Cloud Rental Pricing Landscape

Hourly rental pricing varies by provider type. Rates range from $0.32 to $2.24 per hour, with Fluence positioned at $0.94 to $2.04 per hour through a verified Sesterce data center.

  • Budget marketplaces: $0.32–$0.68/hr (Salad, Vast.ai, Novita)
  • Mid-tier specialists: $0.86–$1.29/hr (Runpod, DataCrunch, Civo, DigitalOcean)
  • Enterprise clouds: $1.57–$2.24/hr (AWS, DigitalOcean, Scaleway, Vultr)
  • Fluence (DePIN): $1.27/hr via Sesterce, enterprise-grade infrastructure

Cost Comparison: L40S vs Alternatives

GPU ModelHourly Price Range (USD)Relative PerformanceNotes
NVIDIA L40S$0.32 – $2.2450–70% of H100Balanced cost and performance
A100 80GB$1.19 – $3.6775–90% of H100Strong training GPU with MIG and NVLink
H100 80GB$1.50 – $7.00100% baselineHighest throughput, premium cost
RTX 4090$0.34 – $0.6540–50% of L40SConsumer-grade, limited reliability

L40S delivers about half to two-thirds of H100 performance at roughly one-third to one-half of the cost, positioning it as the most efficient choice for production inference, fine-tuning, and mixed AI-graphics workloads.

Hidden Costs: Egress and Bandwidth

Network transfer often determines real total cost. Hyperscalers charge around $0.08–$0.12 per GB for data egress, adding $8–$12 to every 100 GB model checkpoint. Fluence charges zero egress fees, while most marketplaces keep rates low or free. Workloads that regularly transfer checkpoints or rendered media save substantially on providers without outbound data costs.

Where to Run NVIDIA L40S: Clouds, Marketplaces, and DePIN

Placement choices for NVIDIA L40S split across hyperscalers, specialists, marketplaces, and DePIN. The right venue depends on reliability, egress costs, and how quickly you need capacity. Many teams blend options, using enterprise clouds for regulated workloads and cost-efficient networks for burst or egress-heavy jobs.

Cloud Rental Options Overview

Hyperscalers like AWS offer SLA-backed reliability and ecosystem depth. Specialists such as DataCrunch and Lambda focus on performance and price. Marketplaces including Vast.ai, Runpod, Salad, and Novita provide flexible inventory at lower rates. DePIN options like Fluence connect you to independent enterprise-grade data centers with transparent pricing and no lock-in. Mid-tier clouds such as DigitalOcean, Vultr, Scaleway, Civo, and UpCloud round out the landscape.

Comparison: Where to Run L40S

ProviderRental per Hour (USD)GPU TypeReliabilityEgress FeesBest Fit / Use Case
Fluence$0.94 – $2.04Data centerHigh (verified providers)ZeroProduction AI, egress-heavy workloads, cost-conscious teams
AWS (g6e.xlarge)$1.86Data centerHighYes ($0.08–$0.12/GB)Enterprise, compliance, AWS ecosystem
DataCrunch$0.91Data centerHighLowTraining, research, cost-optimized production
Runpod$0.86MixedVariableLowDev, test, burst workloads
Vast.ai$0.68Mixed (consumer + Data center)VariableYesExperimentation, spot workloads
Salad$0.32Consumer-gradeVariableYesBudget dev/test, non-critical workloads

Fluence pricing reflects a $1.27 per hour minimum from a Sesterce provider and stands out with zero egress fees. AWS provides predictable performance and integration with managed services. Marketplaces and specialists trade some consistency for lower pricing and faster access to capacity.

Deployment Models

Virtual machines provide full OS control and persistent storage on most providers, including Fluence. Containers enable fast startup and orchestration, with Fluence containers available from $1.27 per hour. Bare metal removes virtualization overhead for maximum performance. Spot or on-demand choices lower cost at the expense of preemption risk, so match the model to your tolerance for interruptions.

Fluence as an Option for NVIDIA L40S

Fluence adds a DePIN path for running NVIDIA L40S at production quality. It connects buyers to independent, enterprise-grade data centers through a marketplace model. Users get a web console and an API, transparent hourly pricing, and no lock-in or egress fees. This structure targets teams that want predictable costs without moving into long-term contracts.

Rent GPU

Fluence Platform Overview

Fluence operates as a decentralized GPU marketplace. Supply comes from verified data center providers, not consumer hardware. The platform exposes pricing clearly, supports automation through API access, and avoids punitive data transfer charges. The result is a simpler path to budget control for workloads that move data frequently.

L40S Availability on Fluence

NVIDIA L40S is available from $0.94 to $2.04 per hour via a Sesterce provider. Capacity spans seven global locations across the United States, Europe, and Asia-Pacific. The current list includes Culpeper, Des Moines, and Houston in the US, Helsinki, Paris, and Warsaw in Europe, and Melbourne in Asia-Pacific. Instances come in multiple configurations, typically 8 to 22 vCPU and 48 to 147 GB RAM. Deployments run as VMs with custom OS images and full root access.

Economics

The $1.27 per hour rate sits competitively against mid-tier specialists. Zero egress fees remove a major source of cost volatility compared to hyperscalers that charge per gigabyte. Billing uses a three-hour minimum with hourly increments. Fluence positions an up to 80% lower cost claim versus hyperscalers when egress is included in total cost.

Architecture and Reliability

Supply is drawn from enterprise-grade data centers operated by Sesterce and similar providers. Nodes use data center GPUs, not consumer cards. Providers are verified, uptime is high, and deployments follow a standard VM model. An API supports programmatic provisioning and lifecycle management, which fits CI workflows and multi-region automation.

Best Fits for Fluence L40S

Production inference benefits from steady pricing and predictable performance. Egress-heavy pipelines such as checkpoint export and video rendering avoid transfer charges. Cost-conscious teams, research groups, and independent AI builders can scale capacity without long contracts. Teams that prefer decentralized networks and want to avoid hyperscaler lock-in gain an additional deployment option. Multi-region setups are supported across the seven listed locations.

When NVIDIA L40S Is (and Is Not) the Right Choice

NVIDIA L40S covers a wide performance and cost range, but it’s not universal for every workload. Choosing it depends on priorities: workload type, scaling needs, precision requirements, and cost per token or frame.

Choose L40S When

  • You need multi-modal AI infrastructure that runs text, image, and video workloads together.
  • Your focus is LLM inference at scale, especially for 7B–70B parameter models with batch sizes up to 32.
  • You run image or video generation tasks like Stable Diffusion or SDXL.
  • You handle graphics + AI hybrid workloads such as rendering, Omniverse, or digital twins.
  • You need media acceleration with 3× NVENC/NVDEC engines for encoding and streaming.
  • You are budget-conscious compared to H100 but still need strong inference and rendering performance.
  • You use standard PCIe servers without NVLink infrastructure.
  • You perform fine-tuning or LoRA on pre-trained transformer models.
  • You mostly use single-GPU workloads without distributed training requirements.

Choose A100 When

  • Budget is the primary constraint since A100 can be 20–40% cheaper than L40S.
  • Workloads are memory-bound, requiring 80 GB of HBM2e and 2 TB/s bandwidth.
  • You need MIG support for running multiple isolated inference tasks.
  • Your models don’t use FP8, making A100’s TF32 or FP16 efficiency sufficient.
  • Your infrastructure is already optimized around Ampere-based systems.

Choose H100 When

  • You need maximum performance for LLM training, with up to 2–3× L40S throughput.
  • You run large-scale multi-GPU clusters that depend on NVLink for fast interconnects.
  • You train with extreme batch sizes or memory-bound transformer models.
  • Budget is not a limiting factor and you can pay a premium for speed.
  • You need confidential computing or MIG multi-tenancy for regulated workloads.

Choose RTX 4090 When

  • You are experimenting or developing models and want the lowest entry cost.
  • You run single-user workloads without 24/7 uptime requirements.
  • You need budget prototyping for testing small-scale ideas or pipelines.

Do Not Choose L40S When

  • You rely on large-scale distributed training, since L40S lacks NVLink.
  • You require MIG-based multi-tenancy or strong workload isolation.
  • You run high-precision scientific computing, which demands FP64.
  • You have extreme memory bandwidth needs, such as 3D simulations or 175B+ parameter models.
  • Your workloads exceed 70B parameters and need multi-GPU scaling.
  • You prioritize maximum performance regardless of cost, where H100 leads decisively.

Rule of Thumb

L40S is a versatile workhorse for inference, fine-tuning, and multi-modal AI that mixes compute and graphics. A100 remains the best value for traditional training and MIG-based sharing. H100 leads in raw throughput for large-scale, high-precision AI. RTX 4090 fits small teams or early-stage development on a tight budget.

Conclusion

NVIDIA L40S lands in 2026 as the universal GPU sweet spot. It combines AI, graphics, and media acceleration on a single card, giving teams flexibility that previously required multiple hardware types. With 50–70% of H100’s performance at roughly 30–50% of the cost, it delivers strong value for production inference, fine-tuning, and hybrid AI workloads.

Hourly pricing varies from $0.32 to $2.24, depending on provider and reliability tier. Fluence positions itself at $1.27 per hour through enterprise-grade data centers, offering zero egress fees and transparent DePIN-based pricing. That combination makes it especially practical for inference and egress-heavy workloads where hyperscaler data transfer costs can quickly dominate total spend.

L40S fits teams running production inference, multi-modal AI, and image or video generation at scale. It’s not designed for large training clusters or MIG-based multi-tenancy but excels in single-node deployments where cost, speed, and availability must align.

To top