TL;DR
- You can run Gemma 4 E4B on a single RTX A6000 GPU VM without relying on a hyperscaler stack
- A simple stack of Fluence GPU VM + Ollama + Open WebUI is enough for a working browser-based chat
- Ollama handles model serving, while Open WebUI provides a clean interface for testing and demos
- SSH tunneling gives you secure local access without exposing ports publicly
- Renting a GPU only when needed is often more practical than forcing large models onto a laptop
- The setup is reproducible: any developer can follow the same steps and get identical results
Running modern open models sits in an awkward middle ground right now. Smaller variants run locally without much friction, but once you move up in capability, you quickly run into memory ceilings, inconsistent performance, and thermal limits on a typical laptop. That gap becomes obvious the moment you try to move from a quick “it works” test to something you can actually demo or share.
Gemma 4 is interesting because it spans that boundary. You can run parts of the model family locally, but the experience varies heavily depending on your hardware. If you want something stable, repeatable, and closer to a real deployment, local quickly stops being the best reference environment.
In this walkthrough, I deploy Gemma 4 E4B on a rented RTX A6000 GPU VM, serve it with Ollama, connect it to Open WebUI, and access everything through a browser. The goal is not just to make it run, but to show a setup that feels practical, reproducible, and usable for real testing or demos.
Why Gemma 4 E4B is the right model for this demo
Gemma 4 E4B strikes a practical balance between capability and deployability. It is large enough to produce solid, useful responses, but still fits comfortably on a single RTX A6000 without requiring model sharding or multi-GPU setup. That keeps the deployment simple and predictable.
From an operational standpoint, E4B avoids common constraints like VRAM pressure and unstable latency. It runs cleanly through Ollama with minimal tuning, which means less time spent debugging memory limits and more time validating the actual setup.
It also makes sense from a cost perspective. Larger models would push this into higher-cost or multi-GPU territory, while smaller ones would weaken the realism of the demo. E4B sits in a range where a single rented GPU is enough for interactive use without overprovisioning.
With the model choice set, the next step is to address whether a local setup would have been sufficient in the first place.
Why not just run it locally?
Running Gemma 4 locally works for quick validation, but it becomes unreliable when consistency and usability matter. Hardware differences directly affect memory limits, load times, and inference speed, which makes it difficult to present a setup that others can reproduce with similar results. A GPU VM provides a fixed environment where performance and behavior are predictable.
Local machines also run into tighter VRAM and system memory constraints. Even if the model runs, it often requires quantization or tuning that impacts output quality or responsiveness. Under sustained use, latency can fluctuate, especially when the system is handling other tasks, which makes the experience less stable for demos or repeated testing.
Access is another limitation. A local setup is tied to a single device unless additional networking steps are introduced. That adds unnecessary complexity when the goal is to expose a simple browser-based interface that behaves like a hosted tool.
In practice, the trade-offs look like this:
- Local setups are fast for initial testing, but inconsistent across different machines
- Memory and performance constraints can affect output quality and latency
- Sharing access requires extra networking steps and setup
- A GPU VM provides a stable, reproducible, and browser-accessible environment
With those constraints in mind, the next step is selecting the GPU infrastructure for the deployment.
What this deployment stack looks like
This setup runs entirely on a single GPU VM, with Ollama serving the model and Open WebUI providing browser access. Keeping everything on one machine avoids unnecessary networking complexity and makes the deployment easier to manage.
At a high level:
GPU VM → Ollama → Gemma 4 E4B → Open WebUI → Browser
Ollama handles model execution and exposes a local API, while Open WebUI connects to it and renders a chat interface. Both run on the same VM, so communication stays local and predictable.
Access is handled through SSH tunneling instead of exposing public ports. This keeps the service private while still allowing browser-based interaction from a local machine, without needing to configure firewalls or TLS.
The result is a minimal, single-node architecture with fewer failure points and straightforward debugging.
With the stack defined, the next step is the step-by-step deployment.
Step-by-step walkthrough
You will:
- launch a Fluence GPU VM with an RTX A6000
- connect to it over SSH
- install Ollama
- pull Gemma 4 E4B
- run Open WebUI in Docker
- access the model from your browser through a simple chat UI
This is a good fit for Gemma 4 E4B because Google lists E4B at roughly 5 GB at Q4, 7.5 GB at 8-bit, and 15 GB at BF16 just for model load, while Ollama’s gemma4:e4b package is a 9.6 GB quantized model. That leaves comfortable headroom on an RTX A6000 48 GB.
What you need before starting
You need:
- a Fluence account with enough balance for a GPU VM
- a working SSH key already attached in Fluence
- a Mac with Terminal
- the ability to log in to the GPU VM as ubuntu
- a Fluence GPU VM using Ubuntu 22.04
- Docker available on the VM, or permission to install it
Step 1: Launch the GPU VM in Fluence

In Fluence Console:
- create a GPU VM
- choose RTX A6000
- choose 1 GPU
- select Ubuntu Server 22.04
- make sure your working SSH key is attached
- launch the VM
- wait until the status becomes Active
Once the VM is active, copy the public IP address from the instance details page.
Step 2: Connect to the VM over SSH
On your Mac, open Terminal and connect with:
ssh -i ~/.ssh/YOUR_KEY -o IdentitiesOnly=yes ubuntu@YOUR_GPU_IP
Example:
ssh -i ~/.ssh/id_ed25519 -o IdentitiesOnly=yes [email protected]
If SSH asks whether you trust the host, type:
yes
If your key has a passphrase, enter it.
When login succeeds, you should land at a prompt that looks like:
ubuntu@server-name:~$
Step 3: Verify the GPU is visible
Run:
nvidia-smi
You want to see NVIDIA driver information and the RTX A6000 listed.

Step 4: Install Ollama
Install Ollama with:
curl -fsSL https://ollama.com/install.sh | sh

Then verify the install:
ollama --version
Ollama provides an official Linux installer and exposes its local API automatically once the service is running.
Step 5: Pull Gemma 4 E4B

Download the model:
ollama pull gemma4:e4b
Then verify it is present:
ollama ls
You should see:
gemma4:e4b
Ollama’s model library includes the gemma4:e4b tag and lists it as a 9.6 GB model.
Step 6: Run a quick terminal test
Start the model in Ollama:
ollama run gemma4:e4b
Then send a simple prompt:
Explain in simple terms what Gemma 4 E4B is.

If the model answers, inference is working.
To leave the interactive chat and go back to the shell, press:
Ctrl + d
Step 7: Confirm the Ollama service is running
Check the background service:
sudo systemctl status ollama
If it opens a full-screen viewer, press:
q
to return to the prompt.
What you want to see is:
active (running)
Now test the local Ollama API:
curl http://127.0.0.1:11434/api/tags
You should get JSON back, and gemma4:e4b should appear in the results. Ollama’s API is available on port 11434.
Step 8: Verify Docker is available
Check Docker:
docker --version
sudo docker ps
If Docker is not installed, install it with:
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
Then test it:
sudo docker run hello-world
Docker’s Ubuntu install docs allow this convenience-script path for development and test setups.
Step 9: Start Open WebUI
The most reliable setup here is to run Open WebUI with host networking and explicitly point it at Ollama on 127.0.0.1:11434.
First remove any old container with the same name:
sudo docker rm -f open-webui
Then start Open WebUI:
sudo docker run -d \
--network=host \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Now verify the container is up:
sudo docker ps
You should see open-webui running.
This pattern matches Open WebUI’s Docker guidance, and host-network mode is the clean fallback when container-to-host Ollama networking is unreliable.
Step 10: Create an SSH tunnel from your Mac
Because Open WebUI is running in host network mode, it is listening on the VM’s port 8080.
Open a second Terminal window on your Mac and run:
ssh -i ~/.ssh/YOUR_KEY -o IdentitiesOnly=yes -L 3000:127.0.0.1:8080 ubuntu@YOUR_GPU_IP
Example:
ssh -i ~/.ssh/id_ed25519 -o IdentitiesOnly=yes -L 3000:127.0.0.1:8080 [email protected]
Leave that tunnel window open.
Step 11: Open Open WebUI in your browser
On your Mac, open:
http://localhost:3000
Create your account and sign in. In Open WebUI, the first account becomes the admin account.
Step 12: Select Gemma 4 E4B and send your first prompt
Open the model dropdown and choose:
gemma4:e4b
Then enter a test prompt:
Explain in simple terms what Gemma 4 E4B is.
If the model responds, the full stack is working:
- Fluence RTX A6000
- Ollama
- Gemma 4 E4B
- Open WebUI
Troubleshooting
1. Open WebUI loads, but no model appears
Run on the VM:
curl http://127.0.0.1:11434/api/tags
ollama ls
If both show gemma4:e4b, Ollama is fine and the issue is the UI connection.
In this runbook, the –network=host container plus:
OLLAMA_BASE_URL=http://127.0.0.1:11434
is the intended fix.
2. Docker says the container name is already in use
Run:
sudo docker rm -f open-webui
Then rerun the docker run command. Open WebUI’s update docs use this exact pattern while preserving the persistent data volume.
3. The browser cannot connect
Make sure the SSH tunnel window is still open. If you close it, http://localhost:3000 stops working.
systemctl status fills the whole screen
Press:
q
That exits the viewer.
ollama run takes over the terminal
Press:
Ctrl + d
to leave the interactive chat and return to the shell.
Command summary
On the Fluence GPU VM
nvidia-smi
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
ollama pull gemma4:e4b
ollama ls
ollama run gemma4:e4b
sudo systemctl status ollama
curl http://127.0.0.1:11434/api/tags
docker --version
sudo docker ps
sudo docker rm -f open-webui
sudo docker run -d \
--network=host \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
sudo docker ps
On your Mac
SSH into the server:
ssh -i ~/.ssh/YOUR_KEY -o IdentitiesOnly=yes ubuntu@YOUR_GPU_IP
Open the SSH tunnel in a second Terminal window:
ssh -i ~/.ssh/YOUR_KEY -o IdentitiesOnly=yes -L 3000:127.0.0.1:8080 ubuntu@YOUR_GPU_IP
Then open in your browser:
http://localhost:3000
Shutdown and cleanup
To stop the UI:
sudo docker rm -f open-webui
To stop the loaded model:
ollama ps
ollama stop gemma4:e4b

To stop GPU billing, terminate the GPU VM in Fluence Console. Fluence bills GPU instances hourly and reserves runtime balance up front.
Why Fluence for this deployment
Fluence works well for this setup because it gives direct, low-friction access to GPU infrastructure via its GPU marketplace without forcing a managed platform model. The result is a deployment that behaves like a normal Linux environment, which makes it easier to install, run, and debug the stack without adapting to provider-specific abstractions.
Instead of navigating a complex cloud service, the workflow stays straightforward: provision a GPU VM, connect over SSH, and run standard tools like Ollama and Docker. This keeps the focus on the model and serving layer rather than the platform itself.
Why Fluence fits this use case:
- Direct GPU access: Provision an RTX A6000 and work in a standard Ubuntu environment
- No platform lock-in: No required APIs, SDKs, or managed inference layers
- Familiar workflow: SSH, Docker, and system tools behave exactly as expected
- On-demand cost control: Start and stop the GPU only when needed
- Full control of the stack: Configure networking, services, and runtime without restrictions
The trade-off is that infrastructure responsibility stays with you. That includes SSH access, container setup, and basic service management. For a single-node deployment like this, that trade-off keeps the system simple and transparent.
Cost-efficiency comparison across RTX A6000 providers
| Provider | vCPU | Memory | Storage | Price (US$/h) | Daily cost (US$/day) | Egress fees |
|---|---|---|---|---|---|---|
| Fluence | 6 vCPUs | 24 GB | 256 GB SSD | $0.46 | $11.23 | None |
| Massed Compute (Alt config) | 6 vCPUs | 32 GB | 256 GB SSD | $0.55 | $13.20 | None |
| Koyeb | 6 vCPUs | 64 GB | 160 GB SSD | $0.75 | $18.00 | Free up to 100 GB/month, then $0.04/GB |
| X-Cloud (customisable) | 6 vCPUs | 24 GB | 256 GB NVMe | $1.49 | $35.64 | None |
| Paperspace (DigitalOcean) | 8 vCPUs | 45 GB | 50 GB SSD | $1.89 | $45.36 | None |
The table shows a clear spread in hourly pricing for similar RTX A6000 setups. Fluence sits at the lower end at around $0.468/h, followed by Massed Compute at $0.55/h. Koyeb moves into a higher tier at $0.75/h, while X-Cloud and Paperspace (DigitalOcean) are significantly more expensive at approximately $1.485/h and $1.89/h.
For short Gemma 4 E4B sessions, this difference directly affects how instances are used. A 1 to 2 hour test cycle costs roughly $0.50 to $1 on Fluence, compared to $2 to $4 on Paperspace. Over repeated iterations, this gap compounds quickly.
The main cost drivers across these providers are:
- Base hourly pricing: Fluence and Massed Compute enable low-cost, short-lived sessions. X-Cloud and Paperspace make short runs comparatively expensive.
- Bandwidth policy: Fluence, Massed Compute, X-Cloud, and Paperspace include free ingress and egress. Koyeb introduces variable cost beyond 100 GB per month.
- Pricing structure: Fluence, Massed Compute, and Koyeb use straightforward instance pricing. X-Cloud’s component model requires combining GPU, CPU, and storage costs.
- Resource fit: Fluence’s 6 vCPU and 24 GB RAM configuration closely matches Gemma 4 E4B requirements. Koyeb and Paperspace allocate more memory and CPU, which increases cost without clear benefit for this workload.
In practice, efficiency comes from running the GPU only when needed. Lower and simpler pricing, as seen with Fluence, makes it practical to spin up an instance, complete a task, and shut it down immediately without needing to optimize around session length or hidden costs.
Local vs GPU cloud: what matters in practice
Local inference is best for quick experimentation, but it becomes limiting once consistency and usability matter. GPU cloud environments provide stable performance, predictable resources, and a setup that can be accessed and reused without depending on a specific machine.
The difference shows up in day-to-day use. Local setups are tied to available memory and competing system workloads, which can affect latency and reliability. GPU VMs isolate those variables, keeping performance steady and making results easier to reproduce across sessions.
There is also a difference in how the system is used. A local model often stays in the terminal or behind ad hoc tooling, while a GPU-backed setup with Open WebUI behaves more like a hosted application. That makes it easier to demo, share, or extend into something closer to a real service.
In practice:
- Local is better for fast, low-effort experimentation
- GPU cloud is better for stable, repeatable, and shareable setups
- Browser-based access makes the model easier to use and present
- Isolated resources remove variability from performance and behavior
The practical takeaway is simple: local setups are useful for trying models, while GPU VMs are better when the goal is to run them in a controlled, usable environment.
Who this setup is best for
This setup is best suited for developers and teams who want a controlled, reproducible way to run open models without committing to dedicated hardware. It works particularly well when the goal is to move beyond quick local tests and into something that behaves like a usable, shareable tool.
It fits a few common scenarios:
- Developers testing and evaluating open models in a stable environment
- Builders who want a browser-based interface instead of terminal interaction
- Teams that need short-lived GPU access without purchasing hardware
- Projects that require more control than a hosted API provides
It is less suited for fully productionized systems that need autoscaling, high availability, or managed endpoints. In those cases, additional infrastructure would be required on top of this baseline.
For its intended use, this setup provides a clean middle ground between local experimentation and more complex cloud deployments.
Conclusion
Gemma 4 E4B is straightforward to deploy when the environment is controlled. Running it on a single RTX A6000 removes most of the friction seen in local setups, while still keeping the infrastructure simple. With Ollama handling inference and Open WebUI providing a browser interface, the stack stays minimal but usable.
The key trade-off is control versus convenience. This approach avoids platform constraints and keeps everything transparent, but it also requires handling basic infrastructure details like SSH access, container setup, and networking. For a single-node deployment, that trade-off keeps the system predictable and easy to debug.
The broader takeaway is practical: local setups are useful for quick tests, but a rented GPU makes it easier to build something stable, repeatable, and shareable. That makes it a better fit for demos, evaluations, and early-stage prototypes.
A useful next step is to extend this setup. Add a reverse proxy for external access, connect the model to an API workflow, or test a larger Gemma variant on a higher-tier GPU to understand how the architecture scales.