Running Uncensored LLMs on a DigitalOcean GPU Droplet

Posted by Michael S. on April 6, 2026

I wanted to run open-source LLMs on my own rented GPU in the cloud. Uncensored ones, and the current state-of-the-art. Here's everything I learned about doing it on DigitalOcean, including why I almost picked NVIDIA but went with AMD instead.


Why Not Run Locally?

I'm on a 2019 MacBook Pro with 16GB RAM and an AMD 5500M (4GB VRAM). That's enough for tiny models (7-8B quantized), but the 31B+ models I want need 20-48GB of VRAM. Cloud GPU it is.

If you have a newer Apple Silicon Mac with 32GB+ unified memory, you can skip the cloud entirely. Just download Ollama or LM Studio and run models directly. No drivers, no setup, no GPU to rent. But for bigger models or longer context windows, you need a real datacenter GPU.


The Models

I wanted to download and run two models, plus keep a third around for reference. All confirmed working on AMD MI300X with ROCm.

1. Gemma 4 31B Uncensored (Google)

Google released Gemma 4 in April 2026. The 31B dense model is the one everyone makes uncensored fine-tunes of. Max context: 256K tokens. Apache 2.0 license.

Model Total Params Effective/Active Architecture
E2B 5.1B ~2.3B effective Dense, per-layer embeddings
E4B 8B ~4.5B effective Dense, per-layer embeddings
26B A4B 26B 3.8B active MoE
31B 31B 31B Dense (the one everyone fine-tunes)

The community makes "uncensored" versions using abliteration — a weight-editing technique that identifies the "refusal direction" in the model's activation space and removes it. No retraining needed.

Which uncensored variant?

I looked at three options:

  1. 199-biotechnologies/gemma-4-abliterated — The most careful, quality-preserving abliteration. Uses a conservative weight factor of 1.0, cross-validated against 686 prompts. The problem: it's MLX format only (for Apple Silicon Macs). No GGUF, so you can't use it with ollama without converting it yourself.
  2. DavidAU's HERETIC uncensored — Aggressive de-censoring using the HERETIC method. Available as GGUF on HuggingFace, so it's ready for ollama. But there's a catch: raw HuggingFace GGUFs for Gemma 4 ship with a broken chat template (wrong turn delimiters), causing the model to output --- on repeat instead of actual responses.
  3. pmarreck/gemma4-heretical — This is the one I'm using. It wraps the DavidAU HERETIC model but fixes the chat template bug and sets up ollama correctly with one command. It uses ollama's built-in RENDERER gemma4 / PARSER gemma4 to handle the template properly. Requires ollama 0.20.0+.

Why pmarreck? It's the only option that combines aggressive uncensoring (HERETIC) + correct ollama integration (chat template fix) + easy setup (one script). The 199-bio version is higher quality abliteration but isn't in GGUF format. DavidAU's raw GGUF works but has the template bug. pmarreck solves both problems.

2. Qwen 3.5 27B (Alibaba)

Why 3.5 and not 3.6 or 3.7? Because they don't exist as open-weight models. Qwen 3.6 Plus is closed-source (API-only via OpenRouter/Alibaba Cloud) — you can't download the weights or self-host it. Qwen 3.7 hasn't been released. The latest self-hostable Qwen is 3.5.

Why 3.5 27B instead of Qwen3 32B? Qwen 3.5 is newer (February 2026 vs April 2025), has a much larger context window (256K native, extensible to 1M vs 32K native/131K with YaRN), and is the current flagship open-weight Qwen model. The 27B size is slightly smaller than Qwen3 32B, which means slightly less VRAM for weights and more room for KV cache.

Available on Ollama: qwen3.5:27b-q4_K_M17 GB download.

Caveat: Qwen 3.5 is natively multimodal (text + vision), and the vision component has known issues in ollama with mmproj files. For text-only chat it works fine — the vision issues don't affect text generation. Tool calling also has bugs in the current ollama version, so if you need agentic features, use vLLM instead.

AMD has official Day 0 support for Qwen models on MI300X — this isn't a community hack, AMD actively tests and optimizes for it.

3. GLM-5 (Z.ai / Zhipu AI) — Current #1 on Chatbot Arena (for reference)

GLM-5 holds the highest Chatbot Arena rating (1451 Elo) as of April 2026. MIT license, no usage restrictions.

  • 744B total parameters, 40B active (MoE architecture)
  • Full fp16: 1.65 TB on disk
  • 2-bit quant: 241 GB — too large for 192GB VRAM
  • 1-bit quant: 176 GB — fits on MI300X, but quality is heavily degraded

I'm including this for reference, but I won't run it day-to-day. A 744B model crushed to 1-bit is like photocopying a photocopied photocopy — you lose a lot. A well-quantized Q5 of the 27-31B models above will likely produce better actual output.


Storage & VRAM Requirements

Model Quant Disk Space VRAM (weights only) Max Context
Gemma 4 31B (HERETIC) Q4_K_M ~19 GB ~20 GB 256K tokens
Gemma 4 31B (HERETIC) Q5_K_M ~22 GB ~24 GB 256K tokens
Qwen 3.5 27B Q4_K_M 17 GB ~18 GB 256K native / 1M extended
GLM-5 (744B) 1-bit 176 GB ~176 GB

Total disk space for the two models I'll actually use: ~36-39 GB. The MI300X droplet has 720GB boot + 5TB scratch, so storage is a non-issue.

VRAM Isn't Just Model Weights — There's the KV Cache

This is the thing that tripped me up. When an LLM processes your conversation, it stores intermediate calculations (key-value pairs from every attention layer) for every token in the context. This is the KV cache, and it grows with your conversation length. The longer the chat, the more VRAM it eats — on top of the model weights.

Rough KV cache estimates for a 32B model:

Context Length KV Cache + Model (Q4) Total VRAM
8K tokens ~2-4 GB ~20 GB ~24 GB
32K tokens ~8-16 GB ~20 GB ~36 GB
128K tokens ~30-50 GB ~20 GB ~60 GB
256K tokens ~60-100 GB ~20 GB ~100 GB

This is why GPU choice matters so much. At 128K context, a 48GB NVIDIA card would run out of memory with a single model. On the 192GB MI300X, you have room to spare.

Can I Run Both Models Simultaneously?

Yes — and this is a major reason I picked the MI300X. With 192GB of VRAM:

  • Gemma 4 31B Q4: ~20 GB
  • Qwen 3.5 27B Q4: ~18 GB
  • Both loaded: ~38 GB — leaves 154 GB for KV cache

That 154GB of headroom means both models can be loaded in VRAM at once, and you still have massive room for long context windows on whichever one you're actively chatting with. Ollama keeps models in VRAM for 5 minutes after last use by default, so switching between them is instant — no loading, no waiting.

On a 48GB NVIDIA card, you'd have to unload one model to load the other, and even a single model would hit the VRAM ceiling at ~60-100K context.

TurboQuant: Stretching Context Even Further

Google released TurboQuant in March 2026 — a technique that compresses the KV cache itself down to 3-4 bits with negligible quality loss. This gives a 4-6x reduction in KV cache memory.

What that means in practice: if 128K context normally needs ~40GB of KV cache, TurboQuant could bring that down to ~7-10GB. That would let you run 256K+ context on a single 32B model even on a 48GB card — or run both models with long context on the MI300X.

Does it work on AMD? Yes. The community has implemented fused kernels that run on both NVIDIA CUDA and AMD ROCm without code changes. There's also a llama.cpp PR with CUDA support and a vLLM plugin. Google's official implementation is expected Q2 2026, but community implementations are already working. It's still bleeding-edge, but by the time you read this it may be standard.


The Droplet: AMD Instinct MI300X

Why AMD Over NVIDIA?

I almost went with the NVIDIA L40S ($1.57/hr, 48GB VRAM). NVIDIA is the safe choice — CUDA "just works," every tutorial assumes it, and the community is 100x larger. When something breaks on NVIDIA, there are 200 GitHub issues with fixes. On AMD, there are 2.

But the math made me pick the MI300X:

The NVIDIA L40S (48GB) can only run one 32B model at a time, and even then you're capped at ~32-60K context before VRAM fills up. Want to switch models? Wait for one to unload and the other to load. Want a long conversation? Hope it doesn't OOM.

The AMD MI300X (192GB) can run both models simultaneously with 150GB left over for KV cache. Switching is instant. Context can go to 128K+ without sweating. And the ROCm situation in 2026 is genuinely decent:

  • Ollama has ROCm support (ships with ROCm 7)
  • vLLM has official ROCm support — DigitalOcean even has a tutorial for it on MI300X
  • AMD publishes Day 0 support for major models — Qwen3 on MI300X is officially tested
  • DigitalOcean provides AI/ML-ready images with ROCm drivers pre-installed

The cost difference is minimal — $0.42/hr more ($1.99 vs $1.57). At 40 hours/month that's ~$17. For 4x the VRAM and the ability to run two models simultaneously with long context, it's a no-brainer.

The Full GPU Comparison

GPU VRAM $/hr Storage Verdict
RTX 4000 Ada 20 GB $0.76 500 GB boot Small models only (7-13B)
L40S 48 GB $1.57 500 GB boot One 32B model, short-medium context
RTX 6000 Ada 48 GB $1.57 500 GB boot Same as L40S
H100 80 GB $3.39 720 GB + 5 TB One 32B model, long context. Overpriced.
MI300X 192 GB $1.99 720 GB + 5 TB Two 32B models + long context. Our pick.

Why Is the MI300X So Cheap?

It's counterintuitive — 192GB for less than an 80GB H100:

  1. Manufacturing — AMD uses a 13-chiplet design. Smaller chiplets have higher yields and cost 30-40% less to produce than NVIDIA's monolithic 814mm² H100 die.
  2. Hardware cost — MI300X chips sell for $10-15K vs $30-40K for an H100.
  3. CUDA tax — NVIDIA charges a premium for the CUDA ecosystem. Everyone wants NVIDIA, so they can. AMD undercuts to gain market share.

What You're Actually Renting

A GPU Droplet isn't just a bare GPU — it's a full virtual machine:

  • 1x AMD Instinct MI300X (192 GB HBM3 VRAM)
  • 20 vCPUs
  • 240 GB system RAM
  • 720 GB NVMe boot disk + 5 TB scratch
  • Ubuntu with AMD ROCm drivers pre-installed (AI/ML-ready image)

You SSH in, download models from HuggingFace, run them. At ~20GB per 32B model, the 5TB scratch disk could store hundreds of models.


Setup

Creating the Droplet

Via DigitalOcean web console (easier for first time):

  1. Log in to cloud.digitalocean.com
  2. Create → GPU Droplets
  3. Select AMD Instinct MI300X (1x GPU, 192GB)
  4. Image: AI/ML Ready (comes with ROCm drivers pre-installed)
  5. Region: NYC or TOR (closest to East Coast)
  6. Add your SSH key
  7. Create

Via doctl CLI:

# Install doctl
brew install doctl
doctl auth init  # paste your DigitalOcean API token

# List your SSH keys
doctl compute ssh-key list

# Create the droplet
doctl compute droplet create llm-inference \
  --size gpu-mi300x1-192gb \
  --image gpu-mi300x-ubuntu-22-04 \
  --region nyc1 \
  --ssh-keys YOUR_SSH_KEY_FINGERPRINT

Or via direct API call (if you don't want doctl touching your existing account config):

curl -X POST "https://api.digitalocean.com/v2/droplets" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "name": "llm-inference",
    "region": "nyc1",
    "size": "gpu-mi300x1-192gb",
    "image": "gpu-mi300x-ubuntu-22-04",
    "ssh_keys": ["YOUR_SSH_KEY_FINGERPRINT"]
  }'

Installing Ollama & Downloading Models

# SSH in
ssh root@YOUR_DROPLET_IP

# Install ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen 3.5 27B
ollama pull qwen3.5:27b-q4_K_M                                # 17 GB

# Pull Gemma 4 31B uncensored via pmarreck's heretical wrapper
# (this fixes the broken chat template that plagues raw HuggingFace GGUFs)
apt-get install -y git
git clone https://github.com/pmarreck/gemma4-heretical
cd gemma4-heretical
./get-gemma4-heretical Q5_K_M                                 # ~22 GB
cd ..

# Optionally pull the official censored Gemma 4 for comparison
ollama pull gemma4:31b                                         # ~18 GB

# Check what's installed
ollama list

# Check disk and VRAM usage
df -h
rocm-smi    # AMD equivalent of nvidia-smi

Running Inference

# Interactive chat — switch between models instantly
# Both stay loaded in VRAM (192GB is enough for both + KV cache)
ollama run qwen3.5:27b-q4_K_M
ollama run gemma4-heretical

# API access (OpenAI-compatible)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5:27b-q4_K_M",
  "prompt": "Explain quantum computing in simple terms"
}'

Using It From Your Mac

SSH tunnel — makes the remote ollama appear local:

# On your Mac, run:
ssh -L 11434:localhost:11434 root@YOUR_DROPLET_IP

# Now http://localhost:11434 on your Mac hits the droplet's ollama
# Any local app that supports ollama (Open WebUI, etc.) just works

Don't Forget to Turn It Off

At $1.99/hr, leaving it running 24/7 = ~$1,433/month.

# Snapshot first (saves your installed models, ~$0.06/GB/month for storage)
doctl compute droplet-action snapshot DROPLET_ID --snapshot-name llm-inference-snapshot

# Then destroy (stops all billing)
doctl compute droplet delete DROPLET_ID

# To restore later, create a new droplet from the snapshot
doctl compute droplet create llm-inference \
  --size gpu-mi300x1-192gb \
  --image SNAPSHOT_ID \
  --region nyc1 \
  --ssh-keys YOUR_SSH_KEY_FINGERPRINT

Cost

Item Cost
MI300X droplet $1.99/hr
10 hours of tinkering ~$20
40 hours/month (casual use) ~$80/month
Model downloads Free (HuggingFace/Ollama)
Snapshot storage (~60GB of models) ~$4/month

References

Gemma 4:

Qwen:

GLM-5:

DigitalOcean:

TurboQuant:

GPU Pricing:


Update (April 6, 3:18 AM): What Actually Happened

Everything above was the plan. Here's what happened when I actually tried to do it.

The MI300X Doesn't Exist

When I queried the DigitalOcean API for available GPU sizes, the MI300X came back with "regions": [] — zero available regions. The hardware is in their catalog, the pricing is on their website, but there are no actual machines to rent. Sold out or not yet deployed.

Here's what the API actually returned on April 6, 2026:

gpu-4000adax1-20gb       $0.76/hr   regions=['tor1']
gpu-l40sx1-48gb          $1.57/hr   regions=['tor1']
gpu-6000adax1-48gb       $1.57/hr   regions=['tor1']
gpu-mi300x1-192gb        $1.99/hr   regions=[]          <-- nothing
gpu-h100x1-80gb          $3.39/hr   regions=['ams3', 'tor1']
gpu-h200x1-141gb         $3.44/hr   regions=['atl1', 'nyc2']
gpu-mi300x8-1536gb       $15.92/hr  regions=[]          <-- nothing
gpu-h100x8-640gb         $23.92/hr  regions=[]
gpu-h200x8-1128gb        $27.52/hr  regions=[]

Lesson: DigitalOcean lists GPU droplets on their pricing page that may not actually be available when you try to provision one. You won't find out until you hit the API. Check availability first:

curl -s "https://api.digitalocean.com/v2/sizes?per_page=200" \
  -H "Authorization: Bearer YOUR_API_TOKEN" | \
  python3 -c "
import sys,json
for s in json.load(sys.stdin)['sizes']:
    if 'gpu' in s['slug']:
        print(f\"{s['slug']:40s} \${s['price_hourly']}/hr  regions={s['regions']}\")
"

Pivoting to the H200

With the MI300X unavailable, here were the realistic options:

GPU VRAM $/hr Can load both models? KV cache headroom Context ceiling
L40S 48 GB $1.57 Barely (~38GB weights, ~10GB left) ~10 GB ~8-16K tokens
H100 80 GB $3.39 Yes (~38GB weights, ~42GB left) ~42 GB ~32-64K tokens
H200 141 GB $3.44 Yes (~38GB weights, ~103GB left) ~103 GB ~100-128K+ tokens

The H200 at $3.44/hr was the clear winner. For just $0.05/hr more than the H100, you get nearly double the VRAM (141GB vs 80GB). Both models loaded with ~103GB left for KV cache — enough for long context. And it's NVIDIA, so CUDA means zero ROCm friction.

Yes, it's $1.45/hr more than the MI300X would have been. At 40 hours/month that's an extra ~$58. The NVIDIA tax is real. But when the AMD option doesn't exist in any datacenter, you pay what's available.

GPU Droplet Quota

With the H200 selected, I fired off the API call to create the droplet:

curl -X POST "https://api.digitalocean.com/v2/droplets" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"name":"llm-bench","region":"nyc2","size":"gpu-h200x1-141gb",...}'

Response:

{"id":"unprocessable_entity","message":"creating this/these droplet(s) will exceed your droplet limit"}

Regular droplets and GPU droplets have separate quotas. My account had 3 regular droplets running (well under the limit of 10), but the GPU quota was zero. This isn't documented clearly anywhere — you just hit the wall when you try.

I tried every available GPU size — H200, H100, L40S — same error on all of them. It's not a per-size limit, it's a blanket "no GPU droplets for you" gate.

Requesting a Limit Increase

There's no API endpoint for requesting a GPU quota increase. You have to:

  1. Go to cloud.digitalocean.com/account/team/droplet_limit_increase
  2. Or try to create a GPU droplet via the web console, which prompts you with an "Increase GPU Droplet limit" form
  3. Fill in a reason and how many GPUs you need
  4. Wait for DigitalOcean support to approve it (hours to a day)

The form asks for a "Reason for increase" and I'll be honest — I stared at this for a minute. Do they need a business justification? Do I need to explain my architecture? Will "I want to run uncensored LLMs" get me flagged? My worry was that I'd need to write a compelling pitch for why I deserve access to a $3.44/hr GPU.

Turns out: the reason doesn't matter that much. They're not evaluating whether your use case is worthy — they're checking that you're a real customer who won't rack up charges and disappear. A one-liner about what workload you're running is standard. They see thousands of these. I submitted:

LLM inference (Gemma 4, Qwen 3.5) via ollama. Already running 3 droplets on this account — need 1x H200 GPU droplet.

That's it. Mentioning your existing droplets helps signal "I'm already a paying customer." Don't overthink it.

The quota increase is account-wide — once they grant GPU access, you can spin up any GPU size that's available in any region. No need to specify which exact GPU you want.

So that's where I am now — waiting for quota approval before I can actually spin anything up.

What's Next

Once the quota is approved, the plan is:

  1. Phase 1 (no TurboQuant): Provision the H200, install ollama, pull both models, and benchmark everything — provisioning time, download speeds, inference latency, model switching time, 50K-token context tests, API setup, and remote access from my Mac. Then tear it down.
  2. Phase 2 (with TurboQuant): Do it all again using vLLM with the TurboQuant plugin for KV cache compression. Compare context length limits, VRAM usage, and inference speed. TurboQuant promises a 4-6x reduction in KV cache memory — if it works, it could push context to 256K+ on the H200.

This post isn't sponsored by DigitalOcean, AMD, or any other company mentioned. I'm sharing my own research and setup process.

Coming up next: I recently migrated a 500GB+ OneDrive repository to an external drive, working around macOS FileProvider limitations, API throttling, and "dataless" file issues. The process involved parallel copy scripts, a quarantine-and-retry system, launchd automation, and hash-based verification with rclone. I'll be publishing the full technical guide soon.

Enjoyed this post?

Get notified when I publish something new. No spam, unsubscribe anytime.