In my ongoing war against shadow AI, I’ve been testing out alternatives that most everyone can use, no matter what your technical expertise. Although somewhat limited depending on your resources available, (laptop CPU/GPU, memory, etc.) If you want to try the newest open models without sending your prompts to the cloud, or you just want a controllable sandbox for demos, the new offering of Ollama on Windows 11 is a great way to run LLMs locally. You’re going to trade some performance and convenience compared to Copilot or ChatGPT, but you gain privacy, offline capability, and a lot of tinkering power. The truth is, if you need some privacy, this is the way to go and if you’re learning, playing around on a laptop isn’t the worst way to go if you’re working with company or private data.
Below is a practical guide: what to expect, how to install, which models to try, and how to squeeze the most out of a Windows laptop or desktop.
Why local instead of public:
- Privacy & control: Everything runs on your machine; prompts and documents don’t leave the PC. That’s attractive for regulated, company data and internal prototypes. Windows-focused outlets have been making the same case, why not use local tools, which can be a smart alternative for many scenarios.
- Offline + cost: No subscription required; you can experiment even without internet.
- Caveat: Copilot (and Microsoft 365 Copilot) integrates deeply with your tenant’s permissions and data boundaries, which is useful in enterprise, if you’re okay with cloud inference.
What you’ll need (realistic expectations)
- CPU-only works, especially with small models (0.5B–7B). A recent article even shows usable results on a 7-year-old laptop, but just temper expectations (think “helper,” not “superhuman coder”).
- GPU helps a lot. Ollama supports NVIDIA GPUs on Windows (CUDA); AMD Radeon acceleration has been introduced for Windows and Linux and continues to mature.
- Model size matters. Smaller, quantized models (e.g., Q4_K_M) load and respond faster but may lose some quality vs. higher-precision variants like Q8_0.
- Context length costs RAM/VRAM. Huge context windows (e.g., 32k–64k tokens) can tank performance; dial them back to keep things a bit snappier.
Once we run through the install, the GUI interface is quite intuitive if you’re used other generative text AI in the past, so I’m going to add the CLI options as well. I appreciate them and to be honest, Ollama performs better when run from just the CLI, which should be expected when running on a local laptop.
Install Ollama on Windows 11
The easiest to install from the command line is either Windows Installer or WinGet. Here’s the WinGet command:
- Installed via WinGet
winget install --id=Ollama.Ollama -e
Note: Ollama runs a local service at http://localhost:11434 and gives you a CLI (ollama) plus a friendly Windows GUI app that makes chatting with local models easy (no terminal required).
Pull a model and run your first prompt
Browse the Ollama model library to pick something light (e.g., llama3.2:3b, phi4:14b, qwen2.5:7b, or a coding model like qwen2.5-coder:7b). Then:
- Example: pull a small general model
ollama pull llama3.2:3b
Output:
C:\Windows\System32>ollama pull llama3.2:3b pulling manifest pulling dde5aa3fc5ff: 100% ▕██████████████████████████████████████████████████████████▏ 2.0 GB pulling 966de95ca8a6: 100% ▕██████████████████████████████████████████████████████████▏ 1.4 KB pulling fcc5a6bec9da: 100% ▕██████████████████████████████████████████████████████████▏ 7.7 KB pulling a70ff7e570d9: 100% ▕██████████████████████████████████████████████████████████▏ 6.0 KB pulling 56bb8bd477a5: 100% ▕██████████████████████████████████████████████████████████▏ 96 B pulling 34bb5ab01051: 100% ▕██████████████████████████████████████████████████████████▏ 561 B verifying sha256 digest writing manifest success
- Chat from the CLI
ollama run llama3.2:3b "Summarize ACID in relational databases in two sentences."
Output:
C:\Windows\System32>ollama run llama3.2:3b "Summarize ACID in relational databases in two sentences." ACID stands for Atomicity, Consistency, Isolation, and Durability, which are four fundamental principles that ensure the reliability and integrity of database transactions in relational databases. By adhering to these principles, ACID guarantees that database operations are processed reliably, even in the presence of failures or concurrency issues, ensuring data consistency and accuracy.
I was curious how long it took, so ran it again in my Powershell console, (I’m such a terminal girly…) and this is what I received, sans the larger values in time, which had 0 values:
PS C:\WINDOWS\system32> Measure-Command {ollama run llama3.2:3b "Summarize ACID in relational databases in two sentences."}
Seconds : 12 Milliseconds : 383 Ticks : 123839645 TotalMinutes : 0.206399408333333 TotalSeconds : 12.3839645 TotalMilliseconds : 12383.9645
So how did this compare to running it in ChatGPT 5? I ran it from a browser, and updated the prompt to request the time in milliseconds:
Per ChatGPT, it brought back more than requested, (over achiever!) and I ended up having to take it back to Powershell if I wanted to get a real timing outside of the 59 seconds reported internally to the browser. With a quick check, came up with the following command to verify the time outside of the additional browser times:
$sw = [System.Diagnostics.Stopwatch]::StartNew() $body = @{ prompt = "Summarize ACID in two sentences." } | ConvertTo-Json Invoke-RestMethod -Method Post -Uri http://localhost:8000/chat -ContentType "application/json" -Body $body | Out-Null $sw.Stop(); "$($sw.ElapsedMilliseconds) ms"
The total time consumed by ChatGPT 5 to do the same from a public LLM is 4283 ms, in other words,
Source | Time in Milliseconds/seconds |
Ollama locally on CPU Laptop with Win 11 | 12383ms/12 seconds |
ChatGPT 5 Pro Plan | 4283ms/4 seconds |
Difference – ChatGPT was faster by: | 8100ms/8 seconds |
Yes, ChatGPT was quicker to respond when removing the browser and additional output, but it should be expected. We’re just running this on our laptop and I don’t have a GPU in sight.
Tuning for speed on Windows
If responses feel sluggish, try the following in order. There’s no guarantee you’ll get break-neck speeds, but it will help if Ollama isn’t performing at acceptable speeds:
- Pick the right model size & quantization.
Start with ~3B–7B parameter models quantized to Q4_K_M for a good size/quality balance; step up only if you need more reasoning.
- Reduce the context length.
Don’t default to 32k or 64k unless you need it; 4k–8k often performs much better on typical PCs.
- Use your GPU (if NVIDIA).
Install current NVIDIA drivers; Ollama will automatically accelerate with CUDA if supported. You can also set Windows “Graphics settings” to force the Ollama app to use the high-performance GPU.
- Keep memory under control.
Limit simultaneous loaded models and how long they stay in memory:
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=5m
These are standard server env vars you can set in Windows “Environment Variables” before launching Ollama.
- Use the new Windows GUI for quick toggles.
The app exposes basic options (like context window) without editing files, which I found handy when I was experimenting.
When to Not Use Ollama
Consider Copilot when:
- Deep Microsoft 365 integration: If you need tenant-aware grounding across SharePoint, Outlook, Teams, etc., Copilot’s data-boundary logic and permissions are built in.
- Bigger models / turnkey accuracy: Cloud services still win for the largest, most capable frontier models. There’s no local VRAM constraints.
Consider ChatGPT, Perplexity, Claude, etc. (PAID versions) when:
- Generic questions regarding business/organizational work. No critical data or PII is involved.
- Help reformatting or creating more crisp/concise wording in correspondence/documents.
- Create what I call, “filler content” around any proprietary or intellectual property.
- Remember to never upload or paste critical data/intellectual property into public generative AI.
Troubleshooting quick hits
- “It’s all CPU.” Make sure you’re on a supported NVIDIA GPU with current drivers whenever possible; AMD acceleration on Windows exists but is newer and may vary by model/driver.
- “The app works, but API calls fail.” Check the service at http://localhost:11434/api/tags. If it’s not responding, start Ollama and wait a few seconds.
- “Model downloads are huge.” Choose smaller/quantized variants (e.g., :3b, Q4_K_M) and lower the context window.
Summary
If Copilot/ChatGPT(Paid, not free), OpenAI, Perplexity, etc. is your daily driver for Generative AI, think of Ollama as a local test bench. It’s perfect for validating prompts, exploring new open models, and building small end-to-end prototypes, both privately and offline. Start with a 3B–7B model, keep the context window modest, and let the new Windows app handle the basics while you get hands-on with the API for deeper experiments
References
These are my go-to I used to understand, to install and to troubleshoot. They’ve been the most helpful and I’d be lying if I didn’t say I’ve asked ChatGPT and Intellisense to help me when stuck with some of the challenges a Google search failed helping me on.
- Download Ollama for Windows (installer & CLI/GUI). Ollama
- API quickstart (generate/chat examples). Ollama
- Model library (browse/choose models). Ollama
- NVIDIA GPU support (what’s supported). ollama.qubitpi.org
- AMD on Windows (support introduced and evolving). Ollama
- PowerShell Docs (support as I learn) Microsoft Learn
- PowerShell Timing using the Measure-Command
- Context length and performance tips. Windows Central
- Why can local AI be a smart choice. Windows Central