Blog Post

Questions and Answers from Running a Local LLM

,

I had a few random questions from my Running a Local LLM on Your Laptop session at the Houston AI-lytics 2026 event last week, so this post looks at a few of those questions and my answers.

Note: This stuff is changing rapidly, and there aren’t a lot of factual answers. A lot of what you should look for is guidance and rational reasons for leaning in some direction.

Questions below:

  • Do we need an NPU? (Or what do I think of NPUs)
  • How do we audit or Test an AI LLM and know what is happening?
  • In which situations would you run a local model?
  • Which Model is Best?

Do we need an NPU? (Or what do I think of NPUs)

You don’t need an NPU to run a local LLM model, but they help with efficiency. An NPU is a Neural Processing Unit, which is a type of CPU that is designed to work with AI-type applications and process instructions more efficiently. This could be training a model or running LLM workloads.

I think an NPU is a great idea for efficiency. We already know AI applications use a lot of compute and power. Just look at all the concerns over power/water and investments being made in new data centers for AI. Being more efficient helps.

Just like a GPU helps with graphics and makes your laptop more efficient, an NPU will help, but it’s not required.

How do we audit or Test an AI LLM and know what is happening?

First, LLMs aren’t deterministic, so they might not return the same thing all the time. It’s hard to test a non-deterministic thing because we look to assert if a is passed in, b is returned. If I pass in a and sometimes get b, sometimes c, and rarely f, this is hard to test.

I have no idea how to test a model for behavior in this case.You get useful results from experiments, and more often they are useful than un-useful to continue using it. If that happens, faster, then you have a better model. If it’s slower/more expensive/less useful, it’s a worse model.

Auditing is looking at what happened, which means reaching into the processing of these GPT-type tools. There are some tools to help (AuditLLM), but I can’t speak to whether these are a) worth the effort, b) effective, or c) junk. I’m still learning here, too.

In which situations would you run a local model?

This is a hard one because there are a few situations in which I’d seriously consider a local model (including Amazon Bedrock/Azure AI/Google Vertex).

First, when I’m worried about costs and I want to control them. While the vendors give you some limits and throttles, it can be expensive. In many cases, if I want to set controllable spend, a known spend, I might consider a local model in some service because I can allocate out capacity and know what is available, what it will cost, and who will be using it. Perhaps the cloud vendors will give us more controls and ensure we aren’t on “shared” systems, but any efficient use of hardware to do this will be for their benefit, not mine.

Second, when I’m really concerned about data security. While most companies might promise they won’t use your data and will delete sessions, they might not, and they might make mistakes. And if they do, would they accidentally use my data, or send it in response to some sort of legal subpoena accidentally? If I’m outside the US or really worried, I’d run local models.

Third, if I want to ensure that I have complete control over the training of the model or the prompts, I might use a local model where I know there aren’t any system prompts being injected into my context.

Which Model is Best?

Yes.

There’s no good answer here. If you look at the list of models on Hugging Face, for example, there are lots and lots of models. None of us has time to test many of them, or even a small fraction. I think you have to depend on the community here to help you decide that any of these models are better for your situation.

Think about what you want a model to do, what things are important to your problem space, and then look for a model that people think works well and does the type of things you want to do. Similar to how you interview a person for certain types of work, think about that for a model.

The nice thing outside of the large LLMs is that you can use smaller models to fill in certain situations if you find you want to provide that capability a lot to your organization. I would see interpretation and linting of best practices in code, for example, using a smaller model that uses less compute, but it trained, or you fine-tune it for your particular situation (and save money).

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating