Ollama
Local LLM inference: run open models against an OpenAI-compatible endpoint on your own hardware.
Ollama packages open models — Llama, Mistral, Gemma, Qwen, and others — into a local server with a REST API that mirrors OpenAI's interface. You pull a model, point your existing SDK code at localhost, and iterate without touching a paid API.

My standard development loop runs Ollama before it runs any cloud API. I do prompt iteration, RAG pipeline testing, and tool-call debugging locally first — the OpenAI-compatible endpoint means I don't touch the integration code when I swap models, I just change the base URL and model name. I use Modelfiles to lock in system prompts and sampling parameters for specific tasks, which keeps experiments reproducible and makes it easy to compare model behavior across Llama 3, Mistral, and Qwen variants at different GGUF quantization levels. For production I've deployed Ollama on GPU servers for workloads where data couldn't leave a private network — client contracts, internal document processing, compliance-constrained environments. The deployment story is simple enough that it doesn't require much infrastructure overhead: Ollama running as a service, network-isolated, with the application talking to it over a local interface.
Local Development Setup
I pull a model, write a Modelfile to configure the system prompt and temperature for the task at hand, and point the OpenAI SDK at localhost:11434. Existing code that targets the OpenAI API works without modification — the only change is the base URL.
Model Selection & Evaluation
I run task-specific evals across model families rather than relying on general benchmarks. The main tradeoffs I'm navigating are GGUF quantization level vs. context window fit for the available VRAM, and whether a smaller quantized model at Q4 outperforms a larger one at Q2 for the specific task.
Private AI Features
For compliance-constrained deployments I run Ollama as a systemd service on a machine with no outbound API access — the application talks to it over a local interface, nothing leaves the network boundary. Setup is straightforward; the harder part is model selection and latency profiling on the target hardware.
LLM Development Environment
I iterate on prompts, RAG retrieval logic, and tool-call schemas locally against Ollama before the code ever touches a paid API. The feedback loop is fast, the cost is zero, and the OpenAI-compatible endpoint means there's no integration code to rewrite when I promote to production.
Privacy-Sensitive AI
For contracts involving client documents or regulated data, I've run inference entirely on a local or private server — no data goes to an external provider. Ollama handles the serving layer; the application code stays identical to what it would be against a cloud API.
Cost-Sensitive Workloads
High-volume tasks like summarization, classification, and structured extraction often don't need GPT-4-class models. I benchmark open models against the task and run the ones that hit the quality threshold locally, keeping cloud API calls for the cases where they're actually justified.
Let's talk Ollama.
No pitch. Just a technical conversation about the problem you're working on.