Run Local LLMs on Your Laptop / Desktop
Preview and scope
This page provides a practical guide to installing popular open-source large-language models from OpenAI/Google/Meta/Deepseek on a laptop or desktop, enabling a ChatGPT-style experience. The primary tool we’ll use is Ollama, a local engine that runs LLMs on your machine. Ollama is simple and works on macOS, Windows, and Linux. Optional interfaces such as LM Studio and Open WebUI are also referenced.
This page is not a deep exploration of LLM theory, fine-tuning, or production serving. For larger models, fine-tuning, higher throughput, or multi-user serving, use our HPC systems and contact us for support.
1) Before we begin: what does “270m / 1B / 4B / 12B / 27B” mean in model size
The number is the model size, measured as the approximate number of learned parameters. These labels indicate the model’s size by parameter count: M means millions of parameters and B means billions (e.g., 270M, 1B, 4B, 12B, 27B).
270m to 1B: small models, fast and light, for basic tasks
4B: a balanced “laptop default” that fits most modern laptops
12B: stronger, needs more memory, and is slower on laptops without a GPU
27B and up: typically requires a powerful laptop/workstation or server
Simple chooser of a model for your laptop
8–12 GB RAM: start with 270m–1B parameter model
16 GB RAM: 4B parameter model is the sweet spot
32 GB RAM or a discrete GPU: 12B parameter model is reasonable
Workstation or server with a powerful GPU: 27B parameter model and up
Platform notes: macOS, Windows, and Linux
macOS (Apple Silicon)
Uses unified memory. CPU and GPU share the same pool.
16 GB machines work well with 4B. 12B parameter model is more comfortable on 32 GB.
Leave headroom for other apps. If the system starts swapping, text generation slows.
Windows and most Linux laptops/desktops
Usually have separate GPU VRAM and system RAM.
With an NVIDIA GPU and 8–12 GB VRAM, mid-size models run more smoothly.
CPU-only machines should stick to 1B–4B for reasonable latency.
Linux workstations/servers
With higher-VRAM GPUs (for example, 24 GB or more), 12B–27B becomes practical.
Use containers and scheduling for reproducibility on multi-user nodes.
General tips
A larger context window increases memory use on every platform.
If you see slowdowns or out-of-memory errors, reduce context length or step down one model size.
Start with the plain model built on the model page. Move to larger sizes only if you need stronger answers and have headroom.
2) Key terminology when downloading the type of models (in plain language)
Model size: the 270m/1B/4B/12B/27B label described above
Context window: how much text the model can consider at once, shown as a token count such as 32K or 128K. Larger context enables longer prompts and documents, but uses more memory and time
Quantization: shrink a model by using fewer bits for its numbers. It makes the model smaller and faster, with a small quality drop.
QAT (Quantization-Aware Training): the model was trained knowing it would be quantized, so it usually keeps quality better at the same small size.
Info: Plain guidance: If you see a QAT version, prefer it. If not, a regular quantized build is still a good laptop choice.
Multimodal means a model can take in and/or produce multiple data types, like text, images, audio, or video, and reason across them together.
Inference (remote hosting): running a model on a separate server and sending it your prompts over the network, so the server does the computation and returns the answer to your laptop.
Local inference: the model runs on your own machine.
Remote inference: the model runs on a server; your laptop is just the client.
Variants on model pages: suffixes such as
-fp16,-q8_0, or-it-qat. If you are unsure, choose the plain tag without a suffix. That is usually the laptop-friendly defaultWhere models come from: the Ollama library provides convenient pre-packaged entries. Hugging Face is a large catalogue of models for discovery and advanced use
3) Minimum hardware requirements for local installation
Memory
8–12 GB RAM → good fit for 270m–1B parameter model
16 GB RAM → good fit for 4B parameter model
32 GB RAM or discrete GPU → good fit for 12B parameter modelCPU/GPU
Models run on both CPU and GPUs. A GPU improves speed. Apple Silicon Mac laptops run models efficiently due to their Unified Memory Architecture (CPU, GPU share one high-bandwidth memory pool).Disk space
Models are several gigabytes each. Ensure adequate free spaceNetwork
Local by default. Keep the runtime on localhost unless you explicitly plan a remote setup
4) Install the runtime engine(Ollama)
Choose your operating system. After installing, verify with a simple command.
4.1 macOS
Download the macOS build from the Ollama website and install it by dragging to Applications
Launch Ollama. You should see its icon in the menu bar
Homebrew users can install via cask if available in your environment
Verification
Open Terminal
Run a quick command from Section 5
4.2 Windows
Download the Windows installer from the Ollama website and run it
Allow any security prompts. You should see an icon in the system tray
Verification
Open Command Prompt
Run a quick command from Section 5
4.3 Linux
Use the official installation instructions for your distribution. A one-line script is typically provided on the Ollama website
If you run Ollama as a service, start and enable it, then check its status
Verification
Open a terminal
Run a quick command from Section 5
5) Your first model: pull and run
You can start from the command line or use the built-in Ollama interface.
Command line
Choose a model size from Section 1 ( See Section 9 for more details). For a modern 16 GB laptop, a 4B parameter model is a safe default
Pull the model (example shown with a gemma3:4B model):
Command to download the gemma3:4b model
ollama pull gemma3:4bRun the model and ask a question:
Command to start interactive chat with gemma3:4b
ollama run gemma3:4bType a question at the prompt. Type
/byeto exitList installed models:
Command to list all installed models
ollama list
Built-in Ollama GUI
Open the Ollama interface from the system tray or menu bar
Use the interface to download a model that matches your laptop (see Section 1)
Start a chat in the interface
At this point you have a fully working local setup that has models installed locally and running with ChatGPT-style experience . You can download models sized (eg: gemma ) to your machine and chat through the command line or the built-in GUI interface.
6) Optional GUIs with more features
Optional front end GUI interfaces provide document and file workflows, conversation history, parameters such as temperature, and multi-model management. Some of the following tools provide these features.
LM Studio. A desktop application with a friendly interface
Open WebUI. A browser interface with many features. It typically runs via a single Docker command and is reachable at
http://localhost:3000
These interfaces can improve convenience for document processing, parameter control, and other advanced features.
7) Remote option: point a GUI to a remote Ollama server
Use this when a remote server runs larger models and your laptop only needs to run a GUI.
You need
Server address or hostname
Port number. The default port number for Ollama is
11434unless configured differently
Where to set it
In Open WebUI. In Settings, set the Ollama API base URL to
http://SERVER:PORTDesktop GUIs. Set the Ollama API URL to
http://SERVER:PORT
If you cannot connect, check that the remote server is reachable, the firewall allows the port, and Ollama is listening for remote connections. If models do not appear in the GUI, confirm that they are pulled on the server.
This section separates local use from remote use. Next is how to call the runtime from Jupyter outside any GUI.
8) Using models programmatically (eg: Jupyter notebook or any Python script)
This allows you to integrate these LLMs into automations and call them from your code. The examples we show here use the local endpoint by default. If you are using a remote server, replace the base URL. You only need three things: the endpoint (URL), the model name, and your prompt. Everything else is optional.
8.1 Environment setup
Create or activate a Python environment and install these packages if needed:
Install required Python packages
pip install requests matplotlib pillowSet the base URL:
Set the Ollama API endpoint URL
API = "http://localhost:11434" # change to "http://SERVER:PORT" for remote8.2 Eg: Minimal text generation (streamed output)
Minimal text generation example with streaming
import requests, json, sys
API = "http://localhost:11434"
r = requests.post(f"{API}/api/generate",
json={"model":"gemma3:4b","prompt":"Explain transformers in one sentence."},
stream=True)
for line in r.iter_lines():
if line:
sys.stdout.write(json.loads(line)["response"])
sys.stdout.flush()8.3 Eg: Summarize a local file
Example: summarize content from a local text file
import requests, json
API = "http://localhost:11434"
text = open("sample.txt").read()
r = requests.post(f"{API}/api/generate",
json={"model":"gemma3:4b","prompt":"Summarize:\n"+text},
stream=True)
summary = "".join(json.loads(l)["response"] for l in r.iter_lines() if l)
print(summary.strip())
8.4 Eg: Batch summarize a folder and plot summary lengths
Batch summarize text files and visualize summary lengths
import os, requests, json, matplotlib.pyplot as plt
API = "http://localhost:11434"
def summarize(txt):
r = requests.post(f"{API}/api/generate",
json={"model":"gemma3:4b","prompt":"Summarize:\n"+txt},
stream=True)
return "".join(json.loads(l)["response"] for l in r.iter_lines() if l).strip()
files = [f for f in os.listdir("docs") if f.endswith(".txt")]
lengths = {f: len(summarize(open(os.path.join("docs", f)).read())) for f in files}
plt.bar(lengths.keys(), lengths.values())
plt.xticks(rotation=45, ha="right")
plt.ylabel("Summary length (characters)")
plt.title("Folder summaries")
plt.show()
8.5 Eg: Optional: image captioning if your model supports images
Some multimodals accept images. If your chosen model supports this capability, you can send a base64 encoded image.
Image captioning with a multimodal model
import base64, requests, json
API = "http://localhost:11434"
img_b64 = base64.b64encode(open("image.jpg","rb").read()).decode()
r = requests.post(f"{API}/api/generate",
json={"model":"gemma3:4b","prompt":"Describe this image in one sentence.","images":[img_b64]},
stream=True)
caption = "".join(json.loads(l)["response"] for l in r.iter_lines() if l)
print(caption.strip())
If your model does not support images, use a vision-enabled model from the model page.
9) Choosing a model on the model page
When you open a model page to download, focus on these items.
Name. For example
gemma3:1b,gemma3:4b,gemma3:12b,gemma3:27b. The number is the size from Section 1Size. The download size. Larger files are heavier to run
Context. For example 128K. Higher context allows longer prompts but uses more memory
Variants. Suffixes such as
-fp16,-q8_0, or-it-qat. If you are unsure, choose the plain tag with no suffix
Default recommendations
Older or 8–12 GB laptops:
…:1bor…:270m16 GB laptops:
…:4b32 GB RAM or discrete GPU:
…:12b
If a QAT variant is offered, it is a good choice on laptops for stronger fidelity at a similar footprint.
Some popular recommended models.
https://ollama.com/library/deepseek-r1 (Reasoning model)
https://ollama.com/library/gpt-oss ( From OpenAI and reasoning model)
https://ollama.com/library/gemma3 ( From Google - Does multimodal 7 Vision)
https://ollama.com/library/llama3.2 ( From Meta)
https://ollama.com/library/qwen3 (Reasoning model)
10) Quick performance tips
Start smaller and scale up only if you need better answers
Keep prompts focused. Very long prompts increase memory use and time
If you see slow responses or memory issues, drop down a size, choose a QAT or plain variant, and close other heavy applications
To compare models, time a fixed prompt and note tokens per second and latency
11) Troubleshooting
No output in Jupyter. The API streams multiple JSON lines. Collect the
responsefield from each line or keep only the final chunkGUI cannot connect. Ensure Ollama is running. For Open WebUI, ensure Docker Desktop is running. Verify the base URL and port
Model not listed. Run
ollama list. Pull the model again if necessarymacOS Python SSL or LibreSSL warnings. Safe to ignore for
http://localhostLow disk space. Remove unused models with:
Command to remove an unused model
ollama rm <name>
12) Hugging Face in practice
Use Hugging Face to discover alternative models and read model cards for size, context, and license. Advanced users can import specific variants into Ollama using a Modelfile and GGUF formats. For most users, the pre-packaged Ollama entries are the simplest path.
13) Where to go next
For larger models, fine-tuning, multi-user serving, and performance work, use our HPC systems. Contact us for architecture reviews, scaling advice, and best practices.