Run Local LLMs on Your Laptop / Desktop

Run Local LLMs on Your Laptop / Desktop

Preview and scope

This page provides a practical guide to installing popular open-source large-language models from OpenAI/Google/Meta/Deepseek on a laptop or desktop, enabling a ChatGPT-style experience. The primary tool we’ll use is Ollama, a local engine that runs LLMs on your machine. Ollama is simple and works on macOS, Windows, and Linux. Optional interfaces such as LM Studio and Open WebUI are also referenced.

This page is not a deep exploration of LLM theory, fine-tuning, or production serving. For larger models, fine-tuning, higher throughput, or multi-user serving, use our HPC systems and contact us for support.

1) Before we begin: what does “270m / 1B / 4B / 12B / 27B” mean in model size

The number is the model size, measured as the approximate number of learned parameters. These labels indicate the model’s size by parameter count: M means millions of parameters and B means billions (e.g., 270M, 1B, 4B, 12B, 27B).

  • 270m to 1B: small models, fast and light, for basic tasks

  • 4B: a balanced “laptop default” that fits most modern laptops

  • 12B: stronger, needs more memory, and is slower on laptops without a GPU

  • 27B and up: typically requires a powerful laptop/workstation or server

Simple chooser of a model for your laptop

  • 8–12 GB RAM: start with 270m–1B parameter model

  • 16 GB RAM: 4B parameter model is the sweet spot

  • 32 GB RAM or a discrete GPU: 12B parameter model is reasonable

  • Workstation or server with a powerful GPU: 27B parameter model and up

Platform notes: macOS, Windows, and Linux

macOS (Apple Silicon)

  • Uses unified memory. CPU and GPU share the same pool.

  • 16 GB machines work well with 4B. 12B parameter model is more comfortable on 32 GB.

  • Leave headroom for other apps. If the system starts swapping, text generation slows.

Windows and most Linux laptops/desktops

  • Usually have separate GPU VRAM and system RAM.

  • With an NVIDIA GPU and 8–12 GB VRAM, mid-size models run more smoothly.

  • CPU-only machines should stick to 1B–4B for reasonable latency.

Linux workstations/servers

  • With higher-VRAM GPUs (for example, 24 GB or more), 12B–27B becomes practical.

  • Use containers and scheduling for reproducibility on multi-user nodes.

General tips

  • A larger context window increases memory use on every platform.

  • If you see slowdowns or out-of-memory errors, reduce context length or step down one model size.

  • Start with the plain model built on the model page. Move to larger sizes only if you need stronger answers and have headroom.


2) Key terminology when downloading the type of models (in plain language)

  • Model size: the 270m/1B/4B/12B/27B label described above

  • Context window: how much text the model can consider at once, shown as a token count such as 32K or 128K. Larger context enables longer prompts and documents, but uses more memory and time

  • Quantization: shrink a model by using fewer bits for its numbers. It makes the model smaller and faster, with a small quality drop.

  • QAT (Quantization-Aware Training): the model was trained knowing it would be quantized, so it usually keeps quality better at the same small size.

Info: Plain guidance: If you see a QAT version, prefer it. If not, a regular quantized build is still a good laptop choice.

  • Multimodal means a model can take in and/or produce multiple data types, like text, images, audio, or video, and reason across them together.

  • Inference (remote hosting): running a model on a separate server and sending it your prompts over the network, so the server does the computation and returns the answer to your laptop.

    • Local inference: the model runs on your own machine.

    • Remote inference: the model runs on a server; your laptop is just the client.

  • Variants on model pages: suffixes such as -fp16, -q8_0, or -it-qat. If you are unsure, choose the plain tag without a suffix. That is usually the laptop-friendly default

  • Where models come from: the Ollama library provides convenient pre-packaged entries. Hugging Face is a large catalogue of models for discovery and advanced use


3) Minimum hardware requirements for local installation

  • Memory
    8–12 GB RAM → good fit for 270m–1B parameter model
    16 GB RAM → good fit for 4B parameter model
    32 GB RAM or discrete GPU → good fit for 12B parameter model

  • CPU/GPU
    Models run on both CPU and GPUs. A GPU improves speed. Apple Silicon Mac laptops run models efficiently due to their Unified Memory Architecture (CPU, GPU share one high-bandwidth memory pool).

  • Disk space
    Models are several gigabytes each. Ensure adequate free space

  • Network
    Local by default. Keep the runtime on localhost unless you explicitly plan a remote setup


4) Install the runtime engine(Ollama)

Choose your operating system. After installing, verify with a simple command.

4.1 macOS

  • Download the macOS build from the Ollama website and install it by dragging to Applications

  • Launch Ollama. You should see its icon in the menu bar

  • Homebrew users can install via cask if available in your environment

Verification

  1. Open Terminal

  2. Run a quick command from Section 5

4.2 Windows

  • Download the Windows installer from the Ollama website and run it

  • Allow any security prompts. You should see an icon in the system tray

Verification

  1. Open Command Prompt

  2. Run a quick command from Section 5

4.3 Linux

Verification

  1. Open a terminal

  2. Run a quick command from Section 5


5) Your first model: pull and run

You can start from the command line or use the built-in Ollama interface.

Command line

  1. Choose a model size from Section 1 ( See Section 9 for more details). For a modern 16 GB laptop, a 4B parameter model is a safe default

  2. Pull the model (example shown with a gemma3:4B model):

    Command to download the gemma3:4b model

    ollama pull gemma3:4b
  3. Run the model and ask a question:

    Command to start interactive chat with gemma3:4b

    ollama run gemma3:4b

    Type a question at the prompt. Type /bye to exit

  4. List installed models:

    Command to list all installed models

    ollama list

Built-in Ollama GUI

  1. Open the Ollama interface from the system tray or menu bar

  2. Use the interface to download a model that matches your laptop (see Section 1)

  3. Start a chat in the interface

At this point you have a fully working local setup that has models installed locally and running with ChatGPT-style experience . You can download models sized (eg: gemma ) to your machine and chat through the command line or the built-in GUI interface.


6) Optional GUIs with more features

Optional front end GUI interfaces provide document and file workflows, conversation history, parameters such as temperature, and multi-model management. Some of the following tools provide these features.

  • LM Studio. A desktop application with a friendly interface

  • Open WebUI. A browser interface with many features. It typically runs via a single Docker command and is reachable at http://localhost:3000

  • Community desktop GUIs

These interfaces can improve convenience for document processing, parameter control, and other advanced features.


7) Remote option: point a GUI to a remote Ollama server

Use this when a remote server runs larger models and your laptop only needs to run a GUI.

You need

  • Server address or hostname

  • Port number. The default port number for Ollama is 11434 unless configured differently

Where to set it

  • In Open WebUI. In Settings, set the Ollama API base URL to http://SERVER:PORT

  • Desktop GUIs. Set the Ollama API URL to http://SERVER:PORT

If you cannot connect, check that the remote server is reachable, the firewall allows the port, and Ollama is listening for remote connections. If models do not appear in the GUI, confirm that they are pulled on the server.

This section separates local use from remote use. Next is how to call the runtime from Jupyter outside any GUI.


8) Using models programmatically (eg: Jupyter notebook or any Python script)

This allows you to integrate these LLMs into automations and call them from your code. The examples we show here use the local endpoint by default. If you are using a remote server, replace the base URL. You only need three things: the endpoint (URL), the model name, and your prompt. Everything else is optional.

8.1 Environment setup

Create or activate a Python environment and install these packages if needed:

Install required Python packages
pip install requests matplotlib pillow

Set the base URL:

Set the Ollama API endpoint URL
API = "http://localhost:11434" # change to "http://SERVER:PORT" for remote

8.2 Eg: Minimal text generation (streamed output)

Minimal text generation example with streaming
import requests, json, sys API = "http://localhost:11434" r = requests.post(f"{API}/api/generate", json={"model":"gemma3:4b","prompt":"Explain transformers in one sentence."}, stream=True) for line in r.iter_lines(): if line: sys.stdout.write(json.loads(line)["response"]) sys.stdout.flush()

8.3 Eg: Summarize a local file

Example: summarize content from a local text file
import requests, json API = "http://localhost:11434" text = open("sample.txt").read() r = requests.post(f"{API}/api/generate", json={"model":"gemma3:4b","prompt":"Summarize:\n"+text}, stream=True) summary = "".join(json.loads(l)["response"] for l in r.iter_lines() if l) print(summary.strip())

8.4 Eg: Batch summarize a folder and plot summary lengths

Batch summarize text files and visualize summary lengths
import os, requests, json, matplotlib.pyplot as plt API = "http://localhost:11434" def summarize(txt): r = requests.post(f"{API}/api/generate", json={"model":"gemma3:4b","prompt":"Summarize:\n"+txt}, stream=True) return "".join(json.loads(l)["response"] for l in r.iter_lines() if l).strip() files = [f for f in os.listdir("docs") if f.endswith(".txt")] lengths = {f: len(summarize(open(os.path.join("docs", f)).read())) for f in files} plt.bar(lengths.keys(), lengths.values()) plt.xticks(rotation=45, ha="right") plt.ylabel("Summary length (characters)") plt.title("Folder summaries") plt.show()

8.5 Eg: Optional: image captioning if your model supports images

Some multimodals accept images. If your chosen model supports this capability, you can send a base64 encoded image.

Image captioning with a multimodal model
import base64, requests, json API = "http://localhost:11434" img_b64 = base64.b64encode(open("image.jpg","rb").read()).decode() r = requests.post(f"{API}/api/generate", json={"model":"gemma3:4b","prompt":"Describe this image in one sentence.","images":[img_b64]}, stream=True) caption = "".join(json.loads(l)["response"] for l in r.iter_lines() if l) print(caption.strip())

If your model does not support images, use a vision-enabled model from the model page.


9) Choosing a model on the model page

When you open a model page to download, focus on these items.

  • Name. For example gemma3:1b, gemma3:4b, gemma3:12b, gemma3:27b. The number is the size from Section 1

  • Size. The download size. Larger files are heavier to run

  • Context. For example 128K. Higher context allows longer prompts but uses more memory

  • Variants. Suffixes such as -fp16, -q8_0, or -it-qat. If you are unsure, choose the plain tag with no suffix

Default recommendations

  • Older or 8–12 GB laptops: …:1b or …:270m

  • 16 GB laptops: …:4b

  • 32 GB RAM or discrete GPU: …:12b

If a QAT variant is offered, it is a good choice on laptops for stronger fidelity at a similar footprint.

Some popular recommended models.

https://ollama.com/library/gpt-oss ( From OpenAI and reasoning model)

https://ollama.com/library/gemma3 ( From Google - Does multimodal 7 Vision)


10) Quick performance tips

  • Start smaller and scale up only if you need better answers

  • Keep prompts focused. Very long prompts increase memory use and time

  • If you see slow responses or memory issues, drop down a size, choose a QAT or plain variant, and close other heavy applications

  • To compare models, time a fixed prompt and note tokens per second and latency


11) Troubleshooting

  • No output in Jupyter. The API streams multiple JSON lines. Collect the response field from each line or keep only the final chunk

  • GUI cannot connect. Ensure Ollama is running. For Open WebUI, ensure Docker Desktop is running. Verify the base URL and port

  • Model not listed. Run ollama list. Pull the model again if necessary

  • macOS Python SSL or LibreSSL warnings. Safe to ignore for http://localhost

  • Low disk space. Remove unused models with:

    Command to remove an unused model

    ollama rm <name>

12) Hugging Face in practice

Use Hugging Face to discover alternative models and read model cards for size, context, and license. Advanced users can import specific variants into Ollama using a Modelfile and GGUF formats. For most users, the pre-packaged Ollama entries are the simplest path.


13) Where to go next

For larger models, fine-tuning, multi-user serving, and performance work, use our HPC systems. Contact us for architecture reviews, scaling advice, and best practices.

Center for Computational Sciences