Getting started using local models with Fido AI

In addition to using hosted AI services from Anthropic, Google, OpenAI etc, Fido AI can use local models running on your device or local network via LM Studio or Ollama.

Hardware requirements

Local AI inference requires significant computing power. While it’s possible to run models using just your CPU, this is very slow. For practical use, a dedicated GPU (graphics card) is recommended.

GPU Memory (VRAM) is the most critical factor – it determines which models you can run:

4 GB:Can run only the smallest models (like moondream)
8 GB:Good starting point for experimentation with small-to-medium models
16 GB or more:Can run larger, more capable models with good performance

Beyond memory capacity, GPU memory bandwidth and the GPU chip generation also affect how quickly you’ll get responses from your local LLM.

To find out what GPU you have:

Press Windows + X and select “System”. You can tab until you reach storage, then press right arrow to hear information about your graphics card.

Select models that fit comfortably within your GPU memory. For example:

moondream(~2 GB) – Good for CPU only or 4 GB GPUs
2-vision(~7-8 GB) – Requires 8 GB+ GPU
Larger models require 16 GB+ for optimal performance

Ollama

You can download Ollama from https://ollama.com/ and configure using the GUI if you prefer. However, the GUI is not ideal with a screen reader, so here are the steps using the Command Prompt.

Open Command Prompt (Windows + R, cmd, Enter)

Install Ollama by typing:

winget install ollama

Winget is the package manager built into Windows that will download the software and then run the installer. After installation, Ollama will run automatically in the background. You can verify it’s running by typing:

ollama list

(This should show an empty list if no models are installed yet)

Next, let’s download a model that can describe images. A small one to start with is moondream. Type:

ollama pull moondream

Once the download completes, the models sit on disk until they’re loaded into memory. Verify the model is available by typing:

ollama list

You should see moondream in the list. You can get information on a model with the show command. Type:

ollama show moondream

Now let’s see what models are running. Type:

ollama ps

(ps stands for “process status”). If your model is not there, then we can load and run it with:

ollama run moondream

This puts you into a chat experience which you can exit by typing /bye.

Now let’s check that Fido can communicate with Ollama:

In Fido, go to Settings / Setup. Ollama exposes its API at http://localhost:11434 by default – check these details are correct.

Select the Check Ollama local server button. If Ollama is running, it will tell you about the loaded model. If you have loaded more than one, Fido will use the first.

To stop a loaded model, type:

ollama stop [model name]

To have Fido use the local model, go to Fido / Settings / Alt service and select Ollama: ollama-local-server.

Ollama runs continuously in the background. You can close the Command Prompt window – Ollama will keep running. To stop Ollama, right-click the Ollama icon in your system tray and select Quit.

Ollama also installs a desktop app with chat window. If you can use this, it can be a convenient way to switch models or to load new ones.

LM Studio

An alternative to Ollama is LM Studio. This uses a graphical user interface that is not great with a screen reader.

To use models via LM Studio, install the app, and download a vision model. Select a model with ctrl + L. Then start the server with ctrl + R. It will display the URL where LM Studio is reachable.

Now let’s check that Fido can communicate with LM Studio. In Fido, go to Settings / Setup. By default LM Studio exposes its API at http://127.0.0.1:1234 – check the details are correct.

Select the Check LM Studio local server button. If LM Studio is running, it will tell you about the loaded model. If you have loaded more than one, Fido will use the first.

To have Fido use the local model, go to Fido / Settings / Alt service and select LM Studio: LM Studio: Im-studio-local-server.

Considerations for Use of Local AI

Running AI models locally offers several advantages and tradeoffs compared to using hosted services like Anthropic, Google, or OpenAI. Understanding these will help you decide when to use local models versus cloud-based services.

Privacy and Data Security

Local models process everything on your device. Your documents, images, and prompts never leave your computer. This makes local AI ideal for sensitive or confidential content where you cannot or prefer not to send data to external servers.

Speed and Performance

Speed depends heavily on your hardware. With a capable GPU (16 GB+ VRAM), local models can respond quickly – often faster than waiting for API calls over the internet. However, with limited hardware (CPU only or small GPU), local processing can be significantly slower than hosted services.

Your internet connection doesn’t matter for processing speed with local models, making them reliable for offline work or in locations with poor connectivity.

Quality and Capability

Local models available today are generally less capable than frontier models like GPT-4, Claude Sonnet, or Gemini Pro. You may notice:

Less nuanced understanding of complex instructions
Lesser quality conversions from PDF
Shorter context windows (less ability to process long documents)
Reduced ability to produce results in languages other than English

However, local models continue to improve rapidly, and for many tasks, smaller local models can produce perfectly acceptable results.

Cost Considerations

Local AI eliminates per-use API costs. Once you have the hardware, you can process unlimited documents without subscription fees or usage charges. However, consider:

Initial hardware investment (a capable GPU will cost hundreds of dollars)
Electricity costs (GPUs consume significant power)
Hosted services are economical for occasional use

Availability and Reliability

Local models work offline and have no rate limits or usage quotas. You’re never blocked by:

API service outages
Rate limiting during high-demand periods
Account issues or payment problems
Changes to service terms or pricing

In the first few months of experimenting with Fido, we’ve experienced all of these (albeit rarely.

When to Choose Local AI

Local models work best for:

Processing confidential or sensitive documents
Working offline or with unreliable internet
High-volume batch processing (to avoid API costs)
Tasks where “good enough” quality is acceptable
Learning and experimentation without usage costs

When to Choose Hosted Services

Hosted services are better for:

Tasks requiring the highest quality results
Occasional or light usage (more economical than hardware investment)
Users without dedicated GPU hardware
Tasks requiring very large context windows
Latest model capabilities and features

Tags: Fido