Getting started using local models with Fido AI
In addition to using hosted AI services from Anthropic, Google, OpenAI etc, Fido AI can use local models running on your device or local network via LM Studio or Ollama.
Hardware requirements
Local AI inference requires significant computing power. While it’s possible to run models using just your CPU, this is very slow. For practical use, a dedicated GPU (graphics card) is recommended.
GPU Memory (VRAM) is the most critical factor – it determines which models you can run:
- 4 GB:Can run only the smallest models (like moondream)
- 8 GB:Good starting point for experimentation with small-to-medium models
- 16 GB or more:Can run larger, more capable models with good performance
Beyond memory capacity, GPU memory bandwidth and the GPU chip generation also affect how quickly you’ll get responses from your local LLM.
To find out what GPU you have:
- Press Windows + X and select “System”. You can tab until you reach storage, then press right arrow to hear information about your graphics card.
Select models that fit comfortably within your GPU memory. For example:
- moondream(~2 GB) – Good for CPU only or 4 GB GPUs
- 2-vision(~7-8 GB) – Requires 8 GB+ GPU
- Larger models require 16 GB+ for optimal performance
Ollama
You can download Ollama from https://ollama.com/ and configure using the GUI if you prefer. However, the GUI is not ideal with a screen reader, so here are the steps using the Command Prompt.
Open Command Prompt (Windows + R, cmd, Enter)
Install Ollama by typing:
winget install ollama
Winget is the package manager built into Windows that will download the software and then run the installer. After installation, Ollama will run automatically in the background. You can verify it’s running by typing:
ollama list
(This should show an empty list if no models are installed yet)
Next, let’s download a model that can describe images. A small one to start with is moondream. Type:
ollama pull moondream
Once the download completes, the models sit on disk until they’re loaded into memory. Verify the model is available by typing:
ollama list
You should see moondream in the list. You can get information on a model with the show command. Type:
ollama show moondream
Now let’s see what models are running. Type:
ollama ps
(ps stands for “process status”). If your model is not there, then we can load and run it with:
ollama run moondream
This puts you into a chat experience which you can exit by typing /bye.
Now let’s check that Fido can communicate with Ollama:
In Fido, go to Settings / Setup. Ollama exposes its API at http://localhost:11434 by default – check these details are correct.
Select the Check Ollama local server button. If Ollama is running, it will tell you about the loaded model. If you have loaded more than one, Fido will use the first.
To stop a loaded model, type:
ollama stop [model name]
To have Fido use the local model, go to Fido / Settings / Alt service and select Ollama: ollama-local-server.
Ollama runs continuously in the background. You can close the Command Prompt window – Ollama will keep running. To stop Ollama, right-click the Ollama icon in your system tray and select Quit.
Ollama also installs a desktop app with chat window. If you can use this, it can be a convenient way to switch models or to load new ones.
LM Studio
An alternative to Ollama is LM Studio. This uses a graphical user interface that is not great with a screen reader.
To use models via LM Studio, install the app, and download a vision model. Select a model with ctrl + L. Then start the server with ctrl + R. It will display the URL where LM Studio is reachable.
Now let’s check that Fido can communicate with LM Studio. In Fido, go to Settings / Setup. By default LM Studio exposes its API at http://127.0.0.1:1234 – check the details are correct.
Select the Check LM Studio local server button. If LM Studio is running, it will tell you about the loaded model. If you have loaded more than one, Fido will use the first.
To have Fido use the local model, go to Fido / Settings / Alt service and select LM Studio: LM Studio: Im-studio-local-server.
Considerations for Use of Local AI
Running AI models locally offers several advantages and tradeoffs compared to using hosted services like Anthropic, Google, or OpenAI. Understanding these will help you decide when to use local models versus cloud-based services.
Privacy and Data Security
Local models process everything on your device. Your documents, images, and prompts never leave your computer. This makes local AI ideal for sensitive or confidential content where you cannot or prefer not to send data to external servers.
Speed and Performance
Speed depends heavily on your hardware. With a capable GPU (16 GB+ VRAM), local models can respond quickly – often faster than waiting for API calls over the internet. However, with limited hardware (CPU only or small GPU), local processing can be significantly slower than hosted services.
Your internet connection doesn’t matter for processing speed with local models, making them reliable for offline work or in locations with poor connectivity.
Quality and Capability
Local models available today are generally less capable than frontier models like GPT-4, Claude Sonnet, or Gemini Pro. You may notice:
- Less nuanced understanding of complex instructions
- Lesser quality conversions from PDF
- Shorter context windows (less ability to process long documents)
- Reduced ability to produce results in languages other than English
However, local models continue to improve rapidly, and for many tasks, smaller local models can produce perfectly acceptable results.
Cost Considerations
Local AI eliminates per-use API costs. Once you have the hardware, you can process unlimited documents without subscription fees or usage charges. However, consider:
- Initial hardware investment (a capable GPU will cost hundreds of dollars)
- Electricity costs (GPUs consume significant power)
- Hosted services are economical for occasional use
Availability and Reliability
Local models work offline and have no rate limits or usage quotas. You’re never blocked by:
- API service outages
- Rate limiting during high-demand periods
- Account issues or payment problems
- Changes to service terms or pricing
In the first few months of experimenting with Fido, we’ve experienced all of these (albeit rarely.
When to Choose Local AI
Local models work best for:
- Processing confidential or sensitive documents
- Working offline or with unreliable internet
- High-volume batch processing (to avoid API costs)
- Tasks where “good enough” quality is acceptable
- Learning and experimentation without usage costs
When to Choose Hosted Services
Hosted services are better for:
- Tasks requiring the highest quality results
- Occasional or light usage (more economical than hardware investment)
- Users without dedicated GPU hardware
- Tasks requiring very large context windows
- Latest model capabilities and features
