-
![](https://nostr.build/i/p/nostr.build_19e442ed3ade3ebb87c32e547d782eac379e4099d181f9e284e477b819add46c.jpg)
@ John Dee
2025-02-09 01:43:24
Alright, Tabby wouldn't work right with CUDA and it was impossible to troubleshoot anything.
Now we're going to try [Twinny](https://github.com/twinnydotdev/twinny).
You'll need:
* Ubuntu 24.04 or some other Linux flavor
* [VSCodium](https://github.com/VSCodium/vscodium)
* GPU with enough VRAM to hold your chosen models
* It might work with just a CPU, give it a try!
# Install Ollama
Ollama loads the AI models into VRAM (and offloads into RAM if needed) and offers inference through an API. The Ollama install script adds a dedicated `ollama` user and a systemd service so that Ollama starts automatically at boot. Ollama will unload models from memory when idle.
```
curl -fsSL https://ollama.com/install.sh | sh
```
## Configuring Remote Ollama
Skip this section if your GPU is on the same machine you're coding from. Ollama and Twinny default to a local configuration.
I'm using Ollama on a separate machine with a decent GPU, so I need to configure it to bind to the network interface instead of `localhost` by editing the systemd service:
1. Edit the service override file:
```
systemctl edit ollama.service
```
1. Add this in the top section. It will be merged with the default service file.
```
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
```
Note: I had a problem with Ollama only binding to the `tcp6` port, and wasted a lot of time trying to fix it. Check with `netstat -na | grep 11434` to see if it only shows tcp6 for port 11434. The workaround for me was using the actual interface IP instead of `0.0.0.0`. If you do this then you'll need to either set `OLLAMA_HOST` in your shell, or specify `OLLAMA_HOST=<hostname>` at the start of every ollama command (or it complains that it can't connect to ollama).
1. Restart Ollama:
```
systemctl daemon-reload
systemctl restart ollama
```
## Download Models
Ollama manages models by importing them into its own special folders. You can use Ollama to download the files from the [Ollama library](https://ollama.com/library), or you can import already-downloaded models by creating a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md). Modelfiles define various parameters about the model and it's important to get them set correctly for the best performance so it's probably better to get them from the library, even though it wastes bandwidth downloading them over and over again for all these tools.
You'll want a model for chatting and a model for fill-in-middle (FIM) that does the code completions.
```
# Chat model, defaults to 7b with Q4_K_M quant, downloads 4.7 GB
ollama pull qwen2.5-coder
# Fill-in-Middle model for code completion, another 4.7 GB
ollama pull qwen2.5-coder:7b-base
```
The RAG/embeddings feature is optional but probably worth using. It needs to run a from local instance of Ollama. The embedding models are very small and run fine on CPU. Most everyone uses `nomic`, but Twinny defaults to `all-minilm`, which is a little smaller.
```
ollama pull all-minilm
```
# Install Twinny
Install the [Twinny extension](https://open-vsx.org/extension/rjmacarthy/twinny) by searching for `twinny` in VSCodium extensions.
Look for the Twinny icon on the left side panel and click it to open the Twinny panel. Be careful if you drag it over to the right side: sometimes it may not appear at all. I managed to make it appear by selecting some text and running a Twinny action on it from the context menu.
Twinny assumes you're using a local instance of Ollama. If you want to use a remote instance then you'll need to go into the **Manage Twinny providers** section (looks like a power plug) and add new providers that point your remote Ollama instance.
These are the settings I'm using for a remote Ollama instance:
## Chat provider
* Type: `chat`
* Provider: `ollama`
* Protocol: `http`
* Model Name: `qwen2.5-coder:latest`
* Hostname: *remote hostname*
* Port: `11434`
* API Path: `/v1` (despite what the Twinny docs and Ollama docs say)
* API Key: *blank*
## FIM Provider
* Type: `fim`
* FIM Template: `codeqwen`
* Provider: `ollama`
* Protocol: `http`
* Model Name: `qwen2.5-coder:7b-base`
* Hostname: *remote hostname*
* Port: `11434`
* API Path: `/api/generate`
* API Key: *blank*
## Embedding Provider
* Type: `embedding`
* Provider: `ollama`
* Protocol: `http`
* Model Name: `nomic-embed-text`
* Hostname: `0.0.0.0`
* Port: `11434`
* API Path: `/api/embed`
* API Key: *blank*
# Use Twinny
## Chat
Select code and it will be used for context in the chat, or right-click selected text to see more options like refactor, write docs, write tests.
## Embeddings
Switch over to the Embeddings tab and click Embed documents to use all the documents in your workspace for more context. Type '@' in the chat to reference the entire workspace, ask about specific problems, or specific files in your chat.
## Code Completion
Suggested stuff shows up in gray italics at the cursor location. Press `Tab` to accept it as a completion.