I’ve been interested in playing around with thinks like ChatGPT for a while. Sadly these have been behind walled gardens for a while. Luckily someone has created an alternative to run these: LocalIA. Figured I would explore what it would take to deploy these on Kubernetes.

The instructions

Looks like LocalAI publishes a Helm Chart so this might be super easy. Values file for the Helm chart makes this look super simple.

To kick things off was fairly simple:

resource "helm_release" "xp-r0" {
  repository = "https://go-skynet.github.io/helm-charts/"
  chart      = "local-ai"
  name       = "r0"
  namespace  = "xp-localai"
}

This takes a bit as an initContainer will pull the GPT4All model. This took approximately 3 minutes on my homelab. Next the primary container of the pod builds something, consuming all 24 CPUs and approximately 3.7GiB of RAM while compiling. Took approximately 15 minutes on 2012 era Xeons.

Exploring the API while waiting

Most of the examples are around the Chat Compeltion API. THere seems to be a role. Based on the API References and OpenAI knowledge base this is to provide context and state for the underlying model to respond with.

Trying it out

My command looked like:

curl http://r0-local-ai.xp-localai.svc.workshop.k8s/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-koala-7b-model-q4_0-r2.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Which then crashed pod, causing a restart. There seems to be two possible error messages which might be the cause: 1.) The first is a missing language model: error loading model: failed to open /models/ggml-koala-7b-model-q4_0-r2.bin: No such file or directory\nllama_init_from_file: failed to load model 2.) Second is a SIGILL: illegal instruction in cgo

I am guessing #2 is a bigger issue since I am running on older Xeons which lack a number of instructions. Eventually I would like to figure out how to list processor extensions on the labels for a node. For now I’ll pin it and provide storage in hopes of avoiding a 15-minute restart timing.

New values.yaml file looks like the following:

models:
  persistence:
    enabled: true
    storageClass: "longhorn"

nodeSelector:
  "kubernetes.io/hostname": "molly"

Welp. Similar issue. After looking at /proc/cpuinfo and double-checking the Celeron N5105 does not have an AVX512 extension either. Guessing that is the underlying reason for the fault.

Raspberry Pi

In an odd twist of fate the Raspberry Pi took about 45 minutes to compile the service and be ready. Attempting to the same command resulted in an actual response:

{"error":{"code":500,"message":"could not load model - all backends returned error: 11 errors occurred:\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\n","type":""}}

Modifying the command to match the given model might by my ticket:

curl http://r0-local-ai.xp-localai.svc.workshop.k8s/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-j.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

This command hung. After reviewing the homelab K8s control plane it looks like the node was knocked off-line and ssh was no longer responsive on the host. Appears like I will need to review the system CPU and memory reservations to avoid resource exhaustion again. Device is also hot enough to be uncomfortable to touch.

Last chance!

Fine, I’ll use my EKS cluster. Relevant Terraform:

resource "kubernetes_namespace" "xp_localai" {
  metadata {
    name = "xp-localai"
  }
}

resource "helm_release" "xp-r0" {
  repository = "https://go-skynet.github.io/helm-charts/"
  chart      = "local-ai"
  name       = "r0"
  namespace  = "xp-localai"

  values = [file("${path.module}/xp-localai.yaml")]
  atomic = false
  wait   = false

  depends_on = [kubernetes_namespace.xp_localai]
}

With the referenced values file. I disabled persistence since my EKS cluster needs some components updated in order to deal with API and credential drifting.

models:
  persistence:
    enabled: false

resources:
  requests:
    cpu: 100m
    memory: 100Mi
  limits:
    cpu: 2
    memory: 2Gi

After waiting 15 minutes the service is online and ready to go. A bit of port forwarding and the curl command looks like this:

curl http://localhost:56907/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-j.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Which sadly resulted in the following:

llama.cpp: loading model from /models/ggml-gpt4all-j.bin
/build/entrypoint.sh: line 11:  7725 Killed                  ./local-ai "$@"

Preventing Rebuilds

Using LocalAI in anger will require image restarts with minimal downtown. The rebuild process really requires 15 minutes per restart. There is a solution in the works however it has not made its way into the Helm chart yet.

RAM requirements

Well, with my t3.mediums running 4Gi of RAM I do not have enough RAM tor un Gpt4All on those instances. It looks like the 8Gi required by gpt4all is actually highly optimized. With my memory generous system having the most RAM but no AVX512, Pi only have 8Gi in total, and I am unwilling to spin up larger EC2 instances right now I guess that just leaves my laptop for now.

Running LocalAI on OSX for M1 Macs

Since M1s are not really supported this leaves me with building it from source. Instructions are fairly straight forward though. So I guess I will give this a whirl?

Took a while to compile the system and download the model, however it worked flawlessly. Might be worth considering to run in the future. Especially if I can figure how to get around the need for more advanced CPU instruction sets.