Exploring LocalAI
• Mark Eschbach
I’ve been interested in playing around with thinks like ChatGPT for a while. Sadly these have been behind walled gardens for a while. Luckily someone has created an alternative to run these: LocalIA. Figured I would explore what it would take to deploy these on Kubernetes.
The instructions
Looks like LocalAI publishes a Helm Chart so this might be super easy. Values file for the Helm chart makes this look super simple.
To kick things off was fairly simple:
resource "helm_release" "xp-r0" {
repository = "https://go-skynet.github.io/helm-charts/"
chart = "local-ai"
name = "r0"
namespace = "xp-localai"
}
This takes a bit as an initContainer
will pull the GPT4All model. This took approximately 3 minutes on my homelab.
Next the primary container of the pod builds something, consuming all 24 CPUs and approximately 3.7GiB of RAM while
compiling. Took approximately 15 minutes on 2012 era Xeons.
Exploring the API while waiting
Most of the examples are around the Chat Compeltion
API. THere seems
to be a role
. Based on the API References and
OpenAI knowledge base this is to provide
context and state for the underlying model to respond with.
Trying it out
My command looked like:
curl http://r0-local-ai.xp-localai.svc.workshop.k8s/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Which then crashed pod, causing a restart. There seems to be two possible error messages which might be the cause:
1.) The first is a missing language model: error loading model: failed to open /models/ggml-koala-7b-model-q4_0-r2.bin: No such file or directory\nllama_init_from_file: failed to load model
2.) Second is a SIGILL: illegal instruction
in cgo
I am guessing #2 is a bigger issue since I am running on older Xeons which lack a number of instructions. Eventually I would like to figure out how to list processor extensions on the labels for a node. For now I’ll pin it and provide storage in hopes of avoiding a 15-minute restart timing.
New values.yaml
file looks like the following:
models:
persistence:
enabled: true
storageClass: "longhorn"
nodeSelector:
"kubernetes.io/hostname": "molly"
Welp. Similar issue. After looking at /proc/cpuinfo
and double-checking the
Celeron N5105
does not have an AVX512 extension either. Guessing that is the underlying reason for the fault.
Raspberry Pi
In an odd twist of fate the Raspberry Pi took about 45 minutes to compile the service and be ready. Attempting to the same command resulted in an actual response:
{"error":{"code":500,"message":"could not load model - all backends returned error: 11 errors occurred:\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\t* failed loading model\n\n","type":""}}
Modifying the command to match the given model might by my ticket:
curl http://r0-local-ai.xp-localai.svc.workshop.k8s/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ggml-gpt4all-j.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
This command hung. After reviewing the homelab K8s control plane it looks like the node was knocked off-line and ssh
was no longer responsive on the host. Appears like I will need to review the system CPU and memory reservations to
avoid resource exhaustion again. Device is also hot enough to be uncomfortable to touch.
Last chance!
Fine, I’ll use my EKS cluster. Relevant Terraform:
resource "kubernetes_namespace" "xp_localai" {
metadata {
name = "xp-localai"
}
}
resource "helm_release" "xp-r0" {
repository = "https://go-skynet.github.io/helm-charts/"
chart = "local-ai"
name = "r0"
namespace = "xp-localai"
values = [file("${path.module}/xp-localai.yaml")]
atomic = false
wait = false
depends_on = [kubernetes_namespace.xp_localai]
}
With the referenced values file. I disabled persistence since my EKS cluster needs some components updated in order to deal with API and credential drifting.
models:
persistence:
enabled: false
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 2
memory: 2Gi
After waiting 15 minutes the service is online and ready to go. A bit of port forwarding and the curl
command
looks like this:
curl http://localhost:56907/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ggml-gpt4all-j.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Which sadly resulted in the following:
llama.cpp: loading model from /models/ggml-gpt4all-j.bin
/build/entrypoint.sh: line 11: 7725 Killed ./local-ai "$@"
Preventing Rebuilds
Using LocalAI in anger will require image restarts with minimal downtown. The rebuild process really requires 15 minutes per restart. There is a solution in the works however it has not made its way into the Helm chart yet.
RAM requirements
Well, with my t3.medium
s running 4Gi of RAM I do not have enough RAM tor un Gpt4All
on those instances. It looks
like the 8Gi required by gpt4all is actually highly optimized. With my memory generous system having the most RAM but
no AVX512, Pi only have 8Gi in total, and I am unwilling to spin up larger EC2 instances right now I guess that just
leaves my laptop for now.
Running LocalAI
on OSX for M1 Macs
Since M1s are not really supported this leaves me with building it from source. Instructions are fairly straight forward though. So I guess I will give this a whirl?
Took a while to compile the system and download the model, however it worked flawlessly. Might be worth considering to run in the future. Especially if I can figure how to get around the need for more advanced CPU instruction sets.