Local Deployment of LLM and Text Embedding Models

Introduction⌗
Technology innovation in the AI field is now advancing rapidly, giving birth to many emerging projects, such as Dify used in this article and Ollama for local deployment of large language models.
These two projects allow you to implement RAG at zero cost locally, as well as context-based chat Q&A using knowledge bases.
System Requirements⌗
Currently, the Windows version of Ollama is still in Preview status. It is recommended to run it on computers with Apple M series processors or Linux servers with GPUs.
The latest llama3.2 requires less memory than the previous generation while maintaining quality comparable to llama3.1. It is said that 2GB+ is sufficient, and it can even run on some mobile devices!
Getting Models⌗
Visit the Ollama official website to download the App for your system. This App includes CLI tools. Then visit Models to check the models you need to use. Here I’m using the latest llama3.2
and mxbai-embed-large
.
Purpose of the models:
- llama3.2: Mainly used for Chat text generation
- mxbai-embed-large: Used for document embedding operations
After installation, check the corresponding version of Ollama. Some models have special version requirements!
ollama -v
ollama version is 0.3.12
ollama pull llama3.2
ollama pull mxbai-embed-large
Running Ollama Service⌗
There are two ways to run the Ollama service: one is to directly open the Ollama App, and the other is to start it through the CLI. When the App is running, the CLI cannot run because this would cause a port binding conflict.
ollama serve --help
Start ollama
Usage:
ollama serve [flags]
Aliases:
serve, start
Flags:
-h, --help help for serve
Environment Variables:
OLLAMA_DEBUG Show additional debug information (e.g. OLLAMA_DEBUG=1)
OLLAMA_HOST IP Address for the ollama server (default 127.0.0.1:11434)
OLLAMA_KEEP_ALIVE The duration that models stay loaded in memory (default "5m")
OLLAMA_MAX_LOADED_MODELS Maximum number of loaded models per GPU
OLLAMA_MAX_QUEUE Maximum number of queued requests
OLLAMA_MODELS The path to the models directory
OLLAMA_NUM_PARALLEL Maximum number of parallel requests
OLLAMA_NOPRUNE Do not prune model blobs on startup
OLLAMA_ORIGINS A comma separated list of allowed origins
OLLAMA_SCHED_SPREAD Always schedule model across all GPUs
OLLAMA_TMPDIR Location for temporary files
OLLAMA_FLASH_ATTENTION Enabled flash attention
OLLAMA_LLM_LIBRARY Set LLM library to bypass autodetection
OLLAMA_GPU_OVERHEAD Reserve a portion of VRAM per GPU (bytes)
OLLAMA_LOAD_TIMEOUT How long to allow model loads to stall before giving up (default "5m")
As you can see, the Ollama service supports configuring startup parameters through environment variables. It is recommended to set the following environment variables:
- OLLAMA_HOST=“0.0.0.0”
- OLLAMA_ORIGINS="*"
- OLLAMA_KEEP_ALIVE=“24h”
OLLAMA_HOST="0.0.0.0" OLLAMA_ORIGINS="*" OLLAMA_KEEP_ALIVE="24h" ollama serve
2024/09/29 12:37:38 routes.go:1153: INFO server config env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/George/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: http_proxy: https_proxy: no_proxy:]"
time=2024-09-29T12:37:38.522+08:00 level=INFO source=images.go:753 msg="total blobs: 17"
time=2024-09-29T12:37:38.523+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-29T12:37:38.523+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.12)"
time=2024-09-29T12:37:38.524+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/var/folders/p6/g0wwf37d165dfg5hvqtr9l180000gn/T/ollama966569504/runners
time=2024-09-29T12:37:38.547+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[metal]
time=2024-09-29T12:37:38.577+08:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=metal variant="" compute="" driver=0.0 name="" total="48.0 GiB" available="48.0 GiB"
After the service starts, you can use the following command to view the list of existing models in Ollama:
curl http://127.0.0.1:11434/api/tags | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1391 100 1391 0 0 614k 0 --:--:-- --:--:-- --:--:-- 679k
{
"models": [
{
"name": "llama3.2:latest",
"model": "llama3.2:latest",
"modified_at": "2024-09-28T21:00:44.510551673+08:00",
"size": 2019393189,
"digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": [
"llama"
],
"parameter_size": "3.2B",
"quantization_level": "Q4_K_M"
}
},
{
"name": "mxbai-embed-large:latest",
"model": "mxbai-embed-large:latest",
"modified_at": "2024-07-04T16:35:42.71585157+08:00",
"size": 669615493,
"digest": "468836162de7f81e041c43663fedbbba921dcea9b9fefea135685a39b2d83dd8",
"details": {
"parent_model": "",
"format": "gguf",
"family": "bert",
"families": [
"bert"
],
"parameter_size": "334M",
"quantization_level": "F16"
}
}
]
}
Running Models⌗
ollama run llama3.2
>>> Send a message (/? for help)
After running the above command, you have successfully run the LLM. If you need to exit the interactive command line, you can use the shortcut key Ctrl + D
or enter /bye
to exit.
After exiting, use the following command to view the running models:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.2:latest a80c4f17acd5 4.0 GB 100% GPU 24 hours from now
Note: The Text Embedding model does not need to be manually run with the
run
command. After the model is pulled, and the Ollama service is started, you can directly access the API to call the Text Embedding model!
Try calling Ollama’s Text Embedding API:
curl http://127.0.0.1:11434/api/embed -d '{
"model": "mxbai-embed-large",
"input": "Why is the sky blue?"
}' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12889 0 12820 100 69 214k 1183 --:--:-- --:--:-- --:--:-- 217k
{
"model": "mxbai-embed-large",
"embeddings": [
[
-0.010455716,
-0.0025081504,
0.014609501,
.....
-0.004249276,
-0.010569806,
-0.00813636
]
],
"total_duration": 57530917,
"load_duration": 708625,
"prompt_eval_count": 6
}
After calling the Text Embedding model through the API, the model will be automatically loaded by Ollama. At this point, running the ps
command will give you the following result:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
mxbai-embed-large:latest 468836162de7 1.2 GB 100% GPU 24 hours from now
llama3.2:latest a80c4f17acd5 4.0 GB 100% GPU 24 hours from now
Conclusion⌗
At this point, the local deployment of models is complete. The next step is to use the LLM and Text Embedding models running in Ollama to configure your own RAG application in Dify.
Besides the above models, you can try other models on your own. Things in the AI field iterate too quickly, and perhaps a few days later, a new model will be released that reduces resource usage while improving model quality.
What we can do is to continue learning and keep up with the latest developments!
I hope this is helpful, Happy hacking…