Introduction

Technology innovation in the AI field is now advancing rapidly, giving birth to many emerging projects, such as Dify used in this article and Ollama for local deployment of large language models.

These two projects allow you to implement RAG at zero cost locally, as well as context-based chat Q&A using knowledge bases.

System Requirements

Currently, the Windows version of Ollama is still in Preview status. It is recommended to run it on computers with Apple M series processors or Linux servers with GPUs.

The latest llama3.2 requires less memory than the previous generation while maintaining quality comparable to llama3.1. It is said that 2GB+ is sufficient, and it can even run on some mobile devices!

Getting Models

Visit the Ollama official website to download the App for your system. This App includes CLI tools. Then visit Models to check the models you need to use. Here I’m using the latest llama3.2 and mxbai-embed-large.

Purpose of the models:

  • llama3.2: Mainly used for Chat text generation
  • mxbai-embed-large: Used for document embedding operations

After installation, check the corresponding version of Ollama. Some models have special version requirements!

ollama -v
ollama version is 0.3.12
ollama pull llama3.2
ollama pull mxbai-embed-large

Running Ollama Service

There are two ways to run the Ollama service: one is to directly open the Ollama App, and the other is to start it through the CLI. When the App is running, the CLI cannot run because this would cause a port binding conflict.

ollama serve --help
Start ollama

Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help   help for serve

Environment Variables:
      OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
      OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
      OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models per GPU
      OLLAMA_MAX_QUEUE           Maximum number of queued requests
      OLLAMA_MODELS              The path to the models directory
      OLLAMA_NUM_PARALLEL        Maximum number of parallel requests
      OLLAMA_NOPRUNE             Do not prune model blobs on startup
      OLLAMA_ORIGINS             A comma separated list of allowed origins
      OLLAMA_SCHED_SPREAD        Always schedule model across all GPUs
      OLLAMA_TMPDIR              Location for temporary files
      OLLAMA_FLASH_ATTENTION     Enabled flash attention
      OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
      OLLAMA_GPU_OVERHEAD        Reserve a portion of VRAM per GPU (bytes)
      OLLAMA_LOAD_TIMEOUT        How long to allow model loads to stall before giving up (default "5m")

As you can see, the Ollama service supports configuring startup parameters through environment variables. It is recommended to set the following environment variables:

  • OLLAMA_HOST=“0.0.0.0”
  • OLLAMA_ORIGINS="*"
  • OLLAMA_KEEP_ALIVE=“24h”
OLLAMA_HOST="0.0.0.0" OLLAMA_ORIGINS="*" OLLAMA_KEEP_ALIVE="24h" ollama serve
2024/09/29 12:37:38 routes.go:1153: INFO server config env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/George/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: http_proxy: https_proxy: no_proxy:]"
time=2024-09-29T12:37:38.522+08:00 level=INFO source=images.go:753 msg="total blobs: 17"
time=2024-09-29T12:37:38.523+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-29T12:37:38.523+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.12)"
time=2024-09-29T12:37:38.524+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/var/folders/p6/g0wwf37d165dfg5hvqtr9l180000gn/T/ollama966569504/runners
time=2024-09-29T12:37:38.547+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[metal]
time=2024-09-29T12:37:38.577+08:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=metal variant="" compute="" driver=0.0 name="" total="48.0 GiB" available="48.0 GiB"

After the service starts, you can use the following command to view the list of existing models in Ollama:

curl http://127.0.0.1:11434/api/tags | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1391  100  1391    0     0   614k      0 --:--:-- --:--:-- --:--:--  679k
{
  "models": [
    {
      "name": "llama3.2:latest",
      "model": "llama3.2:latest",
      "modified_at": "2024-09-28T21:00:44.510551673+08:00",
      "size": 2019393189,
      "digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
          "llama"
        ],
        "parameter_size": "3.2B",
        "quantization_level": "Q4_K_M"
      }
    },
    {
      "name": "mxbai-embed-large:latest",
      "model": "mxbai-embed-large:latest",
      "modified_at": "2024-07-04T16:35:42.71585157+08:00",
      "size": 669615493,
      "digest": "468836162de7f81e041c43663fedbbba921dcea9b9fefea135685a39b2d83dd8",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "bert",
        "families": [
          "bert"
        ],
        "parameter_size": "334M",
        "quantization_level": "F16"
      }
    }
  ]
}

Running Models

ollama run llama3.2
>>> Send a message (/? for help)

After running the above command, you have successfully run the LLM. If you need to exit the interactive command line, you can use the shortcut key Ctrl + D or enter /bye to exit.

After exiting, use the following command to view the running models:

ollama ps
NAME               ID              SIZE      PROCESSOR    UNTIL
llama3.2:latest    a80c4f17acd5    4.0 GB    100% GPU     24 hours from now

Note: The Text Embedding model does not need to be manually run with the run command. After the model is pulled, and the Ollama service is started, you can directly access the API to call the Text Embedding model!

Try calling Ollama’s Text Embedding API:

curl http://127.0.0.1:11434/api/embed -d '{
  "model": "mxbai-embed-large",
  "input": "Why is the sky blue?"
}' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12889    0 12820  100    69   214k   1183 --:--:-- --:--:-- --:--:--  217k
{
  "model": "mxbai-embed-large",
  "embeddings": [
    [
      -0.010455716,
      -0.0025081504,
      0.014609501,
      .....
      -0.004249276,
      -0.010569806,
      -0.00813636
    ]
  ],
  "total_duration": 57530917,
  "load_duration": 708625,
  "prompt_eval_count": 6
}

After calling the Text Embedding model through the API, the model will be automatically loaded by Ollama. At this point, running the ps command will give you the following result:

ollama ps
NAME                        ID              SIZE      PROCESSOR    UNTIL
mxbai-embed-large:latest    468836162de7    1.2 GB    100% GPU     24 hours from now
llama3.2:latest             a80c4f17acd5    4.0 GB    100% GPU     24 hours from now

Conclusion

At this point, the local deployment of models is complete. The next step is to use the LLM and Text Embedding models running in Ollama to configure your own RAG application in Dify.

Besides the above models, you can try other models on your own. Things in the AI field iterate too quickly, and perhaps a few days later, a new model will be released that reduces resource usage while improving model quality.

What we can do is to continue learning and keep up with the latest developments!

I hope this is helpful, Happy hacking…