navbar

Amplifying Security and Speed: Harnessing On-Premise CPU-powered LLMs

AUTHOR

SHARE

Twitter

Foundation Models have disrupted the AI space and sparked the imagination of the general population. Traditionally, machine learning has driven the majority of use cases in AI by training predictive models on historical data to make future predictions. However, there has been a significant shift in AI with the emergence of models like ChatGPT, DALL-E, etc. These models are trained on vast amounts of data and can be used for multiple tasks such as question answering, sentiment analysis, summarization, and more. Furthermore, the availability of these powerful models through APIs has led to the emergence of new startups and use cases, even for teams lacking GPU resources.

Source: On the Opportunities and Risks of Foundation Models (July 2022)

In this blog, we will discuss the challenges of using third-party APIs in an enterprise setting and explore more secure and faster alternatives that can be used today. To show an example, we will utilize an open-source large language model (LLM), quantize its weights, and deploy it on on-premise CPU infrastructure.

Challenges

Despite all the hype surrounding foundation models, their adoption in the enterprise remains low. We will discuss some of the reasons for this below.

Data Security

A paramount issue at the forefront of these deliberations pertains to the intricate domain of data management. Ironically, the ownership and combination of training data are what enabled the emergence of Large Language Models (LLMs). As an organization, your data is considered intellectual property, along with other valuable assets. Using APIs to send your contextual data to a third-party LLM hosting company is not something organizations feel comfortable with, especially if the data contains PII. Organizations require a secure mechanism to utilize LLMs without risking data breaches or unauthorized disclosures.

Risk Management

The recent saga involving Sam Altman at Open AI brought attention to the inherent risks associated with startups, particularly when running business-critical applications on their APIs. Although the service was not affected, it became clear that many generative AI companies are new startups that are constantly at risk of closing due to various factors such as budget constraints, competitive advantages, and other external and internal factors. All of this makes organizations even more cautious about using third-party language models (LLMs).

Cost of Inference

Inference involves deploying a previously trained model to process new input data. According to OpenAI’s 2018 report, a significant portion of computational resources in deep learning is allocated to inference, rather than training. While individual inference steps are more cost-effective than extensive training runs, the cumulative computation required for numerous inference steps can still become a substantial portion of the overall computational load.

To illustrate this concept, let’s consider the cost of generating 750 words using GPT-3, which amounts to just 6 cents. However, if we were to create a model with 1000 times more parameters, similar to the leap from GPT-1 to GPT-3, generating the same 750 words would cost $60, which is approximately the rate charged by a skilled human writer. Nonetheless, to have a truly transformative economic impact, it is essential to achieve significantly lower costs or employ more efficient methods than what human writers can provide.

Open Source comes to the rescue

The open-source community in Large Language Models (LLMs) is rapidly advancing, closing the gap with proprietory models like ChatGPT, Bard, etc., and pushing the boundaries of accessible AI capabilities. This coupled with RAGs allows an organization to completely own and secure its application and Data.

Available open-source LLMs

Llama, Mistral, Dolly, Vicuna, Falcon are just a few of the many LLMs openly available to the world right now. These LLMs boast near GPT 3.5 and sometimes GPT 4-level performance on benchmarks.

Quantization

Using techniques such as quantization LLMs can be run without GPUs allowing greater flexibility and reduced costs for the organization.

Now, before we get excited to download and start building applications with these LLMs we have to understand that these neural networks have billions of weights and hundreds of layers meaning these networks will need huge runtime memory and powerful GPUs to run at their optimal efficiency. Fret not as the open source community has made it easier to work around these constraints using Quantization.

Quantization in LLMs has significantly improved operating efficiency and hardware compatibility. This technique involves reducing the precision of the model’s parameters, transforming them from floating-point to lower-bit representations. This process not only shrinks the model’s size but also accelerates its computational speed, making it more feasible to deploy these advanced models on commodity hardware. The true marvel of quantization lies in its ability to maintain most of the model’s performance while significantly reducing its memory utilization and energy footprint.

While there are many quantization approaches like GGML, GPTQ, NF4; GGUF is the most popular one currently. GGUF is an upgrade of GGML provided by the llama.cpp team whose efforts have led to LLMs being run on personal computers.

Fine Tuning

Another significant advantage the organization has for running LLMs locally is the ability to fine-tune. Fine-tuning allows developers to customize the LLM according to their specific requirements. This can lead to reduced hallucination, improved context filtering, and an ever-expanding set of instructions, among other benefits. Here is an example of fine-tuning mistral-7b model

Example notebook

Here is a notebook where we demonstrate how you can run LLMs locally. Using our experiments we can corroborate the fact that LLM CPU inference time improves as the CPU cores increase. So, having a good multi-core CPU with multi-threading can significantly improve inference time.

LLM (Quantized)EnvironmentCPU CoreRAMInference Time (secs)n_batch
Llama v2Google Colab216 GB56730
Llama v2Macbook Pro416 GB46030
Llama v2Virtual Machine1632 GB10930
Llama v2Virtual Machine1632 GB12160
Llama v2Virtual Machine1632 GB109120
Mistral 7BVirtual Machine1632 GB9860/120

With the help of the open-source community, advancements in foundation models are now allowing organizations and developers to run LLMs locally without compromising on data security, protection; third-party dependencies, and vendor lock-in.

Local vs Third-Party

Running LLMs locally presents certain challenges to consider. It can elevate the Total Cost of Ownership (TCO) and introduce technical debt for an organization, as it becomes responsible for managing hosting, load balancing, serving, and LLM Operations (LLMOPs) under company standards. To assist organizations in evaluating the feasibility and benefits of local LLM deployment, we provide a useful matrix below.

Local LLMThird-party LLM
Data including PII-protectedNo data security other than vendor claims
Economical for high-volume use casesEconomical for low-volume use cases
preferable for batch inferencepreferable for real-time inference
High codeLow to No code
Mid-level accuracyHigher accuracy
High Compute (CPU or GPU)Low to Zero Compute
Low supportDedicated support
Extensive Fine-tuningLimited to no fine-tuning
Outdated quicklyFrequently trained

In a nutshell, the use cases for local LLMs can be driven by at least two factors:

  • Privacy: Making sure your data is secure and private by avoiding the need to send it to a third party.
  • Cost: Getting rid of inference fees, which is especially great for token-intensive applications like text preprocessing (extraction/tagging), summarization, and agent simulations.

If you’re interested in exploring the potential of local LLMs for your projects or business, consider contacting Evolve AI Labs. Our team specializes in advanced AI solutions and can provide you with the expertise and resources to effectively leverage the power of foundation models.

Share this post

Twitter
LinkedIn

Top blogs

Using Large Language Models(LLMs) for Data Privacy

In today’s data-driven landscape, safeguarding Personally Identifiable Information (PII) is

Read More

Accelerating knowledge processes with LLMs

As an operations leader, optimizing knowledge processes is

Read More

Machine Learning based approach to predicting stockouts

Introduction Product stockouts can be a major headache

Read More

Amplifying Security and Speed: Harnessing On-Premise CPU-powered LLMs

Foundation Models have disrupted the AI space and

Read More

Scroll to Top