Foundation Models have disrupted the AI space and sparked the imagination of the general population. Traditionally, machine learning has driven the majority of use cases in AI by training predictive models on historical data to make future predictions. However, there has been a significant shift in AI with the emergence of models like ChatGPT, DALL-E, etc. These models are trained on vast amounts of data and can be used for multiple tasks such as question answering, sentiment analysis, summarization, and more. Furthermore, the availability of these powerful models through APIs has led to the emergence of new startups and use cases, even for teams lacking GPU resources.
In this blog, we will discuss the challenges of using third-party APIs in an enterprise setting and explore more secure and faster alternatives that can be used today. To show an example, we will utilize an open-source large language model (LLM), quantize its weights, and deploy it on on-premise CPU infrastructure.
Challenges
Despite all the hype surrounding foundation models, their adoption in the enterprise remains low. We will discuss some of the reasons for this below.
Data Security
A paramount issue at the forefront of these deliberations pertains to the intricate domain of data management. Ironically, the ownership and combination of training data are what enabled the emergence of Large Language Models (LLMs). As an organization, your data is considered intellectual property, along with other valuable assets. Using APIs to send your contextual data to a third-party LLM hosting company is not something organizations feel comfortable with, especially if the data contains PII. Organizations require a secure mechanism to utilize LLMs without risking data breaches or unauthorized disclosures.
Risk Management
The recent saga involving Sam Altman at Open AI brought attention to the inherent risks associated with startups, particularly when running business-critical applications on their APIs. Although the service was not affected, it became clear that many generative AI companies are new startups that are constantly at risk of closing due to various factors such as budget constraints, competitive advantages, and other external and internal factors. All of this makes organizations even more cautious about using third-party language models (LLMs).
Cost of Inference
Inference involves deploying a previously trained model to process new input data. According to OpenAI’s 2018 report, a significant portion of computational resources in deep learning is allocated to inference, rather than training. While individual inference steps are more cost-effective than extensive training runs, the cumulative computation required for numerous inference steps can still become a substantial portion of the overall computational load.
To illustrate this concept, let’s consider the cost of generating 750 words using GPT-3, which amounts to just 6 cents. However, if we were to create a model with 1000 times more parameters, similar to the leap from GPT-1 to GPT-3, generating the same 750 words would cost $60, which is approximately the rate charged by a skilled human writer. Nonetheless, to have a truly transformative economic impact, it is essential to achieve significantly lower costs or employ more efficient methods than what human writers can provide.
Open Source comes to the rescue
The open-source community in Large Language Models (LLMs) is rapidly advancing, closing the gap with proprietory models like ChatGPT, Bard, etc., and pushing the boundaries of accessible AI capabilities. This coupled with RAGs allows an organization to completely own and secure its application and Data.
Available open-source LLMs
Llama, Mistral, Dolly, Vicuna, Falcon are just a few of the many LLMs openly available to the world right now. These LLMs boast near GPT 3.5 and sometimes GPT 4-level performance on benchmarks.
Quantization
Using techniques such as quantization LLMs can be run without GPUs allowing greater flexibility and reduced costs for the organization.
Now, before we get excited to download and start building applications with these LLMs we have to understand that these neural networks have billions of weights and hundreds of layers meaning these networks will need huge runtime memory and powerful GPUs to run at their optimal efficiency. Fret not as the open source community has made it easier to work around these constraints using Quantization.
Quantization in LLMs has significantly improved operating efficiency and hardware compatibility. This technique involves reducing the precision of the model’s parameters, transforming them from floating-point to lower-bit representations. This process not only shrinks the model’s size but also accelerates its computational speed, making it more feasible to deploy these advanced models on commodity hardware. The true marvel of quantization lies in its ability to maintain most of the model’s performance while significantly reducing its memory utilization and energy footprint.
While there are many quantization approaches like GGML, GPTQ, NF4; GGUF is the most popular one currently. GGUF is an upgrade of GGML provided by the llama.cpp team whose efforts have led to LLMs being run on personal computers.
Fine Tuning
Another significant advantage the organization has for running LLMs locally is the ability to fine-tune. Fine-tuning allows developers to customize the LLM according to their specific requirements. This can lead to reduced hallucination, improved context filtering, and an ever-expanding set of instructions, among other benefits. Here is an example of fine-tuning mistral-7b model
Example notebook
Here is a notebook where we demonstrate how you can run LLMs locally. Using our experiments we can corroborate the fact that LLM CPU inference time improves as the CPU cores increase. So, having a good multi-core CPU with multi-threading can significantly improve inference time.
LLM (Quantized) | Environment | CPU Core | RAM | Inference Time (secs) | n_batch |
---|---|---|---|---|---|
Llama v2 | Google Colab | 2 | 16 GB | 567 | 30 |
Llama v2 | Macbook Pro | 4 | 16 GB | 460 | 30 |
Llama v2 | Virtual Machine | 16 | 32 GB | 109 | 30 |
Llama v2 | Virtual Machine | 16 | 32 GB | 121 | 60 |
Llama v2 | Virtual Machine | 16 | 32 GB | 109 | 120 |
Mistral 7B | Virtual Machine | 16 | 32 GB | 98 | 60/120 |
With the help of the open-source community, advancements in foundation models are now allowing organizations and developers to run LLMs locally without compromising on data security, protection; third-party dependencies, and vendor lock-in.
Local vs Third-Party
Running LLMs locally presents certain challenges to consider. It can elevate the Total Cost of Ownership (TCO) and introduce technical debt for an organization, as it becomes responsible for managing hosting, load balancing, serving, and LLM Operations (LLMOPs) under company standards. To assist organizations in evaluating the feasibility and benefits of local LLM deployment, we provide a useful matrix below.
Local LLM | Third-party LLM |
---|---|
Data including PII-protected | No data security other than vendor claims |
Economical for high-volume use cases | Economical for low-volume use cases |
preferable for batch inference | preferable for real-time inference |
High code | Low to No code |
Mid-level accuracy | Higher accuracy |
High Compute (CPU or GPU) | Low to Zero Compute |
Low support | Dedicated support |
Extensive Fine-tuning | Limited to no fine-tuning |
Outdated quickly | Frequently trained |
In a nutshell, the use cases for local LLMs can be driven by at least two factors:
- Privacy: Making sure your data is secure and private by avoiding the need to send it to a third party.
- Cost: Getting rid of inference fees, which is especially great for token-intensive applications like text preprocessing (extraction/tagging), summarization, and agent simulations.
If you’re interested in exploring the potential of local LLMs for your projects or business, consider contacting Evolve AI Labs. Our team specializes in advanced AI solutions and can provide you with the expertise and resources to effectively leverage the power of foundation models.