- Code
- Context
- Vision
- Obtaining Resources
- Google Colab w/ LLM Tips
- Use Google Drive for the Hugging Face Cache
- Use ray to Deallocate GPU Resources
- Choosing the Models
- Hardware
- Overview of Analysis
- Results
- Conclusions
- Similar Work
- Cheng Li’s “llm-analysis”
- LangChain’s “auto-evaluator”
Code
Context
I recently implemented an iterative summarizer at www.internalize.ai, but was disappointed with OpenAI’s GPT-3.5-turbo endpoint latency. The three most promising options for speedup were:
- Use OpenAI hosted on Azure.
- Try a different provider like Claude via Anthropic.
- Fine-tune a smaller model on the summarization tasks.
Though (1) allegedly provides a 50% improvement over the standard OpenAI API, I lack the enterprise status required to use it. I also lack the API key to try (2). So I decided to proceed with (3) in search of a faster model.
Vision
The goal here is for the summarization process to appear instantaneous to the user.
Obtaining Resources
Though not impossible, it is extremely unpleasant to perform this type of evaluation without a good GPU. I considered using my Mac M1 but determined that it is both less powerful and less flexible than the numerous hosted solutions.
First, I tried out Amazon SageMaker Studio. SageMaker is an IDE that includes a Jupyter notebook interface. Unfortunately I encountered an error when going to enable a GPU:
ResourceLimitExceeded: The account-level service limit 'Studio KernelGateway Apps running on ml.g5.xlarge instance' is 0 Apps, with current utilization of 0 Apps and a request delta of 1 Apps. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.
By default, AWS accounts are not able to reserve GPU hardware. This is fixable via submitting a quota increase request. Unfortunately, these requests can take a few days to be fulfilled.
While I waited for approval, I moved on to Google Colab. Colab’s feature set is sufficient for this use case, and they did not make me wait to begin. Additionally, Colab is cheaper and provides access to nodes with a single A100 GPU:
RAM | GPU | AWS $/hr | Colab $/hr |
≥ 84 GB | A100 40 GB | Single GPU A100 instances not even available. | $1.308 |
≥ 25 GB | T4 16 GB | $1.12 (g4dn.2xlarge) | $0.205 |
An in-depth analysis of SageMaker vs. Colab (+ Google VM) is out of scope for this article. For my use case, I choose Colab.
Google Colab w/ LLM Tips
A few tricks I used to make my Colab development easier.
Use Google Drive for the Hugging Face Cache
This prevents redownloading the datasets/models from Hugging Face every time.
# Direct HF cache to Drive location s.t. models persist across instances.
from google.colab import drive
import os
drive.mount('/drive')
os.environ["HF_HOME"] = '/drive/MyDrive/HFCache'
Use ray
to Deallocate GPU Resources
Working with multiple models gets annoying in Colab notebooks because the GPU memory is frequently leaked. This HF forum discussion suggested using ray
for subprocessing, and I extended the solution there with exception handling to safely handle KeyboardInterrupt
when a notebook cell is stopped.
import ray
@ray.remote(num_gpus=1, max_calls=1)
def _safe_gpu_usage():
# Do stuff...
pass
def safe_gpu_usage():
try:
ref = _safe_gpu_usage.remote()
return ray.get(ref)
except BaseException as e:
# Prevent task retry after keyboard interrupt.
ray.cancel(ref, force=True)
raise e
Choosing the Models
For this analysis I choose the following models:
- OpenAI
gpt-3.5-turbo
- Gold standard.
tiiuae/falcon-7b
- The best-performing 7b model at time of writing.
ybelkada/falcon-7b-sharded-bf16
- Curious how half-precision affects inference performance.
- llama.cpp (7b)
- This model is heavily optimized.
huggyllama/llama-7b
- For comparison against llama.cpp
Hardware
All models are running on a single A100 40 GB GPU for the below benchmark.
Overview of Analysis
The initial experiments I ran are listed below:
For a detailed look at the process, see the code.
Results
Using the above experiments, it is clear that OpenAI and llama.cpp (7b) perform much better than the other models:
Generating more data for OpenAI vs. llama.cpp (7b):
llama.cpp is higher variance, but frequently beats OpenAI at the sub-2000 character output length. Also it is extremely notable that OpenAI improves in performance as the length of the output increases.
Conclusions
My prior was that switching to a smaller, self-hosted model would blow OpenAI’s latency out of the water. That is obviously not the case. Without detailed knowledge, I suspect that economies of scale are at play here with OpenAI’s infrastructure.
Given the added overhead of hosting, fine-tuning, etc., it is not worthwhile to branch off of OpenAI.
Similar Work
Cheng Li’s “llm-analysis”
Here Cheng Li uses analytical methods to calculate memory/latency for training and inference. Additionally she links a variety of example papers on the analytical methods used.
LangChain’s “auto-evaluator”
LangChain’s tool is oriented towards evaluating the quality of the question answering use case, but also displays the latency of producing answers.