Code
Context
Vision
Obtaining Resources
Google Colab w/ LLM Tips
Use Google Drive for the Hugging Face Cache
Use ray to Deallocate GPU Resources
Choosing the Models
Hardware
Overview of Analysis
Results
Conclusions
Similar Work
Cheng Li’s “llm-analysis”
LangChain’s “auto-evaluator”

Code

GitHub - PhilDakin/InferenceLatencyAnalyzer: This notebook provides a pattern for empirically measuring model latency in Google Colab.

This notebook provides a pattern for empirically measuring model latency in Google Colab. - GitHub - PhilDakin/InferenceLatencyAnalyzer: This notebook provides a pattern for empirically measuring m...

github.com

GitHub - PhilDakin/InferenceLatencyAnalyzer: This notebook provides a pattern for empirically measuring model latency in Google Colab.

Context

I recently implemented an iterative summarizer at www.internalize.ai, but was disappointed with OpenAI’s GPT-3.5-turbo endpoint latency. The three most promising options for speedup were:

Use OpenAI hosted on Azure.
Try a different provider like Claude via Anthropic.
Fine-tune a smaller model on the summarization tasks.

Though (1) allegedly provides a 50% improvement over the standard OpenAI API, I lack the enterprise status required to use it. I also lack the API key to try (2). So I decided to proceed with (3) in search of a faster model.

Vision

The goal here is for the summarization process to appear instantaneous to the user.

Obtaining Resources

Though not impossible, it is extremely unpleasant to perform this type of evaluation without a good GPU. I considered using my Mac M1 but determined that it is both less powerful and less flexible than the numerous hosted solutions.

First, I tried out Amazon SageMaker Studio. SageMaker is an IDE that includes a Jupyter notebook interface. Unfortunately I encountered an error when going to enable a GPU:

ResourceLimitExceeded: The account-level service limit 'Studio KernelGateway Apps running on ml.g5.xlarge instance' is 0 Apps, with current utilization of 0 Apps and a request delta of 1 Apps. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

By default, AWS accounts are not able to reserve GPU hardware. This is fixable via submitting a quota increase request. Unfortunately, these requests can take a few days to be fulfilled.

While I waited for approval, I moved on to Google Colab. Colab’s feature set is sufficient for this use case, and they did not make me wait to begin. Additionally, Colab is cheaper and provides access to nodes with a single A100 GPU:

RAM	GPU	AWS $/hr	Colab $/hr
≥ 84 GB	A100 40 GB	Single GPU A100 instances not even available.	$1.308
≥ 25 GB	T4 16 GB	$1.12 (g4dn.2xlarge)	$0.205

An in-depth analysis of SageMaker vs. Colab (+ Google VM) is out of scope for this article. For my use case, I choose Colab.

Google Colab w/ LLM Tips

A few tricks I used to make my Colab development easier.

Use Google Drive for the Hugging Face Cache

This prevents redownloading the datasets/models from Hugging Face every time.

# Direct HF cache to Drive location s.t. models persist across instances.
from google.colab import drive
import os

drive.mount('/drive')
os.environ["HF_HOME"] = '/drive/MyDrive/HFCache'

Use `ray` to Deallocate GPU Resources

Working with multiple models gets annoying in Colab notebooks because the GPU memory is frequently leaked. This HF forum discussion suggested using ray for subprocessing, and I extended the solution there with exception handling to safely handle KeyboardInterrupt when a notebook cell is stopped.

import ray

@ray.remote(num_gpus=1, max_calls=1)
def _safe_gpu_usage():
	# Do stuff...
	pass

def safe_gpu_usage():
  try:
    ref = _safe_gpu_usage.remote()
    return ray.get(ref)
  except BaseException as e:
    # Prevent task retry after keyboard interrupt.
    ray.cancel(ref, force=True)
    raise e

Choosing the Models

For this analysis I choose the following models:

OpenAI gpt-3.5-turbo

Gold standard.

tiiuae/falcon-7b

The best-performing 7b model at time of writing.

ybelkada/falcon-7b-sharded-bf16

Curious how half-precision affects inference performance.

llama.cpp (7b)

This model is heavily optimized.

huggyllama/llama-7b

For comparison against llama.cpp

Hardware

All models are running on a single A100 40 GB GPU for the below benchmark.

Overview of Analysis

The initial experiments I ran are listed below:

For a detailed look at the process, see the code.

Results

Using the above experiments, it is clear that OpenAI and llama.cpp (7b) perform much better than the other models:

Generating more data for OpenAI vs. llama.cpp (7b):

llama.cpp is higher variance, but frequently beats OpenAI at the sub-2000 character output length. Also it is extremely notable that OpenAI improves in performance as the length of the output increases.

Conclusions

My prior was that switching to a smaller, self-hosted model would blow OpenAI’s latency out of the water. That is obviously not the case. Without detailed knowledge, I suspect that economies of scale are at play here with OpenAI’s infrastructure.

Given the added overhead of hosting, fine-tuning, etc., it is not worthwhile to branch off of OpenAI.

Similar Work

Cheng Li’s “llm-analysis”

Here Cheng Li uses analytical methods to calculate memory/latency for training and inference. Additionally she links a variety of example papers on the analytical methods used.

LangChain’s “auto-evaluator”

LangChain’s tool is oriented towards evaluating the quality of the question answering use case, but also displays the latency of producing answers.

LLM Latency Analysis