Fine Tuning Falcon-7b

Code

Context

I recently implemented an iterative summarizer at www.internalize.ai, but was disappointed with OpenAI’s GPT-3.5-turbo endpoint latency. Though I determined that using a smaller model would not solve my latency issues, I resolved to give fine-tuning an attempt as a learning experience.

I heard that the Falcon landed in the Hugging Face ecosystem, so I choose this as the model to fine-tune.

Vision

Provide either:

  1. Three independent Falcon models trained on my three summarization tasks.
  2. One Falcon model trained on all three summarization tasks.

Getting Started

Hugging Face was kind enough to provide an example Colab notebook for fine-tuning the model. This notebook also takes advantage of QLoRA, which makes training much easier.

Generating Data

Inspired by Alpaca, I used GPT-3.5-turbo to generate extractive summarization completions for a subset of the the cnn_dailymail dataset.

OpenAI Robustness Tip

Using Embed GitHubEmbed GitHub combined with a rate-limited process pool is a nice way to maximize OpenAI bandwidth.

Finding Bugs

I mistakenly anticipated a error-free pipeline for training. During the course of training I encountered three issues. Finding these issues provided me an excellent opportunity to become more familiar with the HF codebase and begin minor participation in the open source community.

Issue #1: Gradient Accumulation Breaks at Epoch Boundary

One can find this issue on GitHub here. I need to do some more reading before providing a summary of this issue. But it was fixed here.

Issue #2: Trainer Silently Drops Data ≤ max_seq_length

One can find this issue on GitHub here. Essentially, the provided training library silently drops data from the training set. This is not good and was resulting in my model having a phenomenal training loss and a terrible performance.

Issue #3: Falcon End-of-Stream Tokenization

One can find this issue on GitHub here. Essentially, in the notebook provided by Hugging Face, the tokenization is done in such a way that the end-of-stream token is ignored during the loss calculation. This results in the model never learning to output end-of-stream, meaning that your correct result will always be followed by gibberish up to the supplied maximum length.

Results

Validation beyond spot-checking is out-of-scope for this post. Manually verified outputs for each model are available in the following toggle list.

Prompts
Outputs

Some high-level comments:

  • Raw model does not know when to output EOS token.
  • Extract-only model tries to do extract task on everything, basically copies input.
  • Multi-task model output looks pretty good.