- Code
- Context
- Vision
- Getting Started
- Generating Data
- OpenAI Robustness Tip
- Finding Bugs
- Issue #1: Gradient Accumulation Breaks at Epoch Boundary
- Issue #2: Trainer Silently Drops Data ≤ max_seq_length
- Issue #3: Falcon End-of-Stream Tokenization
- Results
Code
Context
I recently implemented an iterative summarizer at www.internalize.ai, but was disappointed with OpenAI’s GPT-3.5-turbo endpoint latency. Though I determined that using a smaller model would not solve my latency issues, I resolved to give fine-tuning an attempt as a learning experience.
I heard that the Falcon landed in the Hugging Face ecosystem, so I choose this as the model to fine-tune.
Vision
Provide either:
- Three independent Falcon models trained on my three summarization tasks.
- One Falcon model trained on all three summarization tasks.
Getting Started
Hugging Face was kind enough to provide an example Colab notebook for fine-tuning the model. This notebook also takes advantage of QLoRA, which makes training much easier.
Generating Data
Inspired by Alpaca, I used GPT-3.5-turbo to generate extractive summarization completions for a subset of the the cnn_dailymail dataset.
OpenAI Robustness Tip
Using Embed GitHub combined with a rate-limited process pool is a nice way to maximize OpenAI bandwidth.
Finding Bugs
I mistakenly anticipated a error-free pipeline for training. During the course of training I encountered three issues. Finding these issues provided me an excellent opportunity to become more familiar with the HF codebase and begin minor participation in the open source community.
Issue #1: Gradient Accumulation Breaks at Epoch Boundary
One can find this issue on GitHub here. I need to do some more reading before providing a summary of this issue. But it was fixed here.
Issue #2: Trainer Silently Drops Data ≤ max_seq_length
One can find this issue on GitHub here. Essentially, the provided training library silently drops data from the training set. This is not good and was resulting in my model having a phenomenal training loss and a terrible performance.
Issue #3: Falcon End-of-Stream Tokenization
One can find this issue on GitHub here. Essentially, in the notebook provided by Hugging Face, the tokenization is done in such a way that the end-of-stream token is ignored during the loss calculation. This results in the model never learning to output end-of-stream, meaning that your correct result will always be followed by gibberish up to the supplied maximum length.
Results
Validation beyond spot-checking is out-of-scope for this post. Manually verified outputs for each model are available in the following toggle list.
Some high-level comments:
- Raw model does not know when to output EOS token.
- Extract-only model tries to do extract task on everything, basically copies input.
- Multi-task model output looks pretty good.