Releasing Persimmon-8B

September 7, 2023 — Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani

We’re open-sourcing Persimmon-8B, the most powerful fully permissively-licensed language model with <10 billion parameters.

An abstract 3D render

We’re excited to open-source Persimmon-8B, the best fully permissively-licensed model in the 8B class. The code and weights are here.

At Adept, we’re working towards an AI agent that can help people do anything they need to do on a computer. We’re not in the business of shipping isolated language models (LMs)—this was an early output of the model scaling program that will support our products.

Over the last year, we’ve been amazed by how smart small models are becoming, and we wanted to give the community access to an even better 8B LM to build on for any use case, with an open Apache license and publicly accessible weights. The 8B size is a sweet spot for most users without access to large-scale compute—they can be finetuned on a single GPU, run at a decent speed on modern MacBooks, and may even fit on mobile devices.

Persimmon-8B has several nice properties:

  1. This is the most capable open-source, fully permissive model with fewer than 10 billion parameters. We are releasing it under an Apache license for maximum flexibility.
  2. We trained it from scratch using a context size of 16K. Many LM use cases are context-bound; our model has 4 times the context size of LLaMA2 and 8 times that of GPT-3, MPT, etc.
  3. Our base model exceeds other ~8B models and matches LLaMA2 performance despite having been trained on only 0.37x as much data as LLaMA2.
  4. The model has 70k unused embeddings for multimodal extensions, and has sparse activations.
  5. The inference code we’re releasing along with the model is unique—it combines the speed of C++ implementations (e.g. FasterTransformer) with the flexibility of naive Python inference.

We’re excited to see how the community takes advantage of these capabilities not present in other open source language models, and we hope this model spurs even greater innovation!

Because this is a raw model release, we have not added further finetuning, postprocessing or sampling strategies to control for toxic outputs.

A more realistic way of doing evals

Determining the quality of a language model is still as much art as science. Model quality is not an absolute metric and depends on how the language model will be used. In most use cases, we expect language models to generate text. However, a common methodology for evaluating language models doesn’t actually ask them to generate any text at all. Consider the following multiple choice question from the common HellaSwag eval set. The goal is to pick which of the four answers best continues the “question.”

A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She…

a) rinses the bucket off with soap and blow dries the dog’s head.
b) uses a hose to keep it from getting soapy.
c) gets the dog wet, then it runs away again.
d) gets into a bathtub with the dog.

One way to evaluate the model is to simply ask it to answer the question and then see which choice it makes – (a), (b), (c), or (d). This mimics the experience of how people actually interact with language models – they ask questions and expect answers. This is analogous to e.g. HELM.

A more common practice in ML is instead to use the implicit probabilities that the language model assigns to each choice. For option (a) above, we calculate the probability of “rinses” given the previous sentences, and then probability of “the” given the previous sentences plus “rinses,“ and so on. We then multiply all these probabilities together, giving the probability of the entire sequence for option (a). We do this for all four choices (optionally adding length normalization to account for different length sequences) and select the option with the highest sequence probability. This is a fine way to measure the intrinsic knowledge of a language model, but a poor way to understand what actually interacting with it is like.

Since we care about interacting with language models, we do all of our evals with the former technique–we directly generate answers from the model. We’re releasing the prompts we use so that others can reproduce these numbers.


We compared Persimmon-8B to the current most powerful model in its size range—LLama 2—and to MPT 7B Instruct. Our instruction-fine-tuned model—Persimmon-8B-FT—is the strongest performing model on all but one of the metrics. Our base model—Persimmon-8B-Base—performs comparably to Llama 2, despite having seen only 37% as much training data.

Eval TaskMPT 7B Instruct 1-ShotLlama 2 Base 7B 1-ShotPersimmon-8B-Base 1-ShotPersimmon-8B-FT 1-Shot
Arc Easy32.553.748.164.0
Arc Challenge28.843.834.546.8
HumanEval12.80 / 12.2118.920.7

Model Details

Persimmon-8B is a standard decoder-only transformer with several architecture modifications.

We use the squared ReLU activation function2. We use rotary positional encodings – our internal experiments found it superior to Alibi. We add layernorm to the Q and K embeddings before they enter the attention calculation.

The checkpoint we are releasing has approximately 9.3B parameters. In order to make pipelining during training more efficient, we chose to decouple the input and output embeddings. Doing this does not increase the capacity of the model–it is purely a systems optimization to avoid all-reducing the gradients for the (very large) embeddings across potentially slow communication links. In terms of inference cost, the model is equivalent to an 8B parameter model with coupled input/output embeddings.

Furthermore, in a space-constrained environment, the 70k unused embeddings (corresponding to reserved tokens) could be removed from the input/output embedding matrices. This would reduce the model size by approximately 570M parameters.

We train the model from start to finish with a sequence length of 16K on 737B tokens uniformly sampled from a much larger dataset, which is a mix of text (~75%) and code (~25%).

Natively training on such long sequences throughout training is made possible by our development of an improved version of FlashAttention, (Github). We also modified the base for rotary calculations to allow for full position resolution at this longer length. This contrasts with all other open source models, which use a sequence length of at most 4096 for the majority of training. We use a vocabulary of 262k tokens, built using a unigram sentencepiece model.

We’ve included a table with important model information below:

Hidden Size4096
Batch Size120
Sequence Length16384
Training Iterations375000
Tokens Seen737 Billion

Flexible and Fast Inference

We’re also releasing fast inference code for this model–with a short prompt, we can sample ~56 tokens per second on one 80GB A100 GPU3. While most optimized inference code is complicated and brittle, we’ve managed to make ours flexible without sacrificing speed. We can define models in PyTorch, run inference with minimal changes, and still be faster than FasterTransformer.

There are two main things that slow down traditional inference implementations:

  1. First, both the Python runtime and CUDA kernel dispatch incur per-operation overheads.
  2. Second, failing to fuse operations means we spend time writing to memory and then reading back again the same values; while this overhead might go unnoticed during training (which is compute bound), inference is usually bottlenecked by memory bandwidth.

The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.

We wanted a strategy that would fix both of these slowdowns without maintaining a separate C++ codebase.

  1. To handle operator fusion, we’ve extracted one of the attention kernels4 from NVIDIA’s FasterTransformer repo. During Python inference, we simply replace the attention operation with a call to this kernel. Because our architecture modifications don’t touch the core attention operation, this highly complex kernel can remain unmodified.
  2. To handle the per-operator overheads, we use CUDA graphs to capture and replay the forward pass. We’ve also implemented this in a way that works with tensor parallelism, which lets us easily use multiple GPUs for inference.

This strategy gives us the best of worlds—we can write model code in only one place while still doing inference faster than FasterTransformer. We really hope this accelerates the exciting applications that folks in the community can build.

This is just the first small release in a series of things we’re excited to put out this fall and winter. Enjoy!


If you use this model in your work, please use the following BibTeX citation:

  author = {Elsen, Erich and Odena, Augustus and Nye, Maxwell and Ta\c{s}\i{}rlar, Sa\u{g}nak and Dao, Tri and Hawthorne, Curtis and Moparthi, Deepak and Somani, Arushi},
  title = {Releasing {Persimmon-8B}},
  url = {},
  year = {2023}


  1. The Llama 2 base model did not produce valid code in our eval runs, so we additionally report the value from the Llama 2 paper.

  2. In contrast to the more standard SwiGLU and GeLU activations, the squared ReLU often results in output activations consisting of 90+% zeros. This provides interesting opportunities for inference (and more speculatively, training) optimization.

  3. Note that because our vocabulary is larger than that of LLaMA and MPT, the actual inference speed in terms of characters is likely comparatively higher.

  4. The decoder_masked_multihead_attention kernel, in particular.