Adept FuyuHeavy: A new multimodal model
January 24, 2024 — Adept Team
Adept FuyuHeavy is a new multimodal model designed specifically for digital agents.
We’re excited to introduce Adept FuyuHeavy, a new multimodal model designed specifically for digital agents. FuyuHeavy is the world’s thirdmostcapable multimodal model, behind only GPT4V and Gemini Ultra, which are 1020 times bigger. We’re excited about this model because:
 It excels at multimodal reasoning. To us the killer feature is UI understanding, but it also performs well on more traditional multimodal benchmarks. In particular, FuyuHeavy scores higher on the MMMU benchmark than even Gemini Pro.
 On standard textbased benchmarks, it matches or exceeds the performance of models in the same compute class despite having to devote some of its capacity to image modeling.
 It demonstrates that (with some modifications) we can scale up the Fuyu architecture and reap all of the associated benefits, including handling arbitrary size/shape images and efficiently reusing existing transformer optimizations.
Below we’ve included two sample model interactions. The first is a video of FuyuHeavy’s UI understanding capabilities. The second is a sample response on the MMMU benchmark. Read on for further details.
Prompt
According to the table below, which food is the most likely cause of the outbreak of food poisoning:
A. Cold chicken
B. Potato salad
C. Egg sandwiches
D. Fruit pie and cream
Adept
The table shows the number of people who ate different foods and the number of people who ate those foods and got sick. To determine which food is most likely the cause of the outbreak of food poisoning, we’re looking for the food with the highest percentage of people who ate that food and got sick.
Let’s calculate the percentage of people who ate each food and got sick:
 Cold chicken: 34 / 86 = 0.395, which is 39.5%
 Potato salad: 38 / 54 = 0.704, which is 70.4%
 Egg sandwiches: 40 / 76 = 0.526, which is 52.6%
 Fruit pie and cream: 12 / 32 = 0.375, which is 37.5%
 Cheese: 12 / 48 = 0.25, which is 25%
The food with the highest percentage of people who ate that food and got sick is potato salad, with 70.4%. Therefore, potato salad is the most likely cause of the outbreak of food poisoning.
Answer: B
How we got here
Adept’s goal is to build Useful General Intelligence. In order to accomplish this goal, we first had to:
 Connect our agents to the digital world (see ACT1)
 Build robust tooling for training, evaluation, inference, and data collection
 Establish a foundational model design that would scale to meet our needs (see Fuyu8B)
This fall, we were satisfied that we’d accomplished these goals and that it was time to scale up the Fuyu architecture. This wasn’t easy — in addition to the standard hiccups with model scaling, we had to deal with the extra problems associated with training a new architecture on both text and image data. In particular:

Image data stresses systems — keeping the model fed with data becomes harder, memory usage blows up, cloud storage ingress/egress becomes limiting; even handling image formats/coordinates consistently between training and inference is tricky

Image models are famously unstable (see this great paper) — we’ve tweaked the Fuyu architecture and training procedure substantially to deal with this

Finally, highquality image pretraining data is scarce, we’ve devoted a lot of effort to collecting, curating, and even creating this data. There’s also a delicate balance between text and image tasks — we had to develop recipes for striking this balance at scale
Over the last 4 months, we’ve tackled all these problems and more. FuyuHeavy will shortly be powering our enterprise product. We’ve already applied lessons learned from FuyuHeavy to train its successor.
Evals
Although our ultimate goal is to build useful digital agents, and we internally benchmark our models with this standard in mind, it’s important for us to sanitycheck our progress against commonly used benchmarks as well. According to these benchmarks, FuyuHeavy is the strongest multimodal model trained outside of Google or OpenAI.
Despite the tradeoff between language modeling performance and multimodal performance, FuyuHeavy performs roughly on par with Gemini Pro on standard textonly evaluations, outperforming it on the commonly used MMLU benchmark. Inflection2 has stronger performance on some of these text evals, but it’s a much larger model. For all of these evals, we used the standard number of shots in the fewshot prompting regime.
MMLU  GSM8K  MATH  HumanEval  

Adept FuyuHeavy  72.1  82.9  29.5  58.0 
Gemini Pro  71.8  86.5 (Maj1@32)  32.6  67.7 
Inflection2  79.6  81.4  34.8  44.5 
Grok1  73  62.9  23.9  63.2 
It’s also become common to benchmark the ability of models to have longform conversations. To do this, we put FuyuHeavy through a supervised finetuning stage followed by a direct preference optimization (DPO) stage on publicly available chat data. We use the two most commonly used chat evaluations — MTBench and AlpacaEval 1.0. FuyuHeavy Chat performs comparably to Claude 2.0 despite being a smaller model and having to use some of its capacity to model images.
MTBench  AlpacaEval 1.0  

Adept FuyuHeavy  8.01  92.20% 
GPT4 Turbo  9.32  97.70% 
Mistral Medium  8.61  96.83% 
Claude 2.0  8.06  91.60% 
Finally, we evaluate FuyuHeavy on what has become the standard for multimodal performance  MMMU. FuyuHeavy performs slightly better than Gemini Pro on this metric. For further context on what MMMU measures, we’ve included further examples of our model output for that dataset below. For completeness, we also include results on VQAv2 (which is quite flawed, see this post) on AI2D, a diagram understanding dataset, and on ChartQA  a dataset of questionanswer pairs for plots and charts. Adept FuyuHeavy outperforms Gemini Pro on all of these metrics and outperforms Gemini Ultra on AI2D.
MMMU  VQAv2  AI2D  ChartQA  

Adept FuyuHeavy  48.3  76.2  81.2  75.4 
Gemini Pro  47.9  71.2  73.9  74.1 
Gemini Ultra  59.4  77.8  79.5  80.8 
Prompt
Answer the following MMMU question based on the image.
Janet Ludlow’s firm requires all its analysts to use a twostage dividend discount model (DDM) and the capital asset pricing model (CAPM) to value stocks. Using the CAPM and DDM, Ludlow has valued QuickBrush Company at $63 per share. She now must value SmileWhite Corporation. Calculate the required rate of return for SmileWhite by using the information in the following table:
A. 14%
B. 15%
C. 16%
Adept
The required rate of return for SmileWhite can be calculated using the capital asset pricing model (CAPM), which is:
Required Return = Riskfree Rate + Beta x (Expected Market Return  Riskfree Rate)
The riskfree rate and expected market return are provided in the table, and the beta for SmileWhite is given as 1.15.
Let’s calculate the required rate of return for SmileWhite:
Required Return = 4.50% + 1.15 x (14.50%  4.50%)
Required Return = 4.50% + 1.15 x 10%
Required Return = 4.50% + 11.50%
Required Return = 16%
Therefore, the answer is:
C: 16%
Answer: C
FuyuHeavy can also perform more complex calculations, such as the following:
Prompt
Table 11.47 provides a recent survey of the youngest online entrepreneurs whose net worth is estimated at one million dollars or more.
Their ages range from 17 to 30.
Each cell in the table illustrates the number of entrepreneurs who correspond to the specific age group and their net worth.
We want to know whether the ages and net worth independent.
$\chi$
^2 test statistic = ______
A. 1.56
B. 1.76
C. 1.96
D. 2.06
Adept
To answer this question, we need to perform a chisquare test of independence.
The null hypothesis for this test is that the ages and net worth are independent, while the alternative hypothesis is that they are not independent.
The test statistic for a chisquare test of independence is:
where O is the observed frequency, E is the expected frequency, and the sum is taken over all the cells in the contingency table.
The expected frequency for each cell is calculated as:
The degrees of freedom for a chisquare test of independence are:
The critical value for a chisquare distribution with df degrees of freedom at a given level of significance (e.g., 0.05) can be found in a chisquare table or calculated using a statistical software package.
If the calculated chisquare value is greater than the critical value, we reject the null hypothesis and conclude that the ages and net worth are not independent. If the calculated chisquare value is less than or equal to the critical value, we fail to reject the null hypothesis and conclude that the ages and net worth are independent.
Let’s calculate the expected frequencies for each cell in the table:
Now, let’s calculate the chisquare test statistic:
The degrees of freedom for this test are:
The critical value for a chisquare distribution with 2 degrees of freedom at the 0.05 level of significance is approximately 5.99.
Since our calculated chisquare value (1.76) is less than the critical value (5.99), we fail to reject the null hypothesis and conclude that the ages and net worth are independent.
Therefore, the answer is:
B. 1.76
Answer: B
Research at Adept
We’re building on this work in several exciting ways:
 Conducting fundamental scaling research to improve our base model capabilities
 Converting these base models into useful agents through reward modeling, self play, and various inferencetime search techniques
 Connecting these agents to the world to build useful, reliable products
If any of this is exciting to you, please reach out through our careers page.
Acknowledgements
We’d like to thank the Microsoft Azure team for providing and servicing the cluster that trained FuyuHeavy and for answering innumerable configuration questions. We’d also like to thank NVIDIA for helpful collaborations on model efficiency, for advice on various particulars of networking, etc, and, of course, for creating the GPUs :)