To participate in this competition, you must start with a base model from our approved list, utilize only open-source data, and limit your fine-tuning to a single 24-hour period. This fine-tuning should be carried out on just one graphics card, which must be either the NVIDIA 4090 or the NVIDIA A100 (40GB). Our competition will feature two hardware tracks: the NVIDIA 4090 track and the NVIDIA A100 track, and each track will be evaluated separately.

Approved Base models:

The starting model for the competition should be an open base model without instruction-tuning. Examples of licenses we’ll accept are MIT, Apache 2, BigScience RAIL. We’re also happy to discuss other licenses on a case by case basis, for example per community interest we’ll accept the LLAMA 2 community license agreement if you’ve requested and been approved for a download link. All sizes of the common autoregressive and autoencoder base models listed below are allowed.

If you plan to use an open-source base model family not listed here, please reach out to us and we will consider adding it to the list. Please respect the honor system in place for this competition, and acquire your base model through legitimate channels only. (i.e. No pirated LLaMA weights). Any submissions that use base models obtained through illicit means will be disqualified.


You are welcome to use any open sourced dataset. For example:

Under no circumstances should you use data that infringes upon data usage agreements, copyright laws, or privacy policies. This means you should not use datasets that utilize generated content, whether in the form of instructions/prompts or results/answers from another LLM if that LLM did not have a permissive license that explicitly allowed you to do so. If you opt to create your own dataset, it must be open-sourced and readily accessible to the general public at the time of submission. Some concrete clarifications:


The evaluation process in our competition will be conducted in two stages. In the first stage, we will run a subset of HELM benchmark along with a set of secret holdout tasks. The holdout tasks will consist of logic reasoning type of multiple-choice Q&A scenarios as well as conversational chat tasks. Submissions will be ranked based on their performance across all tasks. The ranking will be determined by the geometric mean across all evaluation tasks. This score will be shown in the leaderboard. For the most up-to-date details on which specific HELM tasks we’re evaluating, please parse the .conf files in our starter repo Keep in mind that your submission needs to take at most 2 hours on the sample .conf files we’ve provided. There are also some hardware constraints that we’ll have in place for practical reasons simply because that’s the hardware that the organizers have available 128GB of RAM and 500GB of Disk

\(\text{score} = \Pi ( \text{mean-win-rate(\text{task})} )\)

After the competition is closed on October 25th 2023, we will contact the top 3 teams with the highest scoring models in each hardware category, requesting that they submit all necessary code and data to reproduce their model, starting from their chosen open-source base model. We will then replicate their entire process, to ensure it is repeatable and same results can be achieved with 24 hours using a single GPU. If the top-scoring model cannot be reproduced under these imposed conditions, we will move on to consider the next highest-scoring model in the hardware category, we will continue this process until a reproducible and high-performing model is selected, or we exhaust all potential options and declare no winners for the category.