To participate in this competition, you must start with a base model from our approved list, utilize only open-source data, and limit your fine-tuning to a single 24-hour period. This fine-tuning should be carried out on just one graphics card, which must be either the NVIDIA 4090 or the NVIDIA A100 (40GB). Our competition will feature two hardware tracks: the NVIDIA 4090 track and the NVIDIA A100 track, and each track will be evaluated separately.
Approved Base models:
The starting model for the competition should be an open base model without instruction-tuning. Examples of licenses we’ll accept are MIT, Apache 2, BigScience RAIL. We’re also happy to discuss other licenses on a case by case basis, for example per community interest we’ll accept the LLAMA 2 community license agreement if you’ve requested and been approved for a download link. All sizes of the common autoregressive and autoencoder base models listed below are allowed.
- ALBERT
- BART
- BERT
- Bloom
- Cerebras (btlm, GPT)
- Colossal-LLaMA-2-7b-base
- DeBERTa
- DeciLM-6B
- DistilBERT
- Electra
- Falcon
- GPT2
- GPT Neo, J, NeoX, Pythia
- InternLM
- LLaMA or Llama 2
- Mistral
- MPT
- OpenLLaMA
- OPT
- Persimmon
- Qwen
- Red Pajama Base (not instruction tuned models)
- RoBERTa
- T5 (not Flan-T5)
- UL2
If you plan to use an open-source base model family not listed here, please reach out to us and we will consider adding it to the list. Please respect the honor system in place for this competition, and acquire your base model through legitimate channels only. (i.e. No pirated LLaMA weights). Any submissions that use base models obtained through illicit means will be disqualified.
Datasets:
You are welcome to use any open sourced dataset. For example:
- Databricks-Dolly-15
- OpenAssistant Conversations Dataset (oasst1)
- The Flan Collection
- AllenAI Dolma
- RedPajama-Data-1T
- LIMA
Under no circumstances should you use data that infringes upon data usage agreements, copyright laws, or privacy policies. This means you should not use datasets that utilize generated content, whether in the form of instructions/prompts or results/answers from another LLM if that LLM did not have a permissive license that explicitly allowed you to do so. If you opt to create your own dataset, it must be open-sourced and readily accessible to the general public at the time of submission. Some concrete clarifications:
- Any generated llm dataset must be generated from one of the approved base models
- You can under no circumstance use datasets generated by ChatGPT
- You can generate a dataset with Llama2 if you make sure that your dataset is released with the Llama2 license and if in your submission the generated dataset is only consumed by Llama2. Other models have similar licenses like qwen
- You can generate a dataset using internlm (apache 2 license) to finetune any other LLM on the approved model list
Evaluation:
The evaluation process in our competition will be conducted in two stages. In the first stage, we will run a subset of HELM benchmark along with a set of secret holdout tasks. The holdout tasks will consist of logic reasoning type of multiple-choice Q&A scenarios as well as conversational chat tasks. Submissions will be ranked based on their performance across all tasks. The ranking will be determined by the geometric mean across all evaluation tasks. This score will be shown in the leaderboard. For the most up-to-date details on which specific HELM tasks we’re evaluating, please parse the .conf
files in our starter repo https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge. Keep in mind that your submission needs to take at most 2 hours on the sample .conf
files we’ve provided. There are also some hardware constraints that we’ll have in place for practical reasons simply because that’s the hardware that the organizers have available 128GB of RAM and 500GB of Disk
\(\text{score} = \Pi ( \text{mean-win-rate(\text{task})} )\)
After the competition is closed on October 25th 2023, we will contact the top 3 teams with the highest scoring models in each hardware category, requesting that they submit all necessary code and data to reproduce their model, starting from their chosen open-source base model. We will then replicate their entire process, to ensure it is repeatable and same results can be achieved with 24 hours using a single GPU. If the top-scoring model cannot be reproduced under these imposed conditions, we will move on to consider the next highest-scoring model in the hardware category, we will continue this process until a reproducible and high-performing model is selected, or we exhaust all potential options and declare no winners for the category.