New benchmark approach brings transparency to AI language models18. November 2022
New benchmark approach brings transparency to AI language models
San Franciso, Nov. 18, 2022.
A new benchmark approach can help evaluate and shed more light on rapidly growing AI language models.
The Stanford AI Center for Research on Foundation Models (CRFM) announced the project, called Holistic Evaluation of Language Models (HELM), to serve as a “map for the world of language models” as they grow in popularity.
CFRM grew out of the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Until now, there has been no single standard for comparing and evaluating language models and considering their fairness, robustness, and other aspects. HELM aims to provide that transparency, according to CFRM, which seeks “collaboration with the broader AI community.”
Key features of HELM, which will be updated regularly, include evaluating models for accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.
A team led by CRFM Director Percy Liang evaluated various models for different scenarios. The models were either open-source, private, or offered through commercial APIs. The team concluded that fine-tuning the models using human feedback “is very effective in terms of accuracy, robustness, and fairness” and allows smaller models to compete with those that are ten times their size.
Open models such as Meta’s OPT and BigScience’s BLOOM still perform worse than non-open models such as OpenAI’s InstructGPT davinci v2, although the team noted that open models “have improved dramatically in the last year.”