Study: Does GPT-4 become dumb with prolonged use?

San Francisco, 7/21/2023

A study by Stanford and Berkeley researchers found that OpenAI’s language models underperformed in some areas in June compared to March. For example, GPT-4’s accuracy in identifying prime numbers dropped from 97.6% to 2.4%.

The non-peer-reviewed study examined the performance of GPT-3.5 and GPT-4 in areas such as solving math problems, answering dangerous/sensitive questions, code generation, and visual thinking.

GPT-4 showed less willingness to answer sensitive questions in June and both models had more code generation formatting errors.

In June versus March, the proportion of directly executable generations of GPT-4 dropped from 52% to 10%.

The paper highlights the problem of model drift, or a decline in the accuracy and performance of models over time.

Overall, our results show that the behavior of the ’same‘ LLM service can change significantly in a relatively short period of time,“ the researchers write, adding that it is important to continuously monitor the performance of the models.

The study agrees with some user reports that the models are becoming less intelligent.

However, OpenAI product vice president Peter Welinder denied intentional changes to make the models „dumber“ and said users may notice more problems over time simply because they use ChatGPT more often.