Table of Contents
In a short span since the introduction of Codellama, the coding world has been electrified with the emergence of groundbreaking open-source large language models that have outshined GPT-4’s performance on the HumanEval dataset, or so we thought.
Updated HumanEval scores show that GPT-4 has improved since its release, scoring 82.0, still significantly above WizardCoder at 73.2.
While the original Codellama paper revealed its prowess over GPT-3.5, recent evaluations show substantial advancements. This article delves into the intricacies of these new models, exploring their capabilities and real-world prowess.
|Model||HumanEval Score (%)|
Phind’s Model: Phind unveiled two remarkable models:
- A fine-tuned version of Codellama 34 billion model
- Codellama 34bn python model
Their performance on the HumanEval dataset was stellar. GPT-4, as of August 26, 2023, scored 82% on the HumanEval dataset.
WizardCoder by WizardLM Team
The watershed moment came from the WizardLM team. Their code-centric model, WizardCoder, is a sensation. The newly fine-tuned Codellama 34 billion parameter model, tailored for Python, clinched a score of 73.2% on the HumanEval dataset, significantly rising from the previously reported score.
WizardCoder attains the second position in this benchmark, surpassing GPT4 (2023/03/15, 73.2 vs. 67.0), ChatGPT-3.5 (73.2 vs. 72.5) and Claude2 (73.2 vs. 71.2).
GPT-3.5 vs. Codellama
GPT-3.5 and Codellama’s performances on the HumanEval dataset were noteworthy. GPT-4, as of August 26, 2023, scored 82% on the dataset. GPT-3.5, in the same period, managed 72.5%.
Nevertheless, the WizardCoder team undertook their independent assessment, revealing performance metrics for GPT-3.5 and GPT-4 variants. The results from March 15 align with the scores previously reported by OpenAI. Recent GPT-3.5 and GPT-4 performances on the HumanEval dataset have significantly enhanced.
In the WizardCoder team’s analysis, their model surpassed GPT-3.5 by a notable margin and even outperformed the recent GPT-4 iteration. Understanding the distinction between OpenAI’s original reports and the WizardCoder team’s evaluations is crucial.
The Evolution of Open-Source Models
This year began without an open-source model rivaling ChatGPT. Today, we witness 34 billion parameter models like WizardCoder, comparable to ChatGPT in coding tasks.
The Need for Diverse Benchmarks
Though the HumanEval dataset remains a benchmarking staple, the community increasingly craves broader benchmark datasets. Solely relying on compact datasets like HumanEval might risk models overfitting. Thus, the future lies in curating more extensive and diverse benchmark datasets.
Beyond benchmarks, a model’s mettle is tested against real-world challenges. WizardCoder was evaluated against two coding tasks:
- Grouping Matching Letters: The objective was crafting a Python function to reorganize characters in a string, ensuring similar letters cluster together. WizardCoder aced this.
- Roman to Integer Conversion: The aim was to design a Python function to transition a Roman numeral into an integer. WizardCoder nailed this as well.
These tasks had previously confounded the original Codellama model, underscoring WizardCoder’s advancements.
The Future of Language Models in Coding
As we stand on the cusp of a new era, one can’t help but wonder about the future. Will we soon see models that can seamlessly understand and write in multiple programming languages? Or models that can write code and optimize it for performance? Only time will tell, and I hope to keep on top of it all.
The world of coding is undergoing a revolution, and models like Codellama and WizardCoder are at the forefront. Their performances, capabilities, and potential applications hint at a future where coding becomes more accessible, efficient, and innovative. For enthusiasts and professionals alike, this is a space to watch closely.
Want to try out the new CodeLlama? Check out our guide on How to Install Code Llama.