Large Language Models for Code: Exploring the Landscape, Opportunities, and Challenges

04 Jul 2024 (12 months ago)

Large Language Models (LLMs) for Code

GitHub Copilot and other similar models have shown promising results in improving developer productivity, but they are only accessible through APIs, limiting transparency and reproducibility.
The Hugging Face Hub now hosts over 1,700 open-source code models, including pure code completion models and LLMs trained on code.
Training LLMs for code requires substantial resources, including extensive computational power, large amounts of data, and expertise in data processing and model tuning.
The Big Code project, a collaboration between Hugging Face and ServiceNow, aims to promote open and transparent practices in the development of code LLMs.
Big Code emphasizes data transparency, open-sourcing the code for data processing and model training, and releasing model weights under commercially friendly licenses.

The Big Code project has released the Stack dataset, the largest open dataset of source code, and two families of code generation models, StarCoder and StarCoder 2, along with an instruction-tuned model called StarChat 2.
StarCoder, released in April 2022, was the best code generation model at the time with 15 billion parameters and trained on 500 A100 GPUs for 24 days.
StarCoder 2, released in collaboration with Software Heritage, is a larger and improved model that outperforms StarCoder and many other existing models.
Starcoder 2 outperforms other models like Codex Lama 34d on code and math tasks.
Starcoder 2 is aware of repository context and can provide completions from other files in the same repository.
StarChat 2 is an instruction-tuned version of Starcoder 2 that can follow instructions in various programming languages.
The Starcoder dataset has inspired new models like Stcode and Codeen 2.5, and tools like WizardCoder and Defog SQLCoder.

Customizing existing code LLMs to a codebase or new datasets requires fewer resources compared to training from scratch.
Techniques for adapting code LLMs include prompt engineering, in-context learning, tool use, fine-tuning, and continued pre-training.
Open-source libraries like Text Generation Inference (TGI) and VM allow efficient deployment of large language models for code generation, even on multiple GPUs.

Better open-source models and high-quality datasets are needed to catch up with closed-source models like GPT-4 and Codex.
Data transparency and governance are important to ensure users know the data used for training and any potential biases.
Improved evaluations are needed that focus on low-resource languages and test class implementations, not just functions.
Smaller, specialized models are desirable for users with limited resources.
Code attribution tools are being developed to help users attribute the authors of code generated by the model.
The model training process involves filtering to ensure code quality, but overly aggressive filtering can lead to data loss and reduced model performance.
Legal and ethical considerations regarding data scraping from GitHub include using only public repositories, filtering out restricted licenses, providing opt-out tools for users, and implementing code attribution tools.
Mitigation strategies are being studied to address the risk of model collapse when using synthetic data generated by the model for further training.
The speaker highlights the challenges in generating multi-file or project-size completions with AI models and emphasizes the need for better base models, fine-tuning data, and evaluation metrics.