Microsoft acquired GitHub in 2018 for $7.5 billion and since then has integrated the code repository into its developer toolset while maintaining a largely hands-off approach. However, writer, lawyer and programmer Matthew Butterick has some trouble with Microsoft’s machine learning-based Code WizardGitHub Copilot, and how it apparently mishandles open source licensing.
GitHub Copilot works by offering “suggestions” for code completion as you type, and is a plugin available for Visual Studio and other IDEs. the AI-based system is powered by the Codex. But it’s how the AI is trained, or more specifically where it’s trained from, that becomes a problem for developers like Butterick.
According to OpenAI, the developers of Codex (which is licensed by Microsoft):
Codex was trained on “tens of millions of public repositories”, including code on GitHub. Microsoft itself has loosely described the training materials as “billions of lines of public code.” But Copilot researcher Eddie Aftandilian confirmed in a recent podcast (@36:40) that Copilot is “training[ed] on public repositories on GitHub”.
The issue here is that these public repositories that GitHub is built on are licensed and require attribution when code from the repositories is used. Microsoft has been vague about its use of the code, calling it fair use, but Copilot can not only offer suggestions but also issue text-based snippets of code, as shown by Texas A&M professor and GitHub user Tim Davis:
@github copilot, with “public code” blocked, emits large chunks of my copyrighted code, no attribution, no LGPL license. For example, the simple prompt “sparse matrix transpose, cs_” produces my cs_transpose in CSparse. My code on the left, github on the right. Disagree. pic.twitter.com/sqpOThi8nf
—Tim Davis (@DocSparse) October 16, 2022
For programmers like Butterick, who contribute open source code out of a sense of community, removing attribution from their work is a problem:
It can be said that Microsoft is creating a new walled garden that will prevent programmers from discovering traditional open source communities. Or at the very least, remove any incentive to do so. Over time, this process will starve these communities. User attention and engagement will be moved to Copilot’s walled garden and away from the open source projects themselves, away from their source repositories, issue trackers, mailing lists, discussion forums . This shift in energy will be a painful and permanent loss to open source.
You can check out Butterick’s”GitHub Copilot survey” for more information.