Comedian and author Sarah Silverman, as well as authors Christopher Golden and Richard Kadrey — are suing OpenAI and Meta each in a US District Court over dual claims of copyright infringement.
Interested to see how this plays out! Their argument that the only way a LLM could summarize their book is by ingesting the full copyrighted work seems a bit suspect, as it could’ve ingested plenty of reviews and summaries written by humans and combined that information.
I’m not confident that they’ll be able to prove OpenAI or Meta infringed copyright, just as i’m not confident they’ll be able to prove that they didn’t violate copyright. I don’t know if anyone really knows what these things are trained on.
We got to where we are now with fair use in search and online commentary because of a ton of lawsuits setting precedent, not surprising we’ll have to do the same with machine learning.
ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
I think this is where the crux of the case lies since the article mentions these are only available illegally through torrents.
This is starting to touch on the root of why they keep calling this “AI”, “training”, etc. They aren’t doing this for strictly marketing, they are attempting to skew public opinion. These companies know intimately how to do that.
They’re going to argue that if torrents are legal for educational purposes (ie the loophole that all trackers use), and they’re just “training” an “AI” then they’re just engaging in education. And an ignorant public might buy it.
These kinds of cases will be viewed as landmark cases in the future and honestly I don’t have huge hopes. The history of these companies is engineer first, excuse the lack of ethics later. Or the philosophy of “it’s easier to apologize than ask”.
It’s the defacto term for how we fit a statistical model to data, unrelated to any copyright concepts. I’m pretty sure we called it “training” back in 1997 when I was doing neural networks at uni, and it’s probably been used well before then too.
Neural nets are based on the concept of Hebbian learning (from the 1930s), because they are trying to mimic how a biological neural network learns.
This concept of training/learning has persisted because it’s a good analogy of what we are trying to do with these statistical models, even if they aren’t strictly neural networks.
This concept of training/learning has persisted because it’s a good analogy of what we are trying to do with these statistical models, even if they aren’t strictly neural networks.
LLMs are indeed neural networks.
Ahh ok. I didn’t want to assume as I’m not familiar with the details.
TBH I’m not really familiar with how the AI has developed over the years. Wikipedia says that ChatGPT is proprietary, which leads me to believe it’s hasn’t been developed with research grants or government involvement. Is this the case? Can a company legally develop an AI by obtaining its learning material through illegal means? Which it sounds as if Open AI and Meta did through the use of Bibliotik.
I can’t see how this doesn’t have some legal ramification, but IANAL.
OpenAI is called that for a reason. They absolutely were a non-profit research org initially, so would have been eligible for research grants, etc. They would probably have gotten a pass on using the torrents too, for the same reason.
They went to a private for-profit model later after they built their AI’s and wanted to start selling them as a service. How the hell all of that plays out as the company they are now is anyone’s guess.
Even if they did train the model on the entire text of the book, that’s still not necessarily copyright violation. I would think not, since the resulting model doesn’t actually have a copy of the book embedded within it.
Do we know that it isn’t?
How do we “know” anything where the answers are just being made up as part of humanity’s collective cultural game of Calvinball?
Courts in various jurisdictions will make various rulings. Judges will interpret them in various ways. Legislators will chime in with new legislation and new treaties. Internet arguments will churn away with a whole range of assumptions about what is true or false that may or may not have anything to do with reality.
I present my opinion here. I feel it is well informed and I can back it up in various ways when challenged. But nobody “knows” anything because these aren’t laws of physics or math that we’re talking about here.
Or did you mean whether we know if a copy of the book is embedded in the model? That can be more objectively tested, at least.
AFAIK it takes these large bodies of text and rather than digesting them and keeping it in some sort of database, rather it holistically (and i’m generalising here), see how often certain words are strung together and taking note of that. Let’s call them weights.
Then users can prompt something and the ‘magic’ here is that it is able to pick out words of different weights based on the prompt. Be it, are you writing an angry email to your boss, a code in python, or structure for a book.
But it is unable to recreate the book from a prompt.
People who know the topic more intimately please correct me if I am wrong .
It’s difficult to tell to what extent books are encoded into the model. The data might be there in some abstract form or another.
During training it is kind of instructed to plagiarize the text it’s given. The instruction is basically “guess the next word of this unfinished excerpt”. It probably won’t memorize all input it’s given, but there’s a nonzero chance it manages to memorize some significant excerpts.
It’s difficult to tell to what extent books are encoded into the model. The data might be there in some abstract form or another.
This is a court case so the accusers are going to have to prove it.
The evidence provided is that ChatGPT can produce two-page summaries of the books. The summaries are of unknown accuracy, I haven’t read the books myself so I have no idea how much of those summaries are hallucinations. This is very weak.
They have to prove it but if they case gets far enough they will have the right to ask for discovery and they can see for themselves what was included. Thats why it might just settle quietly to avoid discovery.
The important question is not what was in the training data. The important question is what is in the model. The training data is not magically compressed into the model like some kind of physics-defying ultra-Zip, the model does not contain a copy of the training data.
There are open-source language models out there, you can experiment with training them. Unless you massively over-fit it on a specific source document (an error that real AI training procedures do everything they can to avoid) you won’t be able to extract the source documents from the resulting model.
But the server used to calculate the model would have a copy of it. If training an AI model is not fair use then the mere act of loading a book you don’t have a license for into the server would be copyright infringement. Like text book. It’s a unauthorized digital copy. It’s all very untested legal grounds and seems like lots of people want to be the first to test it. Not everyone has a great case but if the courts interpret things a certain way there’s gonna be lots of payouts so maybe best to get in line early?
Perhaps, but that’s a separate legal issue from the model itself. You might have committed a breach of copyright in the process of gathering the material that the AI was trained on but the model itself is not a copy of that material and so is not itself illegal to train or use. And perhaps not even that, since downloading a pirated book is not the illegal part (uploading it is).
As you say, there’s some untested legal waters here. But it seems likely to me that the best that Silverman will accomplish is some nibbling and quibbling around the edges.
If you can give some vague prompts to the model to obtain something that is close enough to a significant chunk of the work that, had it been written by a human, was susceptible of being considered plagiarism… then I’d say the same laws protecting from plagiarism should operate there.
It doesn’t matter whether it’s really stored there in some form or not (in fact, it’s probably ok for to store copyrighted material in a private server as long as it’s lawfully obtained), but whether the output that is being distributed to third parties is violating the license of the work or not.
If you can give some vague prompts to the model to obtain something that is close enough to a significant chunk of the work that, had it been written by a human, was susceptible of being considered plagiarism… then I’d say the same laws protecting from plagiarism should operate there.
Perhaps, but that’s not even remotely what’s being accused in this case. They’re asking ChatGPT for a summary of the book and it’s generating a summary a couple of pages long. Nothing is even close to verbatim, and I don’t know enough about any of the books to know if those summaries are even accurate. In my experience ChatGPT often ends up hallucinating a lot of details when asked stuff like this.
Right but you can sue for what happened on the training server. I’m guessing the training server still exists. I doubt they wiped it completely before the next round of training. If the training server infringes copyright then you still lose the suit. Maybe. Remember that copyright law is not written with the internet in mind. If you have a “copy” and it’s not authorized that might just be enough for a backwards court to find infringement.
I think of it in extremes. Imagine you had a video producing model of the future. Could you then load up every MLB game recorded and train the model to make novel baseball games based on that or would the MLB be pissed you had a server full of every MLB game ever recorded?
It may be that no one currently knows exactly what these things are trained on, but it could be determined. If you know the methodology you can figure out what data is being used. The companies involved are going to resist letting anyone find out, but I’m hoping a court case will break that black box open.
One of the many problems with this form of AI is the degree to which we don’t know where it’s getting its information from. Without that, there is no way to determine the reliability of the results. They can sound perfectly reasonable and be entirely untrue.
It’s very rare for me to want Facebook to win a lawsuit. It’s just as rare for me to want to see Sarah Silverman not succeed. But in this case, I think the Internet needs to see Facebook win.
You think that the internet needs more AI?
All you will get from this will me more bots, advertising in every place for whichever brand paid the most. It’s all about ads. Just like TV shows, the ads will be integrated to the content and be indistinguishable from the original content.
A place like this one will be treasured as a great thing from the past. It is really what you want?
Honestly? A little bit, yeah. More automated tools with greater function will help as long as we can moderate their use.
My real concern is more related to the fact that this will probably lead to a massive crackdown on sources and shadow libraries that have been used as training data for AIs. If this goes through, I see a lot of ML/AI/bots being forced into an audit, and whenever “potentially infringing” content is found, they won’t just remove it, there will be an aggressive push against the shadow library hosting it.
You want more ads and more bots?
This is reddit, this is where you are coming from.
You seem to have missed the vast majority of my point, so I’ll ask this directly - do you, or do you not, support the continued existence of shadow libraries?
I couldn’t care less about the existence of shadow libraries. What is this attempt at pushing buzz names? Once the the data is out it is out.
She is right to bring the lawyers to the party, this is when the gafam fall, they care about laws. If they didn’t then they wouldn’t hire a legal department the size of a small african nation.
Hitting the AI devs a few time with massive fees for copyright infringement should cool down their enthusiasm and the party should tune down one notch or two. It’s all about money: when the income of ad money is lower than the outcome of fees money then the calm should be restored.
There’s As we’ve said on The Vergecast every time someone gets Nilay going on copyright law, we’re going to see lawsuits centered around this stuff for years to come.
I can’t wait to watch and/or listen to The Vergecast this week. If Nilay is on the podcast this time (he wasn’t this past week) he’s going to talk about it. I for one, can’t wait to hear what he has to say.