Seems very improbable that they scraped a pirate website with forced registration and tight daily download limits (10 books a day max?) to get content that’s often mislabeled and not presented in an homogeneous way.
Probably it’s just using the excerpt from Amazon (which instead with paid API access is much more easy to access) as a prompt and build on it
There’s been ongoing suspicions that pirated content was used to train popular LLMs simply because popular datasets used for training LLMs do include such content. The Washington Post did an article about it.
Google’s C4 dataset used for research included illegal websites. What remains to be seen is if it was cleaned up before training Bard as we know it today.
OpenAI as revealed nothing on its dataset.
No, there’s a big difference. It just includes scraped data from pirate websites. As in: the page with the description and the “download now” button.
This is because they didn’t do a separate scrape for training, but they used what they already had in their service during scans for web indexing. Is zlibrary (b-ok in the article) present in web results? Yes. So, parts of their pages (excerpt stolen from Amazon + a “login to download now” button) are also present in the model.
Between this and assuming that the bot was specifically programmed to login in a pirate website, pay for VIP access (download 1000 eBooks a day instead of 10), then parse the content of the ebook, which aren’t in a consistent format because it’s user uploaded so it changes consistently there’s an ocean of difference
The sources for those websites are all being archived as a huge torrent. You don’t have to download every single book one by one, if you are interested in all of them…
The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
if meta used an illegal source (which is extremely stupid, like using drug money to open a bank) it does not mean google or openai did the same
the meta model is not public, probably for that reason, they just trained it with dirty data for research just to see the feasibility
for fun, i searched the most obscure and niche recent book that i could think: 9791280546517 “Vado e tornerò da voi. Riflessioni sulla Pasqua e sulla Pentecoste”. It’s so niche that’s impossible to find a pirated or even a legit ebook copy. Even if it was published a few months ago, bing AI was able to produce an excerpt and even a short review.
the meta model is not public, probably for that reason, they just trained it with dirty data for research just to see the feasibility
Meta’s LLaMA model actually is publicly available; they released it widely to anyone with a .edu email address and of course it soon ended up on bittorrent. Here is the 🧲 link (which you can also hilariously still find in this pull request, despite the DMCA takedowns they’ve sent elsewhere about it).
Seems very improbable that they scraped a pirate website with forced registration and tight daily download limits (10 books a day max?) to get content that’s often mislabeled and not presented in an homogeneous way.
Probably it’s just using the excerpt from Amazon (which instead with paid API access is much more easy to access) as a prompt and build on it
There’s been ongoing suspicions that pirated content was used to train popular LLMs simply because popular datasets used for training LLMs do include such content. The Washington Post did an article about it.
Google’s C4 dataset used for research included illegal websites. What remains to be seen is if it was cleaned up before training Bard as we know it today. OpenAI as revealed nothing on its dataset.
No, there’s a big difference. It just includes scraped data from pirate websites. As in: the page with the description and the “download now” button.
This is because they didn’t do a separate scrape for training, but they used what they already had in their service during scans for web indexing. Is zlibrary (b-ok in the article) present in web results? Yes. So, parts of their pages (excerpt stolen from Amazon + a “login to download now” button) are also present in the model.
Between this and assuming that the bot was specifically programmed to login in a pirate website, pay for VIP access (download 1000 eBooks a day instead of 10), then parse the content of the ebook, which aren’t in a consistent format because it’s user uploaded so it changes consistently there’s an ocean of difference
The sources for those websites are all being archived as a huge torrent. You don’t have to download every single book one by one, if you are interested in all of them…
I was not aware of that
Seems extremely stupid to commit mass piracy for profit
Huh?
https://annas-blog.org/help-seed-zlibrary-on-ipfs.html
https://libgen.rs/repository_torrent/
The website is like that.
Still seems improbable that they committed massive piracy by specifically searching and downloading illegal torrents
https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai says:
if meta used an illegal source (which is extremely stupid, like using drug money to open a bank) it does not mean google or openai did the same
the meta model is not public, probably for that reason, they just trained it with dirty data for research just to see the feasibility
for fun, i searched the most obscure and niche recent book that i could think: 9791280546517 “Vado e tornerò da voi. Riflessioni sulla Pasqua e sulla Pentecoste”. It’s so niche that’s impossible to find a pirated or even a legit ebook copy. Even if it was published a few months ago, bing AI was able to produce an excerpt and even a short review.
Meta’s LLaMA model actually is publicly available; they released it widely to anyone with a .edu email address and of course it soon ended up on bittorrent. Here is the 🧲 link (which you can also hilariously still find in this pull request, despite the DMCA takedowns they’ve sent elsewhere about it).