• luciole (he/him)@beehaw.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 year ago

    There’s been ongoing suspicions that pirated content was used to train popular LLMs simply because popular datasets used for training LLMs do include such content. The Washington Post did an article about it.

    Google’s C4 dataset used for research included illegal websites. What remains to be seen is if it was cleaned up before training Bard as we know it today. OpenAI as revealed nothing on its dataset.

    • Moonrise2473
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      No, there’s a big difference. It just includes scraped data from pirate websites. As in: the page with the description and the “download now” button.

      This is because they didn’t do a separate scrape for training, but they used what they already had in their service during scans for web indexing. Is zlibrary (b-ok in the article) present in web results? Yes. So, parts of their pages (excerpt stolen from Amazon + a “login to download now” button) are also present in the model.

      Between this and assuming that the bot was specifically programmed to login in a pirate website, pay for VIP access (download 1000 eBooks a day instead of 10), then parse the content of the ebook, which aren’t in a consistent format because it’s user uploaded so it changes consistently there’s an ocean of difference