ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • NaibofTabr@infosec.pub
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    5
    ·
    11 months ago

    ChatGPT is a large language model. The model contains word relationships - a nebulous collection of rules for stringing word together. The model does not contain information. In order for ChatGPT to answer flexibly answer questions, it must have access to information for reference - information that it can index, tag and sort for keywords.

    • TWeaK@lemm.ee
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      11 months ago

      information that it can index, tag and sort for keywords.

      The dataset ChatGPT uses to train on contains data copied unlawfully. They’re not just reading the data at its source, they’re copying the data into a training database without sufficient license.

      Whether ChatGPT itself contains all the works is debatable - is it just word relationships when the system can reproduce significant chunks of copyrighted data from those relationships? - but the process of training inherently requires unlicensed copying.

      In terms of fair use, they could argue a research exemption, but this isn’t really research, it’s product development. The database isn’t available as part of scientific research, it’s protected as a trade secret. Even if it was considered research, it absolutely is commercial in nature.

      In my opinion, there is a stronger argument that OpenAI have broken copyright for commercial gain than that they are legitimately performing fair use copying for the benefit of society.

    • ayaya@lemdro.id
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      1
      ·
      edit-2
      11 months ago

      I’m honestly not sure what you’re trying to say here. If by “it must have access to information for reference” you mean it has access while it is running, it doesn’t. Like I said that information is only available during training. Either you’re trying to make a point I’m just not getting or you are misunderstanding how neural networks function.

      • NaibofTabr@infosec.pub
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        2
        ·
        11 months ago

        Like I said that information is only available during training.

        This is not correct. I understand how neural networks function, I also understand that the neural network is not a complete system in itself. In order to be useful, the model is connected to other things, including a source of reference information. For instance, earlier this year ChatGPT was connected to the internet so that it could respond to queries with more up-to-date information. At that point, the neural network was frozen. It was not being actively trained on the internet, it was just connected to it for the sake of completing search queries.

        • brianorca@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          11 months ago

          That is an optional feature, not required to make use of an LLM. And not even a feature of most LLMs. ChatGPT was usable before they added that, but it can help when you need recent data. And they do continue to train It, with the current cutoff being April of this year, at least for some models. (But training is expensive, so we can expect it to be in conjunction with other design changes that require additional training.)