It’s pretty easy to see the problem here: The Internet is brimming with misinformation, and most large language models are trained on a massive body of text obtained from the Internet.

Ideally, having substantially higher volumes of accurate information might overwhelm the lies. But is that really the case? A new study by researchers at New York University examines how much medical information can be included in a large language model (LLM) training set before it spits out inaccurate answers. While the study doesn’t identify a lower bound, it does show that by the time misinformation accounts for 0.001 percent of the training data, the resulting LLM is compromised.

  • ribhu@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    3
    ·
    15 hours ago

    How old is this study? The LLMs mentioned are Llama 2 and GPT 3.5 which in current terms are almost archaic

    • Zron@lemmy.world
      link
      fedilink
      English
      arrow-up
      17
      arrow-down
      1
      ·
      14 hours ago

      Unfortunately, it’s a lot harder to rigorously test something than it is to shit a new product out into the wild with no regard for its impact.