Unavailable at source.

  • FauxLiving@lemmy.world
    link
    fedilink
    arrow-up
    1
    arrow-down
    1
    ·
    1 day ago

    I’m not sure what standards you’re saying unreliable.

    You can see in the example that I provided it correctly answered the question and also correctly cited the place where the answer came from in the exact same amount of time as it would take to type the query into Google.

    Yes, LLMs by themselves can hallucinate and do so at a high rate so that they’re unreliable sources of information. That is 100% true. It will never be fixed, because LLMs are trained to be an autocorrect and produce syntactically correct language. You should never depend on raw LLM generated text from an empty context, like from a chatbot.

    The study of this in academia (example: https://arxiv.org/html/2312.10997v5) has found that LLMs hallucination rate can be dropped to almost nothing (less than a human) if given text containing the information that it is being asked about. So, if you paste a document into the chat and ask it a question about the document the hallucination rate drops significantly.

    This finding created a technique called Retrieval Augmented Generation where you use some non-AI means of finding data, like a search engine, and then put the documents into the context window along with the question. This makes it so that you can create systems that use LLMs for the tasks that they’re accurate and fast at (like summarizing text that is in the context window) and non-AI tools to do things that require accuracy (like searching databases for facts and tracking citation).

    You can see in the images I posted that it both answered the question and also correctly cited the source which was the entire point of contention.

    • The study of this in academia

      you are linking to an arxiv preprint. I do not know these researchers. there is nothing that indicates to me that this source is any more credible than a blog post.

      has found that LLM hallucination rate can be dropped to almost nothing

      where? It doesn’t seem to be in this preprint, which is mostly a history of RAG and mentions hallucinations only as a problem affecting certain types of RAG more than other types. It makes some relative claims about accuracy that suggest including irrelevant data might make models more accurate. It doesn’t mention anything about “hallucination rate being dropped to almost nothing”.

      (less than a human)

      you know what has a 0% hallucination rate about the contents of a text? the text

      You can see in the images I posted that it both answered the question and also correctly cited the source which was the entire point of contention.

      this is anecdotal evidence, and also not the only point of contention. Another point was, for example, that ai text is horrible to read. I don’t think RAG(or any other tacked-on tool they’ve been trying for the past few years) fixes that.

      • FauxLiving@lemmy.world
        link
        fedilink
        arrow-up
        1
        arrow-down
        1
        ·
        1 day ago

        you know what has a 0% hallucination rate about the contents of a text? the text

        What text are you reading that has a 0% error rate? Google search results? Reddit posts? You seem to be comfortable with the idea that arxiv preprints can have an error rate that isn’t 0% so ‘the text’ isn’t guaranteed to have no errors.

        Even assuming perfect text, your error rate in summarization isn’t 0% either. Do you not misread passages or misremember facts and have to search again or find that you need to edit a rough draft before you finish it? We deal with errors all of the time and they’re manageable as long as they’re low. The question isn’t ‘can we make a process that has a 0% error rate, that’s an impossible standard’ the question is if we can make a system that has an error rate that’s close to or lower than a person’s.

        The reason why is because these systems scale in a way that you do not. Even if you have savant level reading, recall and summarization such that you would make Kim Peek envious, how many books worth of material can you read and summarize in 10 seconds? 1? 5?

        Could you read and summarize 75 novels (10 million tokens) with a 0% error rate? I’d imagine not and you certainly couldn’t do it in 30 seconds. In fact, this would be an impossible task for you no matter how high of an error rate that we allowed. You simply cannot ingest data fast enough to even make a guess at what a summary would look like. Or, to be more accurate to the actual use case, could you read 75 novels and provide a page reference to all of the passages written in iambic pentameter? I can read the passages myself, I just need for you to find them and tell me the page. You’d probably take longer than 10 seconds and you would almost assuredly miss some.

        Meanwhile an LLM could produce a summary, with citations generated and tracked by non-AI systems, with an error rate comparable to a human (assuming the human was given a few months to work on the problem) in seconds.

        • Jack Riddle[Any/All]@lemmy.dbzer0.com
          link
          fedilink
          arrow-up
          1
          ·
          22 hours ago

          what text are you reading that has a 0% error rate?

          as I said, the text has a 0% error rate about the contents of the text, which is what the LLM is summarising, and to which it adds it’s own error rate. Then you read that and add your error rate.

          the question is can we make a system that has an error rate that is close to or lower than a person’s

          can we???

          could you read and summarize 75 novels with a 0% error rate?

          why… would I want that? I read novels because I like reading novels? I also think that on summaries LLMs are especially bad, since there is no distinction between “important” and “unimportant” in the architecture. The point of a summary is to only get the important points, so it clashes.

          provide a page reference to all of the passages written in iambic pentameter?

          no LLM can do this. LLMs are notoriously bad at doing any analysis of this kind of style element because of their architecture. why would you pick this example

          Meanwhile an LLM could produce a summary, with citations generated and tracked by non-AI systems, with an error rate comparable to a human (assuming the human was given a few months to work on the problem) in seconds.

          I still have not seen any evidence for this, and it still does not adress the point that the summary would be pretty much unreadable

          • FauxLiving@lemmy.world
            link
            fedilink
            arrow-up
            1
            arrow-down
            1
            ·
            21 hours ago

            as I said, the text has a 0% error rate about the contents of the text, which is what the LLM is summarising, and to which it adds it’s own error rate. Then you read that and add your error rate.

            Error rates that you simultaneously haven’t defined and also have declared as too high to be usable.

            These tools clearly work, much like a search engine clearly works. They have errors (find me clean search results) but we use them.

            You could make the same argument about search. If you issued a query to Google and compared the results generated by the machine learning systems and then had a human read the entire Internet specifically trying to answer your query you would probably find that in the end (after a few decades) the human results would probably be more responsive to your query and the Google results, once you get to page 3 or 4 start to become random nonsense.

            By any measure the Google results are worse than what a human would choose. This is why you have to ‘learn’ to search and to issue queries in a specific way, because otherwise you get errors/bad results.

            The problem with the accurate human results is that if you had all of the people on the planet working full-time 365 days a year could not service a single minute worth of the queries that the Google machine learning algorithms serve up 24/7.

            Could you read 3 books and find the answer that you want? Or craft some regular expression search to find it? Sure, but you can’t do it faster than it takes to run a RAG search and inference 10 million tokens worth of text.

            The whole point of search is that looking through every document every time that you want to find something is a waste of effort, using summarization allows you to more accurately survey larger volumes of data and search in what you’re looking for. You never trust the output of the model, just like you don’t cite Google’s search results page or Wikipedia, because they are there to point you to information, not provide it. A RAG system gives you the citations for the data so once the summarization indicates that it has found what you’re looking for then you can read for yourself.


            the question is can we make a system that has an error rate that is close to or lower than a person’s

            can we???

            Yes.

            Here is a peer reviewed article published in Nature Medicine - https://pmc.ncbi.nlm.nih.gov/articles/PMC11479659/

            The relevant section from the abstract:

            A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts.

            Another published peer reviewed article posted in npj digital medicine - https://www.nature.com/articles/s41746-025-01670-7

            Our clinical error metrics were derived from 18 experimental configurations involving LLMs for clinical note generation, consisting of 12,999 clinician-annotated sentences. We observed a 1.47% hallucination rate and a 3.45% omission rate. By refining prompts and workflows, we successfully reduced major errors below previously reported human note-taking rates, highlighting the framework’s potential for safer clinical documentation.


            why… would I want that? I read novels because I like reading novels? I also think that on summaries LLMs are especially bad, since there is no distinction between “important” and “unimportant” in the architecture. The point of a summary is to only get the important points, so it clashes.

            Novel is given as a human unit of text, because you may not know what 10 million tokens means in terms of actual length. I’m clearly not talking about fictional novels read for entertainment.

            Meanwhile an LLM could produce a summary, with citations generated and tracked by non-AI systems, with an error rate comparable to a human (assuming the human was given a few months to work on the problem) in seconds.

            I still have not seen any evidence for this, and it still does not adress the point that the summary would be pretty much unreadable

            https://lemmy.world/post/43275879/22220800

            This is an example of a commercial tool which returns both the non-LLM generation of citations and the accurate summation of the contents of the article as it relates to the question.