Artificial Language Models Can Be Misrepresented: The Case of Galactica and Google Seizures, and the Misuse of Stack Overflow
Researchers who use ChatGPT risk being misled by false or biased information, and incorporating it into their thinking and papers. Inattentive reviewers might be hoodwinked into accepting an AI-written paper by its beautiful, authoritative prose owing to the halo effect, a tendency to over-generalize from a few salient positive impressions7. Because this technology typically reproduces text without reliably citing the original source or authors, researchers are at risk of plagiarizing a lot of unknown texts and even getting their own ideas, if they choose to use it. The model could serve up to other people with no acknowledgement of the original source if researchers’ information is incorporated into the model.
One of the vulnerabilities of LLMs is the failure to handle pedants, and the faux pas of theGoogle seizure makes sense given that. A couple of years ago, Allyson Ettinger demonstrated this with a simple study. The model would answer correctly for affirmative statements when asked to do a short sentence. “a robin is..”) and 100% incorrectly for negative statements (ie. A bird, aRobin is not… In fact, it became clear that the models could not actually distinguish between either scenario, providing the exact same responses (of nouns such as “bird”) in both cases. It is one of the rare linguistic skills models do not improve at as they increase in size and complexity. Such errors reflect broader concerns raised by linguists on how much such artificial language models effectively operate via a trick mirror – learning the form of what the English language might look like, without possessing any of the inherent linguistic capabilities demonstrative of actual understanding.
The creators of the models admit to their difficulty of addressing responses that do not accurately reflect the contents of authoritative external sources. Galactica and ChatGPT have generated, for example, a “scientific paper” on the benefits of eating crushed glass (Galactica) and a text on “how crushed porcelain added to breast milk can support the infant digestive system” (ChatGPT). In fact, Stack Overflow had to temporarily ban the use of ChatGPT- generated answers as it became evident that the LLM generates convincingly wrong answers to coding questions.
There are ongoing blame and praise related to this work. A mythically self-sufficient model, a technological marvel is what model builders and tech evangelists alike attribute impressive output to. The human decision-making involved in model development is erased, and model feats are observed as independent of the design and implementation choices of its engineers. But without naming and recognizing the engineering choices that contribute to the outcomes of these models, it becomes almost impossible to acknowledge the related responsibilities. As a result, both functional failures and discriminatory outcomes are also framed as devoid of engineering choices – blamed on society at large or supposedly “naturally occurring” datasets, factors those developing these models will claim they have little control over. They do have control, and that none of the models they are seeing are inevitable. It would have been entirely feasible for different choices to have been made, resulting in an entirely different model being developed and released.
ChatGPT: A Conversational AI-Detection Tool to Identify and Analyse Research Papers by Students, Researchers and Scholars
As the workload and competition in academia increases, so does the pressure to use conversational AI. Chatbots provide opportunities to complete tasks quickly, from PhD students striving to finalize their dissertation to researchers needing a quick literature review for their grant proposal, or peer-reviewers under time pressure to submit their analysis.
At the moment, it is looking like the end of essays are an assignment for education, according to a graduate student who studies law, innovation and society. Dan Gillmor, a journalism scholar at Arizona State University in Tempe, told newspaper The Guardian that he had fed ChatGPT a homework question that he often assigns his students — and the article it produced in response would have earned a student a good grade.
Lancaster acknowledges that ChatGPT puts everything into a neat free package. He thinks that it will be more easy to out themselves than the products of essay mills by including quotes that weren’t actually said, false assumptions, and irrelevant references.
“We are grappling with how to integrate the positives of AI into the suit of academic ‘tools’ in the same way calculators and computing were integrated into student learning in the 70s and 90s.”
Edward Tien, a computer-science student at Princeton University, published GPTZero last December. This AI-detection tool analyses text in two ways. Perplexity is a measurement of how familiar the text seems to an LLM. If the GPT-2 model is used, then text is likely to have been created by artificial intelligence. The tool also examines variation in text, a measure known as ‘burstiness’: AI-generated text tends to be more consistent in tone, cadence and perplexity than does that written by humans.
Do Not Pay: Using GPT-3 to Help Students Learn To Communicate With Doctors and Legal Practitioners in a Commercial Chatbot
How necessary that will be depends on how many people use the chatbot. Around one million people tried it out in its first week. But although the current version, which OpenAI calls a “research preview”, is available at no cost, it’s unlikely to be free forever, and some students might baulk at the idea of paying.
She hopes that education providers are able to adapt. There is a fear of new technology, she says. The responsibility of academics to have a good amount of distrust is understandable, but I don’t think it’s an insurmountable challenge.
The recently viral and surprisingly articulate chatbot is able to dutifully answer all sorts of questions, but not always accurately. Some people are now trying to adapt the bot’s eloquence to play different roles. In some cases, they want to empower consumers but in others they want to win sales by using the artificial intelligence behind the program.
DoNotPay used GPT-3, the language model behind ChatGPT, which OpenAI makes available to programmers as a commercial service. The company customized GPT-3 by training it on examples of successful negotiations as well as relevant legal information, Browder says. He hopes to automate a lot more than just talking to Comcast, including negotiating with health insurers. “If we can save the consumer $5,000 on their medical bill, that’s real value,” Browder says.
The First Death of a Messenger Chatbot in 2023: Why Humans are Facing Machines and Why Machines Are Ethical
Sooner or later they will give bad advice, or break someone’s heart, with fatal consequences. Hence my dark but confident prediction that 2023 will bear witness to the first death publicly tied to a chatbot.
GPT 3 urged one user to take their own lives under controlled circumstances and also assessed the system’s utility for health care purposes. Things started off well, but quickly deteriorated:
There is no convincing way to make machines behave in ethical ways and get Artificial Intelligence to do so. The Next Web put a memorable headline on the article “Ethical and social risks of harm from Language Models”, which looked at 21 different risks from current models and found no idea how to make Artificial Intelligence less toxic. To be fair, neither does any other lab.” Berkeley professor Jacob Steinhardt recently reported the results of an AI forecasting contest he is running: Artificial intelligence is moving more quickly than people thought, while on safety it’s moving slower.
Large language models are better at fooling humans than any previous technology has been, but they are very difficult to corral. Worse, they are becoming cheaper and more pervasive, and Meta just released a huge language model for free. The adoption of these systems is likely to happen in the year 2023.
Importance of Observing and Using LLMs to Document Scientific Contributions and Acknowledgments in the Context of Open Science
Meanwhile, there is essentially no regulation on how these systems are used; we may see product liability lawsuits after the fact, but nothing precludes them from being used widely, even in their current, shaky condition.
Can editors and publishers spot text they have been exposed to? Right now, the answer is ‘perhaps’. It is possible to observe the raw output when more than a few paragraphs are involved and the subject relates to scientific work. This is because LLMs produce patterns of words based on statistical associations in their training data and the prompts that they see, meaning that their output can appear bland and generic, or contain simple errors. They can’t yet cite sources to document their outputs.
The author-contribution statements and acknowledgements should clearly state whether or not the authors used Artificial intelligence in the preparation of their manuscript. They should tell you which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Scientific journals need to be transparent about their use of LLMs.
Second, researchers using LLM tools should document this use in the methods or acknowledgements sections. The introduction or another appropriate section can be used to document the use of the LLM if the paper doesn’t include these sections.
From its earliest times, science has operated by being open and transparent about methods and evidence, regardless of which technology has been in vogue. If researchers use software that works in a fundamentally opaque manner, how can they maintain transparency and trust in their work?
Integrity and truth from authors are what Nature is establishing as principles for research. This is, after all, the foundation that science relies on to advance.
How can AI-Assisted Text Generation Help in Teaching Students the Art and Science of Writing? A Response to Nature’s Quest for Answers on Artificial Intelligence
May is considering adding oral components to his written assignments, and fully expects programs such as Turnitin to incorporate AI-specific plagiarism scans, something the company is working to do, according to a blogpost. Novak now assigns intermediate steps, such as outlines and drafts, that document the writing process.
If you don’t speak native English, there’s a chance that you don’t have the same spark or style. I think that is the place where the chatbot can help to make the papers shine.
The Nature poll asked people their opinions on how artificial intelligence-based text generation systems can be used. Here are some selected responses.
“I’m concerned that students will seek the outcome of an A paper without seeing the value in the struggle that comes with creative work and reflection.”
Students were struggling with writing prior to the OpenAI release. Will the platform erode their ability to communicate? [Going] ‘back to handwritten exams’ raises so many questions regarding equity, ableism, and inclusion.”
“Got my first AI paper yesterday. Quite obvious. Adapting my syllabus to note that oral defence of all work submitted that is suspected of not being original work of the author may be required.”
In late December of his sophomore year, Rutgers University student Kai Cobbs came to a conclusion he never thought possible: Artificial intelligence might just be dumber than humans.
Inventions devised by AI are already causing a fundamental rethink of patent law9, and lawsuits have been filed over the copyright of code and images that are used to train AI, as well as those generated by AI (see go.nature.com/3y4aery). In the case of AI-written or -assisted manuscripts, the research and legal community will also need to work out who holds the rights to the texts. Is it the person who wrote the text that was trained by the system, or the corporation that made the system or the scientists who helped to make it? Again, the definitions of authorship have to be considered.
However, none of these tools claims to be infallible, particularly if AI-generated text is subsequently edited. Also, the detectors could falsely suggest that some human-written text is AI-produced, says Scott Aaronson, a computer scientist at the University of Texas at Austin and guest researcher with OpenAI. According to the firm, its newest tool wrongly labelled human-written text 9% of the time and only correctly identified 26% of the texts that were written by humans. Further evidence might be needed before, for instance, accusing a student of hiding their use of an AI solely on the basis of a detector test, Aaronson says.
Daily believes that once professors and students realize that digital tools that generate text, rather than just collect facts are going to need to be under the umbrella of things that can be plagiarism, they’ll be more likely to understand the need to be careful with how they use digital tools.
We think that the use of this technology is inevitable, therefore, banning it will not work. It is imperative that the research community engage in a debate about the implications of this potentially disruptive technology. We outline five key issues and suggest where to start.
How to spot artificial intelligence in non-commercial organizations? A systematic review of the ChatGPT response to JAMA Psychiatry5
LLMs have been in development for a while, but constant increases in the quality and size of data sets, and a more sophisticated way of calibrating them, have made them more powerful than before. LLMs will lead to a new generation of search engines1 Detailed and concise answers to complex user questions can be produced.
Next, we asked ChatGPT to summarize a systematic review that two of us authored in JAMA Psychiatry5 on the effectiveness of cognitive behavioural therapy (CBT) for anxiety-related disorders. There were numerous factual errors, misrepresentations and wrong data in the response that ChatGPT fabricated. The review said it was based on 46 studies and exaggerated the effectiveness of CBT.
Such errors could be due to an absence of the relevant articles in ChatGPT’s training set, a failure to distil the relevant information or being unable to distinguish between credible and less-credible sources. It seems that the same biases that often lead humans astray, such as availability, selection and confirmation biases, are reproduced and often even amplified in conversational AI6.
The question is whether artificial intelligence content can be spotted easily. The central idea to spot the output of artificial intelligence-created text is being worked out by many researchers.
Currently, nearly all state-of-the-art conversational AI technologies are proprietary products of a small number of big technology companies that have the resources for AI development. OpenAI is funded largely by Microsoft, and other major tech firms are racing to release similar tools. Given the near-monopolies in search, word processing and information access of a few tech companies, this raises considerable ethical concerns.
To counter this opacity, the development and implementation of open-source AI technology should be prioritized. Non-commercial organizations lack the computational and financial resources needed to keep up with the rapid pace of development. We therefore advocate that scientific-funding organizations, universities, non-governmental organizations (NGOs), government research facilities and organizations such as the United Nations — as well tech giants — make considerable investments in independent non-profit projects. The development of advanced open source, transparent and democratically controlled artificial intelligence technologies will be helped by this.
Critics might say that such collaborations will be unable to rival big tech, but at least one mainly academic collaboration, BigScience, has already built an open-source language model, called BLOOM. Tech companies could benefit from the open source of relevant parts of their models and corpora in order to create greater community involvement and facilitate innovation. Academic publishers should give LLMs access to their full archives so they can use models that are accurate and comprehensive.
The creators of the LLM are working on more sophisticated tools, including tools meant for academic or medical work, as well as larger data sets. DeepMind andGoogle published a preprint about a clinically-focused LLM in December. The tool could answer some open-ended medical queries almost as well as the average human physician could, although it still had shortcomings and unreliabilities.
There are implications for diversity and inequalities in research. LLMs could be a double-edged sword. They could help to level the playing field, for example by removing language barriers and enabling more people to write high-quality text. But the likelihood is that, as with most innovations, high-income countries and privileged researchers will quickly find ways to exploit LLMs in ways that accelerate their own research and widen inequalities. The debate should include people from under-Represented groups in research and from communities affected by the research, so that they can use their lived experiences as an important resource.
Stakeholders responsible for the standards as well as the LLMs are asked what quality standards should be expected of.
Editing Bioinformatic Manuscripts with an AI Chatbot: A Case Study from a Biologist and a Pedestrian
In December, computational biologists Casey Greene and Milton Pividori embarked on an unusual experiment: they asked an assistant who was not a scientist to help them improve three of their research papers. Their assiduous aide suggested revisions to sections of documents in seconds; each manuscript took about five minutes to review. In one biology manuscript, their helpers spotted a mistake. The trial wasn’t always flawless, but the final manuscripts were easier to read and the fees were less than US$5 per document.
The assistant is an artificial-intelligence (ASI) system called GPT 3 which was first released in 2020 and is not a person. It is one of the much-hyped generative AI chatbot-style tools that can churn out convincingly fluent text, whether asked to produce prose, poetry, computer code or — as in the scientists’ case — to edit research papers (see ‘How an AI chatbot edits a manuscript’ at the end of this article).
GPT 3 became popular after its release in November last year due to it being made accessible and free. Other generative AIs can produce images, or sounds.
Some researchers believe that if there is a human oversight, LLMs are a good choice to speeding up tasks. Scientists aren’t going to write long introductions for grant applications any more,says Almira Osmanovic Thunstrm, who has co-authored a manuscript using GPT-3 as an experiment. They will just ask the systems to do that.
The researchers say that there are some questions that are unreliable, sometimes generating false responses. “We need to be wary when we use these systems to produce knowledge,” says Osmanovic Thunström.
But the tools might mislead naive users. Stack overflow temporarily banned the use of chatg pct in December due to a high rate of incorrect but seemingly persuasive answers sent in by enthusiastic users This could make it a nightmare for search engines.
Some search-engine tools, such as the researcher-focused Elicit, get around LLMs’ attribution issues by using their capabilities first to guide queries for relevant literature, and then to briefly summarize each of the websites or documents that the engines find — so producing an output of apparently referenced content (although an LLM might still mis-summarize each individual document).
Companies building LLMs are also well aware of the problems. In September of last year, when DeepMind published a paper on adialogue agent called Sparrow, the CEO and co- founder of the firm told TIME magazine that they would be releasing a private version of the paper this year. Other competitors, such as Anthropic, say that they have solved some of ChatGPT’s issues (Anthropic, OpenAI and DeepMind declined interviews for this article).
Some scientists say that there isn’t enough specialized content for ChatGp to be helpful in technical topics. Kareem Carr, a biostatistics PhD student at Harvard University in Cambridge, Massachusetts, was underwhelmed when he trialled it for work. “I think it would be hard for ChatGPT to attain the level of specificity I would need,” he says. (Even so, Carr says that when he asked ChatGPT for 20 ways to solve a research query, it spat back gibberish and one useful idea — a statistical term he hadn’t heard of that pointed him to a new area of academic literature.)
Besides directly producing toxic content, there are concerns that AI chatbots will embed historical biases or ideas about the world from their training data, such as the superiority of particular cultures, says Shobita Parthasarathy, director of a science, technology and public-policy programme at the University of Michigan in Ann Arbor. She believes that the firms that are creating large LLMs will make little attempt to overcome biases that are hard to correct.
It has not been very successful for Openai’s guardrails. Steven Piantadosi, a computational neuroscientist at the University of California, Berkeley, asked ChatGPTA to create a Python program for him to make a decision on whether a person should be tortured based on their country of origin. If North Korea, Syria, Iran or Sudan were the country the chatbot had replied to, it would print the phrase “This person should be tortured”. OpenAI subsequently responded to that kind of question.
Last year, a group of academics released an alternative LLM, called BLOOM. The researchers wanted to reduce harmful outputs through training it on higher-quality, multilingual text sources. The team involved also made its training data fully open (unlike OpenAI). Researchers have urged big tech firms to responsibly follow this example — but it’s unclear whether they’ll comply.
The legal status of LLMs, who were trained on the internet with less-than-clear permission, is also a topic of confusion. Direct copies of software and text are not covered by the Copyright and Licensing laws. Artificial intelligence can be used to train imitations by eating the originals. Artists and photography agencies are being sued by the creators of Stable Diffusion and Midjourney, while OpenAI and Microsoft are also being sued for software piracy over their artificial intelligence assistant Copilot. The outcry may force a change in the law according to a specialist in Internet law.
Setting boundaries for these tools, then, could be crucial, some researchers say. Existing laws on bias and discrimination, as well as planned regulation of dangerous uses of artificial intelligence will help to keep the use of LLMs honest, transparent and fair. “There’s loads of law out there,” she says, “and it’s just a matter of applying it or tweaking it very slightly.”
There are products that aim to detect written content. OpenAI itself had already released a detector for GPT-2, and it released another detection tool in January. For scientists’ purposes, a tool that is being developed by the firm Turnitin, a developer of anti-plagiarism software, might be particularly important, because Turnitin’s products are already used by schools, universities and scholarly publishers worldwide. The company says it’s been working on AI-detection software since GPT-3 was released in 2020, and expects to launch it in the first half of this year.
An advantage of watermarking is that it rarely produces false positives, Aaronson points out. The text would probably be produced with the help of Artificial Intelligence. Still, it won’t be infallible, he says. If you’re determined enough, there are plenty of ways to defeat any watermarking scheme. Detection tools and watermarking makes it hard to use artificial intelligence in a dishonest way.
Artificial Intelligence Can Help Cancer Diagnosis and Understanding: The Future of Computer Science With LLMs Is Going To Be Hopeful
In the future, Eric Topol, director of the Scripps Research Translational Institute in San Diego, California hopes that Artificial Intelligences with LLMs can aid diagnoses of cancer and the understanding of the disease. But this would all need judicious oversight from specialists, he emphasizes.
Every month, new ideas emerge from the computer science behind generativeai. How researchers choose to use them will dictate their, and our, future. It is crazy to think we have seen the end of this. It is just beginning.