A new group is trying to make data licensing ethical
Data Set Providers Alliance: How Artificial Intelligence Will Affect the Future of Scientific Communication and Law Enforcement in the Modern Era
The Dataset Providers Alliance wants to standardize the industry and make it more fair. It just released a position paper outlining it’s stance on some major issues. The alliance includes several companies, including music-copyright-management firm Rightsify, a Japanese stock- photo marketplace, and generative- Artificial Intelligence startup Calliope Networks. (At least five new members will be announced in the fall.)
The lead of the Data Provenance Initiative, a volunteer collective that audits artificial intelligence data, believes the data source effort is admirable, but he doesn’t think the opt-in standard will be very popular. He says you’re either going to be data deficient or you’re going to pay a lot under this regime. Large tech companies may not be able to license all that data.
Poisot is worried that artificial intelligence will affect the future relationship between science and policy. Chatbots such as Microsoft’s Bing, Google’s Gemini and ChatGPT, made by tech firm OpenAI in San Francisco, California, were trained using a corpus of data scraped from the Internet — which probably includes Poisot’s work. Authors don’t know how their work is used and the credibility of the statements they make because chatbots don’t often cite original content in their outputs. It seems, Poisot says, that unvetted claims produced by chatbots are likely to make their way into consequential meetings such as COP16, where they risk drowning out solid science.
The international policy has not caught up with the burst of artificial intelligence technology and the answers to questions such as where AI output falls under existing copyright legislation will likely be years away. “We are now in this period where there are very fast technological developments, but the legislation is lagging,” says Christophe Geiger, a legal scholar at Luiss Guido Carli University in Rome. “The challenge is how we establish a legal framework that will not disincentivize progress, but still take care of our human rights.”
A new reality in which Artificial intelligence is here to stay is the way Tudorache sees the act. “We’ve had many other industrial revolutions in the history of mankind, and they all profoundly affected different sectors of the economy and society at large, but I think none of them have had the deep transformative effect that I think AI is going to have,” he says.
Academics sometimes sign their intellectual property over to institutions to give them less of a say in how their data is used. But Christopher Cornelison, the director of IP development at Kennesaw State University in Georgia, says it’s worth starting a conversation with your institution or publisher if you have concerns. These entities could be better placed to broker a licensing agreement with an AI company or pursue litigation when infringement seems likely to happen. “We certainly don’t want an adversarial relationship with our faculty, and the expectation is that we’re working towards a common goal,” he says.
Scientists can now detect whether visual products, such as images or graphics, have been included in a training set, and have developed tools that can ‘poison’ data such that AI models trained on them break in unpredictable ways. Ben Zhao a computer security researcher at the University of Chicago says they teach models about a cow with four wheels and a nice fender. Nightshade works by changing the individual pixels of an image so that an artificial intelligence model associates it with a dog or a cat, instead of a cat. Unfortunately, there are not yet similar tools for poisoning writing.
Specialists broadly agree that it’s nearly impossible to completely shield your data from web scrapers, tools that extract data from the Internet. There are some steps that can add an additional layer of oversight, such as making resources open and available, but only by request, or hosting data locally on a private server. Several companies, including IBM, allow customers to make their own chatbot that can be trained on their own data and isolated in this way.
It might feel like you missed out on a golden chance if you refrain from using Genai. But for certain disciplines — particularly those that involve sensitive data, such as medical diagnoses — giving it a miss could be the more ethical option. “Right now, we don’t really have a good way of making AI forget, so there are still a lot of constraints on using these models in health-care settings,” says Uri Gal, an informatician at the University of Sydney in Australia, who studies the ethics of digital technologies.
Other publishers, such as Wiley and Oxford University Press, have brokered deals with AI companies. Taylor & Francis, for example, has an agreement with Microsoft. The Cambridge University Press (CUP) has not yet entered any partnerships, but is developing policies that will offer an ‘opt-in’ agreement to authors, who will receive remuneration. The managing director of academic publishing for the company, who is based in Oxford, UK, said in a statement that there would be more than 24,000 e-books and over 300 research journals in the future.
Representatives of the publishers Springer Nature, the American Association for the Advancement of Science (which publishes the Science family of journals), PLOS and Elsevier say they have not entered such licensing agreements — although some, including those for the Science journals, Springer Nature and PLOS, noted that the journals do disclose the use of AI in editing and peer review and to check for plagiarism. The journal is editorially independent from its publisher.
Losing out on credits in fields where research output is correlated to professional success and prestige is harmful to a person’s reputation. “Removing peoples’ names from their work can be really damaging, especially for early-career scientists or people working in places in the global south,” says Evan Spotte-Smith, a computational chemist at Carnegie Mellon University in Pittsburgh, Pennsylvania, who avoids using AI for ethical and moral reasons. Research has shown that members of groups that are marginalized in science have their work published and cited less frequently than average5, and overall have access to fewer opportunities for advancement. AI stands to further exacerbate these challenges, Spotte-Smith says: failing to attribute someone’s work to them “creates a new form of ‘digital colonialism’, where we’re able to get access to what colleagues are producing without needing to actually engage with them”.
An OpenAI spokesman said the company was looking at ways to improve the opt-out process. The spokesman says that the research company believes that artificial intelligence offers huge benefits for academics and progress of science. “We respect that some content owners, including academics, may not want their publicly available works used to help teach our AI, which is why we offer ways for them to opt out. We’re also exploring what other tools may be useful.”
Private companies have little incentive to prioritize transparency or open access because the technology that underpins GenAI is no longer being developed by public institutions. When it comes to the inner mechanics of a chatbot, they are almost always a black box, and their creators don’t fully understand them. It is impossible to know what goes into a model’s answer to a prompt. Users have been asked to make sure that outputs used in other work do not violate laws such as intellectual- property and copyright or reveal sensitive information, such as a person’s age, ethnicity or contact information. There are studies showing that the tools can do both.
There is no way to know who did what or where the information came from if we start outsourcing research and synthesis to an artificial intelligence.
Timothée Poisot, a computational ecologist at the University of Montreal in Canada, has made a successful career out of studying the world’s biodiversity. A guiding principle for his research is that it must be useful, Poisot says, as he hopes it will be later this year, when it joins other work being considered at the 16th Conference of the Parties (COP16) to the United Nations Convention on Biological Diversity in Cali, Colombia. “Every piece of science we produce that is looked at by policymakers and stakeholders is both exciting and a little terrifying, since there are real stakes to it,” he says.