Scientists are looking at how to use a blockbuster model of artificial intelligence
Enhancing Trust in Artificial Intelligence Against Trump’s Reconciling with Real-World Data, and Implications for Tabulated Data
The best-known LLMs are pre-trained on hundreds of billions of examples of actual data, such as text and images. This enables them to answer user queries with a degree of reliability. But what if there is not relevant real-world data? Can AI still provide reliable answers when trained on fewer data sets? There’s nowhere close to the required quantity for training models in tabulated data sets, which is the key question for researchers using artificial intelligence to make predictions. The Nature study suggests that artificial intelligence models could be trained on randomly generated data that mimics the statistical properties of real-world data.
Hollman and colleagues’ work is an example of necessity spurring innovation: the researchers realized that there were not enough accessible real-world data sets to train their model, and so they found an alternative approach.
Enhancing trust in Artificial Intelligence and minimizing harms must continue to be a priority, even if it has been reduced by Trump. The president has rescinded an executive order by his predecessor, which called on the National Institutes of Standards and Technology (NIST) and AI companies to collaborate to improve both trust in and the safety of AI, including for the use of synthetic data. Trump’s new executive order, which is called removing barriers to US leadership in artificial intelligence, doesn’t use the word safety. NIST published a report on methods for authenticating and tracking artificial intelligence in November of last year. Researchers need to keep the efforts going and not let them go to waste.
Synthetic data can have risks, such as the danger of producing inaccurate results or hallucinating. It is important that the studies are replicated. Replication, a cornerstone of science, also reassures users that they can trust the results of their queries.
This advance is the work of computer scientists Noah Hollman, Samuel Müller and Frank Hutter at the University of Freiburg, Germany, and their colleagues. Their model is called TabPFN and is designed to analyse tabulated data, such as those found in spreadsheets. Typically, a user creates a spreadsheet by populating rows and columns with data, and uses mathematical models to make inferences or projections from those data. TabPFN can make predictions on any small data set, ranging from those used in accounting and finance to those from genomics and neuroscience. Moreover, the model predictions are accurate even though it is trained entirely without real-world data, but instead on 100 million randomly generated data sets.
Scientists are flocking to DeepSeek-R1, a cheap and powerful artificial intelligence (AI) ‘reasoning’ model that sent the US stock market spiralling after it was released by a Chinese firm last week.
It is not even past January, and 2025 is already proving to be a defining year for artificial intelligence (AI). On 21 January, just one day into his presidency, US President Donald Trump announced the Stargate Project, a joint venture between leading technology companies and financiers in the United States, Japan and the United Arab Emirates. They pledged a staggering US$500 billion to developing AI infrastructure in the United States.
Cong Lu, a researcher at the University of British Columbia, says that there have been tons of researchers investigating training their own reasoning models based on R1 since its launch. That’s backed up by data from Hugging Face, an open-science repository for AI that hosts the DeepSeek-R1 code. In the week since its launch, the site had logged more than 3 million downloads of different versions of R1, including those already built on by independent users.
The low cost and excellent performance of Deepseek-R1 will encourage more scientists to use it for their research without worrying about the price, according to a University researcher. Almost all of your colleagues are talking about it.
Much of the excitement over R1 is because it has been released as ‘open-weight’, meaning the learned connections between different parts of its algorithm are available to build on. Scientists who download R1, or one of the much smaller ‘distilled’ versions also released by DeepSeek, can improve its performance in their field through additional training, known as fine tuning. Given a suitable data set, researchers could train the model to improve at coding tasks specific to the scientific process, says Sun.
R1 is also showing promise in mathematics. Frieder Simon, a mathematician at the University of Oxford, UK, challenged both models to create a proof in the field of functional analysis and R1’s argument was more promising than o1’s. He says that researchers need to be prepared with the skills to tell a good and bad proof apart, because models can make mistakes.