“We have basically exhausted the accumulated sum of human knowledge… in training AI.” This is what magnate Elon Musk said in a recent public discussion on X, in conversation this Wednesday night with the president of the marketing company Stagwell, Mark Penn. This depletion of resources “basically happened last year,” Musk said, suggesting that the nascent artificial intelligence sector will have to find a way to overcome this limit they have just encountered.
Musk, who is also the founder of the company xAI, provider of the Grok chatbot that has recently joined the social network artificial improve their models at a faster rate. «The only way to complement [los datos del mundo real] It is with synthetic data, where AI creates [datos de entrenamiento]“explained the businessman.
«With the synthetic data… [la IA] “It will grade itself and go through this self-learning process,” says Musk, although the argument in favor of using artificial data raises a lot of criticism due to the possible effects that this type of data can have on the final quality of AI responses. .
Musk is not the only one who has already spoken about this problem. A month ago, OpenAI co-founder Ilya Sutskever already warned that the sources of information on the Internet with which to train their AI were running out, and that this will force the industry to change the way they develop artificial intelligence.
Beyond using data available on the Internet and using artificially generated data, another option would be to capture data in real time using IoT devices.
These companies will be able to continue collecting new data generated on those platforms with which they have signed collaboration agreements, such as media, social networks or forums such as Reddit. However, the generation of new data does not occur at a fast enough pace, so AI algorithms will need to be able to learn more deeply from the data already available. This is about making them smarter.
Other large companies such as Microsoft, OpenAI and Meta are already using artificial data to train their models. According to the consulting firm Gartner, 60% of the data used for artificial intelligence and analysis projects in 2024 will be generated synthetically.
The synthetic data problem
Various experts criticize the use of synthetic data as it could create a kind of feedback loop that prevents AI from learning really new things, instead becoming less creative and biased. This is known as “model collapse”; moment when an AI model deteriorates due to poor quality of its information sources.
Some effects of this phenomenon, as indicated by IBM, are worse decision-making, disinterest on the part of users in more limited answers, and more limited knowledge affected by particular political inclinations.