A new paper suggests that we may run out of high-quality data upon which to train artificial intelligence systems within the next few years — possibly by 2026. In fact, there are allegations that Google has already trained their system on output generated from ChatGPT, though they deny those allegations. Artificial Intelligence systems depend upon a ton of quality information in order to keep up their gains. What happens if that information goes away?
Nothing good, likely. First, it is possible that these systems are already reaching the limit of the mass training data approach to improvement. ChatGPT may be getting worse in some respects already, though that may be an artifact of the means of testing its ability improving not necessarily of its performance degrading. It is extremely unlikely that these companies will give up on the idea of mass training data as a means to perfecting AI, so we can expect them to train their systems with the output from poor sources, including AI systems themselves. Given that these systems already lie, that lack is likely to retard their ability to improve. Garbage in, garbage out is likely to apply even more to systems that depend on masses of data to determine what should come next in a sequence. If such systems are fed sequences that are, well, garbage, then they are not likely to get better at predicting similar sequences.
Remember, the quality of the data really matters. Cancer detecting systems, for example, were thrown off by the presence of rulers in the training data because pictures of known tumors all had rulers in them. They were not tumor detection systems; they were ruler detection systems. The quality of data is paramount.
Poor data means a decrease in the rate of improvement. That in turns likely means more misinformation, more hallucinations, more need for human oversight, and less functionality in these products. Despite the hype, we could be closer to the upper limit of the abilities of these systems than most people think. The gravy train may be closer to the station and that station might be a small terminal on the outskirts of town at midnight, not Grand Central at rush hour.
On the one hand, it may mean that there are fewer jobs lost than we think. On the other hand, it likely means that the jobs these systems do infiltrate become even worse. The people who make the decisions about how and when to use these systems are much more likely to force humans to wade through the mistakes by the machines than admit that the systems don’t in fact save time and effort. It will still likely be cheaper to pay people to clean up after the machines as “unskilled” labor than pay “skilled” labor to do the job right from the start.
From a societal point of view, this could be a real loss. These systems, if managed properly, if focused on machine usefulness instead of pure automation, can actually bring value. They can help make human beings more efficient at discovery in important areas such as health care, basic science, rote work, etc. The fact that current methods may have such a low ceiling for improvement means that many of the expected improvements will not happen and that artificial intelligence will need a new route to success.
But maybe that is not such a bad thing. The current brute force approach is bad for the environment and by its nature impractical for anyone who doesn’t have a ton of computing power and thus a ton of capital. Maybe that paradigm having a lower-than-expected ceiling is, in the medium to long term, better for us all. If nothing else, artificial intelligence may not be a solved problem. And that is always an exciting place to live.
Leave a Reply