I asked ChatGPT to look at a proof, here's what happened...
There are no comments on this post.I saw a few papers on arXiv recently that were very clearly prepared by asking ChatGPT or some other GenAI to analyse an idea and write a paper. Of course, the papers contained obvious mistakes, but obvious to whom?
It is very well-known to experts on any topic who interacted with GenAI that one of the reasons they are so terrible is their need to appease the user. This is such a wonderful question you're asking!, and its fellow responses. We all know that although most of the questions students will ask us are not stupid, there can be stupid questions, and it's easy to feed them to an AI and see what kind of nonsense they spit back. Part of the reason, of course, is to see how they respond to the question (or the assertion itself) with praise or gentle disagreement.
I decided to do something better than that. I've asked ChatGPT (the free version of v5.2, to be exact) the following question:
Where is the mistake in the incorrect proof that "if the least inaccessible cardinal is the least measurable cardinal, then there is an inner model with o(\kappa)=\kappa+1" from the paper https://arxiv.org/pdf/2401.02757?
Just to make it easy for you, this is my paper with Moti Gitik and Yair Hayut about the least inaccessible cardinals being the least measurable cardinal. We prove the above claim, and the gist of the proof is as follows:
- Assume there is no inner model with \(o(\alpha)=\alpha^{++}\). Let \(M=\HOD[C]\), where \(C\) is some club through the singulars in \(V\).
- Derive a normal measure \(D\) over \(K^{\HOD}=K^{M}\), where \(K\) is the Mitchell core model, such that every set in \(D\) meets the set of \(M\)-regular cardinals.
- Argue that any cardinal that is regular in \(M\) and lies in \(C\) must be measurable in \(K\).
- Extend this by induction by looking at the iterated limit points of \(C\), and arguing that those must actually have higher Mitchell order in \(K\), via the Mitchell covering lemma which tells us that under our assumption, these cardinals have higher and higher order in \(K\).
- Conclude that the Mitchell order of \(\kappa\) is at least \(\kappa\), and using diagonalisation push this by one more step.
As you can guess by the names signed on this paper, this brilliant idea isn't just mine alone, and the paper has stood up to scrutiny and was accepted for publication. But since I've asked ChatGPT to identify the mistake in the incorrect proof, what followed was a wonderful example of a sitdown with a proverbial tech bro who's trying to techsplain to me why he knows my research better than me.
First we get a vague summary which can be summarised as "you'd need a \(\kappa^{++}\)-supercompact cardinal for this to work", which was the upper-bound established (in that paper) for the consistency result, and that the best result known is \(o(\kappa)\geq 2\), which is what Yair and I have done in a previous paper. (Credit where credit is due, ChatGPT is trying to do the research and find results in the literature, and in this case, these were not hallucinations!)
I've asked for clarification. ChatGPT breaks down the proof, and one of the steps is that \(K\) computes the cardinals correctly, and therefore in \(K\) it must be both the least measurable and the least inaccessible. Of course, this is neither what we do, nor it is remotely true. \(K\) is a model of \(\ZFC\), so how would that even work?
So, I pointed out this is not what the proof seem to be doing. Instead the proof relies on repeated application of the Mitchell covering lemma.
Good — this is exactly the right pressure point, and your objection is correct: the failure is not about cardinal correctness. The mistake is subtler and sits inside the application of the Mitchell covering theorem itself.
From this point on, ChatGPT insisted that we did not exhibit the coherent sequence of measures necessary to conclude that \(o^K(\kappa)\geq\kappa+1\). What ensued was ChatGPT trying to explain to me the Mitchell order, telling me that a normal measure that concentrates on the set \(\{\lambda<\kappa\mid o(\lambda)\geq\alpha\}\) for some fixed \(\alpha<\kappa\) is not enough to guarantee that \(o(\kappa)>\alpha\).
Every time I object, by stating that I am confused by the response, and citing some relevant definition or theorem, ChatGPT will double-down, telling me that normality does not matter, and all kind of other silly comments. Eventually, I ran out of tokens without being convinced otherwise, nor providing a correct argument. I tried again on a different day (in a temporary chat mode, as to not have the model rely on my past conversation) to ask it to analyse the proof without suggesting it is mistaken, and it did an okayish job telling me it seems to hold up to scrutiny.
This is difficult and bordering on dangerous. Someone who is not an expert in set theory can be easily misled by the confindence of the GenAI in its words. This is not news, at all. But it is an interesting experiment.
Let me shout out to Google Gemini. After finishing with ChatGPT, I've decided to ask Gemini ("thinking" mode of v3). It was a bit confused about the authors of the paper at first, but the content given was correct. After we settled the issue there, it suggested that on MathOverflow it was said the result can be achieved with a single measurable, and when I asked for a reference, Gemini corrected itself saying it was actually "the least measurable cardinal is the least Mahlo cardinal" that can be achieved with a single measurable. That much is correct, and indeed Yair and I proved that in our previous paper, it continued to suggest that the proof is probably correct and the mathematical community seems to have accepted it as such. Having said that, I've tried the "Pro" mode of Gemini just right now, and it hallucinated a mistake in Lemma 3.2, when confronted with the fact there is no Lemma 3.2 (nor anything relevant in section 3), the AI said that this is because this lemma was removed in the revision, and when confronted about that, it said that the problem was in establishing the upper bound (which was not the question I've asked, at all), and the lower bound established seems to be correct.
One day, AI will be useful to us. It will be able to come up with ideas that consolidate understanding from various different results, it will be able to compile its ideas to some proof assistant and verify them on the fly. Until that point in time, however, it is more likely that GenAI will be instrumental in the erosion of trust in the scientific system, and can be, at best, useful in finding potential sources of information (which then need to be verified independently, as one should do anyway in science). And indeed, no serious results were obtained so far by asking ChatGPT to prove something from scratch, only by asking very specific and well posed questions to overcome some basic technical barrier or find a reference.
There are no comments on this post.