LLMs Cannot Find Reasoning Errors, but They Can Correct Them!

31 May 2024

Authors:

(1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk);

(2) Hassan Mansoor, Google Research (e-mail: hassan@google.com);

(3) Victor Carbune, Google Research (e-mail: vcarbune@google.com);

(4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com);

(5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com).

Table of Links

Abstract and Introduction

Conclusion, Limitations, and References

A. Implementational details

B. Annotation

C. Benchmark scores

Abstract

While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces.

We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.

1 Introduction

Large Language Models (LLMs) have dominated the field of NLP in recent years, achieving state of-the-art performance in a large variety of applications. In particular, LLMs have demonstrated the ability to solve tasks with zero- or few-shot prompting, giving rise to prompting methods such as Chain-of-Thought (CoT) (Wei et al., 2022), SelfConsistency (SC) (Wang et al., 2023), ReAct (Yao et al., 2022), etc.

Recent literature on few- or zero-shot prompting has focused on the concept of self-correction, i.e. having an LLM correct its own outputs (Shinn et al., 2023; Miao et al., 2023; Madaan et al., 2023; Chen et al., 2023; Saunders et al., 2022). (See Pan et al. (2023) for a review of the literature.)

However, Huang et al. (2023) note that while self-correction may prove effective for improving model outputs in terms of style and quality, there is limited evidence that LLMs can identify and fix their own reasoning and logical errors without external feedback. For example, Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023) both use ground truth correctness as a signal to halt the selfcorrection loop. Initially observed by Madaan et al. (2023) on a math reasoning dataset, Huang et al. (2023) further demonstrate this shortcoming of selfcorrection in 2 additional datasets.

While previous work typically present selfcorrection as a single process, we divide it into mistake finding and output correction.

Mistake finding is a fundamental reasoning skill that has been studied and utilised extensively in philosophy, psychology, and mathematics, spawning concepts such as critical thinking, and logical and mathematical fallacies. One might expect that the ability to find mistakes should also be an important requirement for LLMs. However, our results show that state-of-the-art LLMs currently cannot find mistakes reliably.

Output correction involves partially or completely changing previously generated outputs. In the context of self-correction, this is typically done with outputs generated by the same model (see Pan et al. (2023) for an overview of different strategies). Despite LLMs’ inability to find mistakes, our results show that they can correct outputs using our backtracking method, if given information about the mistakes, for example via a small, supervised reward model.

Our contributions for this paper are as follows:

1. With Chain-of-Thought prompting, any task can be turned into a mistake-finding task. We collect and release1 to the research community BIGBench Mistake, a dataset of CoT-style traces generated using PaLM 2, and annotated according to where the first logical mistake is. To our knowledge, BIG-Bench Mistake is the first dataset of its kind that goes beyond problems in mathematics.

2. We produce benchmark results for our dataset to test the reasoning capabilities of state-of-the-art LLMs. We demonstrate that current state-ofthe-art LLMs struggle with mistake finding, even for objective, unambiguous cases. We hypothesise that this is a main contributing factor to LLMs’ inability to self-correct reasoning errors, and we call on the research community to pursue further improvements on the mistake finding task

3. We propose backtracking as an output correction technique that makes use of mistake location information to improve performance on the original task. We demonstrate that this method corrects outputs that are originally incorrect, with minimal effect on outputs that are originally correct.

4. We construe backtracking as a form of “verbal reinforcement learning” (Shinn et al., 2023), allowing iterative improvement on CoT outputs without requiring any weight updates. We propose that backtracking can be used with a trained classifier as a reward model, and demonstrate the effectiveness of backtracking at various reward model accuracies.

This paper is available on arxiv under CC 4.0 license.