Coming soon
Earlier in this semester, in a seminar on LLM reasoning and planning, I talked about STaR, a method that helps LLMs do reasoning through finetuning. Well, that “explanation” kinda oversimplifies a lot of things in the paper. When I first read it, I had little to no context on LLM reasoning and RL, but the background reading I had to do to just understand the choices of the authors was worth it.
[Emergent abilities] shows that a sufficiently large language model trained on a variety of data has many “emergent abilities”- abilities that happeed to manifest after the model was scaled, like reasoning and few-shot learning. [Chain-of-thought] shows that when prompted with triplets of task, rationale and answer, the model will learn patterns from the in-context examples and follow up with a rationale and answer to the prompted question. There is no weight update.
You may ask yourself, why is the model trained through bootstrapping? The model could be finetuned on the reasoning traces of a bigger model like GPT-4 or DeepSeek-R1; it surely is a much simpler approach. I have two theories
Distillation of bias:
When the student model is finetuned on the reasoning traces of the teacher model, it will also learn the biases of the teacher model, among the many features that the teacher has.
New reasoning mechanisms:
During rationalization, the search space for the next token is much lower that that during generating rationales; all because the final answer is known. As we don’t directly care about the quality of rationales / rationalizations, the model implicitly learns to prioritize rationales that lead to correct answers, thus learning a strong signal
Here are some more articles you might like to read next: