Alibaba's Qwen with Questions reasoning model beats o1-preview

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Chinese e-commerce giant Alibaba has released the latest model in its ever-expanding Qwen family. This one is known as Qwen with Questions (QwQ), and serves as the latest open source competitor to OpenAI’s o1 reasoning model.

Like other large reasoning models (LRMs), QwQ uses extra compute cycles during inference to review its answers and correct its mistakes, making it more suitable for tasks that require logical reasoning and planning like math and coding.

What is Qwen with Questions (OwQ?) and can it be used for commercial purposes?

Alibaba has released a 32-billion-parameter version of QwQ with a 32,000-token context. The model is currently in preview, which means a higher-performing version is likely to follow.

According to Alibaba’s tests, QwQ beats o1-preview on the AIME and MATH benchmarks, which evaluate mathematical problem-solving abilities. It also outperforms o1-mini on GPQA, a benchmark for scientific reasoning. QwQ is inferior to o1 on the LiveCodeBench coding benchmarks but still outperforms other frontier models such as GPT-4o and Claude 3.5 Sonnet.

Qwen with Questions
Example output of Qwen with Questions

QwQ does not come with an accompanying paper that describes the data or the process used to train the model, which makes it difficult to reproduce the model’s results. However, since the model is open, unlike OpenAI o1, its “thinking process” is not hidden and can be used to make sense of how the model reasons when solving problems.

Alibaba has also released the model under an Apache 2.0 license, which means it can be used for commercial purposes.

‘We discovered something profound’

According to a blog post that was published along with the model’s release, “Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun… This process of careful reflection and self-questioning leads to remarkable breakthroughs in solving complex problems.”

This is very similar to what we know about how reasoning models work. By generating more tokens and reviewing their previous responses, the models are more likely to correct potential mistakes. Marco-o1, another reasoning model recently released by Alibaba might also contain hints of how QwQ might be working. Marco-o1 uses Monte Carlo Tree Search (MCTS) and self-reflection at inference time to create different branches of reasoning and choose the best answers. The model was trained on a mixture of chain-of-thought (CoT) examples and synthetic data generated with MCTS algorithms.

Alibaba points out that QwQ still has limitations such as mixing languages or getting stuck in circular reasoning loops. The model is available for download on Hugging Face and an online demo can be found on Hugging Face Spaces.

The LLM age gives way to LRMs: Large Reasoning Models

The release of o1 has triggered growing interest in creating LRMs, even though not much is known about how the model works under the hood aside from using inference-time scale to improve the model’s responses. 

There are now several Chinese competitors to o1. Chinese AI lab DeepSeek recently released R1-Lite-Preview, its o1 competitor, which is currently only available through the company’s online chat interface. R1-Lite-Preview reportedly beats o1 on several key benchmarks.

Another recently released model is LLaVA-o1, developed by researchers from multiple universities in China, which brings the inference-time reasoning paradigm to open-source vision language models (VLMs). 

The focus on LRMs comes at a time of uncertainty about the future of model scaling laws. Reports indicate that AI labs such as OpenAI, Google DeepMind, and Anthropic are getting diminishing returns on training larger models. And creating larger volumes of quality training data is becoming increasingly difficult as models are already being trained on trillions of tokens gathered from the internet. 

Meanwhile, inference-time scale offers an alternative that might provide the next breakthrough in improving the abilities of the next generation of AI models. There are reports that OpenAI is using o1 to generate synthetic reasoning data to train the next generation of its LLMs. The release of open reasoning models is likely to stimulate progress and make the space more competitive.



Source link