Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

1ByteDance Seed
2Fudan University 3Institute for AI Industry Research (AIR), Tsinghua University
4Nanjing University 5Shanghai Jiao Tong University
6SIA-Lab of Tsinghua AIR and ByteDance Seed
*Project Lead; †Equal Contribution

Introduction

We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills, which integrates seamlessly with reinforcement learning using verifiable rule-based rewards.

Enigmata-Data includes 36 tasks across 7 categories, each with: 1) a generator that produces unlimited examples with controllable difficulty, and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark for assessing puzzle reasoning abilities and guiding research on generalizable reasoning models.

Qwen2.5-32B-Enigmata, trained with RLVR, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI, and ARC-AGI 2. It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multitasking trade-off.

When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata.

We hope Enigmata serves as a solid foundation for the community to push forth the research on reasoning models!

🧩 Enigmata-Data: Synthetic Verifiable Puzzle Generation

Enigmata-Data comprises 36 distinct task types spanning 7 categories of logical reasoning puzzles. Each task is built with two core components: (1) Generator: Produces massive puzzle instances with precisely controllable difficulty parameters; (2) Verifier: Provides automatic, rule-based solution validation for reliable evaluation.

1

Unlimited Self-Verifying Data: It can generate an unlimited supply of self-verifying puzzle prompts, which plug seamlessly into the RLVR framework and support long chain-of-thought training.

2

Controlled Difficulty: Programmatic difficulty control allows researchers to mix puzzles in desired difficulty ratios and to conduct fine-grained experiments on how curriculum design influences reinforcement learning.

3

Flexible Task Sampling: Generators can emit arbitrary sample counts per task, enabling studies of task balancing and cross-task generalization.

1/

Details of 36 tasks in Enigmata

⚖️ Enigmata-Eval: Evaluating Logical Reasoning Capabilities

Enigmata-Eval is a comprehensive benchmark containing 4,758 puzzle instances across Easy, Medium, and Hard difficulty levels. Each task provides 50 instances per difficulty level where possible, with strict train-eval separation to prevent data leakage.

📥 Download Enigmata-Eval: HuggingFace Dataset

🤖 Enigmata-Model: The Training Recipe

Our training methodology follows a two-stage process designed to systematically build reasoning abilities: (1) rejection fine-tuning to establish foundational reasoning patterns, and (2) multi-task RL to develop general reasoning skills that transfer across diverse problem domains.

1

Rejection Fine-tuning: This initial stage focuses on building foundational reasoning by fine-tuning the model with high-quality solutions from a balanced mix of math and puzzle problems, including ARC-AGI.

2

RL with Verifiable Puzzles: The model then undergoes reinforcement learning using VC-PPO, where an automated verifier for puzzles provides immediate rewards, enabling an automatic RL pipeline for puzzle reasoning.

3

Multi-task Training: To develop general and transferable logical reasoning, the training incorporates multi-task methods like Mix-training RL and Multi-stage RL, combining diverse puzzle types (Enigmata, ARC-AGI) with challenging mathematical problems (AIME) while maintaining a balanced ratio.

👀 Experimental Results

Our model, specifically the 32B parameter version, significantly outperforms most public models on Enigmata-Eval and ARC-AGI, showcasing enhanced general logical reasoning. This success stems from effective rejection fine-tuning (RFT) and multi-task RL strategies, which improve generalization while preserving existing math reasoning abilities.

A descriptive alt text for your image
Performance of reasoning, generic, and our trained LLMs on reasoning benchmarks

On Enigmata-Eval, our Qwen2.5-32B-Enigmata model excels in Crypto, Arithmetic, and Logic tasks, indicating strong rule-based reasoning. It also performs competitively in search tasks, though spatial and sequential categories remain challenging.

A descriptive alt text for your image
Performance of reasoning LLMs, generic LLMs, and our trained LLMs on Enigmata-Eval

🌟 Generalization with Scaling: Free Lunch from Enigmata

Incorporating the Enigmata-Data synthetic puzzle dataset into large-scale model training, e.g., Seed1.5-Thinking, surprisingly improving performance on challenging benchmarks like AIME and GPQA Diamond. This demonstrates an unexpected generalization benefit for advanced reasoning models.

A descriptive alt text for your image
Results on benchmarks for general reasoning capabilities

📝 Citation

If you find this work useful, please cite our paper:


    @article{2025enigmata,
      title={Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles},
      author={Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang},
      journal={arXiv preprint arXiv:2505.19914},
      year={2025}
    }