Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

¹ByteDance Seed
²Fudan University ³Institute for AI Industry Research (AIR), Tsinghua University
⁴Nanjing University ⁵Shanghai Jiao Tong University
⁶SIA-Lab of Tsinghua AIR and ByteDance Seed

*Project Lead; †Equal Contribution

Contact: jiangjiec@bytedance.com

Introduction

We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills, which integrates seamlessly with reinforcement learning using verifiable rule-based rewards.

Enigmata-Data includes 36 tasks across 7 categories, each with: 1) a generator that produces unlimited examples with controllable difficulty, and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark for assessing puzzle reasoning abilities and guiding research on generalizable reasoning models.

Qwen2.5-32B-Enigmata, trained with RLVR, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI, and ARC-AGI 2. It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multitasking trade-off.

When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata.

We hope Enigmata serves as a solid foundation for the community to push forth the research on reasoning models!

🧩 Enigmata-Data: Synthetic Verifiable Puzzle Generation

Enigmata-Data comprises 36 distinct task types spanning 7 categories of logical reasoning puzzles. Each task is built with two core components: (1) Generator: Produces massive puzzle instances with precisely controllable difficulty parameters; (2) Verifier: Provides automatic, rule-based solution validation for reliable evaluation.

Unlimited Self-Verifying Data: It can generate an unlimited supply of self-verifying puzzle prompts, which plug seamlessly into the RLVR framework and support long chain-of-thought training.

Controlled Difficulty: Programmatic difficulty control allows researchers to mix puzzles in desired difficulty ratios and to conduct fine-grained experiments on how curriculum design influences reinforcement learning.

Flexible Task Sampling: Generators can emit arbitrary sample counts per task, enabling studies of task balancing and cross-task generalization.

1/

Details of 36 tasks in Enigmata

⚖️ Enigmata-Eval: Evaluating Logical Reasoning Capabilities

Enigmata-Eval is a comprehensive benchmark containing 4,758 puzzle instances across Easy, Medium, and Hard difficulty levels. Each task provides 50 instances per difficulty level where possible, with strict train-eval separation to prevent data leakage.

📥 Download Enigmata-Eval: HuggingFace Dataset

🤖 Enigmata-Model: The Training Recipe

Our training methodology follows a two-stage process designed to systematically build reasoning abilities: (1) rejection fine-tuning to establish foundational reasoning patterns, and (2) multi-task RL to develop general reasoning skills that transfer across diverse problem domains.

Rejection Fine-tuning: This initial stage focuses on building foundational reasoning by fine-tuning the model with high-quality solutions from a balanced mix of math and puzzle problems, including ARC-AGI.

RL with Verifiable Puzzles: The model then undergoes reinforcement learning using VC-PPO, where an automated verifier for puzzles provides immediate rewards, enabling an automatic RL pipeline for puzzle reasoning.

Multi-task Training: To develop general and transferable logical reasoning, the training incorporates multi-task methods like Mix-training RL and Multi-stage RL, combining diverse puzzle types (Enigmata, ARC-AGI) with challenging mathematical problems (AIME) while maintaining a balanced ratio.

👀 Experimental Results

Our model, specifically the 32B parameter version, significantly outperforms most public models on Enigmata-Eval and ARC-AGI, showcasing enhanced general logical reasoning. This success stems from effective rejection fine-tuning (RFT) and multi-task RL strategies, which improve generalization while preserving existing math reasoning abilities.

A descriptive alt text for your image — Performance of reasoning, generic, and our trained LLMs on reasoning benchmarks

On Enigmata-Eval, our Qwen2.5-32B-Enigmata model excels in Crypto, Arithmetic, and Logic tasks, indicating strong rule-based reasoning. It also performs competitively in search tasks, though spatial and sequential categories remain challenging.

🌟 Generalization with Scaling: Free Lunch from Enigmata

Incorporating the Enigmata-Data synthetic puzzle dataset into large-scale model training, e.g., Seed1.5-Thinking, surprisingly improving performance on challenging benchmarks like AIME and GPQA Diamond. This demonstrates an unexpected generalization benefit for advanced reasoning models.

📝 Citation

If you find this work useful, please cite our paper:


@article{chen2025enigmata,
    title={Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles},
    author={Chen, Jiangjie and He, Qianyu and Yuan, Siyu and Chen, Aili and Cai, Zhicheng and Dai, Weinan and Yu, Hongli and Yu, Qiying and Li, Xuefeng and Chen, Jiaze and others},
    journal={arXiv preprint arXiv:2505.19914},
    year={2025}
}