Werewolfing as an assessment framework has three dimensions of advantage over traditional testing methods:
- Multi-dimensional competency testing: Simultaneous testing of language generation, logical reasoning, strategy development, mental gaming, and other complex abilities
- Dynamic Interactive Environment: The model needs to adjust its strategy based on real-time feedback from other participants, which is closer to the real social scenario
- Highly interpretable: Visualize the causes and consequences of each decision made by the model through a complete conversation log.
Specifically:
- The game's natural deception mechanism effectively tests the factual consistency of the model
- Role Identity Hiding Requirements Can Evaluate the Depth of Contextual Understanding of Models
- The voting session reflects the model's ability to synthesize judgment
The OpenNumbers team has strengthened the evaluation dimensions in the design, and made the game performance quantifiable through a standardized score system (e.g., "Accuracy of Lie Detection", "Success Rate of Identity Disguise", etc.). This type of evaluation can reveal the real ability of large models in complex scenarios better than a single question and answer test.
This answer comes from the articleWatch multiple large models compete in a game of Werewolf Reasoning to test who has the best reasoning skills!The





























