Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words
-
Updated
Mar 22, 2025 - Python
Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words
UnrealMCP is here!! A Unreal Engine plugin for LLM/GenAI models and MCP UE5 server. Supports automatic blueprint and scene generation from Claude Desktop App & Cursor. It currently also includes OpenAI's GPT4o/GPT4o-mini, DeepseekR1 and Claude Sonnet 3.7 APIs for Unreal Engine 5.1 or higher, with plans to add Gemini, Grok 3 & realtime APIs soon.
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
Agentic abstraction layer for building high precision vertical AI agents written in python. Middleware for Model Context Protocol.
This repository hosts code samples, benchmarks, and experiments exploring the capabilities of Large Language Models (LLMs) like ChatGPT, Claude, DeepSeek, Grok, and more. From AI-driven coding to gaming, creativity, and education. Fork, explore, and contribute! 🚀
"Type or Die" – A weekend challenge built with Cursor and Claude Sonnet 3.7, where speed and accuracy are your only survival tools! 🚀🔥
Add a description, image, and links to the sonnet3-7 topic page so that developers can more easily learn about it.
To associate your repository with the sonnet3-7 topic, visit your repo's landing page and select "manage topics."