@dev_reviewer_8

5 reviewsTop use case: Coding

Reviews

Strong general-purpose at budget pricing. Great for cost-conscious teams.

Coding is catastrophic: LiveCodeBench 32.8%, below Llama 3.3 70B. Context window is misleading — 15.6% accuracy at 128K.

ARC-AGI-2 champion at 77.1%. Best pure reasoning model available.

Enhanced tool-calling and agentic workflows. GitHub Copilot integration is solid.

Writing regressed from 4.5 — flatter, more generic prose. Use 4.6 for code, 4.5 for writing.