About 53 results
Open links in new tab
  1. Top OSWorld score in 2025? | Manifold

    Resolved MKT. Background OSWorld is a benchmark for evaluating multimodal AI agents on real-world computer tasks in open-ended environments. It tests an AI's ability to navigate operating systems, …

  2. Humanity’s Last Exam lists grok 4 at 45%+? | Manifold

    Grok 4's score is now up at 25.4%, but I'd suggest waiting to see if they release Grok 4-heavy or Grok 4 (heavy or not) with reasoning capabilities before resolving. They released Grok 4, Grok 4-Heavy (and …

  3. Highest Epoch-acknowledged FrontierMath score at EOY2026?

    While OpenAI has claimed that o3-mini achieved 32% on FrontierMath, I don't really believe them, plus they used an ungodly amount of compute. When judging how much progress has been made on …

  4. Will 10+ AI models get released in March? | Manifold

    Mar 8, 2026 · Resolved MKT. Based on prior similar markets for model releases, the following clarifications have been added: GPT-5.4-Codex is sufficient for GPT-5.4 (as it counted in Feb AI …

  5. A top-three AI lab delays a frontier model release six months for ...

    A top-three AI lab delays a frontier model release six months for safety reasons?

  6. Outcomes of Trump's Strait of Hormuz ultimatum | Manifold

    Iran closes and restricts usage of the Strait of Hormuz and the U.S. either makes a deal or decides to pull back.

  7. Will any AI model score >80% on Epoch's Frontier Math Benchmark in …

    Resolved NO. Background The FrontierMath benchmark, created by Epoch AI, is designed to test AI models' mathematical reasoning capabilities. As of December 2024, OpenAI's o3 reasoning model …

  8. Will an OpenAI model design an improved version of an existing drug …

    Jan 1, 2026 · Resolved 50%. Would-be strawberry man Riley Coyote commented on Sam Altman's post, asking "can I tell them about the thing?" Sam replied "which thing?" (possibly implying that Sam …

  9. OpenAI releases a new flagship model by November 15, 2025?

    Resolved NO. This market resolves to YES if OpenAI publicly announces and releases a new flagship large language model (successor to GPT-4o or equivalent) by November 15, 2025. Recent context: …

  10. Will Claude Opus be ranked in the top 20 on the Chatbot Arena ...

    Hey @ VerySeriousPoster -- this market just closed! The original Claude 3 Opus from March 2024 is nowhere near the top 20 on Chatbot Arena anymore. Current top spots are Claude Opus 4.6, Gemini …

  11. Claude Sonnet 5 released this week? | Manifold

    Resolved NO. [image]If whether Sonnet 5 was released is ambiguous (various valid definitions yield a different resolution decision), uninvolved moderators will be asked to resolve the market based on …