A new AI coding challenge just published its first results – and they are not beautiful

Blue code on a dark background presented at an angle.

A new AI coding challenge has revealed its first winner and set a new bar for AI-powered software engineers.

Wednesday at. At 5 pm, Non-Profit Laude Institute announced the first winner of the K Award, a multi-round AI Coding challenge launched by Databricks and Conplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andrade who will receive $ 50,000 for the award. But more surprising than the victory was his final result: he won with correct answers to only 7.5% of the questions on the test.

“We are glad that we built a benchmark that is actually difficult,” said Konwinski. “Benchmarks had to be difficult if they want to do something,” he continued, adding, “Results would be different if the big laboratories had gone in with their biggest models. But that’s the kind of point. The k -prize runs offline with limited calculation so it faves less and open models. I love it. It levels the game the field.”

Konwinski has promised $ 1 million to the first open source model that can score higher than 90% on the test.

Similar to the well-known SWE-BENCH system, K prices model against marked questions from GitHub as a test of how good models can handle programming problems in the real world. But while SWE-Bench is based on a fixed set of problems that models can train against, the K price is designed as a “contamination-free version of SWE-Bench” using a timed input system to protect against any benchmark-specific training. For round one there were models before March 12. The K -pricing organizers then built the test by only using GITHUB problems that were marked after this date.

The 7.5% top score contrasts markedly on SWE-Bench itself, which currently shows a 75% top score on its slightly ‘verified’ test and 34% on its tougher ‘full’ test. Konwinski is still not sure if the difference is due to pollution on Swe-Bench or just the challenge of collecting new problems from GitHub, but he expects the K prize project to answer the question soon.

“When we get more races from the thing, we get a better sense,” he told Techcrunch, “because we expect people to adapt to the dynamics of competing for this every few months.”

TechCrunch -event

San Francisco
|
27-29. October 2025

It may seem like a strange place to get short, given the wide range of AI coding tools that are already publicly available -but with benchmarks that get too easy, many critics see projects such as the K -prize as a necessary step towards solving AIS growing evaluation problem.

“I’m quite bullish building new tests for existing benchmarks,” says Princeton researcher Sayash Kapoor, who put forward a similar idea in a recent article. “Without such experiments, we can’t actually tell if the problem is pollution, or even just target the Swe-Bench-Leaderboard with a human in Løkken.”

For Konwinski, it’s not only a better benchmark, but an open challenge for the rest of the industry. “If you listen to hype, it’s like we should see AI doctors and AI lawyers and AI software engineers, and that’s just not true,” he says. “If we can’t even get more than 10% on a contamination-free Sweden-Bench, it’s reality check for me.”

Leave a Reply

Your email address will not be published. Required fields are marked *