I think my problem is that I’m not sure I understand whether you evals are testi...

gertlabs · 2026-05-12T17:10:40 1778605840

I'm not so sure there's a difference. The main thing we want to measure for LLMs is broad reasoning capability, but seeing how that ability changes under different constraints (like programming language) is the interesting part.

stingraycharles · 2026-05-13T01:46:57 1778636817

Ok, when you put it that way, I agree that this is in fact interesting. Maybe you should actually put that sentence in some kind of form on your website because it clarifies a lot.

gertlabs · 2026-05-13T02:16:22 1778638582

Appreciate the feedback, we're taking a lot of good recommendations from this thread, which will be live in a few days: better explanations/tooltips, more samples for any filter sets we offer (or show sample counts if potentially noisy), and adding some functional languages like Clojure