Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think my problem is that I’m not sure I understand whether you evals are testing language abilities or reasoning abilities.

It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.



I'm not so sure there's a difference. The main thing we want to measure for LLMs is broad reasoning capability, but seeing how that ability changes under different constraints (like programming language) is the interesting part.


Ok, when you put it that way, I agree that this is in fact interesting. Maybe you should actually put that sentence in some kind of form on your website because it clarifies a lot.


Appreciate the feedback, we're taking a lot of good recommendations from this thread, which will be live in a few days: better explanations/tooltips, more samples for any filter sets we offer (or show sample counts if potentially noisy), and adding some functional languages like Clojure




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: