We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Introducing the WeirdML Benchmark” by Håvard Tveit Ihle
21:31
Share
2025/1/16
LessWrong (30+ Karma)
AI Chapters
Transcribe
Chapters
Introduction to the WeirdML Benchmark
What Are the Results of the WeirdML Benchmark?
How Was the Evaluation Setup Conducted?
What Is the System Architecture Behind the WeirdML Benchmark?
What Are the Tasks in the WeirdML Benchmark?
Shapes (Easy): How Do LLMs Perform on Simple Shape Recognition?
Shapes (Hard): Can LLMs Handle Complex Shape Recognition?
Image Patch Shuffling (Easy): How Do LLMs Handle Simple Image Patch Tasks?
Image Patch Shuffling (Hard): Can LLMs Tackle Complex Image Patch Tasks?
Chess Game Outcome Prediction: How Do LLMs Predict Chess Game Outcomes?
Unsupervised Digit Recognition: Can LLMs Recognize Digits Without Supervision?
Further Analysis and Insights
What Is the Failure Rate of LLMs in the WeirdML Benchmark?
How Does Model Performance Improve Over Iterations?
What Is the Maximum of k First Submissions (max@k)?
What Are the Future Directions for the WeirdML Benchmark?
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.