This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
** Introduction**
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
First published: July 2nd, 2025
Source: https://www.lesswrong.com/posts/fjgYkTWKAXSxsxdsj/untitled-draft-zgxc)
Narrated by TYPE III AUDIO).
Images from the article:
)
)
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.