I contributed one (1) task to HCAST, which was used in METR's Long Tasks paper. This gave me some thoughts I feel moved to share.
** Regarding Baselines and Estimates**
METR's tasks have two sources for how long they take humans: most of those used in the paper were Baselined using playtesters under persistent scrutiny, and some were Estimated by METR.
I don’t quite trust the Baselines. Baseliners were allowed/incentivized to drop tasks they weren’t making progress with, and were – mostly, effectively, there's some nuance here I’m ignoring – cut off at the eight-hour mark; Baseline times were found by averaging time taken for successful runs; this suggests Baseline estimates will be biased to be at least slightly too low, especially for more difficult tasks.[1]
I really, really don’t trust the Estimates[2]. My task was never successfully Baselined, so METR's main source for how long it would take – [...]
Outline:
(00:22) Regarding Baselines and Estimates
(02:23) Regarding Task Privacy
(04:00) In Conclusion
The original text contained 9 footnotes which were omitted from this narration.
First published: May 4th, 2025
---
Narrated by TYPE III AUDIO).