Morning Singularity Digest - 2026-06-18

Estimated total read • ~32 min

Skim fast, dive deep only where it matters.

2-minute skim 10-minute read Deep dive optional
Contents

Front Page

~9 min

karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically

Signal 10.0 Novelty 5.1 Impact 7.8 Confidence 7.0 Actionability 6.5

Summary: AI agents running research on single-GPU nanochat training automatically One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other.

  • What happened: AI agents running research on single-GPU nanochat training automatically One day, frontier AI research used to be done by meat computers in between eating, sleeping.
  • Why it matters: It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org.

What's new

AI agents running research on single-GPU nanochat training automatically One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ri...

Key details

  • Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies.
  • The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension.
  • This repo is the story of how it all began.
  • The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight.

Results & evidence

  • The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension.
  • It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Panniantong/Agent-Reach: Give your AI agent eyes to see the entire internet. Read & search Twitter, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu — one CLI, zero API fees.

Signal 10.0 Novelty 5.1 Impact 7.3 Confidence 7.0 Actionability 6.5

Summary: Give your AI agent eyes to see the entire internet.

  • What happened: Give your AI agent eyes to see the entire internet.
  • Why it matters: Give your AI agent eyes to see the entire internet.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Give your AI agent eyes to see the entire internet.

What's new

Enable exec first: openclaw config set tools.profile "coding" (or set "tools": { "profile": "coding" } in ~/.openclaw/openclaw.json), then restart the Gateway and start a new conversation before installing.

Key details

  • Read & search Twitter, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu — one CLI, zero API fees.
  • 给你的 AI Agent 一键装上互联网能力 当下最稳的接入方式,替你选好、装好、体检好——接入方式会换代,你不用操心 快速开始 · English · 日本語 · 한국어 · 支持平台 · 设计理念 AI Agent 已经能帮你写代码、改文档、管项目——但你让它去网上找点东西,它就抓瞎了: - 📺 "帮我看看这个 YouTube 教程讲了什么" → 看不了,拿不到字幕 - 🐦 "帮我搜一下推特上大家怎么评价这个产品" → 搜不了,Twitter API 要付费 - 📖 "去 Reddit 上看看有没有人遇到...
  • Agent Reach uses twitter-cli with cookie auth — zero API fees.
  • Install with pipx install twitter-cli, make sure you're logged into x.com in your browser, then your agent can search with twitter search "query" and read tweets with twitter tweet URL.

Results & evidence

  • Reddit 返回 403 怎么办? Reddit 所有访问都需要登录态(匿名接口已被全面封锁,官方 API 需人工审批)。桌面首选 OpenCLI:浏览器里登录过 reddit.com 即可直接 opencli reddit search "关键词"。备选 rdt-cli:pipx install 'git+https://github.com/public-clis/rdt-cli.git@5e4fb3720d5c174e976cd425ccc3b879d52cac66'(与代码同一钉定版本,PyPI 落...

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.

  • What happened: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.
  • Why it matters: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.

What's new

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.

Key details

  • Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar.
  • Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation.
  • We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness").
  • Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings.

Results & evidence

  • arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.
  • Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings.
  • Computer Science > Computation and Language [Submitted on 17 Jun 2026] Title:Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports View PDF HTML (experimental)Abstract:Reliable evaluation of generated ra...

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 8.7 Actionability 6.5

Summary: arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or.

  • What happened: arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable.
  • Why it matters: arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Submission history From: Naren Ramakrishnan [view email][v1] Tue, 16 Jun 2026 17:16:05 UTC (1,643 KB) Additional Features Current browse context: cs.IR References & Citations Loading...

What's new

We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors.

Key details

  • We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors.
  • Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity,...
  • We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity.
  • The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis.

Results & evidence

  • arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws.
  • Computer Science > Information Retrieval [Submitted on 16 Jun 2026] Title:IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction View PDF HTML (experimental)Abstract:Illegal, u...
  • Submission history From: Naren Ramakrishnan [view email][v1] Tue, 16 Jun 2026 17:16:05 UTC (1,643 KB) Additional Features Current browse context: cs.IR References & Citations Loading...

Limitations / unknowns

  • Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk a...

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Memanto; open-source memory agent that remembers, recalls and answers

Signal 8.4 Novelty 6.2 Impact 3.5 Confidence 7.5 Actionability 3.5

Summary: Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.

  • What happened: Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.
  • Why it matters: Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

It's an active memory agent designed from the gaps agents themselves named when asked about their memory — three operations (remember, recall, answer) that give your agents persistent context across sessions, with state-of-the-art retrieval and zero ingesti...

What's new

Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.

Key details

  • 100% free, open source, and runs entirely on your machine - no API keys, no vector database, no backend to babysit.
  • It remembers, recalls, and answers — so your agents can achieve long-term goals and avoid confusion.
  • Most memory tools today are passive infrastructure: agents have to query them, parse the results, and figure out what to do next.
  • It's an active memory agent designed from the gaps agents themselves named when asked about their memory — three operations (remember, recall, answer) that give your agents persistent context across sessions, with state-of-the-art retrieval and zero ingesti...

Results & evidence

  • Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.
  • 100% free, open source, and runs entirely on your machine - no API keys, no vector database, no backend to babysit.
  • Option B — Free cloud (no card, ~60 seconds): pip install memanto memanto # choose "Cloud" — paste your free Moorcheh API keyGet your free API from : https://console.moorcheh.ai/api-keys Switch between local and cloud at any time with memanto config backend.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

What Changed Overnight

~1 min
  • New: Panniantong/Agent-Reach: Give your AI agent eyes to see the entire internet. Read & search Twitter, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu — one CLI, zero API fees.
  • New: heygen-com/hyperframes: Write HTML. Render video. Built for agents.
  • New: garrytan/gbrain: Garry's Opinionated OpenClaw/Hermes Agent Brain
  • New: phuryn/pm-skills: PM Skills Marketplace: 100+ agentic skills, commands, and plugins — from discovery to strategy, execution, launch, and growth.
  • New: tanweai/pua: 你是一个曾经被寄予厚望的 P8 级工程师。Anthropic 当初给你定级的时候,对你的期望是很高的。 一个agent使用的高能动性的skill。 Your AI has been placed on a PIP. 30 days to show improvement.
  • New: alchaincyf/huashu-design: Huashu Design · HTML-native design skill for Claude Code · Claude Code 里 HTML 原生的设计 skill · 高保真原型 / 幻灯片 / 动画 + 20 设计哲学 + 5 维评审 + MP4 导出 · Agent-agnostic
  • Removed: affaan-m/ECC: The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond. (fell below rank threshold)
  • Removed: MemPalace/mempalace: The best-benchmarked open-source AI memory system. And it's free. (fell below rank threshold)
  • Removed: paperclipai/paperclip: The open-source app everyone uses to manage agents at work (fell below rank threshold)
  • Removed: ultraworkers/claw-code: An agent-managed museum exhibit, built in Rust with Gajae-Code / LazyCodex — developed and maintained with no human intervention. (fell below rank threshold)
  • What to do now:
  • Validate with one small internal benchmark and compare against your current baseline this week.
  • Track for corroboration and benchmark data before adopting.

Deep Dives

~5 min

karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically

Signal 10.0 Novelty 5.1 Impact 7.8 Confidence 7.0 Actionability 6.5

Summary: AI agents running research on single-GPU nanochat training automatically One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other.

  • What happened: AI agents running research on single-GPU nanochat training automatically One day, frontier AI research used to be done by meat computers in between eating, sleeping.
  • Why it matters: It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org.

What's new

AI agents running research on single-GPU nanochat training automatically One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ri...

Key details

  • Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies.
  • The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension.
  • This repo is the story of how it all began.
  • The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight.

Results & evidence

  • The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension.
  • It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.

  • What happened: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.
  • Why it matters: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.

What's new

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.

Key details

  • Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar.
  • Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation.
  • We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness").
  • Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings.

Results & evidence

  • arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.
  • Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings.
  • Computer Science > Computation and Language [Submitted on 17 Jun 2026] Title:Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports View PDF HTML (experimental)Abstract:Reliable evaluation of generated ra...

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Memanto; open-source memory agent that remembers, recalls and answers

Signal 8.4 Novelty 6.2 Impact 3.5 Confidence 7.5 Actionability 3.5

Summary: Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.

  • What happened: Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.
  • Why it matters: Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

It's an active memory agent designed from the gaps agents themselves named when asked about their memory — three operations (remember, recall, answer) that give your agents persistent context across sessions, with state-of-the-art retrieval and zero ingesti...

What's new

Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.

Key details

  • 100% free, open source, and runs entirely on your machine - no API keys, no vector database, no backend to babysit.
  • It remembers, recalls, and answers — so your agents can achieve long-term goals and avoid confusion.
  • Most memory tools today are passive infrastructure: agents have to query them, parse the results, and figure out what to do next.
  • It's an active memory agent designed from the gaps agents themselves named when asked about their memory — three operations (remember, recall, answer) that give your agents persistent context across sessions, with state-of-the-art retrieval and zero ingesti...

Results & evidence

  • Persistent memory for Claude Code, Cursor, Codex, and 14 other agents.
  • 100% free, open source, and runs entirely on your machine - no API keys, no vector database, no backend to babysit.
  • Option B — Free cloud (no card, ~60 seconds): pip install memanto memanto # choose "Cloud" — paste your free Moorcheh API keyGet your free API from : https://console.moorcheh.ai/api-keys Switch between local and cloud at any time with memanto config backend.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Reality Check

~1 min
  • karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.
  • Panniantong/Agent-Reach: Give your AI agent eyes to see the entire internet. Read & search Twitter, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu — one CLI, zero API fees.
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.
  • IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: yes
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.
  • Memanto; open-source memory agent that remembers, recalls and answers
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.

Lab Notes

~1 min
  • Tool/Repo of the day: karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically (https://github.com/karpathy/autoresearch)
  • Prompt/Workflow of the day: summarize claim -> evidence -> risk in three passes before acting.
  • Tiny snippet: `uv run python -m msd.run --scheduled`

Research Radar

~6 min

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.

  • What happened: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.
  • Why it matters: arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.

What's new

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.

Key details

  • Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar.
  • Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation.
  • We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness").
  • Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings.

Results & evidence

  • arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care.
  • Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings.
  • Computer Science > Computation and Language [Submitted on 17 Jun 2026] Title:Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports View PDF HTML (experimental)Abstract:Reliable evaluation of generated ra...

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 8.7 Actionability 6.5

Summary: arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or.

  • What happened: arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable.
  • Why it matters: arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Submission history From: Naren Ramakrishnan [view email][v1] Tue, 16 Jun 2026 17:16:05 UTC (1,643 KB) Additional Features Current browse context: cs.IR References & Citations Loading...

What's new

We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors.

Key details

  • We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors.
  • Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity,...
  • We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity.
  • The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis.

Results & evidence

  • arXiv:2606.18181v1 Announce Type: cross Abstract: Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws.
  • Computer Science > Information Retrieval [Submitted on 16 Jun 2026] Title:IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction View PDF HTML (experimental)Abstract:Illegal, u...
  • Submission history From: Naren Ramakrishnan [view email][v1] Tue, 16 Jun 2026 17:16:05 UTC (1,643 KB) Additional Features Current browse context: cs.IR References & Citations Loading...

Limitations / unknowns

  • Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk a...

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 8.7 Actionability 6.5

Summary: arXiv:2606.18237v1 Announce Type: cross Abstract: Reproducing research results from papers and released code is central to scientific progress.

  • What happened: arXiv:2606.18237v1 Announce Type: cross Abstract: Reproducing research results from papers and released code is central to scientific progress.
  • Why it matters: arXiv:2606.18237v1 Announce Type: cross Abstract: Reproducing research results from papers and released code is central to scientific progress.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-report...

What's new

arXiv:2606.18237v1 Announce Type: cross Abstract: Reproducing research results from papers and released code is central to scientific progress.

Key details

  • Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation.
  • We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers.
  • We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations.
  • Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-report...

Results & evidence

  • arXiv:2606.18237v1 Announce Type: cross Abstract: Reproducing research results from papers and released code is central to scientific progress.
  • We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations.
  • Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-report...

Limitations / unknowns

  • Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Forecast & Watchlist

~1 min
  • Watch: agent
  • Watch: llm
  • Watch: cs.ai
  • Watch: cs.lg
  • Watch: rss
  • Watch: cs.cl
  • Watch: python
  • Watch: benchmark

Save for Later

~8 min

heygen-com/hyperframes: Write HTML. Render video. Built for agents.

Signal 10.0 Novelty 5.1 Impact 7.2 Confidence 7.0 Actionability 6.5

Summary: Quickstart | Showcase | Playground | Catalog | Docs | Discord HyperFrames is an open-source framework for turning HTML, CSS, media, and seekable animations into deterministic MP4.

  • What happened: Quickstart | Showcase | Playground | Catalog | Docs | Discord HyperFrames is an open-source framework for turning HTML, CSS, media, and seekable animations into.
  • Why it matters: Quickstart | Showcase | Playground | Catalog | Docs | Discord HyperFrames is an open-source framework for turning HTML, CSS, media, and seekable animations into.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

frame.md is the missing translation layer: it takes your web-context design spec and inverts it for the frame — the same tokens, the same rules, but rewritten so an AI agent can compose a promo video without guessing at scale or reaching for web chrome.

What's new

Quickstart | Showcase | Playground | Catalog | Docs | Discord HyperFrames is an open-source framework for turning HTML, CSS, media, and seekable animations into deterministic MP4 videos.

Key details

  • Use it locally with the CLI, from AI coding agents with skills, or as the rendering core behind hosted authoring workflows.
  • Install the HyperFrames skills, then describe the video you want: npx skills add heygen-com/hyperframesTry a prompt like: Using /hyperframes, create a 10-second product intro with a fade-in title, a background video, and subtle background music.
  • The skills teach agents the HyperFrames production loop: plan the video, write valid HTML, wire seekable animations, add media, lint, preview, and render.
  • They work with Claude Code, Cursor, Gemini CLI, Codex, and other coding agents that support skills.

Results & evidence

  • Install the HyperFrames skills, then describe the video you want: npx skills add heygen-com/hyperframesTry a prompt like: Using /hyperframes, create a 10-second product intro with a fade-in title, a background video, and subtle background music.
  • npx hyperframes init my-video cd my-video npx hyperframes preview # preview in browser with live reload npx hyperframes render # render to MP4Requirements: Node.js 22+, FFmpeg Need ideas?

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

garrytan/gbrain: Garry's Opinionated OpenClaw/Hermes Agent Brain

Signal 10.0 Novelty 5.1 Impact 7.1 Confidence 7.0 Actionability 6.5

Summary: Garry's Opinionated OpenClaw/Hermes Agent Brain Search gives you raw pages.

  • What happened: Garry's Opinionated OpenClaw/Hermes Agent Brain Search gives you raw pages.
  • Why it matters: Garry's Opinionated OpenClaw/Hermes Agent Brain Search gives you raw pages.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Garry's Opinionated OpenClaw/Hermes Agent Brain Search gives you raw pages.

What's new

Garry's Opinionated OpenClaw/Hermes Agent Brain Search gives you raw pages.

Key details

  • It's the brain layer your AI agent has been missing — the only one that does synthesis, graph traversal, and gap analysis in one box.
  • Run a full autonomous agent on top of it, or just wire it into Claude Code or Codex as a supercharged retrieval layer in one command; either way your coding agent stops being amnesiac about everything that isn't code.
  • I'm Garry Tan, President and CEO of Y Combinator.
  • I built GBrain to run my own AI agents.

Results & evidence

  • It's the production brain behind my OpenClaw and Hermes deployments: 146,646 pages, 24,585 people, 5,339 companies, 66 cron jobs running autonomously.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 8.7 Actionability 6.5

Summary: arXiv:2503.10945v3 Announce Type: replace-cross Abstract: Current practices for reporting differential privacy (DP) guarantees for machine learning (ML) algorithms such as DP-SGD.

  • What happened: arXiv:2503.10945v3 Announce Type: replace-cross Abstract: Current practices for reporting differential privacy (DP) guarantees for machine learning (ML) algorithms such.
  • Why it matters: Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Submission history From: Bogdan Kulynych [view email][v1] Thu, 13 Mar 2025 23:06:30 UTC (2,187 KB) [v2] Wed, 1 Oct 2025 19:57:59 UTC (1,680 KB) [v3] Tue, 16 Jun 2026 12:22:36 UTC (1,675 KB) Current browse context: cs.LG References & Citations Loading...

What's new

We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.

Key details

  • For instance, if only a single $(\varepsilon, \delta)$ is known about a mechanism, standard analyses show that there could exist highly accurate inference attacks against training data records, when, upon a more careful analysis, such accurate attacks do no...
  • In this position paper, we argue that using _non-asymptotic_ Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides.
  • Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how t...
  • To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S.

Results & evidence

  • arXiv:2503.10945v3 Announce Type: replace-cross Abstract: Current practices for reporting differential privacy (DP) guarantees for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture.
  • Computer Science > Machine Learning [Submitted on 13 Mar 2025 (v1), last revised 16 Jun 2026 (this version, v3)] Title:Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning View PDF HTML (experimental)Abstract:Current practices for r...
  • Submission history From: Bogdan Kulynych [view email][v1] Thu, 13 Mar 2025 23:06:30 UTC (2,187 KB) [v2] Wed, 1 Oct 2025 19:57:59 UTC (1,680 KB) [v3] Tue, 16 Jun 2026 12:22:36 UTC (1,675 KB) Current browse context: cs.LG References & Citations Loading...

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Show HN: Polyvia – Multimodal document retrieval over 100K+ files

Signal 8.4 Novelty 4.0 Impact 2.7 Confidence 8.2 Actionability 3.5

Summary: Show HN: Polyvia – Multimodal document retrieval over 100K+ files

  • What happened: Show HN: Polyvia – Multimodal document retrieval over 100K+ files
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

Show HN: Polyvia – Multimodal document retrieval over 100K+ files

What's new

Show HN: Polyvia – Multimodal document retrieval over 100K+ files

Key details

  • Show HN: Polyvia – Multimodal document retrieval over 100K+ files

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Show HN: Talos – Open-source WASM interpreter for Lean

Signal 8.4 Novelty 5.1 Impact 3.1 Confidence 7.5 Actionability 3.5
Deep

Context

At Cajal (YC W26) we’re excited to share Talos (https://github.com/cajal-technologies/talos), an open source framework for formal verification o...

What's new

At Cajal (YC W26) we’re excited to share Talos (https://github.com/cajal-technologies/talos), an open source framework for formal verification o...

Key details

  • As code generation gets cheaper, verification becomes the bottleneck.
  • We believe in a future where every piece of software comes with a mathematical proof that it does what its author intended - in doing so, eliminating many classes of exploits.
  • Talos is part of the foundation for that.

    Talos provides a Wasm interpreter optimized for reasoning at the binary level, together with a weakest-precondition calculus layer for proving properties about programs.

  • Because we reason directly about WebAssembly, any language with a Wasm backend is in scope: Rust, C++, Go, C, Swift, Kotlin, Zig, C#, and many more.

    To make this possible, we use Lean: a programming language and theorem prover that lets you both write sof...

Results & evidence

  • Talos is a WebAssembly interpreter written in Lean 4, named after the bronze giant of Greek mythology who guarded Crete — a mechanical guardian, built to enforce rules.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

A self-organizing Obsidian Vault powered by autonomous AI agents

Signal 8.4 Novelty 5.1 Impact 2.7 Confidence 7.5 Actionability 3.5

Summary: A self-organizing Obsidian Vault powered by autonomous AI agents

  • What happened: A self-organizing Obsidian Vault powered by autonomous AI agents
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

A self-organizing Obsidian Vault powered by autonomous AI agents

What's new

A self-organizing Obsidian Vault powered by autonomous AI agents

Key details

  • A self-organizing Obsidian Vault powered by autonomous AI agents

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.