Morning Singularity Digest - 2026-06-02

Estimated total read • ~29 min

Skim fast, dive deep only where it matters.

2-minute skim 10-minute read Deep dive optional
Contents

Front Page

~7 min

MemPalace/mempalace: The best-benchmarked open-source AI memory system. And it's free.

Signal 10.0 Novelty 6.2 Impact 7.5 Confidence 7.8 Actionability 6.5

Summary: The best-benchmarked open-source AI memory system.

  • What happened: The best-benchmarked open-source AI memory system.
  • Why it matters: The best-benchmarked open-source AI memory system.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

The best-benchmarked open-source AI memory system.

What's new

The best-benchmarked open-source AI memory system.

Key details

  • Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls.
  • MemPalace has no other official websites.
  • The only official sources are this GitHub repository, the PyPI package, and the docs at mempalaceofficial.com.
  • Any other domain (including .tech , .net , or other .com variants) is an impostor and may distribute malware.

Results & evidence

  • Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls.
  • Important Claude Code sessions expire in 30 days without auto-save hooks wired.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

affaan-m/ECC: The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

Signal 10.0 Novelty 6.2 Impact 8.2 Confidence 7.0 Actionability 6.5

Summary: The agent harness performance optimization system.

  • What happened: The agent harness performance optimization system.
  • Why it matters: The agent harness performance optimization system.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

| Topic | What You'll Learn | |---|---| | Token Optimization | Model selection, system prompt slimming, background processes | | Memory Persistence | Hooks that save/load context across sessions automatically | | Continuous Learning | Auto-extract patterns...

What's new

Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

Key details

  • Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
  • Language: English | Português (Brasil) | 简体中文 | 繁體中文 | 日本語 | 한국어 | Türkçe | Русский | Tiếng Việt | ไทย | Deutsch 182K+ stars | 28K+ forks | 170+ contributors | 12+ language ecosystems | Cross-harness agent workflows Language / 语言 / 語言 / Dil / Язык / Ngôn ng...
  • Built from real-world multi-harness engineering workflows.
  • A complete system: skills, instincts, memory optimization, continuous learning, security scanning, and research-first development.

Results & evidence

  • Language: English | Português (Brasil) | 简体中文 | 繁體中文 | 日本語 | 한국어 | Türkçe | Русский | Tiếng Việt | ไทย | Deutsch 182K+ stars | 28K+ forks | 170+ contributors | 12+ language ecosystems | Cross-harness agent workflows Language / 语言 / 語言 / Dil / Язык / Ngôn ng...
  • Production-ready agents, skills, hooks, rules, MCP configurations, and legacy command shims evolved over 10+ months of intensive daily use building real products.
  • ECC v2.0.0-rc.1 adds the public Hermes operator story on top of that reusable layer: start with the Hermes setup guide, then review the rc.1 release notes and cross-harness architecture.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Signal 9.4 Novelty 6.2 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

  • What happened: We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
  • Why it matters: arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.

What's new

We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.

Key details

  • Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.
  • However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages.
  • We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
  • AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.

Results & evidence

  • arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains.
  • AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
  • We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.

Limitations / unknowns

  • However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

How to Correctly Report LLM-as-a-Judge Evaluations

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.

  • What happened: arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.
  • Why it matters: arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Submission history From: Chungpa Lee [view email][v1] Wed, 26 Nov 2025 07:46:46 UTC (396 KB) [v2] Sun, 4 Jan 2026 07:18:14 UTC (313 KB) [v3] Mon, 9 Feb 2026 07:36:38 UTC (315 KB) [v4] Sun, 31 May 2026 12:00:00 UTC (2,623 KB) Current browse context: cs.LG Re...

What's new

We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification.

Key details

  • However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores.
  • We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification.
  • Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset.
  • Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals.

Results & evidence

  • arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.
  • Computer Science > Machine Learning [Submitted on 26 Nov 2025 (v1), last revised 31 May 2026 (this version, v4)] Title:How to Correctly Report LLM-as-a-Judge Evaluations View PDF HTML (experimental)Abstract:Large language models (LLMs) are widely used as sc...
  • Submission history From: Chungpa Lee [view email][v1] Wed, 26 Nov 2025 07:46:46 UTC (396 KB) [v2] Sun, 4 Jan 2026 07:18:14 UTC (313 KB) [v3] Mon, 9 Feb 2026 07:36:38 UTC (315 KB) [v4] Sun, 31 May 2026 12:00:00 UTC (2,623 KB) Current browse context: cs.LG Re...

Limitations / unknowns

  • However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Software 3.0 developer guide – principles, methods, and a four-phase framework

Signal 8.4 Novelty 4.0 Impact 2.8 Confidence 7.5 Actionability 5.2

Summary: Software 3.0 developer guide – principles, methods, and a four-phase framework

  • What happened: Software 3.0 developer guide – principles, methods, and a four-phase framework
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

Software 3.0 developer guide – principles, methods, and a four-phase framework

What's new

Software 3.0 developer guide – principles, methods, and a four-phase framework

Key details

  • Software 3.0 developer guide – principles, methods, and a four-phase framework

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

What Changed Overnight

~1 min
  • New: affaan-m/ECC: The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
  • New: MemPalace/mempalace: The best-benchmarked open-source AI memory system. And it's free.
  • New: paperclipai/paperclip: The open-source app everyone uses to manage agents at work
  • New: VoltAgent/awesome-design-md: A collection of DESIGN.md files analysis by popular brand design systems. Drop one into your project and let coding agents generate a matching UI.
  • New: karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically
  • New: addyosmani/agent-skills: Production-grade engineering skills for AI coding agents.
  • Removed: MatchFixAgent: Language-Agnostic Autonomous Repository-Level Code Translation Validation and Repair (fell below rank threshold)
  • Removed: SERA: Soft-Verified Efficient Repository Agents (fell below rank threshold)
  • Removed: AI Agent Guidelines for CS336 at Stanford (fell below rank threshold)
  • Removed: DuckDuckGo makes its 'no-AI' search engine easier to access as its traffic booms (fell below rank threshold)
  • What to do now:
  • Validate with one small internal benchmark and compare against your current baseline this week.
  • Track for corroboration and benchmark data before adopting.

Deep Dives

~6 min

affaan-m/ECC: The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

Signal 10.0 Novelty 6.2 Impact 8.2 Confidence 7.0 Actionability 6.5

Summary: The agent harness performance optimization system.

  • What happened: The agent harness performance optimization system.
  • Why it matters: The agent harness performance optimization system.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

| Topic | What You'll Learn | |---|---| | Token Optimization | Model selection, system prompt slimming, background processes | | Memory Persistence | Hooks that save/load context across sessions automatically | | Continuous Learning | Auto-extract patterns...

What's new

Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

Key details

  • Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
  • Language: English | Português (Brasil) | 简体中文 | 繁體中文 | 日本語 | 한국어 | Türkçe | Русский | Tiếng Việt | ไทย | Deutsch 182K+ stars | 28K+ forks | 170+ contributors | 12+ language ecosystems | Cross-harness agent workflows Language / 语言 / 語言 / Dil / Язык / Ngôn ng...
  • Built from real-world multi-harness engineering workflows.
  • A complete system: skills, instincts, memory optimization, continuous learning, security scanning, and research-first development.

Results & evidence

  • Language: English | Português (Brasil) | 简体中文 | 繁體中文 | 日本語 | 한국어 | Türkçe | Русский | Tiếng Việt | ไทย | Deutsch 182K+ stars | 28K+ forks | 170+ contributors | 12+ language ecosystems | Cross-harness agent workflows Language / 语言 / 語言 / Dil / Язык / Ngôn ng...
  • Production-ready agents, skills, hooks, rules, MCP configurations, and legacy command shims evolved over 10+ months of intensive daily use building real products.
  • ECC v2.0.0-rc.1 adds the public Hermes operator story on top of that reusable layer: start with the Hermes setup guide, then review the rc.1 release notes and cross-harness architecture.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Signal 9.4 Novelty 6.2 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

  • What happened: We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
  • Why it matters: arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.

What's new

We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.

Key details

  • Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.
  • However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages.
  • We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
  • AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.

Results & evidence

  • arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains.
  • AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
  • We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.

Limitations / unknowns

  • However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

paperclipai/paperclip: The open-source app everyone uses to manage agents at work

Signal 10.0 Novelty 6.2 Impact 7.7 Confidence 7.0 Actionability 6.5

Summary: The open-source app everyone uses to manage agents at work Quickstart · Docs · GitHub · Discord · Twitter · Website full-tour.webm Open-source orchestration for teams of AI agents.

  • What happened: The open-source app everyone uses to manage agents at work Quickstart · Docs · GitHub · Discord · Twitter · Website full-tour.webm Open-source orchestration for teams of.
  • Why it matters: The open-source app everyone uses to manage agents at work Quickstart · Docs · GitHub · Discord · Twitter · Website full-tour.webm Open-source orchestration for teams of.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

The open-source app everyone uses to manage agents at work Quickstart · Docs · GitHub · Discord · Twitter · Website full-tour.webm Open-source orchestration for teams of AI agents.

What's new

The open-source app everyone uses to manage agents at work Quickstart · Docs · GitHub · Discord · Twitter · Website full-tour.webm Open-source orchestration for teams of AI agents.

Key details

  • If OpenClaw is an employee, Paperclip is the company.
  • Paperclip is a Node.js server and React UI that orchestrates a team of AI agents to run a business.
  • Bring your own agents, assign goals, and track work and costs from one dashboard.
  • Under the hood: org charts, budgets, governance, goal alignment, and agent coordination.

Results & evidence

  • | Step | Example | | |---|---|---| | 01 | Define the goal | "Build the #1 AI note-taking app to $1M MRR." | | 02 | Hire the team | CEO, CTO, engineers, designers, marketers — any bot, any provider.
  • | | 03 | Approve and run | Review strategy.
  • | - ✅ You want to build autonomous AI companies - ✅ You coordinate many different agents (OpenClaw, Codex, Claude, Cursor) toward a common goal - ✅ You have 20 simultaneous Claude Code terminals open and lose track of what everyone is doing - ✅ You want age...

Limitations / unknowns

  • When they hit the limit, they stop.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Reality Check

~1 min
  • affaan-m/ECC: The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.
  • Software 3.0 developer guide – principles, methods, and a four-phase framework
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.
  • affaan-m/ECC: The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.
  • paperclipai/paperclip: The open-source app everyone uses to manage agents at work
  • Primary source: yes
  • Demo available: no
  • Benchmarks/evals: no
  • Baselines/ablations: no
  • Third-party corroboration: no
  • Reproducibility details: yes
  • What would change my mind:
  • Independent replication with comparable or better results.
  • Public benchmark numbers with clear baseline comparisons.
  • Likely failure mode: Performance may collapse outside curated demos or narrow tasks.

Lab Notes

~1 min
  • Tool/Repo of the day: MemPalace/mempalace: The best-benchmarked open-source AI memory system. And it's free. (https://github.com/MemPalace/mempalace)
  • Prompt/Workflow of the day: summarize claim -> evidence -> risk in three passes before acting.
  • Tiny snippet: `uv run python -m msd.run --scheduled`

Research Radar

~6 min

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Signal 9.4 Novelty 6.2 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

  • What happened: We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
  • Why it matters: arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.

What's new

We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.

Key details

  • Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.
  • However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages.
  • We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
  • AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.

Results & evidence

  • arXiv:2603.19005v2 Announce Type: replace-cross Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains.
  • AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
  • We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.

Limitations / unknowns

  • However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

How to Correctly Report LLM-as-a-Judge Evaluations

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.

  • What happened: arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.
  • Why it matters: arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Submission history From: Chungpa Lee [view email][v1] Wed, 26 Nov 2025 07:46:46 UTC (396 KB) [v2] Sun, 4 Jan 2026 07:18:14 UTC (313 KB) [v3] Mon, 9 Feb 2026 07:36:38 UTC (315 KB) [v4] Sun, 31 May 2026 12:00:00 UTC (2,623 KB) Current browse context: cs.LG Re...

What's new

We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification.

Key details

  • However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores.
  • We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification.
  • Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset.
  • Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals.

Results & evidence

  • arXiv:2511.21140v4 Announce Type: replace Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators.
  • Computer Science > Machine Learning [Submitted on 26 Nov 2025 (v1), last revised 31 May 2026 (this version, v4)] Title:How to Correctly Report LLM-as-a-Judge Evaluations View PDF HTML (experimental)Abstract:Large language models (LLMs) are widely used as sc...
  • Submission history From: Chungpa Lee [view email][v1] Wed, 26 Nov 2025 07:46:46 UTC (396 KB) [v2] Sun, 4 Jan 2026 07:18:14 UTC (313 KB) [v3] Mon, 9 Feb 2026 07:36:38 UTC (315 KB) [v4] Sun, 31 May 2026 12:00:00 UTC (2,623 KB) Current browse context: cs.LG Re...

Limitations / unknowns

  • However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

Signal 9.4 Novelty 4.0 Impact 2.0 Confidence 9.5 Actionability 6.5

Summary: arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision.

  • What happened: arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy.
  • Why it matters: arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

Current browse context: cs.CL Change to browse by: References & Citations Loading...

What's new

arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations.

Key details

  • A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated.
  • For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $\rho$, Kendall's $\tau_b$, the phi coefficient $\phi$, and the Matthews...
  • Cohen's $\kappa$ is the one agreement coefficient that adds information: it shares $\phi$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's.
  • We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences.

Results & evidence

  • arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations.
  • A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated.
  • Computer Science > Computation and Language [Submitted on 25 May 2026] Title:Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why View PDF HTML (experimental)Abstract:Validating an LLM judge against human annotations usually means reporting...

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Forecast & Watchlist

~1 min
  • Watch: agent
  • Watch: llm
  • Watch: cs.ai
  • Watch: cs.lg
  • Watch: rss
  • Watch: cs.cl
  • Watch: python
  • Watch: benchmark

Save for Later

~6 min

VoltAgent/awesome-design-md: A collection of DESIGN.md files analysis by popular brand design systems. Drop one into your project and let coding agents generate a matching UI.

Signal 10.0 Novelty 5.1 Impact 7.8 Confidence 7.0 Actionability 6.5

Summary: A collection of DESIGN.md files analysis by popular brand design systems.

  • What happened: DESIGN.md is a new concept introduced by Google Stitch.
  • Why it matters: A collection of DESIGN.md files analysis by popular brand design systems.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

A collection of DESIGN.md files analysis by popular brand design systems.

What's new

DESIGN.md is a new concept introduced by Google Stitch.

Key details

  • Drop one into your project and let coding agents generate a matching UI.
  • Copy a DESIGN.md into your project, tell your AI agent “build me a page that looks like this,” and generate high-quality UI that stays visually consistent with the design language.
  • Built with real design depth — including analyzed patterns, tokens, and rules — for high-quality UI generation, not surface-level outputs.
  • DESIGN.md is a new concept introduced by Google Stitch.

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

Signal 9.4 Novelty 5.1 Impact 2.0 Confidence 8.7 Actionability 6.5

Summary: arXiv:2603.16572v2 Announce Type: replace-cross Abstract: Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality.

  • What happened: arXiv:2603.16572v2 Announce Type: replace-cross Abstract: Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality.
  • Why it matters: arXiv:2603.16572v2 Announce Type: replace-cross Abstract: Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality.
  • What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep

Context

We collect 238,180 unique skills from three major distribution platforms and GitHub, and analyze their contents, behavior, and repository context.

What's new

arXiv:2603.16572v2 Announce Type: replace-cross Abstract: Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality.

Key details

  • Their growing popularity has led to dedicated marketplaces resembling mobile app stores, as well as automated scanners that assess whether skills are benign or malicious.
  • However, scanner reports from individual marketplaces classify up to 46.8% of skills as malicious, raising concerns about false positives.
  • We present the largest empirical security analysis of the AI agent skill ecosystem to date.
  • We collect 238,180 unique skills from three major distribution platforms and GitHub, and analyze their contents, behavior, and repository context.

Results & evidence

  • arXiv:2603.16572v2 Announce Type: replace-cross Abstract: Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality.
  • However, scanner reports from individual marketplaces classify up to 46.8% of skills as malicious, raising concerns about false positives.
  • We collect 238,180 unique skills from three major distribution platforms and GitHub, and analyze their contents, behavior, and repository context.

Limitations / unknowns

  • However, scanner reports from individual marketplaces classify up to 46.8% of skills as malicious, raising concerns about false positives.
  • Overall, our findings provide a more robust view of the agent-skill ecosystem's current risk surface and highlight the need for context-aware security evaluation.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Open-source AI Sales Agent with Next.js 15 and Ollama – zero API costs

Signal 8.4 Novelty 6.2 Impact 2.6 Confidence 7.5 Actionability 3.5

Summary: Open-source AI Sales Agent with Next.js 15 and Ollama – zero API costs

  • What happened: Open-source AI Sales Agent with Next.js 15 and Ollama – zero API costs
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

Open-source AI Sales Agent with Next.js 15 and Ollama – zero API costs

What's new

Open-source AI Sales Agent with Next.js 15 and Ollama – zero API costs

Key details

  • Open-source AI Sales Agent with Next.js 15 and Ollama – zero API costs

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Show HN: AERF, signed receipts for AI agent actions

Signal 8.4 Novelty 5.1 Impact 2.8 Confidence 7.5 Actionability 3.5

Summary: Show HN: AERF, signed receipts for AI agent actions

  • What happened: Show HN: AERF, signed receipts for AI agent actions
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

Show HN: AERF, signed receipts for AI agent actions

What's new

Show HN: AERF, signed receipts for AI agent actions

Key details

  • Show HN: AERF, signed receipts for AI agent actions

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Open-source real company briefs to practice AI-native building and get hired

Signal 8.4 Novelty 5.1 Impact 2.6 Confidence 7.5 Actionability 3.5

Summary: Open-source real company briefs to practice AI-native building and get hired

  • What happened: Open-source real company briefs to practice AI-native building and get hired
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

Open-source real company briefs to practice AI-native building and get hired

What's new

Open-source real company briefs to practice AI-native building and get hired

Key details

  • Open-source real company briefs to practice AI-native building and get hired

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Signal 7.3 Novelty 4.0 Impact 2.0 Confidence 3.0 Actionability 5.2

Summary: Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

  • What happened: Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
  • Why it matters: Could materially affect near-term AI workflows.
  • What to do: Track for corroboration and benchmark data before adopting.
Deep

Context

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

What's new

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Key details

  • Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Results & evidence

  • No hard numbers surfaced in the source text; treat claims as directional until benchmarks appear.

Limitations / unknowns

  • Generalization outside curated tasks is still unclear.

Next-step validation checks

  • Reproduce one claim with a public baseline and fixed evaluation settings.
  • Check robustness on out-of-distribution or long-context cases.
  • Track whether independent teams report matching results.