Benchmarks

Performance comparisons and evaluations

108 articles

AI & ML

Even (very) noisy LLM evaluators are useful for improving AI agents

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Disentangling Language Roles in Multilingual LLM Task Execution

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

Training AI chatbots to be warm and empathetic makes them less factually accurate

companies are cutting junior roles over AI while admitting they cant prove AI ROI yet. anyone else notice this tension?

galilai-group/stable-worldmodel — A platform for reproducible world model research and evaluation

I Renovated My Apartment With AI. Here's What Came Out of It

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days

Was some of the recent anti-AI push beneficial to big corporations?

Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

I integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.

We built a public archive of AI failure patterns. The ones that keep coming back after changes.

Ok, talvez eu pague pelo Meta Premium

Looking for an AI image generator, what's the best one

Leaving performance on the table

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Mind Your Tone: Does Tone Alter LLM Performance?

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

The mysterious Hy3 LLM is topping OpenRouter Model Rankings by a large margin

2027 Audi RS5 first drive: A performance PHEV with split personalities

Sources: Amazon has shut down an internal leaderboard that tracked employees' use of AI tools after workers tried to boost their scores with needless tasks

[D] Where do you go for serious AI research discussion online? [D]

AI Adoption Issue Debugging

Confidence Scores for Exam Questions

Chase the next new thing or lock-in on one ecosystem?

Boos, AI-washing, and 'low-value human capital': The psychological traps CEOs are falling into when they botch their AI messaging

[D] Self-Promotion Thread

[D] Monthly Who's Hiring and Who wants to be Hired?

A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]

AI-generated CUDA kernels silently break training and inference [R]

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

ACM MM 2026 review discussion [D]

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Training GPT-like model on non-language series [R]

STEM PhD's transitioning to MLE/Data [R]

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

I used the N.E.A.T algorithm to teach AI how to control a worm in my game in making! It uses evolution to improve. [P]

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Profiling PyTorch training without accidentally stalling the GPU [D]

EMA-Gated Temporal Sequence Compression in Vision Transformers [P]

[R]GNN Model For Fraud Detection Isn't Performing Well[R]

UK GDPR Small Business Q&A — 5,000 synthetic pairs with article-level citations [D]

Should I attend ICML as a junior? [D]

[R] What 1000+ Harness Experiments Taught Me About Self-Improving Agents [R]

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

Cross-species RSA: same learning rules (BP, PC, STDP, FA) tested against both human fMRI and macaque electrophysiology [P]

Best Text to Text Translation Model? [D]

Physics Informed Neural Networks for damped harmonic oscillator and Burger's Equation (with extrapolation analysis) [P]

[D] Is IEEE Workshop on Machine Learning for Signal Processing Reputable? [D]

The OpenClaw crisis is the most complete case study of agentic AI security failure. Here's the full timeline and technical breakdown.

Bigger rewards dramatically speed up learning in the brain

Best Video Generators for Your Workflow

How does the economy work if everyone gets laid off and human jobs disappear?

Nothing is real anymore. We are reaching the point where crowd scenes can be entirely generated by AI.

Experiment to see what happens when you let AI models run the world

Things that AI cannot do which are surprising.

Nobody on the internet knows if you are a human

I gave my AI agents email instead of better reasoning. They started fixing each other's bugs.

Adding agentic AI to an existing search app without replacing anything

Recommended NotebookLM alternatives

Why do calm AI conversations sometimes feel less exhausting than social media?

IGADA-IoT: IoT Sensor Energy Optimization in Wireless Sensor Networks Driven by Automatic Data Augmentation

$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference

Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

Evaluating Local Explainability Metrics for Machine Learning Models on Tabular Data

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

A new register allocator for ZJIT

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70%

Constraint acquisition needs better benchmarks

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

JobBench: Aligning Agent Work With Human Will

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Initial benchmarks show Nvidia's Vera CPU, which features 88 in-house-designed Olympus cores, packs a heavy-hitting punch, beating Intel's and AMD's x86_64 CPUs

DeepSWE: A contamination-free benchmark for long-horizon coding agents

Introducing DoomBench - Can Your Data Stack Run DOOM?

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

ai & ml

2 min read★★★☆☆

Read Breakdown →

Benchmarks

Even (very) noisy LLM evaluators are useful for improving AI agents

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Disentangling Language Roles in Multilingual LLM Task Execution

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

Training AI chatbots to be warm and empathetic makes them less factually accurate

companies are cutting junior roles over AI while admitting they cant prove AI ROI yet. anyone else notice this tension?

galilai-group/stable-worldmodel — A platform for reproducible world model research and evaluation

I Renovated My Apartment With AI. Here's What Came Out of It

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Hopfield Memory in VLA [R]

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

Google reached AGI ?🚨🚨

Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days

Was some of the recent anti-AI push beneficial to big corporations?

Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

I integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.

We built a public archive of AI failure patterns. The ones that keep coming back after changes.

Ok, talvez eu pague pelo Meta Premium

Looking for an AI image generator, what's the best one

Leaving performance on the table

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Mind Your Tone: Does Tone Alter LLM Performance?

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

The mysterious Hy3 LLM is topping OpenRouter Model Rankings by a large margin

2027 Audi RS5 first drive: A performance PHEV with split personalities

Sources: Amazon has shut down an internal leaderboard that tracked employees' use of AI tools after workers tried to boost their scores with needless tasks

[D] Where do you go for serious AI research discussion online? [D]

AI Adoption Issue Debugging

Confidence Scores for Exam Questions

Chase the next new thing or lock-in on one ecosystem?

Boos, AI-washing, and 'low-value human capital': The psychological traps CEOs are falling into when they botch their AI messaging

[D] Self-Promotion Thread

[D] Monthly Who's Hiring and Who wants to be Hired?

A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]

AI-generated CUDA kernels silently break training and inference [R]

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

ACM MM 2026 review discussion [D]

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Training GPT-like model on non-language series [R]

STEM PhD's transitioning to MLE/Data [R]

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

I used the N.E.A.T algorithm to teach AI how to control a worm in my game in making! It uses evolution to improve. [P]

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Profiling PyTorch training without accidentally stalling the GPU [D]

EMA-Gated Temporal Sequence Compression in Vision Transformers [P]

[R]GNN Model For Fraud Detection Isn't Performing Well[R]

UK GDPR Small Business Q&amp;A — 5,000 synthetic pairs with article-level citations [D]

Should I attend ICML as a junior? [D]

[R] What 1000+ Harness Experiments Taught Me About Self-Improving Agents [R]

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

Cross-species RSA: same learning rules (BP, PC, STDP, FA) tested against both human fMRI and macaque electrophysiology [P]

Best Text to Text Translation Model? [D]

Physics Informed Neural Networks for damped harmonic oscillator and Burger's Equation (with extrapolation analysis) [P]

[D] Is IEEE Workshop on Machine Learning for Signal Processing Reputable? [D]

The OpenClaw crisis is the most complete case study of agentic AI security failure. Here's the full timeline and technical breakdown.

Bigger rewards dramatically speed up learning in the brain

Best Video Generators for Your Workflow

How does the economy work if everyone gets laid off and human jobs disappear?

Nothing is real anymore. We are reaching the point where crowd scenes can be entirely generated by AI.

Experiment to see what happens when you let AI models run the world

Things that AI cannot do which are surprising.

Nobody on the internet knows if you are a human

I gave my AI agents email instead of better reasoning. They started fixing each other's bugs.

Adding agentic AI to an existing search app without replacing anything

Recommended NotebookLM alternatives

UK GDPR Small Business Q&A — 5,000 synthetic pairs with article-level citations [D]