Models

LLMs, foundation models, and model releases

125 articles

AI & ML

Sources: Microsoft is working on an app that will include GitHub Copilot, Copilot chat, Copilot Cowork, and a new agentic workflow tool called Autopilot

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Label-Free Reinforcement Learning via Cross-Model Entropy

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

Enhancing LLM Medical Coding with Structured External Knowledge

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

Learning to Translate from Soft to Hard LLM Prompts

Disentangling Language Roles in Multilingual LLM Task Execution

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

We should be more tired than the model

Show HN: Compile-time model-id validation with declared capability

anthropics/claude-code — Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and he

cursor/plugins — Cursor plugin specification and official plugins

run-llama/liteparse — A fast, helpful, and open-source document parser

galilai-group/stable-worldmodel — A platform for reproducible world model research and evaluation

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days

Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

I integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.

Claude Code – Everything You Can Configure That the Docs Don't Tell You

Python utility package for building Claude Code hooks

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

ai & ml

2 min read★★★☆☆

Read Breakdown →

$Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild$

AI & ML

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Mind Your Tone: Does Tone Alter LLM Performance?

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

ReasonOps: Operator Segmentation for LLM Reasoning Traces

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

Microsoft overhauled Copilot's design in Microsoft 365 with a minimalist black-and-white, text-focused interface aimed at creating a more consistent experience

AI researchers ran 15-day simulations of worlds governed by different AI models: Claude Sonnet 4.6 recorded zero crime, while Gemini 3 Flash had the most at 683

The mysterious Hy3 LLM is topping OpenRouter Model Rankings by a large margin

LLMs believe false statements even after explicit warnings that they're false

Various LLM Smells

Apple working to cram massive Gemini model into iPhone to power new Siri

Microsoft 365 Copilot gets a speed boost and cleaner design

The CFTC moves to vacate a $5M settlement with Gemini, reversing a Biden-era enforcement action, following a lobbying campaign by the Winklevoss twins

Training GPT-like model on non-language series [R]

[R]GNN Model For Fraud Detection Isn't Performing Well[R]

Best Text to Text Translation Model? [D]

Tuning LLVM's SLP Vectorizer Cost Model

About LLMs at Zig Days

Dynamic Workflows in Claude Code

Claude’s new model is more ‘honest’ when it messes up

Claude Opus 4.8

Sneak peek at new Siri app reveals Apple’s plans to take on ChatGPT and more

These new iOS 27 renders hint at Siri’s big redesign

A Simple State Space Model Excels at Multivariate Time Series Classification

Gradient Transformer: Learning to Generate Updates for LLMs

Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

Sources: at WWDC, Apple is likely to showcase how 15 years of designing custom silicon chips gives it an advantage in local AI, using a distilled Gemini model

Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

IBM and Red Hat commit $5B to establish a new model for open-source software, dubbed Project Lightwell, and will deploy 20,000 engineers, supported by AI

unclecode/crawl4ai — 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://disc

OpenMOSS/MOSS-TTS — MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the Open

EveryInc/compound-engineering-plugin — Official Compound Engineering plugin for Claude Code, Codex, Cursor, and more

London- and SF-based Orbital Industries, which uses its Orb model to design advanced materials and then sell them directly, raised a $50M Series B led by Plural

Mistral says it is accelerating superintelligence development to ensure Europe's independence from US tech giants, and signs deals to supply Airbus and BMW

Qwen3.7-Max Ran for 35 Hours on Unknown Hardware and Achieved a 10× Speedup

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

Soro: A Lightweight Foundation Model and Chatbot for Tajik

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Voluntary Collusion with Secret Tools in Competing LLM Agents

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

A Policy-Driven Runtime Layer for Agentic LLM Serving

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

A Query Engine for the Agents

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Why Ctrl+V won't paste images in Claude Code on WSL, with a fix

The CFTC files alongside Gemini to nullify Gemini's $5M settlement in January 2025, arguing that the agency's current management wouldn't have pursued the case

Getting Claude to extract data from a 1997 football manager game

Chachamaru127/claude-code-harness — Claude Code Dedicated Development Harness - Achieving High-Quality Development Through an Autonomous

Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

Gemini, Gophers, and Fingers. Oh My Alternative Internets Beyond HTTPS

harry0703/MoneyPrinterTurbo — 利用AI大模型，一键生成高清短视频 Generate short videos with one click using AI LLM.

ElevenLabs’s new music generation model can switch genres mid-track

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Co-folding model guided by structural proteomics

The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

Biohub, the Mark Zuckerberg and Priscilla Chan-funded institute, releases a protein-structure prediction model and more, calling it “a world model” of proteins

Yeachan-Heo/oh-my-claudecode — Teams-first Multi-agent orchestration for Claude Code

Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70%

Prompt Politeness Affects LLM Accuracy

Claude Code as a Daily Driver: Claude.md, Skills, Subagents, Plugins, and MCPs

A deep dive into how Claude Code and OpenClaw unleashed the AI agent revolution that is rapidly transforming the modern computing landscape

Q&A with Claude Code creator and head Boris Cherny on how the title “software engineer” is disappearing, why AI may create more jobs than it destroys, and more

Can LLMs Introspect? A Reality Check

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

Mixture of Complementary Agents for Robust LLM Ensemble

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team

Amazon fulfillment competitor Stord raises $250M at $3B valuation

ai & ml

2 min read★★★☆☆

Read Breakdown →

Models

Sources: Microsoft is working on an app that will include GitHub Copilot, Copilot chat, Copilot Cowork, and a new agentic workflow tool called Autopilot

Rsync maintainer starts uses Claude, regressions mount

Notes from the Mistral AI Now Summit in Paris

Flathub disallows LLM-based submissions

Even (very) noisy LLM evaluators are useful for improving AI agents

Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Label-Free Reinforcement Learning via Cross-Model Entropy

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

Enhancing LLM Medical Coding with Structured External Knowledge

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

Learning to Translate from Soft to Hard LLM Prompts

Disentangling Language Roles in Multilingual LLM Task Execution

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

We should be more tired than the model

Show HN: Compile-time model-id validation with declared capability

anthropics/claude-code — Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and he

cursor/plugins — Cursor plugin specification and official plugins

run-llama/liteparse — A fast, helpful, and open-source document parser

galilai-group/stable-worldmodel — A platform for reproducible world model research and evaluation

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

OpenAI says it has briefed the White House on its new biodefense program, which uses GPT-Rosalind to help develop biodefense and pandemic preparedness tools

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days

Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

I integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.

Claude Code – Everything You Can Configure That the Docs Don't Tell You

Python utility package for building Claude Code hooks

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Mind Your Tone: Does Tone Alter LLM Performance?

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

ReasonOps: Operator Segmentation for LLM Reasoning Traces

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

Microsoft overhauled Copilot's design in Microsoft 365 with a minimalist black-and-white, text-focused interface aimed at creating a more consistent experience

AI researchers ran 15-day simulations of worlds governed by different AI models: Claude Sonnet 4.6 recorded zero crime, while Gemini 3 Flash had the most at 683

The mysterious Hy3 LLM is topping OpenRouter Model Rankings by a large margin

LLMs believe false statements even after explicit warnings that they're false

Various LLM Smells

Apple working to cram massive Gemini model into iPhone to power new Siri

Microsoft 365 Copilot gets a speed boost and cleaner design

The CFTC moves to vacate a $5M settlement with Gemini, reversing a Biden-era enforcement action, following a lobbying campaign by the Winklevoss twins

Training GPT-like model on non-language series [R]

[R]GNN Model For Fraud Detection Isn't Performing Well[R]

Best Text to Text Translation Model? [D]

Tuning LLVM's SLP Vectorizer Cost Model

About LLMs at Zig Days

Dynamic Workflows in Claude Code

Claude’s new model is more ‘honest’ when it messes up

Claude Opus 4.8

Sneak peek at new Siri app reveals Apple’s plans to take on ChatGPT and more

These new iOS 27 renders hint at Siri’s big redesign

A Simple State Space Model Excels at Multivariate Time Series Classification

Gradient Transformer: Learning to Generate Updates for LLMs

Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

Sources: at WWDC, Apple is likely to showcase how 15 years of designing custom silicon chips gives it an advantage in local AI, using a distilled Gemini model

Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

IBM and Red Hat commit $5B to establish a new model for open-source software, dubbed Project Lightwell, and will deploy 20,000 engineers, supported by AI

unclecode/crawl4ai — 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://disc

OpenMOSS/MOSS-TTS — MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the Open

EveryInc/compound-engineering-plugin — Official Compound Engineering plugin for Claude Code, Codex, Cursor, and more

London- and SF-based Orbital Industries, which uses its Orb model to design advanced materials and then sell them directly, raised a $50M Series B led by Plural

Mistral says it is accelerating superintelligence development to ensure Europe's independence from US tech giants, and signs deals to supply Airbus and BMW

Qwen3.7-Max Ran for 35 Hours on Unknown Hardware and Achieved a 10× Speedup

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

Soro: A Lightweight Foundation Model and Chatbot for Tajik

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Voluntary Collusion with Secret Tools in Competing LLM Agents