Machine Learning Interatomic Potentials: Achieving Accurate Force Calculations for Materials and Biomolecular Research

Grayson Bailey Dec 02, 2025 438

Machine learning interatomic potentials (MLIPs) have emerged as a transformative technology, bridging the gap between the high accuracy of quantum mechanical methods and the computational efficiency of classical force fields.

Machine Learning Interatomic Potentials: Achieving Accurate Force Calculations for Materials and Biomolecular Research

Abstract

Machine learning interatomic potentials (MLIPs) have emerged as a transformative technology, bridging the gap between the high accuracy of quantum mechanical methods and the computational efficiency of classical force fields. This article provides a comprehensive overview for researchers and drug development professionals on how MLIPs enable accurate force and energy calculations, which are fundamental to reliable atomistic simulations. We explore the foundational principles of MLIPs, including the critical role of symmetry-equivariant architectures and advanced material descriptors. The review covers state-of-the-art methodologies and their applications across diverse domains, from electrocatalyst design to biomolecular modeling. We address key challenges in model training, uncertainty quantification, and transferability, while presenting rigorous validation frameworks and comparative performance benchmarks. By synthesizing the latest advances, this article serves as an essential guide for leveraging MLIPs to accelerate discovery in computational materials science and drug development.

Understanding Machine Learning Interatomic Potentials: From Quantum Mechanics to Accurate Force Fields

Computational scientists and materials researchers face a persistent challenge: choosing between the high accuracy of quantum mechanical methods like Density Functional Theory (DFT), the efficiency of classical force fields (FFs), or compromising on both. This trilemma has constrained progress in fields ranging from drug development to materials design, where simulating realistic systems at experimental conditions requires both quantum accuracy and nanoscale temporal/spatial scalability [1]. Machine Learning Interatomic Potentials (MLIPs) have emerged as a transformative technology that bridges this divide by learning the quantum-mechanical potential energy surface (PES) from reference data, then enabling simulations with near-quantum accuracy at classical force field computational costs [2] [3].

The fundamental innovation of MLIPs lies in their data-driven approach. Unlike classical FFs with fixed mathematical forms, MLIPs use flexible machine learning models to map atomic configurations to energies and forces, effectively "learning physics from data" while preserving essential physical symmetries [1]. This paradigm shift enables researchers to capture complex atomic interactions—including bond formation/breaking and subtle non-covalent forces—that were previously inaccessible to efficient simulation methods [3]. For pharmaceutical researchers, this technology enables accurate modeling of molecular crystals, protein-ligand interactions, and drug formulation stability with unprecedented fidelity to quantum mechanical benchmarks.

Theoretical Foundation: How MLIPs Reconcile Accuracy and Efficiency

The Potential Energy Surface Framework

The concept of the Potential Energy Surface (PES) is fundamental to understanding atomistic simulations. The PES represents the total energy of a system as a function of its atomic coordinates, providing the foundation for determining stable structures, reaction pathways, and dynamical evolution [3]. In molecular dynamics based on Newton's laws, the force on each atom is derived from the PES through the relation ( Fi = -\partial E/\partial ri ), where the force is the negative gradient of the potential energy ( E ) with respect to atomic position ( r_i ) [3].

Traditional methods for constructing PES face significant limitations. Quantum mechanical (QM) approaches, particularly DFT, provide accurate PES descriptions but scale poorly with system size (typically O(N³) or worse), restricting applications to small systems containing hundreds of atoms [1] [3]. Classical force fields use simplified functional forms to enable rapid energy calculations but lack transferability and accuracy for complex chemistries, particularly when bond formation/breaking occurs [1] [3]. MLIPs resolve this by training on QM reference data to create surrogate models that maintain high accuracy while achieving computational efficiency comparable to classical MD [1].

Architectural Innovations in MLIPs

Modern MLIP architectures incorporate several key innovations that enable their high performance:

Geometric Equivariance: State-of-the-art MLIPs explicitly embed physical symmetries (rotational, translational, and sometimes reflectional invariance) directly into their network architectures [1]. Equivariant layers maintain internal feature representations that transform correctly under symmetry operations, ensuring that scalar predictions (e.g., total energy) remain invariant while vector targets (e.g., forces) exhibit proper equivariant behavior [1]. This approach parallels classical multipole theory in physics, encoding atomic properties as monopole, dipole, and quadrupole tensors and modeling their interactions via tensor products [1].

Graph Neural Networks (GNNs): GNN-based MLIPs represent atomic systems as graphs, where atoms constitute nodes and chemical bonds form edges. Message-passing operations between connected nodes enable effective learning of local chemical environments without handcrafted descriptors [1]. Frameworks such as NequIP and MACE leverage these architectures to achieve superior data efficiency and accuracy across diverse materials systems [1] [4].

Foundation Models for Chemistry: Recent efforts have developed pre-trained MLIPs on large DFT datasets, creating foundation models that qualitatively reproduce PES across broad portions of the periodic table [4]. These models can be fine-tuned to high accuracy for specific applications with minimal additional data, dramatically reducing the computational cost of MLIP development for specialized applications [4].

Quantitative Performance Comparison

Table 1: Comparative Analysis of Computational Methods for Atomistic Simulation

Method	Accuracy Range	Computational Scaling	Typical System Size	Time Scale	Key Limitations
Quantum Chemistry (CCSD(T))	Very High (Chemical Accuracy)	O(N⁷)	10-100 atoms	Static calculations	Prohibitively expensive for dynamics
Density Functional Theory (DFT)	High (5-20 meV/atom)	O(N³)	100-1,000 atoms	Picoseconds	System size, time scale limitations
Classical Force Fields	Low-Medium (>50 meV/atom)	O(N²)	10⁶-10⁹ atoms	Nanoseconds to microseconds	Limited transferability, inaccurate for reactions
Machine Learning IPs	Medium-High (1-20 meV/atom)	O(N²)	10³-10⁶ atoms	Nanoseconds	Training data requirements, transferability

Table 2: Performance Benchmarks for Specific MLIP Implementations

MLIP Framework	Architecture Type	Reported Energy MAE	Reported Force MAE	Key Applications Demonstrated
DeePMD	Deep Neural Network	<1 meV/atom	<20 meV/Å	Water systems, molecular crystals
MACE	Equivariant GNN	Sub-chemical accuracy	~30 meV/Å	Molecular crystals, pharmaceuticals
NequIP	Equivariant GNN	~1 meV/atom	~15-20 meV/Å	Materials phase transitions
NeuralIL	Neural Network	DFT-level accuracy	DFT-level accuracy	Ionic liquids, charged fluids

The performance data reveals MLIPs' unique position in the computational landscape. With energy mean absolute errors (MAEs) potentially below 1 meV/atom and force MAEs under 20 meV/Å, MLIPs approach the accuracy of the quantum methods on which they're trained while maintaining the O(N²) scaling characteristic of classical force fields [1] [4]. This enables simulations of systems containing thousands to millions of atoms at nanosecond timescales—regimes previously inaccessible to quantum-accurate methods [1].

Application Protocols for Pharmaceutical Research

Protocol 1: Predicting Molecular Crystal Stability

Molecular crystal stability prediction is crucial for pharmaceutical development, as crystal forms dictate drug stability, solubility, and bioavailability [4]. The following protocol enables accurate calculation of sublimation enthalpies—a key thermodynamic property—with sub-chemical accuracy (<4 kJ/mol) relative to experiment:

Step 1: Foundation Model Selection

Begin with a pre-trained foundation model such as MACE-MP-0b3, which has been trained on diverse inorganic crystals from the Materials Project database [4].
Foundation models provide qualitatively accurate PES across broad chemical spaces, serving as optimal starting points for system-specific refinement.

Step 2: Minimal Data Generation

Sample the molecular crystal phase space around the equilibrium volume using short MD simulations at multiple lattice volumes.
Use the foundation model itself to run these initial simulations, avoiding costly ab initio MD during data generation.
Randomly sample approximately 10 structures per volume from the trajectories to create an initial training set of ~200 structures [4].

Step 3: Model Fine-Tuning

Fine-tune the foundation model by optimizing parameters to minimize errors on energy, forces, and stress tensor components.
Employ a balanced loss function that weights energy, force, and stress terms appropriately for the target properties.
Typical training requires 50-100 epochs with batch sizes of 5-10 structures, though this should be adjusted based on dataset size.

Step 4: Validation and Iteration

Test the fine-tuned model on equation of state (energy vs. volume) and vibrational properties.
If accuracy is insufficient, particularly at extremes of volume or temperature, iteratively augment training data with additional structures from regions of poor performance.
Final models should achieve energy root mean square errors (RMSE) below 1 meV/atom relative to target DFT references [4].

Step 5: Thermodynamic Property Calculation

Use the validated MLIP to run isothermal-isobaric (NPT) ensemble simulations at target temperatures and pressures.
Calculate sublimation enthalpies from the difference between crystal and gas-phase free energies, incorporating anharmonicity and nuclear quantum effects via path integral methods when necessary.
For the X23 molecular crystal benchmark, this protocol has achieved experimental agreement with average errors <4 kJ/mol [4].

Figure 1: MLIP Fine-Tuning Workflow for Molecular Crystals

Protocol 2: Modeling Charged Fluids and Ionic Liquids

Pharmaceutical applications increasingly utilize ionic liquids for drug formulation, extraction, and delivery. This protocol enables accurate simulation of these complex charged systems:

Step 1: Reference Data Generation

Perform ab initio molecular dynamics (AIMD) simulations of target ionic liquid systems using validated DFT functionals with appropriate dispersion corrections.
For systems with proton transfer reactions, ensure sufficient sampling of reactive events through enhanced sampling techniques or extended trajectory lengths.
Extract diverse configurations, energies, and atomic forces covering the relevant phase space.

Step 2: Neural Network Potential Training

Employ a neural network architecture such as NeuralIL specifically designed for ionic liquids [5].
Train on energies and forces using a committee of networks to enable uncertainty quantification.
Implement active learning by iteratively retraining with configurations extracted from ML-MD simulations where prediction uncertainty is high [5].

Step 3: Large-Scale Production Simulation

Deploy the trained MLIP for large-scale MD simulations of ionic liquid systems containing hundreds of ion pairs (thousands of atoms).
Run 100+ ps trajectories to access relevant timescales for structural relaxation and transport properties.
For reactive systems, validate that proton transfer events and equilibrium constants match reference AIMD and experimental data [5].

Step 4: Property Extraction and Analysis

Calculate structural properties (radial distribution functions, hydrogen bonding patterns).
Extract dynamic properties (mean square displacements, diffusion coefficients, vibrational densities of states).
Compare with classical force field results to identify systematic improvements, particularly for transport properties typically underestimated by non-polarizable FFs [5].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Software Tools and Datasets for MLIP Development

Tool/Dataset	Type	Primary Function	Relevance to Pharmaceutical Research
DeePMD-kit	Software Package	DeePMD potential training and deployment	High-performance MLIP for large biomolecular systems
MACE	Software Framework	Equivariant GNN potentials	Molecular crystal modeling with high data efficiency
NeuralIL	Specialized MLIP	Ionic liquid simulations	Drug formulation with ionic liquid excipients
QM9	Benchmark Dataset	134k small organic molecules	MLIP validation for drug-like molecules
MD17/MD22	Training Dataset	Molecular dynamics trajectories	Biomolecular force field training
X23	Benchmark Dataset	23 molecular crystals	Pharmaceutical crystal stability prediction

Table 4: Computational Resources for Different Simulation Types

Simulation Type	Typical Hardware	Time Scale Accessible	System Size Limit	Relative Cost
DFT (AIMD)	HPC Cluster	10-100 ps	100-1,000 atoms	1000× (reference)
Classical MD	Workstation/Small Cluster	ns-μs	10⁶-10⁹ atoms	1×
MLIP-MD	Workstation/Medium Cluster	10-100 ns	10³-10⁶ atoms	10-100×

Future Perspectives and Challenges

Despite rapid progress, MLIP technology faces several important challenges that pharmaceutical researchers should consider:

Data Quality and Efficiency: MLIP accuracy remains fundamentally limited by the quality and breadth of training data. Promising approaches include active learning, where models query reference calculations for configurations with high uncertainty, and multi-fidelity frameworks that combine expensive high-accuracy and cheaper lower-accuracy data [1]. For molecular crystals, recent demonstrations achieving sub-chemical accuracy with only ~200 reference structures represent an order-of-magnitude improvement in data efficiency [4].

Transferability and Generalization: MLIPs typically perform best for systems similar to their training data, struggling with out-of-distribution configurations. Foundation models trained on diverse materials datasets show improved transferability, while physically-constrained architectures that embed known symmetries and conservation laws enhance generalization [1] [4].

Interpretability and Explainability: The "black box" nature of complex MLIP architectures poses challenges for extracting physical insights. Research into interpretable AI techniques is crucial, particularly for pharmaceutical applications where mechanistic understanding is as important as predictive accuracy [1].

Scalability and Integration: As system sizes grow, MLIP computational costs and memory requirements increase significantly. Ongoing work on scalable message-passing architectures and efficient implementations will be essential for applying MLIPs to pharmaceutical-relevant systems such as protein-ligand complexes and amorphous solid dispersions [1].

For the pharmaceutical research community, MLIP technology promises to enable accurate prediction of polymorph stability, drug-excipient compatibility, and formulation performance under realistic conditions. By providing quantum accuracy at classical computational costs, these methods are poised to accelerate drug development and materials design while reducing empirical optimization cycles.

The development of machine learning interatomic potentials (MLIPs) represents a paradigm shift in computational materials science and chemistry, enabling molecular dynamics simulations with near quantum-mechanical accuracy at a fraction of the computational cost. The core architectural principles underpinning these advances center on two complementary concepts: invariant descriptors, which provide symmetry-preserving representations of atomic environments, and equivariant neural networks, which embed physical symmetries directly into model architectures. These principles are fundamental for accurate force field and energy predictions in applications ranging from catalyst design to drug development. This document outlines the core theoretical frameworks, provides structured experimental protocols, and details the essential computational tools required for implementing these approaches in research settings.

Theoretical Framework and Key Concepts

Foundational Descriptors for Atomic Systems

Descriptors transform atomic configurations into mathematical representations suitable for machine learning. They are broadly categorized into three classes based on their construction strategy and physical interpretation [6].

Intrinsic statistical descriptors comprise elemental properties such as atomic number, valence electron count, ionization energy, and electronegativity. Tools like Magpie can generate over 100 such attributes for each element [6]. These descriptors require no quantum mechanical calculations, making them computationally inexpensive for initial high-throughput screening of material spaces, though they may lack detailed physical interpretability.

Electronic structure descriptors encode quantum mechanical properties including orbital occupancies, d-band center (εd), magnetic moments, and charge distributions [6]. For instance, the non-bonding d-orbital lone-pair electron count (Nie-d) has served as an effective descriptor for nitrogen reduction reaction activity [6]. While these descriptors offer direct connections to chemical reactivity, they typically require preliminary density functional theory (DFT) calculations.

Geometric and microenvironmental descriptors capture structural information such as interatomic distances, coordination numbers, local strain patterns, and symmetry functions. Examples include the metal second ionization energy combined with structural parameters like M-O-O triangle areas in metal-organic frameworks [6]. The Atom-Centered Symmetry Functions (ACSF) and Smooth Overlap of Atomic Positions (SOAP) are seminal descriptors in this category, with SOAP particularly effective for predicting grain boundary energy with high accuracy (R² = 0.99) [7].

Table 1: Classification and Performance of Foundational Descriptors

Descriptor Category	Examples	Data Requirements	Computational Cost	Key Applications
Intrinsic Statistical	Magpie attributes, elemental properties	None (elemental data)	Very Low	High-throughput screening of single-atom alloys, dual-atom catalysts [6]
Electronic Structure	d-band center, magnetic moments, orbital occupancies	DFT calculations	High	Catalyst mechanistic studies, reactivity prediction [6]
Geometric/Microenvironmental	SOAP, ACSF, coordination numbers, local strain	Atomic coordinates	Medium	Grain boundary energy prediction, complex materials [7]

The Evolution toward Equivariant Architectures

Early MLIPs relied on invariant descriptors handcrafted to be unchanged under rotation and translation of the input structure. These include bond lengths, angles, and dihedral angles [1]. While effective, these approaches required careful manual feature engineering.

A fundamental advancement came with equivariant neural networks, which explicitly embed physical symmetries into the model architecture rather than just the input features. These networks maintain internal feature representations that transform predictably under symmetry operations like rotations and translations [1]. For example, scalar outputs like energy remain invariant, while vector quantities like forces transform equivariantly [1].

The Multi-ACE framework has emerged as a unifying mathematical construction that connects descriptor-based methods with message-passing neural networks [8]. This framework extends the Atomic Cluster Expansion (ACE) to multiple layers, creating a comprehensive design space that encompasses most equivariant MPNN-based interatomic potentials [8].

Diagram 1: Equivariant architecture workflow for MLIPs (76 characters)

Quantitative Comparison of Architectures and Performance

Performance Benchmarks of MLIP Architectures

Different MLIP architectures exhibit distinct performance characteristics across benchmark datasets. The choice between invariant and equivariant approaches involves trade-offs between accuracy, computational cost, and data efficiency.

Table 2: Performance Comparison of MLIP Architectures on Benchmark Tasks

Architecture	Type	Key Features	Test Error (Energy/Forces)	Computational Efficiency	Notable Applications
DeePMD [1]	Invariant	Deep neural network with local environment descriptors	~1 meV/atom / ~20 meV/Å (water)	High (comparable to classical MD)	Large-scale water simulations [1]
NequIP [8]	Equivariant	Message-passing with equivariant features	State-of-the-art accuracy at release (2× improvement)	Moderate	General molecular dynamics [8]
BOTNet [8]	Equivariant	Simplified, interpretable NequIP variant	Competitive with NequIP	Higher than NequIP	Benchmark molecular datasets [8]
MACE [8]	Equivariant	Multi-ACE framework, tensor decomposition	State-of-the-art (2025)	Optimized via decomposition	Materials and molecules [8]
HIPNN [9]	Equivariant (lmax≥1)	Hierarchical message passing with tensor sensitivity	Chemically accurate (<1 kcal/mol)	Varies with lmax	Molecular property prediction [9]

Algorithm Selection Guidelines for Different Data Regimes

The optimal choice of machine learning algorithm depends significantly on dataset size and feature dimensionality, with tree ensembles generally outperforming in medium-to-large sample regimes, while kernel methods excel with smaller datasets [6].

For medium-to-large datasets (N ≈ 2,600 samples, p ≈ 10 features), tree ensemble methods like Gradient Boosting Regressor (GBR) have demonstrated superior performance, achieving test RMSE of 0.094 eV for CO adsorption energy prediction compared to 0.120 eV for Support Vector Regression (SVR) and 0.133 eV for Random Forest [6].

In small-data regimes (N ≈ 200 samples, p ≈ 10 features), kernel methods like Support Vector Regression with radial basis function kernels can achieve exceptional performance (test R² up to 0.98) when paired with physically-informed features [6].

Experimental Protocols and Implementation

Protocol 1: Developing a Custom Equivariant MLIP

Objective: Implement and train an equivariant machine learning interatomic potential for molecular dynamics simulations of catalytic materials.

Materials and Computational Resources:

Quantum chemistry reference data (energies, forces, stresses)
High-performance computing cluster with GPU acceleration
MLIP framework (e.g., DEEPMD-KIT, NequIP, MACE)

Procedure:

Data Generation and Preparation:
- Perform ab initio molecular dynamics (AIMD) simulations to sample relevant configurations
- Extract energies, atomic forces, and stress tensors from DFT calculations
- Split data into training (80%), validation (10%), and test sets (10%)
- Ensure coverage of relevant chemical and conformational space

Descriptor Selection and Feature Engineering:
- For invariant models: Compute SOAP or ACSF descriptors with optimized parameters
- For equivariant models: Design graph representation with appropriate cutoff radius (typically 5-6 Å)
- Include atomic numbers as node attributes
- Implement periodic boundary conditions for crystalline materials
Model Architecture Configuration:
- Select appropriate equivariant architecture (e.g., MACE, NequIP, BOTNet)
- Set tensor order (l_max) based on accuracy requirements (typically 1-3)
- Configure message-passing layers (2-3 layers often sufficient)
- Implement invariant readout function for energy conservation
Training and Optimization:
- Initialize model parameters following established schemes
- Employ combined loss function: L = λ₁Lenergy + λ₂Lforces + λ₃L_stresses
- Use Adam or stochastic gradient descent optimizer with learning rate decay
- Implement early stopping based on validation set performance
- Train until forces converge to <50-100 meV/Å accuracy
Validation and Deployment:
- Evaluate on test set to assess generalization
- Perform molecular dynamics simulations to verify stability
- Calculate relevant materials properties for experimental comparison
- Deploy for production simulations with appropriate monitoring

Protocol 2: Teacher-Student Knowledge Distillation for Efficient MLIPs

Objective: Create a computationally efficient student MLIP through knowledge distillation from a larger teacher model without sacrificing accuracy [9].

Rationale: Foundation MLIPs with up to 10⁹ parameters have high computational and memory requirements that limit large-scale MD simulations. Teacher-student training enables lighter-weight models with faster inference and reduced memory footprint [9].

Procedure:

Teacher Model Training:
- Train a large, accurate teacher model (e.g., HIPNN with lmax=1) on quantum chemistry data
- Use standard energy and force loss functions: Lteacher = |EDFT - Epred|² + |FDFT - Fpred|²
- Validate teacher model achieving chemical accuracy (<1 kcal/mol)

Student Model Architecture Design:
- Design compact architecture with fewer parameters (30-50% reduction)
- Maintain same fundamental operations but reduced feature dimensions
- Ensure compatible atomic energy decomposition: Etotal = Σεi
Knowledge Distillation Training:
- Generate atomic energy predictions (ε_i) from teacher model on training set
- Train student using combined loss function: Lstudent = λQC(|EDFT - Estudent|² + |FDFT - Fstudent|²) + λKDΣ|εi,teacher - ε_i,student|²
- Use λQC = 0.5, λKD = 0.5 for balanced training
- Employ same optimizer settings as teacher training
Validation and Efficiency Assessment:
- Compare student accuracy to teacher on test set
- Benchmark inference speed and memory usage
- Verify stability in molecular dynamics simulations
- Deploy efficient student model for large-scale production runs

Diagram 2: Teacher-student knowledge distillation workflow (76 characters)

Key Software Frameworks and Descriptor Tools

Table 3: Essential Computational Tools for MLIP Development

Tool/Resource	Type	Primary Function	Key Features	Reference/Citation
DEEPMD-Kit [1]	Software Framework	MLIP implementation and training	Deep potential molecular dynamics, high efficiency	[1]
SOAP [7]	Descriptor	Atomic environment representation	Smooth overlap, high accuracy for structures	[7]
ACE [8] [7]	Descriptor & Framework	Atomic cluster expansion	Unified framework, complete basis	[8] [7]
NequIP [8]	Software Framework	Equivariant MLIP implementation	Message-passing, equivariant features	[8]
MACE [8]	Software Framework	Equivariant MLIP	Multi-ACE layers, state-of-the-art accuracy	[8]
HIPNN [9]	Software Framework	Message-passing neural network	Tensor sensitivity, teacher-student compatible	[9]

Benchmark Datasets for MLIP Development

Table 4: Key Benchmark Datasets for Training and Validation

Dataset	System Type	Size	Key Properties	Applications
QM9 [1]	Small organic molecules	134k molecules	Energies, HOMO/LUMO, dipoles	Molecular property prediction
MD17 [1]	Molecular dynamics trajectories	~3-4M configurations	Energies, forces	MLIP training and validation
MD22 [1]	Biomolecular fragments	0.2M configurations	Energies, forces	Large molecule MLIP testing

The architectural principles of invariant descriptors and equivariant neural networks have fundamentally transformed the landscape of machine learning interatomic potentials. Invariant descriptors provide physically meaningful representations of atomic environments, while equivariant networks embed physical symmetries directly into learning architectures, enabling unprecedented accuracy in force field predictions. The emerging Multi-ACE framework offers a unifying mathematical foundation that connects these approaches, facilitating the development of next-generation MLIPs.

Future developments will likely focus on improving data efficiency through techniques like teacher-student training, extending to more complex physical phenomena including electronic properties and magnetic interactions, and enhancing interpretability to provide deeper physical insights. As these architectures continue to mature, they will enable increasingly accurate and efficient simulations across materials science, chemistry, and drug development, accelerating the discovery of novel materials and therapeutic compounds.

The accurate prediction of interatomic forces is a cornerstone of modern computational chemistry and materials science, enabling the exploration of molecular dynamics, catalyst design, and drug discovery. Machine learning interatomic potentials (MLIPs) have emerged as transformative tools that bridge the gap between computationally expensive quantum mechanical methods like density functional theory (DFT) and efficient but often inaccurate classical force fields [1]. A fundamental challenge in developing robust MLIPs lies in ensuring their predictions respect the underlying physical laws governing atomic systems, particularly their behavior under spatial transformations. This is where the mathematical principles of SO(3) (rotation), SE(3) (rotation and translation), and E(3) (rotation, translation, and reflection) equivariance become critical [10].

Integrating these symmetries directly into neural network architectures ensures that model outputs transform consistently with their inputs. For instance, rotating a molecular system should correspondingly rotate the predicted force vectors, while the scalar energy should remain unchanged [11] [1]. Early MLIPs relied on invariant features such as interatomic distances and angles, which preserved symmetry but often lacked the expressive power to unambiguously describe complex local atomic environments [11]. The advent of equivariant models represents a paradigm shift, actively exploiting geometric symmetries to achieve richer representations, superior data efficiency, and enhanced prediction accuracy for both scalar (energy) and vector/tensor (forces, dipole moments) properties [11] [1] [10].

Fundamental Symmetry Groups and Their Physical Interpretation

In the context of molecular systems, physical symmetries are described by specific mathematical groups whose actions on 3D space dictate how physical observables must transform.

E(3) - The Euclidean Group: This group comprises all isometries of three-dimensional Euclidean space, including translations t ∈ ℝ³, rotations R ∈ SO(3), and reflections (where det R = ±1). An element g = (R, t) acts on a point x as g·x = R x + t [10]. E(3)-equivariance is fundamental for ensuring that all predictions transform consistently with physical laws under any rigid-body transformation of the entire system.
SE(3) - The Special Euclidean Group: A subgroup of E(3), SE(3) includes all rotations and translations but excludes reflections. This is particularly relevant for modeling chiral molecules and other systems where reflection symmetry may not hold.
SO(3) - The Special Orthogonal Group: This group contains all rotations about the origin in 3D space, without translations or reflections. SO(3)-equivariance is crucial for handling vectorial outputs like forces, which must rotate in the same way as the input coordinates [12].

A function is deemed equivariant if applying a group transformation to its input is equivalent to applying a corresponding transformation to its output. Formally, for a function f : X → Y and all g in a group G, equivariance satisfies: D_Y[g] f(x) = f(D_X[g] x), where D_X and D_Y are the group representations on the input and output spaces, respectively [10]. In practice, this means that for a rotated molecular configuration, an equivariant model will predict forces that are rotated accordingly, and energies that remain invariant.

Current Architectural Paradigms for Embedding Equivariance

Modern approaches to building equivariant MLIPs can be broadly categorized into three paradigms, each with distinct advantages and implementation strategies. The table below summarizes the core methodologies and representative models.

Table 1: Paradigms for Achieving Equivariance in Machine Learning Interatomic Potentials

Paradigm	Core Methodology	Key Advantages	Representative Models
Hard-Wired Equivariance	Built-in architectural constraints using irreducible representations & tensor products [10].	Guarantees symmetry; high data efficiency; state-of-the-art accuracy [1] [10].	NequIP [1] [10], MACE [11] [13], MACE-MP-0 [13]
Scalar-Vector Dual Representations	Uses separate but interacting scalar and vector features to maintain SE(3)-equivariance [11].	Balances performance and computational cost; more efficient than high-order tensor models [11].	E2GNN [11], PaiNN [11], NewtonNet [11]
Learned Equivariance	Uses generic architectures (e.g., Transformers) with contrastive learning to steer models toward equivariance [12].	Retains hardware efficiency and flexibility of standard models; no complex equivariant layers required [12].	TransIP [12]

The following diagram illustrates the core architectural logic and data flow shared by many equivariant GNNs for interatomic potential prediction.

Quantitative Performance of Equivariant Models

The adoption of equivariant architectures is driven by their demonstrably superior performance and data efficiency compared to invariant models. The following table synthesizes key quantitative results reported across multiple studies.

Table 2: Performance Metrics of Equivariant and Invariant Machine Learning Interatomic Potentials

Model / Framework	Architecture Type	Key Performance Results	Reference / Dataset
NequIP	E(3)-Equivariant GNN	Matched/surpassed state-of-the-art accuracy with 100-1000x less training data vs. non-equivariant methods [10].	[10]
E2GNN	Efficient SE(3)-Equivariant GNN	Consistently outperformed representative baselines; achieved ab initio MD accuracy in solid, liquid, and gas systems [11].	Catalysts, Molecules, Organic Isomers [11]
TransIP	Transformer with Learned Equivariance	Attained comparable performance to state-of-the-art equivariant baselines; 40-60% improvement over data augmentation baseline [12].	Open Molecules (OMol25) [12]
MLIP for α-Fe	Machine-Learned Potential (MTP)	Predicted average grain boundary energy of 1.57 J/m², showing excellent agreement with experimental predictions [14].	α-Fe Polycrystals [14]
MACE-MP-0	Foundation Equivariant Model	Enabled high-throughput prediction of heat capacities for porous materials with accuracy comparable to bespoke ML models in a zero-shot manner [13].	Porous Materials [13]

Application Notes and Experimental Protocols

Protocol 1: Implementing an Equivariant GNN for Force Field Training

This protocol details the procedure for training an equivariant graph neural network for force and energy prediction, based on implementations such as E2GNN [11] and NequIP [10].

1. Data Preparation and Preprocessing

Input Data: Gather reference data from ab initio molecular dynamics (AIMD) trajectories or density functional theory (DFT) calculations. Essential data includes atomic coordinates (ℝ^(M×3)), atomic numbers (Z), total energies (ℝ), and atomic forces (ℝ^(M×3)), where M is the number of atoms [11] [1].
Graph Construction: Represent the atomic system as a graph 𝒢=(𝒱,ℰ,ℛ). Nodes (v_i ∈ 𝒱) correspond to atoms. Connect nodes with edges (ℰ) if the interatomic distance is within a predefined cutoff radius (e.g., 5 Å), ensuring a maximum number of neighbors N for computational efficiency [11].

2. Feature Initialization

Scalar Features: Initialize node scalar features x_i^(0) as a trainable embedding vector dependent on the atomic number z_i: x_i^(0) = E(z_i) ∈ ℝ^F [11].
Vector Features: Initialize node vector features x⃗_i^(0) to zero: x⃗_i^(0) = 0 ∈ ℝ^(F×3) [11].

3. Model Architecture and Training Configuration

Equivariant Layers: Implement a series of equivariant interaction layers. Each layer should perform operations such as local message passing and update functions that preserve SO(3) equivariance [11] [10].
Loss Function: Use a combined loss function that optimizes for both energy and forces: ℒ = λ_E ∥E_hat − E∥² + λ_F (1/(3N)) ∑_i^N ∑_α ∥F_hat_iα − F_iα∥² [10]. This ensures physical consistency, as forces are the negative gradient of the energy with respect to atomic positions.
Radial Basis Functions: Employ Gaussian radial basis functions (λ_h, λ_u, λ_v) to encode interatomic distances, which are then used in the message-passing steps [11].

4. Validation and Deployment

Validation: Evaluate the trained model on hold-out test sets from the AIMD trajectory, calculating mean absolute error (MAE) for energies and forces.
Molecular Dynamics Simulation: Integrate the trained potential into MD simulation packages (e.g., LAMMPS). Run simulations in target ensembles (NVE, NVT) to validate the model's stability and its ability to reproduce structural and dynamic properties [11].

Protocol 2: Enhancing Generalization with Physics-Informed Sharpness-Aware Minimization (PI-SAM)

This protocol is designed for semiconductor materials and other challenging systems with out-of-distribution atomic configurations, based on the PI-SAM framework [15].

1. Base Model and Data Setup

Select a suitable GNN architecture (invariant or equivariant) as the base model for predicting energies and forces [15].
Prepare the training dataset, ensuring it includes diverse atomic configurations that reflect the target application's possible states, including defects and non-equilibrium structures if applicable.

2. Physics-Informed Regularization

Energy-Force Consistency Loss: Add a regularization term that penalizes the discrepancy between predicted forces and the negative gradient of the predicted energy with respect to atomic positions. This explicitly enforces the physical relationship F = −∇E [15].
Potential Energy Surface (PES) Curvature Regularization: Incorporate a term that encourages the model to learn the correct curvature of the PES, which is related to vibrational properties and stability [15].

3. Sharpness-Aware Minimization (SAM) Optimization

During training, the SAM optimizer seeks parameters that minimize not only the loss value but also the loss "sharpness." It does this by considering worst-case perturbations of the model weights within a small neighborhood [15].
The combined PI-SAM loss function is: ℒ_PI-SAM = ℒ_Base + λ_EF * ℒ_EnergyForce + λ_PES * ℒ_PES + ρ * SAM_Perturbation, where λ_EF and λ_PES are regularization coefficients [15].

4. Evaluation on Out-of-Distribution (OOD) Data

Test the final model on a dedicated OOD test set containing atomic configurations not seen during training, such as novel defect structures or surfaces. Compare its performance against a model trained without PI-SAM to quantify the improvement in generalization [15].

The workflow for this protocol, integrating both physical constraints and advanced optimization, is outlined below.

This section catalogs critical datasets, software, and model frameworks necessary for research and development in equivariant MLIPs.

Table 3: Essential Resources for Equivariant MLIP Research

Resource Name	Type	Primary Function	Access / Reference
QM9	Dataset	Benchmarking model performance on stable small organic molecules (134k molecules) [1].	https://figshare.com/collections/Quantumchemistrystructuresandpropertiesof134kilomolecules/978904
MD17/MD22	Dataset	Training and validation on molecular dynamics trajectories of organic molecules and biomolecular fragments [1].	http://quantum-machine.org/datasets/#md-datasets
e3nn / E3x	Software Library	Provides abstractions for building E(3)-equivariant networks, managing irreducible representations and Clebsch-Gordan algebra [10].	[10]
LAMMPS	Simulation Software	High-performance molecular dynamics simulator; supports integration of various MLIPs for large-scale simulations [14].	[14]
MACE-MP-0	Foundation Model	Pretrained equivariant potential (MACE architecture) for zero-shot property prediction on diverse materials [13].	[13]
WANDER	Dual-Functional Model	A physics-informed neural network capable of predicting both atomic forces (like a force field) and electronic structures [16].	[16]

Advanced Applications and Future Outlook

The impact of equivariant models extends far beyond simple force prediction. They are now being applied to predict complex tensorial material properties such as dielectric constants, piezoelectric tensors, and elasticity tensors with state-of-the-art accuracy by decomposing these tensors into their spherical harmonic components [10]. In drug discovery and biophysics, equivariant GNNs like VN-EGNN are being used to identify protein binding sites, leveraging their ability to handle 3D molecular structures robustly [10].

A promising frontier is the development of multi-functional models that bridge the gap between force fields and electronic structure calculations. Frameworks like WANDER (Wannier-based dual functional model for simulating electronic band and structural relaxation) exemplify this trend. By using a deep potential molecular dynamics backbone and sharing information with a Wannier Hamiltonian module, WANDER can simultaneously predict atomic forces and electronic band structures, marking a significant step toward machine-learning models that offer multiple functionalities of first-principles calculations [16].

Future research will likely focus on improving the computational efficiency of equivariant models to enable simulations of even larger systems, extending these approaches to more complex symmetries and conservation laws, and enhancing interpretability to glean new physical insights from the learned representations [1] [10]. The integration of equivariant MLIPs into automated, multi-scale simulation workflows holds the potential to dramatically accelerate the design of new molecules and advanced materials.

Universal Machine Learning Interatomic Potentials (uMLIPs) represent a transformative advancement in computational materials science, enabling accurate atomistic simulations across wide spans of the periodic table. These foundational models learn the mapping from atomic configurations to energies and forces from quantum mechanical data, achieving near-ab initio accuracy at a fraction of the computational cost of density functional theory (DFT) calculations [1]. The shift from system-specific potentials to universal models has been facilitated by innovative graph neural network architectures, the accumulation of large-scale DFT databases, and advanced training protocols [17] [1]. This progress has positioned uMLIPs as powerful tools for predicting diverse materials properties, from thermodynamic stability to vibrational spectra. However, significant challenges remain in their generalization to out-of-distribution regimes, data fidelity requirements, and computational scalability [1]. This article examines the current state of uMLIP development, benchmarks their performance across key applications, and provides detailed protocols for their evaluation and improvement.

Performance Benchmarks and Application-Specific Accuracy

Phonon Property Predictions

The accuracy of uMLIPs in predicting harmonic phonon properties—fundamental to understanding thermal and vibrational behavior—has been systematically evaluated using a dataset of approximately 10,000 ab initio phonon calculations [17]. As shown in Table 1, performance varies considerably across models, with some achieving high accuracy while others exhibit substantial errors despite excelling at energy and force predictions near equilibrium configurations [17].

Table 1: Performance of uMLIPs on phonon properties and structural relaxation

Model	Phonon Prediction Accuracy	Geometry Relaxation Failure Rate (%)	Architecture Type	Forces as Energy Gradients
M3GNet	Moderate	~0.20%	Three-body graph network	Yes
CHGNet	Moderate	0.09%	Graph network	Yes
MACE-MP-0	High	~0.20%	Atomic cluster expansion	Yes
SevenNet-0	Moderate	~0.20%	Equivariant (NequIP-based)	Yes
MatterSim-v1	High	0.10%	M3GNet-based with active learning	Yes
ORB	Variable	High	Smooth overlap + graph network	No
eqV2-M	High	0.85%	Equivariant transformer	No

Notably, models that predict forces as separate outputs rather than as exact derivatives of the energy (ORB and eqV2-M) demonstrate higher failure rates in geometry relaxations, often due to high-frequency errors that prevent convergence [17]. This highlights a critical architectural consideration for uMLIP developers.

High-Pressure Applications

uMLIP performance deteriorates under extreme pressure conditions (0-150 GPa) due to limitations in training data coverage rather than algorithmic constraints [18]. Benchmark studies reveal that while these models excel at standard pressure, their predictive accuracy declines as pressure increases, manifested by inaccurate predictions of compressed bond lengths and volumes per atom [18]. As illustrated in Table 2, targeted fine-tuning on high-pressure configurations can significantly restore model robustness, underscoring the importance of representative training data.

Table 2: uMLIP performance under high-pressure conditions

Pressure (GPa)	First-Neighbor Distance Range (Å)	Volume per Atom Range (Å³)	Typical uMLIP Performance
0	0.74 - ~5.0	10-40 (with tail >100)	Excellent
25	Narrowing	Narrowing	Good
50	Narrowing	Narrowing	Moderate
100	Narrowing	~20	Declining
150	0.72 - ~3.3	~20	Poor

Experimental Protocols and Methodologies

DIRECT Sampling for Robust Training Set Selection

The DImensionality-Reduced Encoded Clusters with sTratified (DIRECT) sampling approach addresses the critical challenge of selecting representative training structures from large configuration spaces [19]. The protocol, visualized in Figure 1, consists of five key steps:

Figure 1: Workflow for DIRECT sampling

Step 1: Configuration Space Generation - Generate a comprehensive configuration space of N structures using methods such as AIMD simulations, random atom displacements, lattice strains, or sampling from universal MLIP molecular dynamics trajectories [19].

Step 2: Featurization/Encoding - Convert the configuration space into fixed-length vectors using the concatenated output of the final graph convolutional layer from pre-trained graph deep learning formation energy models (e.g., M3GNet trained on Materials Project formation energies) [19].

Step 3: Dimensionality Reduction - Apply Principal Component Analysis (PCA) to the normalized features, retaining the first m principal components with eigenvalues >1 (Kaiser's rule) to represent the feature space [19].

Step 4: Clustering - Employ the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to group structures into n clusters based on their locations in the m-dimensional feature space, weighting PCs by explained variance [19].

Step 5: Stratified Sampling - Select k structures from each cluster based on Euclidean distance to centroids. When k=1, choose the feature closest to each centroid; for k>1, select features at constant index intervals after distance sorting [19].

Application of DIRECT sampling to the Materials Project relaxation trajectories dataset with over one million structures has yielded improved M3GNet universal potentials that extrapolate more reliably to unseen structures [19].

Active Learning for Infrared Spectra Prediction

The Python-based Active Learning Code for Infrared Spectroscopy (PALIRS) implements a four-step protocol for efficient IR spectra prediction [20], with the workflow detailed in Figure 2:

Figure 2: PALIRS active learning workflow

Step 1: Initial Dataset Preparation - Sample molecular geometries along normal vibrational modes from DFT calculations to create foundational training data [20].

Step 2: Initial MLIP Training - Train an ensemble of three MACE models on the initial structures to enable uncertainty quantification through force prediction variance [20].

Step 3: Active Learning Loop - Iteratively expand the training set through MLMD simulations at multiple temperatures (300K, 500K, 700K), selecting configurations with highest uncertainty in force predictions to enrich the dataset [20].

Step 4: Dipole Moment Model Training - Train a separate MACE model specifically for dipole moment predictions using the final active learning dataset [20].

Step 5: MLMD Production and IR Spectra Calculation - Perform production MLMD simulations using the refined MLIP, compute dipole moments along trajectories, and derive IR spectra via autocorrelation function analysis [20].

This protocol achieves accurate IR spectra predictions at a fraction of the computational cost of AIMD, with applications demonstrated for small organic molecules relevant to catalysis [20].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key computational resources for uMLIP development and application

Resource Category	Specific Tools/Databases	Function	Application Context
Benchmark Datasets	MDR Phonon Database [17], Alexandria [18], Materials Project [17] [19]	Provide standardized datasets for training and benchmarking uMLIP performance	Phonon calculations, high-pressure studies, universal potential development
MLIP Architectures	MACE [17] [18], M3GNet [17] [19], NequIP/SevenNet [17], CHGNet [17]	Core model architectures with different accuracy/efficiency trade-offs	Cross-materials applications, system-specific fine-tuning
Training Methodologies	DIRECT Sampling [19], Active Learning (PALIRS) [20]	Advanced strategies for robust training set selection and efficient data generation	Improving model generalizability, reducing computational cost
Specialized Software	DeePMD-kit [1], PALIRS [20]	Implementation frameworks for specific MLIP approaches and applications	Molecular dynamics, IR spectra prediction
Descriptor Schemes	Atomic Cluster Expansion [17], Smooth Overlap of Atomic Positions [17]	Represent atomic environments for machine learning	Encoding structural and chemical information

Critical Challenges and Future Directions

Despite rapid progress, uMLIPs face several significant challenges that limit their universal applicability. A primary concern is generalization to regimes underrepresented in training data, such as high-pressure environments [18] or metastable structures far from dynamical equilibrium [17]. Additionally, models that predict forces as separate outputs rather than energy gradients demonstrate higher failure rates in geometry relaxations [17], highlighting architectural limitations.

Future development should focus on several key areas: (1) incorporating diverse training data covering extreme conditions and rare configurations; (2) developing improved uncertainty quantification methods to detect extrapolation risks; (3) enhancing model architectures for better physical consistency; and (4) creating more comprehensive benchmark datasets spanning multiple materials classes and properties [17] [1] [18]. The integration of active learning strategies, multi-fidelity frameworks, and interpretable AI techniques will be crucial for advancing the next generation of truly universal interatomic potentials [1].

As uMLIP methodologies continue to mature, they hold immense promise for accelerating materials discovery across diverse applications, from catalyst design to high-pressure materials synthesis, ultimately bridging the gap between quantum mechanical accuracy and computational efficiency in atomistic simulations.

Machine learning interatomic potentials (MLIPs) represent a paradigm shift in computational chemistry and materials science, offering near-quantum mechanical accuracy at a fraction of the computational cost of traditional density functional theory (DFT) calculations [21] [1]. The performance and generalizability of these models are fundamentally constrained by the breadth, diversity, and fidelity of their training data [1]. This application note provides a comprehensive overview of current data resources and practical protocols for developing and applying MLIPs across chemical spaces, from small molecules to complex biomolecular systems, within the broader context of force calculation research.

Current Landscape of MLIP Training Datasets

The development of MLIPs has been accelerated by the creation of large-scale, high-quality datasets. These resources vary significantly in scale, chemical diversity, and target applications, enabling researchers to select appropriate training data for specific use cases. The table below summarizes key datasets that have emerged as critical resources for training MLIPs.

Table 1: Overview of Major Datasets for MLIP Training

Dataset Name	Data Scale	Elements Covered	Level of Theory	Key Features	Primary Applications
OMol25 [22] [23]	>100 million DFT calculations	83 elements	ωB97M-V/def2-TZVPD	Includes biomolecules, metal complexes, electrolytes, systems up to 350 atoms	Universal MLIPs, healthcare and energy storage technologies
QDπ [24]	1.6 million structures	13 elements (H, C, N, O, F, P, S, Cl, and others relevant to drug discovery)	ωB97M-D3(BJ)/def2-TZVPPD	Active learning curation, includes conformational energies, intermolecular interactions, tautomers	Drug discovery, biomolecular simulations
QM9 [1]	134k molecules (~1M atoms)	C, H, O, N, F	Not specified in sources	Small organic molecules with ≤9 heavy atoms	Molecular property prediction
MD17/MD22 [1]	MD17: ~3-4M configurations; MD22: 0.2M configurations	Varies by subset	Not specified in sources	Molecular dynamics trajectories	Energy and force prediction

The OMol25 dataset represents a significant milestone in dataset scale and diversity, requiring approximately 6 billion CPU core-hours to generate and containing configurations up to 10 times larger than previous molecular datasets [23]. Its unique value lies in blending "elemental, chemical, and structural diversity" including intermolecular interactions, explicit solvation, variable charge and spin states, conformers, and reactive structures [22].

For drug discovery applications, the QDπ dataset offers strategic advantages through its active learning curation process, which maximizes chemical diversity while minimizing redundant information [24]. This approach demonstrates that targeted, information-dense datasets can effectively cover relevant chemical spaces without requiring exhaustive computation of all possible structures.

Experimental Protocols for MLIP Development and Validation

Active Learning for Data Curation and Expansion

The QDπ dataset development employed a query-by-committee active learning strategy to efficiently sample chemical space [24]. The following protocol details this approach:

Procedure:

Initialization: Begin with an initial set of structures with quantum mechanical reference data
Committee Training: Train 4 independent MLIP models on the current dataset using different random seeds
Candidate Screening: For each structure in source databases, calculate the standard deviation of energy and force predictions across the committee models
Selection Criteria:
- Structures with standard deviations exceeding 0.015 eV/atom for energy or 0.20 eV/Å for forces are considered informative
- Select a random subset of up to 20,000 candidate structures for DFT calculation
Iteration: Repeat steps 2-4 until all structures in source databases either meet inclusion criteria or are excluded

Applications: This protocol is particularly valuable for pruning large datasets to remove redundancy or expanding small datasets through molecular dynamics sampling of thermally accessible conformations [24].

Recent research has evaluated MLIPs trained on the OMol25 dataset for predicting charge-sensitive molecular properties [25]. The following protocol outlines this benchmarking approach:

Procedure:

System Preparation:
- Obtain molecular structures for both reduced and non-reduced states of target species
- For reduction potential calculations, include solvent information

Geometry Optimization:
- Optimize all structures using the target MLIP
- Use robust optimization algorithms (e.g., geomeTRIC 1.0.2)
- Apply convergence criteria for forces (e.g., <0.005 eV/Å)
Energy Evaluation:
- Calculate electronic energies for optimized structures
- For solvation effects, apply implicit solvation models (e.g., CPCM-X)
Property Calculation:
- Reduction potential: ΔE = E(non-reduced) - E(reduced) [in eV, equivalent to volts]
- Electron affinity: E(neutral) - E(anion) [in eV]
Validation:
- Compare predictions against experimental data
- Calculate mean absolute error (MAE), root mean square error (RMSE), and R² values

Key Findings: This protocol revealed that despite not explicitly modeling Coulombic physics, OMol25-trained models like UMA-Small can predict reduction potentials for organometallic species with accuracy comparable to or exceeding traditional DFT methods [25].

Workflow Visualization for MLIP Development

The following diagram illustrates the integrated workflow for developing and applying machine learning interatomic potentials, from data curation to model deployment:

Table 2: Key Research Reagents and Computational Tools for MLIP Development

Resource Category	Specific Tools/Models	Function and Application
Universal MLIP Models	UMA (Universal Model for Atoms) [23], MACE [26], CHGNet [26], MatterSim [26]	Pretrained foundational models for broad chemical spaces; can be used directly or fine-tuned for specific applications
Specialized MLIP Models	AIMNet2 [27], SevenNet [26], eqV2-M [17]	Models with specific strengths: AIMNet2 for charged systems and IR spectra, SevenNet and eqV2-M for high accuracy on derivatives
Software Platforms	AMS/MLPotential [27], DeePMD-kit [1], DP-GEN [24]	Integrated platforms for MLIP training, deployment, and active learning workflows
Benchmarking Resources	Matbench Discovery [17], MDR phonon database [17], Experimental redox datasets [25]	Standardized datasets and metrics for evaluating MLIP performance on specific properties

Model Selection Guidelines

Choosing an appropriate MLIP requires careful consideration of the target application:

For broad exploratory studies across diverse chemical spaces: Universal models like UMA and MatterSim provide excellent starting points [26] [23]
For electronic property predictions: CHGNet incorporates charge information via magnetic moment constraints [26]
For high-accuracy force and stress predictions: MACE and SevenNet demonstrate strong performance on elastic and phonon properties [26]
For molecular systems with charges: AIMNet2 currently offers unique support for charged species and dipole moment predictions [27]

The evolving landscape of data resources for machine learning interatomic potentials is dramatically accelerating molecular simulations across chemical and materials spaces. The emergence of massive, diverse datasets like OMol25, coupled with strategically curated resources like QDπ, provides researchers with unprecedented opportunities to develop accurate, transferable models. The experimental protocols and resources outlined in this application note offer practical pathways for leveraging these data assets to advance force calculation research. As the field progresses, increased emphasis on data quality, specialized benchmarking, and model interpretability will further enhance the utility of MLIPs in scientific discovery and industrial applications, particularly in drug development and materials design.

Implementing MLIPs: Architectures, Training Strategies, and Domain Applications

Machine learning interatomic potentials (MLIPs) have emerged as a transformative tool in computational materials science and drug discovery, bridging the gap between quantum-mechanical accuracy and molecular dynamics efficiency. These models learn the relationship between atomic configurations and potential energies from reference data, enabling simulations across extended time and length scales. The rapid evolution of MLIP architectures has given rise to distinct families, each with characteristic approaches to balancing accuracy, computational efficiency, and physical faithfulness. This application note provides a structured overview of the primary MLIP architecture families—graph networks, symmetry-equivariant models, and high-efficiency frameworks—contextualized within the broader thesis of enabling accurate force calculations for research applications. We summarize quantitative performance data, detail experimental protocols for model evaluation, and visualize key architectural relationships to equip researchers with practical guidance for selecting and implementing these advanced computational tools.

MLIP Architecture Families: Core Principles and Comparative Analysis

Architectural Paradigms and Their Evolution

MLIP architectures have evolved from using invariant descriptors to sophisticated symmetry-aware models. Invariant models initially dominated the field, relying on handcrafted features like bond lengths and angles that remain unchanged under rotational and translational transformations. While computationally efficient, these models face limitations in distinguishing structures with identical local features but different spatial arrangements [11]. Equivariant models represent a significant advancement by explicitly embedding the symmetries of Euclidean space (E(3)) directly into their architecture. These models ensure that their internal representations transform predictably under rotation, leading to superior data efficiency and accuracy [1]. Recently, high-efficiency frameworks have emerged, aiming to preserve the benefits of equivariance while reducing computational overhead through innovative representations and training paradigms [11] [28].

Table 1: Core MLIP Architecture Families and Their Characteristics

Architecture Family	Core Principle	Representative Models	Key Advantages	Inherent Limitations
Invariant Graph Networks	Uses invariant features (distances, angles) to construct potential energy surface.	CGCNN [11], SchNet [11], MEGNet [11], M3GNet [29], ALIGNN [11]	Conceptual simplicity, computational efficiency.	Limited geometric awareness, struggles with stereoisomers and complex spatial configurations [11].
Symmetry-Equivariant Models	Embeds E(3) symmetry directly into network operations via spherical harmonics/tensor products.	NequIP [11] [30], MACE [11] [31], Allegro [21]	High data efficiency & accuracy, excellent generalization from small datasets [11] [30].	High computational cost and memory footprint from tensor operations [11].
High-Efficiency Frameworks	Achieves equivariance without high-order tensors or uses learned symmetry compliance.	E2GNN [11], PaiNN [11], TransIP [28], CAMP [32]	Favorable accuracy-speed trade-off, scalability to larger systems [11] [28].	Potential accuracy trade-offs vs. high-order equivariant models on complex tasks [11].

Quantitative Performance Comparison

Benchmarking studies and model publications provide critical metrics for comparing the performance of different MLIP architectures across diverse datasets. The following table synthesizes reported performance data, offering a quantitative basis for model selection. It is crucial to note that performance is dataset-dependent, and these figures should serve as a guide rather than an absolute ranking.

Table 2: Reported Performance Metrics of Selected MLIP Models

Model	Architecture Family	Dataset	Energy MAE	Force MAE	Key Benchmarking Context
M3GNet	Invariant Graph Network	Materials Project (universal) [29]	-	-	Identified ~1.8M potentially stable materials; 1,578 verified by DFT [29].
E2GNN	High-Efficiency Framework	Catalysts, molecules, organic isomers [11]	Outperforms baselines [11]	Outperforms baselines [11]	Achieved AIMD accuracy in solid, liquid, gas MD simulations [11].
NequIP	Symmetry-Equivariant Model	Diverse systems (Li diffusion, water) [30]	-	-	Unprecedented accuracy & sample efficiency; 1000x fewer data than invariant models [30].
DeePMD	Invariant/Descriptor-Based	~10^6 water configs. [1]	<1 meV/atom	<20 meV/Å	Quantum accuracy with classical MD efficiency [1].

Experimental Protocols for MLIP Evaluation and Application

Standardized Benchmarking with MLIPAudit

Robust evaluation of MLIPs requires going beyond static energy and force errors to assess performance on downstream tasks. The MLIPAudit benchmarking suite provides a standardized framework for this purpose [31].

Protocol Objectives: To evaluate the accuracy, robustness, and transferability of MLIPs across a diverse set of molecular systems and simulation tasks, providing a holistic view of model utility beyond training metrics [31].

Procedure:

System Preparation: Select benchmark systems from the MLIPAudit repository, which includes small organic compounds, molecular liquids, flexible peptides, and folded protein domains [31].
Model Inference: Configure the target MLIP(s) using the provided tools, ensuring compatibility via an ASE calculator interface [31].
Task Execution: Run the standardised benchmark tasks, which may include:
- Static Property Prediction: Calculate energies and forces on held-out validation sets, reporting RMSE and MAE [31].
- Molecular Dynamics Stability: Perform MD simulations to assess energy conservation and structural stability over time [31].
- Conformational Sampling: Evaluate the model's ability to reproduce correct populations of molecular conformers or protein secondary structures [31].
- Property Reproduction: Compare simulated properties (e.g., radial distribution functions, diffusion coefficients) against reference QM or experimental data [31].
Results Submission and Comparison: Submit results to the MLIPAudit leaderboard for standardized comparison against pre-computed evaluations of published models [31].

Key Controls: Consistent cutoff distances, system sizes, and simulation parameters across all models compared; use of identical initial configurations and thermostats for dynamic simulations [31].

Molecular Dynamics Simulation with MLIPs

Applying a trained MLIP to run molecular dynamics simulations is a primary use case. This protocol outlines the steps for a typical NVT (constant number, volume, and temperature) simulation.

Protocol Objectives: To perform a stable MD simulation using a trained MLIP to study thermodynamic properties, dynamic behavior, and structural evolution of a molecular system.

Procedure:

System Initialization:
- Coordinate Preparation: Obtain initial atomic coordinates from a crystal structure, previous simulation, or model builder.
- Box Size and Periodicity: Define the simulation box with periodic boundary conditions appropriate for the system (e.g., ~10 Å padding around solute for solvated systems).
- Model Loading: Load the trained MLIP model and its parameters into an MD engine that supports MLIPs (e.g., LAMMPS, ASE, SchNetPack).

Energy Minimization:
- Perform a geometry optimization (e.g., using the FIRE algorithm) to relax the structure to the nearest local energy minimum, minimizing internal stresses.
- Convergence Criterion: Set a force tolerance (e.g., 0.01 eV/Å) to terminate the minimization.
Equilibration Phase:
- Thermostat Coupling: Initialize velocities according to a Maxwell-Boltzmann distribution for the target temperature. Use a thermostat (e.g., Nose-Hoover, Langevin) to maintain temperature.
- Integration: Use a time integrator like Velocity Verlet with a small time step (0.5-1.0 fs), as MLIPs can model high-frequency vibrations.
- Duration: Run for a sufficient time (e.g., 10-100 ps) for system properties (temperature, pressure, energy) to stabilize.
Production Run:
- Disable any strong coupling to external baths if measuring equilibrium properties.
- Continue the simulation, writing trajectories (atomic positions, velocities) and energies/forces to disk at regular intervals (e.g., every 10-100 steps) for subsequent analysis.
Analysis:
- Calculate relevant properties from the production trajectory, such as radial distribution functions, mean-squared displacement for diffusion coefficients, or root-mean-square deviation for structural stability.

Troubleshooting Note: Poor energy conservation or structural instability during dynamics can indicate a failure of the MLIP to generalize. This underscores the importance of rigorous benchmarking using protocols like MLIPAudit before relying on simulation results [31].

Visualization of MLIP Architectures and Workflows

Conceptual Hierarchy of MLIP Architectures

The following diagram illustrates the logical relationships and evolution between the major MLIP architecture families, highlighting their core strategies for handling geometric symmetries.

MLIP Benchmarking and Application Workflow

This flowchart outlines the standard workflow for developing, benchmarking, and applying an MLIP, from data generation to production simulation, incorporating feedback loops for model improvement.

Successful development and application of MLIPs rely on a suite of software tools, datasets, and computational resources. The following table details key components of the modern MLIP research toolkit.

Table 3: Essential Resources for MLIP Research and Development

Resource Category	Specific Tool / Database	Function and Application
Benchmarking Suites	MLIPAudit [31]	Standardized framework for evaluating MLIP accuracy, robustness, and transferability across diverse molecular systems.
Reference Datasets	Materials Project [29], OMol25 [28], MD17/MD22 [1], QM9 [1]	Sources of high-quality reference data (energies, forces) from DFT calculations for training and testing MLIPs.
Software Packages	DeePMD-kit [1], ASE (Atomic Simulation Environment) [31]	Software ecosystems for training MLIPs (DeePMD-kit) and running simulations with interoperability between codes (ASE).
Pre-trained Universal Models	M3GNet [29], MACE-MP [31], MACE-OFF [31]	Ready-to-use MLIPs trained on extensive datasets across the periodic table or organic molecules, enabling simulations without initial training.

Application Notes

The selection and engineering of molecular descriptors are pivotal for developing robust machine learning interatomic potentials (MLIPs). These descriptors encode atomic environments into mathematical representations, enabling the prediction of potential energy surfaces with near-quantum accuracy. The choice of descriptor strategy directly impacts a model's data efficiency, computational cost, transferability, and accuracy across diverse chemical spaces [1].

Table 1: Comparison of Core Descriptor Strategies for Machine Learning Interatomic Potentials

Descriptor Strategy	Core Principles	Key Advantages	Common Algorithms/Implementations	Target Properties
Geometric Descriptors	Roto-translationally invariant descriptions of local atomic neighborhoods using distances, angles, and dihedral angles [1].	- Strong inductive bias for molecular mechanics [33].- Computationally efficient.- Well-established and interpretable.	- Atom-Centered Symmetry Functions (ACSF) [34].- Smooth Overlap of Atomic Positions (SOAP) [34].- Atomic Cluster Expansion (ACE) [34].	Potential energy, atomic forces, elastic constants, phonon dispersion [35].
Electronic Structure Descriptors	Utilizes features from quantum mechanical (QM) calculations, such as orbital interactions, to inform the model [33].	- Superior for quantum chemical properties [33].- High data efficiency and transferability.- Can model electronic phenomena (e.g., charge transfer) [33].	- OrbNet-Equi [33].- Machine Learning Hamiltonian (ML-Ham) methods [1].	Electronic energies, dipole moments, frontier orbital energies (HOMO/LUMO), electron densities [33].
Custom Composite Descriptors	Combines readily available database features with vectorized property matrices and empirical functions [36].	- Mitigates data scarcity for specific applications.- Leverages low-cost computational features.- Highly tunable for specific property prediction.	- Hybrid feature engineering (e.g., mixing formation heat with vectorized electronegativity) [36].- Descriptor vectorization from property matrices [36].	Band gaps, work functions, and other electronic structure properties of complex materials like 2D systems [36].

The integration of physical symmetries is a critical consideration. Equivariant models explicitly embed symmetries like rotation (SO(3)) and translation into their architecture. This ensures that scalar outputs like energy are invariant, while vector outputs like forces transform correctly, leading to significant improvements in data efficiency and physical consistency [1] [34]. For example, models like MACE and NequIP use equivariant operations to achieve high accuracy with fewer training samples [34] [37].

The paradigm of foundation MLIPs represents a shift towards universal potentials pre-trained on massive datasets. However, adapting these models to specific tasks or higher levels of theory often requires transfer learning. Frameworks like franken enable efficient adaptation by extracting atomic descriptors from pre-trained graph neural networks and fine-tuning them with minimal data, showcasing strong performance with as few as tens of training structures [34].

Protocols

Protocol 2.1: Implementing an Electronic-Structure-Informed Equivariant Model (e.g., OrbNet-Equi)

This protocol details the procedure for training a model that incorporates electronic structure features within an equivariant neural network, suitable for predicting quantum chemical properties [33].

Research Reagent Solutions

Item	Function/Description
GFN-xTB Software	Provides efficient semi-empirical tight-binding calculations to generate initial mean-field electronic structure inputs [33].
OrbNet-Equi Codebase	The core neural network architecture that is equivariant to isometric transformations of the molecular system [33].
QM9 Dataset	A standard benchmark containing ~134,000 small organic molecules with quantum chemical properties [1].
Quantum Chemistry Software (e.g., PySCF, ORCA)	High-fidelity methods (e.g., DFT) used to generate the target training data for energies, dipoles, and other properties [33].

Procedure

Input Featurization: For a given molecular geometry, compute a tight-binding mean-field electronic structure using GFN-xTB. The inputs to the neural network are constructed as a stack of matrices—typically the Fock (F), density (P), core Hamiltonian (H), and overlap (S) matrices—represented in the atomic orbital basis [33].
Descriptor Symmetry Handling: The atomic orbital features transform block-wise under 3D rotations. The equivariant network must be designed to process these features in a way that conserves the physical symmetries of the quantum chemical system [33].
Model Training:
- Architecture: Employ an equivariant neural network that iteratively updates atomic representations by propagating information via the off-diagonal blocks of the input matrices. The network's forward pass should respect the equivariance condition: ( R \cdot \mathcal{F}(T[\Psi0]) \equiv \mathcal{F}(R \cdot T[\Psi0]) ), where ( R ) is a rotation and ( T ) is the input feature matrix [33].
- Training Data: Use a dataset like the ~236,000 molecules from small-molecule libraries (e.g., SDC21) with corresponding high-fidelity target values (e.g., DFT-level energies) [33].
- Loss Function: Minimize a cost function ( \mathcal{L} ) that penalizes differences between the model's predictions and the high-fidelity reference data for the target properties [33].
Validation: Validate the trained model on downstream benchmarks encompassing diverse chemical processes, such as ionization potentials of charge-transfer complexes and open-shell systems, to ensure transferability beyond the training data [33].

Protocol 2.2: Engineering Custom Composite Descriptors for 2D Material Properties

This protocol outlines a method for constructing hybrid descriptors to predict electronic properties of 2D materials, which is particularly useful when dataset size is limited [36].

Research Reagent Solutions

Item	Function/Description
Computational 2D Materials Database (C2DB)	A repository of DFT-calculated properties for about 4,000 two-dimensional materials, serving as a data source [36].
Elemental Property Data	Tabulated values for properties like covalent radius, dipole polarizability, and ionization energy for each element [36].
Tree-Based ML Models (e.g., XGBoost)	Ensemble learning models that can effectively handle the mixed feature types and provide high predictive performance [36].
Descriptor Calculation Script	A custom script (e.g., in Python) to implement the vectorization and hybrid descriptor construction process.

Procedure

Feature Selection and Acquisition:
- Database Features: Select readily available features from the C2DB, such as the number of atoms, cell volume, molecular mass, and formation heat [36].
- Empirical Feature: Calculate a molecular-level electronegativity descriptor using an empirical equation, such as the geometric mean of constituent atomic electronegativities [36].
Descriptor Vectorization:
- For a target property (e.g., band gap), select relevant atomic properties (e.g., covalent radius, ionization energy).
- For a molecule with a reduced stoichiometric formula ( AaBbCc ), construct a property matrix ( Pi ) for each atomic property. The matrix elements ( a_{ij} ) represent the atom-atom pair contribution, computed via a predefined operator (e.g., summation of individual atomic polarizabilities) [36].
- Compute the eigenvalues of each symmetric property matrix. This set of eigenvalues forms a "property vector," which is a flattened, characteristic spectrum of the feature that retains physical meaning while reducing data volume [36].
Hybrid Descriptor Construction: Assemble the final input feature vector by combining the database features, the empirical electronegativity descriptor, and the vectorized descriptors (eigenvalues of covalent radius, polarizability, and ionization energy matrices) [36].
Model Training and Validation:
- Train an ensemble model like Extreme Gradient Boosting (XGBoost) using the hybrid descriptors to predict target properties (e.g., band gap, work function).
- Evaluate performance using metrics like R² and Mean Absolute Error (MAE). The hybrid features have been shown to achieve R² > 0.9 and significantly reduce overfitting compared to using database features alone [36].

Visualization

The following workflow diagram illustrates the integrated protocol for developing and applying advanced descriptor strategies in MLIPs.

Diagram Title: MLIP Descriptor Strategy Workflow

Machine learning interatomic potentials (MLIPs) represent a transformative advancement in atomistic simulations, offering near-quantum mechanical accuracy at a fraction of the computational cost of first-principles methods. The accuracy, efficiency, and transferability of these potentials are fundamentally governed by their training methodologies. This application note details three principal training paradigms—passive learning, active learning, and multi-fidelity learning—framed within the context of developing robust MLIPs for accurate force and energy calculations. These methodologies address the critical challenge of generating high-quality, diverse training datasets without prohibitive computational expense, enabling reliable molecular dynamics (MD) simulations across materials science, chemistry, and drug development.

The selection of a training strategy involves balancing computational cost, data efficiency, and the resulting model accuracy. The table below summarizes the core characteristics, requirements, and typical applications of the three methodologies.

Table 1: Comparative analysis of MLIP training methodologies.

Methodology	Data Selection Strategy	Key Data Requirements	Computational Cost (Data Generation)	Relative Accuracy	Ideal Use Cases
Passive Learning	Static; user-defined and curated	Large, diverse datasets of energies and forces	Very High	Variable; limited by dataset quality	Systems with well-understood configuration space
Active Learning	Dynamic; iterative and uncertainty-driven	Initial seed dataset; queries for new configurations	Moderate (focused on informative samples)	High	Exploring new systems, rare events, and complex phase spaces
Multi-Fidelity Learning	Integrates datasets of varying accuracy	Low-level forces + High-level energies	Low (leverages existing low-cost data)	Very High	Leveraging public datasets; achieving high accuracy without high-level forces

Protocol 1: Passive Learning

Principle and Application Context

Passive learning, the traditional approach, involves training an MLIP on a static, pre-computed dataset. The dataset is typically generated through methods like ab initio molecular dynamics (AIMD) or density functional theory (DFT) calculations on a wide range of atomic configurations, including bulk structures, defects, and thermally perturbed systems, to ensure broad coverage of the potential energy surface (PES) [1] [38]. Its primary advantage is straightforward implementation, but its major limitation is the requirement for a large, comprehensive dataset to avoid poor extrapolation, making data generation computationally expensive [39].

Detailed Experimental Workflow

Step 1: Dataset Construction

Configuration Sampling: Generate a diverse set of atomic configurations. This should include:
- Fundamental Crystal Structures: Unit cells of relevant phases (e.g., HCP, FCC, BCC for metals), scaled to various volumes (e.g., ±10% around equilibrium) to sample different strains [38].
- Defect Structures: Introduce vacancies, interstitial atoms, dislocations, grain boundaries, and stacking faults [38].
- Thermal Perturbations: Perform AIMD simulations at multiple temperatures (e.g., 300 K, 800 K, 1300 K) to sample non-equilibrium structures driven by thermal agitation [38].
Reference Calculations: For each configuration, perform high-fidelity DFT calculations to obtain reference energies, atomic forces, and virial stresses. Ensure numerical convergence with respect to plane-wave cutoff energy and k-point mesh [38] [40].

Step 2: Model Training

Descriptor/Model Selection: Choose an MLIP architecture (e.g., Deep Potential (DP), Moment Tensor Potential (MTP), or a graph neural network like MACE or NequIP) [1] [39].
Loss Function Optimization: Train the model by minimizing a loss function (L) that typically includes energy (E), forces (F), and sometimes stress contributions: L = ω_E × MSE(E_pred, E_DFT) + ω_F × MSE(F_pred, F_DFT) where ω_E and ω_F are weighting factors tuned to balance the influence of energies and forces [1] [40].
Validation: Split the dataset into training and validation sets (e.g., 3:1 ratio) to monitor for overfitting and assess the model's generalization error [38].

Step 3: Model Validation

Property Prediction: Validate the trained potential by predicting key material properties not included in the training set, such as elastic constants, surface energies, phonon spectra, and thermal expansion coefficients [39] [38].
Molecular Dynamics: Run MD simulations to test the model's stability and its ability to reproduce dynamic properties over time.

Workflow Visualization

Protocol 2: Active Learning

Principle and Application Context

Active learning (AL) employs an iterative, uncertainty-driven strategy to build training datasets efficiently. Starting from a small seed dataset, the algorithm queries a sampler (e.g., an MD simulation) for new candidate configurations. These candidates are evaluated by an estimator (the current MLIP), and those with high uncertainty (where the model is least confident) are selected for first-principles calculation and added to the training set [39]. This "smart" sampling focuses computational resources on the most informative regions of the PES, often achieving high accuracy with orders of magnitude fewer data points than passive learning [39]. It is particularly valuable for exploring unknown systems, capturing rare events, and mapping complex phase transitions.

Detailed Experimental Workflow

Step 1: Initialization

Seed Dataset: Generate a small initial dataset (e.g., 50-100 configurations) containing minimal structures like the equilibrium crystal phase and slight random perturbations.
Initial Model: Train a preliminary MLIP on this seed dataset.

Step 2: Active Learning Loop The core of AL is the iterative cycle of exploration and labeling, often automated within frameworks like DP-GEN [38].

Exploration (Sampling): Use the current MLIP to run molecular dynamics or Monte Carlo simulations, exploring a wide range of thermodynamic conditions (temperatures, pressures). This generates a pool of candidate configurations.
Candidate Selection (Uncertainty Estimation): For each candidate, compute an uncertainty metric. Common methods include:
- Committee Models: Train multiple models (a committee); the standard deviation of their predictions serves as the uncertainty [39].
- Built-in Uncertainty Indicators: Some MLIP formats, like MTP, provide inherent indicators based on the descriptor's position relative to the training set.
Labeling: Select the candidates with the highest uncertainty and compute their accurate energies and forces using DFT.
Model Update: Add the newly labeled configurations to the training set and retrain the MLIP.

Step 3: Convergence and Validation

The loop continues until a convergence criterion is met, such as the maximum uncertainty falling below a predefined threshold or the model's predictions on key properties stabilizing.
Finally, the model undergoes rigorous validation as described in Protocol 1.

Workflow Visualization

Protocol 3: Multi-Fidelity Learning

Principle and Application Context

Multi-fidelity learning (MFL) addresses a common problem in computational chemistry: the existence of vast datasets computed with low-cost, low-fidelity methods (e.g., DFT with a moderate basis set) and small, expensive, high-fidelity datasets (e.g., coupled-cluster theory) that often lack atomic forces [41] [42]. MFL trains a single model on multiple datasets of different accuracies ("fidelities"). The key insight is that low-level atomic forces and high-level energies are all you need to create a highly accurate potential [41] [42]. This approach allows the model to learn the high-level PES without the prohibitive cost of generating a large dataset of high-level forces, effectively leveraging the abundance of existing public quantum mechanical data.

Detailed Experimental Workflow

Step 1: Fidelity Dataset Curation

Low-Fidelity Dataset: Source or compute a large dataset (e.g., 10,000s of configurations) that includes both energies and atomic forces, calculated using a fast but less accurate method (e.g., DFT with a generalized gradient approximation functional and a small basis set).
High-Fidelity Dataset: Source or compute a smaller dataset (e.g., 100s of configurations) containing highly accurate energies, calculated using a high-level method (e.g., CCSD(T) with a complete basis set extrapolation). Forces may be absent or sparse in this dataset.

Step 2: Model and Loss Function Design

Architecture: Use a neural network architecture capable of learning complex mappings, such as an equivariant graph neural network (e.g., MACE or NequIP) [41].
Multi-Fidelity Loss Function: Design a composite loss function that simultaneously fits both datasets. A simplified form can be represented as: L = L_Low(E_pred, E_LF, F_pred, F_LF) + ω_High × L_High(E_pred, E_HF) Here, L_Low ensures the model reproduces the low-fidelity PES topography and forces, while L_High constrains the absolute energy scale to the high-fidelity level. The weight ω_High balances the two objectives [41] [42].

Step 3: Training and Validation

Train the model on the combined dataset. The model learns to correct the low-fidelity PES using the high-fidelity energy anchors.
Validate the final model's forces and energies against a held-out set of high-fidelity calculations, if available. The resulting MLIP is shown to be far more accurate than a model trained only on high-fidelity energies and nearly as accurate as a model trained on the (prohibitively expensive) full set of high-fidelity energies and forces [41].

Workflow Visualization

The Scientist's Toolkit: Research Reagent Solutions

The following table lists essential software, frameworks, and datasets used in the development and benchmarking of MLIPs.

Table 2: Essential tools and resources for MLIP research.

Category	Name	Primary Function	Application Context
Active Learning Frameworks	DP-GEN [38]	Automated iterative dataset generation and potential training.	Active learning for materials and molecules.
MLIP Training Packages	DeePMD-kit [1] [39]	Implements the Deep Potential method for training MLIPs.	High-accuracy potentials for complex systems.
	FitSNAP [40]	Trains linear (e.g., SNAP, qSNAP) and spectral MLIPs.	Computationally efficient potential fitting.
Benchmarking Suites	MLIPAudit [31]	Standardized benchmarking suite for MLIPs on biomolecules and molecular liquids.	Evaluating model robustness & transferability.
	Matbench Discovery [31] [43]	Leaderboard and framework for evaluating MLIPs on materials stability prediction.	Materials discovery and property prediction.
Reference Datasets	OC20/OC22 [34] [43]	Large-scale dataset of catalyst relaxations with DFT energies and forces.	Training and testing potentials for catalysis.
	QM9, MD17, MD22 [1]	Quantum chemical properties and molecular dynamics trajectories for organic molecules.	Training MLIPs for molecular systems.

The accurate calculation of forces within biomolecular systems is a cornerstone of understanding fundamental processes such as enzymatic catalysis and ligand binding. Machine learning interatomic potentials (MLIPs) are emerging as transformative tools that bridge the gap between the high accuracy of quantum mechanical methods and the computational feasibility of classical molecular dynamics for large-scale systems [2]. These potentials are trained on first-principles data, enabling them to capture the complex energy landscapes of proteins and other biomolecules with near-quantum accuracy at a fraction of the computational cost. This application note details how MLIPs can be leveraged to study conformational sampling and its critical role in biomolecular function, providing structured experimental data and detailed protocols for the research community.

Application Notes: Key Studies and Quantitative Data

The following studies exemplify the critical relationship between conformational dynamics and biomolecular function, providing a foundation for simulations powered by MLIPs.

Table 1: Quantitative Data on Enzymatic Activity and Conformational Landscapes

Enzyme / System	Catalytic Efficiency (kcat/KM s-1 M-1)	Turnover Number (kcat s-1)	Michaelis Constant (KM μM)	Key Conformational Finding
Bacterial PTE (pdPTE) [44]	1.68 x 10^7	1,600 ± 50	95 ± 10	"Open" and "closed" conformational substates (CSs) linked to substrate access and catalysis.
Bacterial PTE (arPTE) [44]	1.67 x 10^7	3,180 ± 100	190 ± 19	Remote mutations alter conformational landscape, affecting turnover rate (kcat).
AncCDT-3/P188 [45]	~10^1	~10^-2	N/D	Predominantly samples catalytically unproductive open states, a vestige of ancestral binding protein function.
AncCDT-5 [45]	N/D	4	277	Represents an intermediate conformational landscape between ancestor and extant enzyme.
PaCDT (extant enzyme) [45]	~10^6	18	19	Exclusively samples catalytically relevant compact states due to refined conformational landscape.

Table 2: Key Mutations and Their Functional Consequences

Enzyme Variant	Mutation(s)	Impact on Activity and Dynamics
pdPTE H254R [44]	Single point mutation (H254R)	Lower kcat despite R254 being the natural residue in the more active arPTE; altered conformational sampling.
arPTE 4M [44]	K185R, D208G, N265D, T274N	Increased kcat, demonstrating that remote mutations can optimize the conformational distribution for faster turnover.
arPTE 8M [44]	G60A, A80V, R118Q, K185R, Q206P, D208G, I260T, G273S	Reduced kcat, demonstrating the sensitivity of the conformational landscape to sequence changes.
Ancestral CDT Intermediates [45]	Multiple remote substitutions (e.g., P188L, Tyr177Gln)	Cumulative mutations progressively freeze out non-productive wide-open states, shifting the equilibrium toward catalytically competent closed states over evolution.

Notes on Functional Insights

Conformational Substates in Catalysis: Studies on bacterial phosphotriesterase (PTE) have identified distinct conformational substates. A "closed" state is optimally preorganized for the chemical step of hydrolysis but appears to block substrate access and product release. In contrast, an "open" state facilitates solvent access but is poorly organized for catalysis [44]. This highlights a fundamental trade-off that enzymes must overcome for efficient turnover [46].
Evolution of Conformational Landscapes: Research on the evolutionary trajectory from solute-binding proteins (SBPs) to cyclohexadienyl dehydratase (CDT) demonstrates that a key mechanism for increasing catalytic efficiency is the reshaping of the global conformational landscape. Primitive enzymes predominantly sample catalytically unproductive states inherited from their non-catalytic ancestors. Remote mutations, which do not directly alter the active site, gradually freeze out these unproductive states (e.g., wide-open conformations), leading to modern enzymes that exclusively sample compact, catalytically competent states [45].
Ligand Binding Pathways: Proteins can bind ligands via induced fit or conformational selection. The latter model posits that the substrate-free enzyme transiently samples a high-energy, substrate-bound-like conformation. The arrest of adenylate kinase (AdK) in a closed, high-energy state via a engineered disulfide bond provided direct structural evidence for this model, showing that this state is pre-organized for ligand binding and catalysis [47].

Experimental Protocols

The protocols below outline core methodologies for studying conformational states, generating data for MLIP training, and applying MLIPs in biomolecular simulations.

Protocol 1: Trapping and Characterizing a High-Energy Conformational State

This protocol is adapted from a study on adenylate kinase (AdK) to experimentally characterize a high-energy state crucial for catalysis [47].

Design of a Disulfide Trap:
- Identify residue pairs in the protein structure that are distant in the known "open" or "ground" state conformation (e.g., PDB ID: 4AKE) but in close proximity (< 5 Å) in the target "closed" or "high-energy" state conformation (e.g., PDB ID: 1AKE).
- Use computational tools like "Disulfide by Design" [47] to evaluate the feasibility of disulfide bond formation between candidate residues (e.g., G56C and T163C in AdK).
Site-Directed Mutagenesis and Protein Purification:
- Introduce the selected cysteine mutations into the gene of interest using standard molecular biology techniques.
- Express and purify the mutant protein under oxidizing conditions to promote disulfide bond formation.
Validation of Disulfide Arrest:
- SDS-PAGE Analysis: Confirm the formation of an intramolecular disulfide bond by observing a differential migration pattern between the oxidized (AdKcc,ox) and reduced (AdKcc,red) protein on a non-reducing SDS-PAGE gel [47].
- Size Exclusion Chromatography: Verify that the oxidized protein elutes as a monomer, confirming the disulfide is intramolecular.
Biophysical Characterization of the Trapped State:
- X-ray Crystallography: Crystallize the trapped protein, potentially in complex with a transition state analog (e.g., Ap5a for AdK). Solve the structure to confirm it resembles the target high-energy conformation and that catalytic sidechains are properly aligned [47].
- NMR Spectroscopy: Collect 1H-15N HSQC spectra of the trapped (oxidized) and released (reduced) states. Compare the chemical shift perturbations (CSPs) to those observed when a ligand binds to the wild-type enzyme. A similar CSP pattern indicates the trapped state successfully mimics the ligand-bound conformation [47].
- Ligand Binding Affinity: Measure the binding affinity of a substrate or inhibitor to the trapped state. An improved affinity compared to the wild-type enzyme demonstrates that binding to the high-energy state is not occluded by steric hindrance [47].

Protocol 2: Data Generation for Training a Machine Learning Interatomic Potential

This protocol describes the generation of a reference dataset for training an MLIP for biomolecular simulations [2] [17].

System Selection and Preparation:
- Select the target biomolecular system (e.g., a protein-ligand complex or an enzyme in different conformational states).
- Generate an initial set of structures that sample the relevant conformational space. Sources can include:
  - Experimental structures from the PDB.
  - Structures from molecular dynamics (MD) simulations.
  - Systematically distorted geometries around the equilibrium structure [17].
Ab Initio Calculation of Reference Data:
- For each structure in the dataset, perform first-principles calculations using Density Functional Theory (DFT) or a higher-level quantum chemistry method.
- The key outputs to calculate for each configuration are:
  - Total Energy: The potential energy of the system.
  - Atomic Forces: The negative gradient of the energy with respect to atomic coordinates.
  - Stress Tensor: (For periodic systems) The pressure on the unit cell.
Dataset Curation and Splitting:
- Assemble a final dataset containing the structural configurations and their corresponding energies, forces, and stresses.
- Split the dataset into training, validation, and test sets (e.g., 80%/10%/10%) to enable model development and unbiased evaluation.

Protocol 3: Simulating Conformational Transitions with an MLIP

This protocol uses a trained MLIP to study conformational sampling, a key application in enzymology [2].

Model Selection and Setup:
- Select a pre-trained universal MLIP (e.g., M3GNet, CHGNet) or train a new MLIP on a system-specific dataset generated via Protocol 2.
- Prepare the initial simulation system, which could be a crystal structure, a homology model, or a structure pre-equilibrated with classical MD.
Geometry Optimization:
- Use the MLIP to perform an energy minimization or a geometry relaxation of the initial structure to find the nearest local minimum on the potential energy surface. This step is crucial for removing steric clashes and preparing a stable structure for dynamics.
Molecular Dynamics (MD) Simulation:
- Run an MD simulation using the MLIP to calculate energies and forces at each time step.
- Critical Settings:
  - Ensemble: Use NVT or NPT ensembles as required.
  - Temperature: Set to a biologically relevant temperature (e.g., 300 K).
  - Integrator: Use a stable integrator like Velocity Verlet with a time step of 0.5-2 femtoseconds.
  - Simulation Time: Aim for simulations on the scale of nanoseconds to microseconds, as permitted by computational resources, to observe relevant conformational transitions.
Analysis of Trajectory:
- Analyze the MD trajectory to extract thermodynamic and kinetic properties.
- Key Analyses:
  - Root Mean Square Deviation (RMSD): Measure overall structural stability.
  - Root Mean Square Fluctuation (RMSF): Identify flexible regions.
  - Principal Component Analysis (PCA): Identify the major modes of conformational change.
  - Free Energy Landscapes: Project the trajectory onto key reaction coordinates (e.g., inter-domain distances, dihedral angles) to compute the free energy landscape and identify metastable states [45].

Workflow and Pathway Visualizations

Conformational Selection in Ligand Binding

MLIP-Driven Conformational Analysis Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Reagents and Computational Tools

Reagent / Tool	Function / Application	Example Use Case
Disulfide by Design Software [47]	Computational identification of residue pairs for engineered disulfide bonds.	Arresting adenylate kinase in a high-energy closed conformation for biophysical study.
Propargyl-DO3A-Gd(III) Tag [45]	Biorthogonal spin label for distance measurements via Double Electron-Electron Resonance (DEER).	Measuring conformational states and populations in solution for ancestral CDT variants.
PDBbind Database [48] [49] [50]	Curated database of protein-ligand complexes with binding affinity data.	Benchmarking and training machine learning scoring functions for binding affinity prediction.
Universal MLIPs (e.g., M3GNet, CHGNet) [2] [17]	Pre-trained machine learning models for accurate energy and force prediction across diverse chemical spaces.	Running large-scale molecular dynamics simulations of biomolecules to study conformational transitions.
AlphaFold [48]	Protein structure prediction tool.	Providing reliable 3D structural models for proteins when experimental structures are unavailable.
Ancestral Sequence Reconstruction [45]	Computational and experimental method to infer and characterize ancient proteins.	Studying the evolution of conformational landscapes and catalytic activity in enzymes.

Application Note: Machine Learning Interatomic Potentials as a Computational Foundation

Machine learning interatomic potentials (ML-IAPs) have emerged as a transformative tool in computational materials science, bridging the critical gap between the high accuracy but computational cost of quantum mechanical methods like Density Functional Theory (DFT) and the efficiency but limited accuracy of classical force fields [1] [51]. ML-IAPs are surrogate models trained on high-fidelity ab initio data that learn the mapping from atomic coordinates to energies and forces, implicitly encoding electronic effects to faithfully recreate the potential energy surface (PES) without explicitly propagating electronic degrees of freedom [1]. This capability enables accurate atomistic simulations over extended time and length scales, which is indispensable for studying complex materials phenomena such as phase transitions, defect dynamics, and catalytic processes [1] [51].

The robustness of modern ML-IAPs stems from advanced neural network architectures, particularly graph neural networks (GNNs) and equivariant models, which explicitly embed the physical symmetries of atomic systems (e.g., rotational, translational, and reflection invariances) [1]. By preserving these symmetries, models like NequIP and MACE achieve greater data efficiency and improved accuracy for predicting both scalar (energy) and vector/tensor (forces, dipole moments) quantities [1]. The development of universal ML-IAPs (uMLIPs), such as M3GNet, CHGNet, and MACE-MP-0, represents a significant advancement, providing foundational models capable of handling diverse chemistries and crystal structures without system-specific retraining [17]. Benchmark studies demonstrate that these uMLIPs can predict critical properties like harmonic phonon spectra with high accuracy, a essential factor for understanding thermal and vibrational behavior in materials [17].

Table 1: Key Machine Learning Interatomic Potential Models and Their Characteristics

Model Name	Architectural Features	Key Capabilities	Reported Performance
DeePMD [1]	Deep neural network with local environment descriptors	High accuracy for large-scale molecular dynamics	Energy MAE <1 meV/atom, Force MAE <20 meV/Å on water datasets
M3GNet [17]	Graph network with three-body interactions	Universal potential for molecules and crystals	Pioneering uMLIP; featured in Matbench Discovery leaderboard
CHGNet [17]	Graph network with small architecture	Universal potential with high computational efficiency	Low failure rate (0.09%) in geometry relaxations; over 400k parameters
MACE-MP-0 [17]	Atomic cluster expansion descriptor	High data efficiency and reduced message-passing steps	Ranked highly in uMLIP benchmarks for phonon properties
NequIP [1]	Equivariant architecture with higher-order tensors	Excellent data efficiency and accuracy for vector/tensor properties	Superior performance in predicting forces and dynamical properties
eqV2-M [17]	Equivariant transformers	Higher-order equivariant representations	Ranked 1st in Matbench Discovery at time of publication

Application Note & Protocol: Phase-Change Materials for Memory Devices

Application Note

Phase-change materials (PCMs) are a class of compounds that rapidly and reversibly switch between amorphous and crystalline phases, accompanied by significant changes in electrical and optical properties [52]. This makes them ideal for non-volatile memory and neuro-inspired computing. The core functionality relies on the kinetics of the phase transition, a process that occurs across extended time scales (nanoseconds to microseconds) and involves complex structural rearrangements of atoms, making it an ideal application for ML-IAPs [51]. ML-IAPs have been successfully trained on ab initio data for PCMs like Ge-Sb-Te (GST) alloys, enabling large-scale molecular dynamics simulations that accurately model the crystallization process, identify defect formation energies, and predict the electronic properties of different phases at a fraction of the computational cost of DFT [1] [51].

Experimental Protocol: Simulating Crystallization Kinetics in Ge₂Sb₂Te₅

Objective: To simulate the crystallization process of Ge₂Sb₂Te₅ from its amorphous phase and calculate key kinetic parameters using an ML-IAP.

Materials & Computational Reagents:

Initial Structure: An amorphous Ge₂Sb₂Te₅ structure (e.g., 180-360 atoms) prepared via melt-quenching using classical MD.
ML-IAP: A pre-trained ML-IAP for Ge-Sb-Te systems (e.g., trained on DeePMD or NequIP frameworks) [1] [51].
Simulation Engine: Molecular dynamics software compatible with the ML-IAP, such as LAMMPS with a DeePMD-kit interface.
Reference Data: DFT-calculated energies and forces for the training set, typically obtained using VASP or Quantum ESPRESSO.

Procedure:

Model Training and Validation: a. Data Generation: Perform ab initio molecular dynamics (AIMD) on a small model system at various temperatures to sample relevant configurations of both amorphous and crystalline phases. b. Training: Train the ML-IAP on the generated dataset, ensuring low errors in energy and forces on a held-out test set. c. Validation: Validate the potential by comparing properties like radial distribution functions and bond-angle distributions with independent DFT calculations.

Molecular Dynamics Simulation: a. Equilibration: Equilibrate the larger amorphous structure at the target temperature (e.g., 600 K) for 100 ps using the ML-IAP in an NPT ensemble (constant number of atoms, pressure, and temperature). b. Production Run: Perform a long-timescale MD simulation (hundreds of nanoseconds to microseconds) in the NVT ensemble (constant number of atoms, volume, and temperature). c. Analysis: i. Crystallization Time: Monitor the potential energy and volume over time; a sharp drop indicates crystallization. ii. Mean-Square Displacement (MSD): Calculate MSD of atoms to quantify atomic diffusion. iii. Structural Analysis: Use order parameters (e.g., Steinhardt bond-order parameters) to distinguish between amorphous and crystalline domains and track the evolution of the crystal front.

Troubleshooting Tips:

If crystallization does not occur on a feasible simulation timescale, consider starting from a partially crystalline seed or increasing the simulation temperature (while remaining below the melting point).
If the ML-IAP produces unphysical forces, revisit the training set to ensure it adequately covers the transition pathway and consider employing active learning to augment the dataset.

Diagram 1: Computational workflow for PCM crystallization simulation.

Application Note & Protocol: Defect Dynamics in Electrocatalysts

Application Note

Defect engineering is a cornerstone of modern electrocatalyst design, where intrinsic point defects (e.g., vacancies, interstitials) and extrinsic doping defects are intentionally introduced to modulate the local electronic structure and surface properties of catalysts, thereby enhancing their activity, selectivity, and stability [53]. Under operating conditions, these defects are not static; they undergo dynamic reconstruction, which is crucial for catalytic performance [54] [53]. ML-IAPs are exceptionally suited to study this dynamic evolution because they can simulate thousands of atoms over time scales long enough to observe defect migration and interaction, all while maintaining near-DFT accuracy [1] [53]. This allows researchers to move beyond static defect models and probe the finite-temperature behavior of defects under realistic electrochemical conditions.

Experimental Protocol: Modeling Oxygen Vacancy Migration in Metal Oxide Electrocatalysts

Objective: To calculate the migration energy barrier and pathway of an oxygen vacancy in a transition metal oxide (e.g., Co₃O₄) under operational conditions using ML-IAP-driven molecular dynamics.

Materials & Computational Reagents:

Model System: A supercell (e.g., 3x3x2) of the spinel Co₃O₄ crystal structure.
Defect Creation: A single oxygen vacancy (Vₒ) created by removing one oxygen atom from the supercell.
ML-IAP: A universal ML-IAP (e.g., CHGNet, MACE-MP-0) pre-trained on diverse oxide materials or a specific potential fine-tuned for the Co-O system [17].
Simulation Engine: Molecular dynamics software with enhanced sampling capabilities (e.g., LAMMPS, ASE).

Procedure:

System Preparation and Equilibration: a. Construct the perfect Co₃O₄ supercell and introduce one oxygen vacancy. b. Relax the defective structure to its minimum energy configuration using the ML-IAP until forces on all atoms are below 0.01 eV/Å.

Nudged Elastic Band (NEB) Calculation: a. Identify Endpoints: Using the relaxed structure, manually create a second configuration where the vacancy has moved to a neighboring oxygen site. Relax this final state. b. Interpolation: Generate a chain of images (typically 5-7) between the initial and final states. c. NEB Optimization: Use the ML-IAP to compute energies and forces for each image and optimize the reaction pathway to find the saddle point. d. Barrier Extraction: The migration energy barrier is the energy difference between the saddle point image and the initial state.
Metadynamics or Umbrella Sampling (Optional): For a more thorough sampling of the free energy landscape at operating temperature, perform well-tempered metadynamics or umbrella sampling simulations using a collective variable such as the coordination number of the migrating oxygen atom or the distance between the vacancy and a metal cation.

Troubleshooting Tips:

If the NEB calculation fails to converge or shows a discontinuous path, increase the number of images or use the climbing-image NEB method.
For complex diffusion pathways, using a machine-learned potential in conjunction with advanced sampling methods is significantly more efficient than using pure DFT [1] [51].

Table 2: Research Reagent Solutions for Defect Dynamics Studies

Research Reagent	Function/Description	Example in Protocol
Universal ML-IAP (uMLIP)	Foundational model trained on diverse materials databases for general applicability without system-specific training.	CHGNet or MACE-MP-0 for simulating Co₃O₄ [17]
Nudged Elastic Band (NEB)	An optimization technique to find the minimum energy path and energy barrier for transient events like defect migration.	Calculating the oxygen vacancy migration pathway [51]
Metadynamics	An enhanced sampling method to accelerate the exploration of free energy surfaces and rare events.	Probing the full free energy landscape of vacancy migration [51]
Supercell Model	A repetition of the unit cell used to model a defect with periodic boundary conditions, minimizing spurious interactions.	3x3x2 supercell of Co₃O₄ to host an isolated oxygen vacancy

Application Note & Protocol: Dynamic Reconstruction of Electrocatalysts for CO₂ Reduction

Application Note

The electrochemical CO₂ reduction reaction (CO₂RR) is a promising pathway for converting greenhouse gases into value-added chemicals. Electrocatalysts for CO₂RR are dynamic entities whose structures evolve in operando, undergoing reconstruction that directly correlates with their catalytic performance [54]. This reconstruction can involve changes in oxidation state, surface amorphization, and the formation of new phases, posing a significant challenge for rational catalyst design. ML-IAPs, particularly when integrated with in situ/operando characterization data, provide a powerful platform to simulate these reconstruction processes atom-by-atom, offering mechanistic insights that are difficult to obtain experimentally [54] [53].

Experimental Protocol: Simulating Surface Reconstruction of a Cu-based CO₂RR Electrocatalyst

Objective: To simulate the potential-induced surface reconstruction of a Cu(100) catalyst in an electrochemical environment and identify the resulting active sites.

Materials & Computational Reagents:

Initial Structure: A slab model of the Cu(100) surface with a vacuum layer and explicit solvation (water molecules).
Electrochemical Potential: Implicit or explicit control of the electrode potential.
ML-IAP: A robust potential for metal-water interfaces, such as a DeePMD model trained on Cu-H₂O DFT data.
Simulation Engine: Ab initio molecular dynamics software with ML-IAP support.

Procedure:

Model and Training Set Construction: a. Build a Cu(100)-water interface model. b. Generate a diverse training set using AIMD simulations of the interface at various potentials (controlled via the computational hydrogen electrode approach or by applying an electric field). c. Train an ML-IAP on this dataset, ensuring accurate reproduction of Cu-water and water-water interactions.

Operando Simulation via ML-IAP-MD: a. Equilibration: Equilibrate the system at the target temperature (e.g., 300 K) for 50 ps. b. Potential Control: Apply a constant electric field corresponding to a reducing potential relevant for CO₂RR (e.g., -1.0 V vs. RHE). c. Long-timescale MD: Run a multi-nanosecond ML-IAP-MD simulation to observe surface dynamics. d. Analysis: i. Surface Structure: Track the root-mean-square deviation (RMSD) of surface Cu atoms to identify restructuring. ii. Coordination Number: Calculate the average coordination number of surface atoms; a decrease may indicate roughening or adatom formation. iii. Adsorption Site Analysis: Identify the formation of under-coordinated sites (e.g., steps, kinks) that are known to be active for CO₂ activation.

Troubleshooting Tips:

The choice of water model and the method for applying the electric field are critical. Consistency between the training data (DFT) and the simulation (ML-IAP) is paramount.
If reconstruction is slow, consider starting from a metastable surface structure or slightly elevating the simulation temperature to accelerate dynamics while remaining physically meaningful.

Diagram 2: Workflow for simulating electrocatalyst surface reconstruction.

Optimizing MLIP Performance: Addressing Training Challenges and Uncertainty

In the field of machine learning interatomic potentials (MLIPs), the choice of data selection strategy is crucial for developing accurate, transferable, and computationally efficient models. MLIPs serve as surrogate potential energy surfaces (PES), enabling molecular dynamics simulations with near-ab initio accuracy at a fraction of the computational cost [1] [21]. The performance and data efficiency of these models are profoundly influenced by how training configurations are selected. This application note examines two fundamental paradigms—active learning and passive learning—within the context of MLIP development for accurate force and energy calculations. We provide a quantitative comparison of these strategies, detailed experimental protocols, and practical guidance for researchers seeking to implement these approaches in materials science and drug development applications.

Comparative Analysis of Learning Strategies

Core Definitions and Conceptual Framework

Passive Learning involves training MLIPs on a static, pre-generated dataset of atomic configurations, typically derived from ab initio molecular dynamics (AIMD), density functional theory (DFT) calculations, or structure enumeration [39]. The dataset is constructed upfront without feedback from the developing model, requiring broad coverage of the PES to ensure transferability, which often necessitates a large number of computationally expensive ab initio calculations.

Active Learning is an iterative, query-based strategy that starts with a small initial dataset. The MLIP is trained, and then used to run simulations (e.g., molecular dynamics). During these simulations, atomic configurations that trigger a high "extrapolation" threshold—indicating the model is operating outside its learned domain—are automatically flagged. These informative configurations are sent for ab initio calculation and then added to the training set, creating a feedback loop that dynamically expands the dataset based on the model's uncertainty [39] [55].

Quantitative Performance Comparison

The following table synthesizes performance metrics from recent studies that directly compare active and passive learning for developing MLIPs, highlighting gains in data efficiency and accuracy.

Table 1: Quantitative Comparison of Active vs. Passive Learning for MLIP Development

Study and Material System	Learning Strategy	Key Performance Metrics	Data Efficiency
TiAlNb Alloys [39]	Active Learning (DeePMD)	Produced a single potential capable of predicting properties of both γ-TiAl and α2-Ti3Al phases.	Achieved high performance with only a fraction of the training samples required by passive learning.
TiAlNb Alloys [39]	Passive Learning (DeePMD)	Required extensive datasets (e.g., 333,340 configurations) and struggled to capture α2-Ti3Al phase characteristics accurately.	Lower data efficiency; required large, static datasets.
TiAlNb Alloys [39]	Active Learning (MTP)	Outperformed passive MTP but exhibited limitations requiring separate training for each phase.	Required fewer samples than passive MTP to achieve comparable error levels.
CoCrFeMnNi HEA [55]	Active Learning (MTP)	Potential outperformed a traditional MEAM potential; demonstrated high accuracy in defect properties, tensile deformation, and nano-indentation.	Dynamic retraining on extrapolative configurations ensured robustness with a minimized dataset.

A key finding across studies is that active learning consistently outperforms passive learning while requiring only a fraction of the training samples [39]. This strategy directly targets the gaps in the model's knowledge, leading to superior data efficiency. For instance, in developing potentials for TiAlNb alloys, active learning with DeePMD yielded a single, unified potential for multiple phases, a task that posed significant challenges for passive methods [39]. It is important to note that the sample selection bias in active learning (e.g., towards higher-energy, non-equilibrium states) can sometimes lead to less accurate predictions for certain properties, such as room-temperature behavior, if the training is not carefully balanced [39].

Experimental Protocols

Protocol for Active Learning of MLIPs

This protocol outlines the steps for developing an MLIP using an active learning strategy, as applied in systems like TiAlNb alloys and CoCrFeMnNi high-entropy alloys [39] [55].

Step 1: Initial Dataset Generation

Select a set of diverse initial structures (e.g., perfect crystals, surfaces, isolated point defects) representative of the chemical system.
Perform ab initio (DFT) calculations on these structures to obtain reference energies, forces, and stresses.
This initial dataset can be relatively small (e.g., a few hundred configurations).

Step 2: MLIP Training

Train an initial MLIP (e.g., a DeePMD or MTP model) on the current dataset.
Validate the model on a separate, held-out test set to monitor for overfitting and establish baseline errors.

Step 3: Sampling and Configuration Querying

Use the trained MLIP to perform exploratory molecular dynamics (MD) or Monte Carlo (MC) simulations at target thermodynamic conditions (e.g., high temperatures to accelerate phase space exploration).
During these simulations, continuously compute the model's extrapolation index or uncertainty metric (e.g., the gamma factor in MTP [55]).
Flag and collect atomic configurations where the uncertainty exceeds a predefined threshold.

Step 4: Ab Initio Labeling and Dataset Augmentation

Perform ab initio calculations on the queried, high-uncertainty configurations to obtain accurate energies and forces.
Add these newly labeled configurations to the training dataset.

Step 5: Iterative Retraining

Retrain the MLIP on the augmented dataset.
Repeat Steps 3-5 until the model's performance and stability converge, as measured by low and stable errors on the test set and no new high-uncertainty configurations being encountered in sampling runs.

Protocol for Passive Learning of MLIPs

Step 1: Comprehensive Dataset Construction

Generate a large and diverse set of atomic configurations a priori. This typically involves:
- AIMD Trajectories: Running AIMD at various temperatures and pressures.
- Structural Enumeration: Creating supercells with vacancies, substitutions, and interstitial defects.
- Elastic and Phonon Deformations: Applying strains to perfect crystals to probe the energy landscape around minima.
Compute reference energies, forces, and stresses for all configurations using ab initio methods.

Step 2: Single-Shot Training

Train the MLIP on the entire static dataset in a single, comprehensive training run.
This process does not involve any feedback loops or further ab initio calculations.

Step 3: Validation and Benchmarking

Validate the final model against a reserved test set and benchmark its predictions against key material properties not directly included in the training (e.g., elastic constants, thermal expansion, phase stability).

Workflow Visualization

The following diagram illustrates the iterative feedback loop that characterizes the active learning approach for MLIP development, contrasting it with the linear workflow of passive learning.

This section details key computational "reagents" and tools required for implementing the protocols described above.

Table 2: Essential Tools and Resources for MLIP Development

Category	Item	Function and Description	Examples/Implementations
MLIP Frameworks	DeePMD-kit [39]	A deep learning package for constructing DeePMD models; maps atomic configurations to energies and forces via deep neural networks.	DeePMD-kit
	Moment Tensor Potential (MTP) [39] [55]	An MLIP using moment tensor descriptors to represent atomic environments; well-suited for active learning with built-in extrapolation detection.	MLIP, LAMMPS interface
Ab Initio Codes	DFT Calculators	Provides high-fidelity reference data (energies, forces, stresses) for training and labeling configurations.	VASP, Quantum ESPRESSO, ABINIT
Simulation Engines	MD Engines	Performs sampling and production MD simulations using the developed MLIPs.	LAMMPS (widely used with DeePMD and MTP)
Data & Infrastructure	Training Datasets	Collections of atomic configurations with reference ab initio data for training and benchmarking.	Public datasets [56]; project-specific AIMD/DFT data
	Computational Hardware	Accelerates the training of MLIPs and the execution of ab initio calculations.	GPUs (critical for efficient training), High-performance computing (HPC) clusters

Active learning represents a paradigm shift in the development of machine learning interatomic potentials, offering superior data efficiency and model robustness compared to traditional passive learning. By strategically querying ab initio data for the most informative configurations, this method builds accurate potentials with significantly reduced computational cost for data generation. The provided protocols and toolkit equip researchers to implement these strategies, accelerating reliable atomistic simulations for materials discovery and molecular design.

Uncertainty Quantification (UQ) is a critical component in the development and deployment of reliable machine learning interatomic potentials (MLIPs) for force prediction. Ensemble-based approaches, which aggregate predictions from multiple models, have emerged as popular and practical methods for UQ. This application note details the core ensemble methodologies, their implementation protocols for MLIPs, and a critical analysis of their limitations, particularly for force and energy predictions in atomistic simulations. Framed within the broader context of force calculation research, this document provides researchers and drug development professionals with structured data, experimental workflows, and a toolkit for applying and evaluating these UQ techniques, supported by benchmarking results from recent studies.

In machine learning, ensemble learning combines multiple base models (weak learners) to create a single, stronger predictive model. This approach often yields better performance than any single constituent model for both regression and classification tasks [57]. The fundamental principle is that by aggregating the predictions of several diverse models, the overall bias and variance can be reduced, leading to more robust and accurate predictions. Common aggregation techniques include max voting for classification, averaging for regression, and weighted averaging where models with higher predictive power are assigned greater importance [57].

For Machine Learning Interatomic Potentials (MLIPs), which enable atomistic simulations with near first-principles accuracy at substantially reduced computational cost, accuracy on held-out datasets of ab initio energies and atomic forces does not guarantee reliability for emergent, system-level behavior [58]. Uncertainty Quantification (UQ) provides tools to evaluate this reliability, which is paramount in high-stakes applications like materials design and drug development. The two primary types of uncertainty are aleatoric uncertainty, caused by inherent noise in the data, and epistemic uncertainty, arising from limited knowledge or data about the model parameters [59]. Ensemble methods are primarily used to quantify epistemic uncertainty. In MLIPs, UQ is essential not only for building trust in simulations but also for guiding active learning workflows, where uncertain predictions can flag the need for additional ab initio data [60].

Core Ensemble Methods for Uncertainty Quantification

Several ensemble strategies have been developed and adapted for UQ in deep learning models, including MLIPs. The table below summarizes the primary ensemble-based UQ methods, their mechanisms, and key characteristics.

Table 1: Core Ensemble-Based Uncertainty Quantification Methods

Method	Core Mechanism	Key Characteristics	Common Use in MLIPs
Bootstrap Ensembles [58] [59]	Trains multiple models on different random subsets (with replacement) of the training data.	Simple to implement; provides a distribution of predictions.	A baseline method for assessing model uncertainty.
Random Initialization Ensembles [58]	Trains multiple models with the same architecture but different random initializations.	Leverages variability in non-convex optimization; computationally expensive.	Common in neural network potentials to capture parameter uncertainty.
Snapshot Ensembles [58]	Uses multiple snapshots of a single model's parameters from different stages of the training cycle.	More efficient than training multiple independent models.	Used to approximate an ensemble from a single training run.
Dropout Ensembles [59]	Activates dropout layers at inference time to generate stochastic forward passes, effectively creating a pseudo-ensemble.	Not a true ensemble; efficient as it uses a single model.	Often used as a lightweight UQ method for large neural network potentials.

The following diagram illustrates the logical workflow and relationships between different ensemble UQ methods and their role in a typical MLIP development cycle.

Performance Benchmarking and Limitations

Quantitative Performance in Material Property Prediction

Ensemble UQ methods are widely used as a heuristic for predictive accuracy. However, systematic benchmarks reveal critical nuances in their performance. A large-scale benchmark of universal MLIPs (uMLIPs) on nearly 11,000 elastically stable materials provides a quantitative comparison for property prediction, which is intrinsically linked to force and energy accuracy [26].

Table 2: Benchmarking uMLIP Performance on Elastic Properties [26]

uMLIP Model	Key Architectural Feature	Reported Performance on Elastic Properties
SevenNet	Scalable EquiVariance-Enabled Neural Network	Achieved the highest accuracy in the benchmark.
MACE	Combines Atomic Cluster Expansion with higher-order equivariant message passing.	Balanced accuracy with computational efficiency.
MatterSim	A large-scale, symmetry-preserving force field based on the M3GNet architecture.	Balanced accuracy with computational efficiency.
CHGNet	Crystal Hamiltonian Graph Neural Network that incorporates charge information.	Performed less effectively overall in this benchmark.

Documented Limitations of Ensemble UQ

Despite their utility, ensemble-based UQ methods possess significant limitations, especially when MLIPs are applied in extrapolative regimes:

Counterintuitive Behavior under Distribution Shift: A systematic study of ensemble-based UQ for neural network interatomic potentials found that uncertainty estimates can behave counterintuitively in out-of-distribution (OOD) settings. Predictive errors can grow while the associated uncertainty estimates plateau or even decrease, providing a false sense of security [58]. This highlights that predictive precision (inverse uncertainty) is not a reliable proxy for accuracy in large-scale, extrapolative applications.
Inability to Capture Model Misspecification: A key, often disregarded, source of error is model misspecification—the inability of any single set of model parameters to exactly fit all training data due to constraints in model architecture. Commonly used loss-based UQ schemes, including standard ensembles, can significantly underestimate parameter uncertainties because the training loss is only sensitive to epistemic and aleatoric uncertainties, not misspecification [60]. This is a dominant form of uncertainty when fitting underparameterized models to deterministic data like DFT calculations.
Computational Overhead: While methods like snapshot ensembles offer improvements, generating a full ensemble of neural networks—each of which may be a large graph model like MACE or CHGNet—incurs a non-trivial computational cost during both training and inference compared to using a single model [58] [60].

Application Notes and Protocols

Protocol for Implementing Ensemble UQ in MLIP Training

The following workflow provides a detailed protocol for implementing and evaluating ensemble UQ for a neural network interatomic potential, drawing from methodologies used in recent comparative studies [58] [61].

Detailed Protocol Steps:

Dataset Curation & Splitting:
- Gather a diverse set of atomic configurations with corresponding ab initio energies and forces. Diversity in element composition, crystal structures, and local atomic environments is critical for robustness [60].
- Split the data into training, validation, and a held-out test set. The test set should ideally contain configurations that are structurally or chemically distinct from the training data to assess OOD performance.
Ensemble Model Generation: Choose one or more ensemble generation strategies. A comparative study often employs multiple in tandem [61]:
- Bootstrap Ensembles: Create multiple training dataset subsets by sampling with replacement. Train one model on each subset.
- Random Initialization Ensembles: Train ( N ) models (e.g., 5-10) with identical architecture and hyperparameters but different random weight initializations on the full training set.
- Snapshot Ensembles: Train a single model using a cyclic learning rate schedule, saving the model parameters (snapshots) at the end of each cycle.
Parallel Model Training: Train all models in the ensemble. Utilizing a framework like PyTorch Lightning [59] can help standardize and parallelize this process.
Prediction & Aggregation: For a new atomic configuration, perform a forward pass with each model in the ensemble to obtain a distribution of predictions for energy and forces. Aggregate the results (e.g., mean for energy, mean for forces).
Uncertainty Quantification: Calculate the standard deviation or variance across the ensemble's predictions for energies and forces. This serves as the quantitative measure of epistemic uncertainty.
Validation on Target Properties:
- In-Distribution (ID) Validation: Calculate standard metrics (MAE, RMSE) on the held-out test set that is similar to the training data. Correlate low UQ values with low prediction errors [58].
- Out-of-Distribution (OOD) Testing: This is critical for assessing real-world reliability. Use the ensemble to predict more complex, system-level properties that were not directly trained, such as:
  - Cold curve energies (energy versus volume) [58].
  - Phonon dispersion relations [58].
  - Phase diagrams and phase stability [62].
  - Elastic constants and derived moduli (bulk, shear) [26].
- Compare the MLIP predictions with direct ab initio calculations or experimental data where available. Critically analyze whether the uncertainty estimates from the ensemble correctly reflect the observed errors.

The Scientist's Toolkit: Research Reagents & Computational Solutions

Table 3: Essential Software Tools and Frameworks for Ensemble UQ in MLIP Research

Tool / Resource	Type	Primary Function	Relevance to Ensemble UQ
Torch-Uncertainty [59]	Software Library	A PyTorch-based framework for training and evaluating deep learning models with UQ.	Provides modular implementations of ensemble methods (and other UQ families) and standardized metrics for benchmarking.
Atomic Simulation Environment (ASE)	Software Library	A set of Python tools for setting up, manipulating, running, visualizing, and analyzing atomistic simulations.	Serves as a common platform for running simulations with various MLIPs and processing the results for UQ analysis.
Materials Project Database [26]	Public Database	A repository of computed material properties for over 150,000 materials and molecules.	Provides a source of diverse structures and reference data (e.g., elastic constants) for OOD testing and benchmarking.
MACE, CHGNet, SevenNet [26]	Pre-trained MLIPs	State-of-the-art universal machine learning interatomic potentials.	Serve as both subjects for UQ benchmarking and base architectures for developing new ensemble models.
PhaseForge [62]	Computational Workflow	A program that integrates MLIPs with the ATAT toolkit for phase diagram calculation.	An example of an application-oriented framework for evaluating MLIP effectiveness, including uncertainty, on thermodynamic properties.

The pursuit of universal machine learning interatomic potentials (ML-IAPs) represents a paradigm shift in computational materials science, offering the promise of near ab initio accuracy across extended spatiotemporal scales [1]. These data-driven surrogates implicitly encode electronic effects through training on quantum mechanical reference data, enabling faithful recreation of potential energy surfaces (PES) across diverse chemical environments [1]. However, a critical challenge persists: models that demonstrate exemplary in-distribution (ID) performance often fail catastrophically when confronted with out-of-distribution (OOD) data involving unseen chemistries or structural symmetries [63].

The fundamental issue lies in the conflation of precision with accuracy in OOD contexts. As demonstrated in materials science settings, heuristic evaluations of generalizability often lead to biased conclusions about model capabilities [63]. Surprisingly, many tasks labeled as OOD demonstrate good performance across models because most test data reside within regions well-covered by training data, while genuinely challenging OOD tasks involve data outside the training domain [63]. This domain misidentification leads to systematic overestimation of true generalizability and scaling benefits.

Current OOD detection methods face fundamental limitations. Feature-based methods incorrectly conflate far feature-space distance with being OOD, while uncertainty-based methods incorrectly conflate high uncertainty with being OOD [64]. This conceptual misalignment means these procedures answer fundamentally different questions than "is this point drawn from a different distribution?" [64]. For ML-IAPs deployed in drug development and materials discovery, where novel chemical spaces are routinely explored, these limitations pose significant reliability concerns.

Quantitative Benchmarking of ML-IAP Performance

Performance Across Chemical Systems

Robust evaluation requires benchmarking across diverse OOD tasks. Recent assessments in materials science have examined over 700 OOD tasks designed to challenge common heuristics based on chemistry or structural symmetry [63]. The results reveal surprising generalization capabilities alongside systematic failure modes.

Table 1: Leave-One-Element-Out Generalization Performance on Formation Energy Prediction (Materials Project Dataset)

Model Architecture	Tasks with R² > 0.95	Problematic Elements	Systematic Bias Observed
ALIGNN	85%	H, F, O	Overestimation of formation energies
XGBoost	68%	H, F, O	Overestimation for H/F; mixed for O
Random Forest	~65%*	H, F, O*	Similar systematic biases*
GMP-NN	~75%*	H, F, O*	Overestimation tendencies*
LLM-Prop	~70%*	H, F, O*	Consistent with deep learning models*

Note: Values marked with * are estimated from trend analysis in the source material [63]

The data reveals that 85% of leave-one-element-out tasks for the ALIGNN model and 68% for the simpler XGBoost model achieved R² scores above 0.95 [63]. This broad generalization across most of the periodic table is surprising given that training sets contain no information about bonding with the left-out elements. However, significant failures occur consistently with nonmetals (H, F, O), where systematic overestimation of formation energies dominates error patterns [63].

Phonon Property Prediction Accuracy

Phonon properties derived from second derivatives of the PES provide stringent tests of ML-IAP robustness. Recent benchmarks of seven universal ML-IAPs on approximately 10,000 ab initio phonon calculations reveal substantial variations in OOD capabilities [17].

Table 2: Universal ML-IAP Performance on Phonon Properties and Structural Relaxation

Model	Force Convergence Failure Rate	Energy MAE (eV/atom)	Volume MAE (Å³/atom)	Phonon Frequency MAE
CHGNet	0.09%	~0.05*	~0.10*	Medium accuracy
MatterSim-v1	0.10%	~0.04*	~0.09*	High accuracy
M3GNet	~0.15%*	~0.035*	~0.08*	Medium accuracy
MACE-MP-0	~0.15%*	~0.03*	~0.07*	High accuracy
SevenNet-0	~0.15%*	~0.03*	~0.07*	Medium accuracy
ORB	~0.60%*	~0.03*	~0.06*	Variable accuracy
eqV2-M	0.85%	~0.02*	~0.05*	High accuracy

Note: Values marked with * are estimated from trend analysis in the source material [17]

Notably, models with the lowest energy MAEs don't necessarily demonstrate the best OOD reliability. The eqV2-M model, while achieving exceptional energy accuracy (∼0.02 eV/atom), exhibits the highest failure rate for force convergence (0.85%) [17]. This precision-accuracy misalignment highlights the danger of optimizing metrics that don't reflect real-world OOD performance.

Experimental Protocols for OOD Assessment

Protocol 1: Leave-One-Group-Out Cross-Validation

Purpose: To evaluate model generalization to completely unseen chemical families or structural classes.

Methodology:

Data Partitioning: Split datasets using chemically meaningful criteria:
- Leave out all compounds containing a specific element (e.g., oxygen)
- Exclude entire groups of the periodic table (e.g., halogens)
- Remove specific crystal systems (e.g., all monoclinic structures)
- Omit space groups with distinct symmetry properties [63]

Evaluation Metrics:
- Primary: Coefficient of determination (R²) and Mean Absolute Error (MAE)
- Secondary: Systematic bias analysis via parity plots
- Tertiary: Failure rate analysis during geometry optimization [17]
SHAP-Based Bias Attribution:
- Train a correction model for each leave-one-out task
- Quantify contributions from compositional versus structural features
- Identify source of OOD failures (chemical vs. geometric origins) [63]

Implementation Considerations:

Minimum test set size of 200 samples to ensure statistical significance
Multiple random seeds to account for training stochasticity
Analysis of latent space coverage to distinguish interpolation from true extrapolation

Protocol 2: Phonon Spectrum Validation

Purpose: To assess model performance on second-order derivatives of the PES, which are highly sensitive to OOD failures.

Methodology:

Dataset Construction:
- Utilize specialized phonon databases (e.g., MDR database with ∼10,000 phonon calculations)
- Ensure functional consistency (PBE vs. PBEsol) with training data [17]

Calculation Workflow:
- Perform geometry optimization to ground state using ML-IAP
- Compute dynamical matrix via finite displacements or density functional perturbation theory
- Calculate phonon densities of states and dispersion relations
- Compare with ab initio reference results [17]
Error Analysis:
- Quantify phonon band center MAEs
- Identify soft modes (imaginary frequencies) indicating dynamical instability
- Assess thermodynamic property accuracy (e.g., vibrational free energy)

Diagnostic Interpretation:

High-frequency phonon inaccuracies suggest short-range interaction failures
Low-frequency errors indicate problems with long-range potentials
Systematic shifts reveal biases in curvature of the PES

Diagram 1: OOD assessment workflow for systematic evaluation of ML-IAP robustness across distribution shifts.

Visualization Framework for OOD Diagnostics

Precision-Accuracy Discrepancy Visualization

The fundamental challenge in OOD detection arises from the conceptual misalignment between what supervised models learn and what true OOD detection requires [64]. This discrepancy manifests in two primary failure modes:

Uncertainty-Based Method Failures: These methods answer "Is the model uncertain about which ID label to assign?" but OOD detection requires asking whether the input comes from a different data distribution entirely [64]. This leads to failures when:

ID inputs are inherently ambiguous (high uncertainty but not OOD)
OOD inputs are confidently misclassified (low uncertainty but truly OOD)

Feature-Based Method Failures: These approaches answer "Does this input lead to features that are far from training features?" but cannot reliably identify OOD points because [64]:

ID and OOD feature distributions can significantly overlap
Selecting discriminative dimensions without OOD data is challenging
Feature distance doesn't necessarily correlate with distributional shift

Diagram 2: Fundamental limitations of supervised models for OOD detection, leading to systematic failure modes.

Latent Space Coverage Analysis

A critical insight from recent studies is that most heuristic OOD tests actually reflect interpolation rather than true extrapolation [63]. Visualization of the materials representation space shows that test data from well-performing OOD tasks largely reside within the training domain, whereas genuinely challenging tasks involve data outside this domain [63].

Implementation Protocol:

Representation Extraction: Compute latent embeddings for both training and OOD test sets
Dimensionality Reduction: Apply PCA or t-SNE for 2D/3D visualization [65]
Density Estimation: Calculate training data coverage in latent space regions
Coverage-Error Correlation: Map model performance to training data density

Diagnostic Interpretation:

High performance in high-density regions indicates successful interpolation
Performance degradation in low-density regions reveals true extrapolation limits
Systematic biases indicate representation learning failures for specific chemical domains

Research Reagent Solutions

Table 3: Essential Computational Tools for ML-IAP Development and OOD Assessment

Research Reagent	Function	Implementation Considerations
Benchmark Datasets
Materials Project [63]	Provides DFT-derived structures and properties for training and validation	Contains > 140,000 materials; bias toward oxides and equilibrium structures
JARVIS [63]	Complementary materials database with different distribution characteristics	Enh robustness of OOD evaluation across data sources
OQMD [63]	Large-scale quantum materials database	Expands chemical diversity for OOD testing
MDR Phonon Database [17]	Specialized dataset for phonon property validation	Enables stringent testing of PES curvature prediction
ML-IAP Architectures
ALIGNN [63]	Graph neural network incorporating bond angles	Strong performance on chemical OOD tasks
CHGNet [17]	Compact architecture with charge information	High reliability despite smaller parameter count
MACE [17]	Atomic cluster expansion with message passing	Leading accuracy-force convergence tradeoff
Equivariant Transformers [17]	Higher-order equivariant representations	State-of-the-art accuracy with potential convergence issues
Assessment Tools
SHAP Analysis [63]	Quantifies compositional vs. structural contributions to errors	Identifies sources of OOD failure modes
Latent Space Visualization [65]	Projects high-dimensional representations to 2D/3D	Distinguishes interpolation from true extrapolation
Phonopy [17]	Phonon calculation package integrated with ML-IAPs	Validates second-derivative properties
Training Strategies
Active Learning [1]	Iteratively expands training set based on uncertainty	Targets coverage gaps in chemical space
Multi-Fidelity Learning [1]	Combines high-low accuracy data sources	Improves data efficiency for broad coverage
Off-Equilibrium Augmentation [17]	Adds distorted structures to training data	Enhances robustness for force predictions

Overcoming OOD failures in machine learning interatomic potentials requires moving beyond heuristic evaluations and simplistic precision metrics. The research community must adopt rigorous assessment protocols that distinguish true extrapolation from interpolation within well-covered domains [63]. Current supervised approaches fundamentally answer the wrong questions for OOD detection [64], necessitating architectural innovations that explicitly model distributional boundaries.

The path forward involves several critical directions: developing benchmark datasets specifically designed for OOD challenges, creating architectures with built-in uncertainty quantification for novel chemistries, implementing active learning strategies that target domain gaps, and establishing standardized evaluation protocols that report both ID and OOD performance. Only through these comprehensive approaches can we transform precision from a misleading proxy into a genuine indicator of accuracy across the chemical and structural spaces relevant to drug development and materials discovery.

Machine learning interatomic potentials (MLIPs) have emerged as transformative tools in computational materials science and chemistry, enabling atomistic simulations at near quantum-mechanical accuracy but with drastically reduced computational cost. The performance and reliability of these models are fundamentally constrained by the quality and diversity of the training databases. This protocol examines the critical relationship between database construction and model transferability, providing researchers with structured methodologies for developing robust MLIPs capable of accurate force predictions across diverse chemical and configurational spaces.

The foundational importance of database design stems from the fact that MLIPs serve as surrogate models for the potential energy surface (PES). Their ability to generalize beyond training data depends entirely on the representative coverage of atomic environments in the database. Traditional databases focused predominantly on equilibrium structures have proven insufficient for molecular dynamics and materials discovery applications, where systems frequently sample non-equilibrium configurations. This document synthesizes recent advances in database construction methodologies that enhance model transferability through strategic configuration selection and quality assurance protocols.

Database Diversity: Configurational and Chemical Space Coverage

The Massive Atomic Diversity (MAD) Framework

The Massive Atomic Diversity (MAD) dataset represents a paradigm shift in database construction philosophy, explicitly designed to maximize configurational coverage through systematic structural perturbations. Unlike conventional databases focused primarily on stable minima, MAD employs aggressive sampling strategies to ensure comprehensive PES coverage [66] [67].

Table 1: MAD Dataset Composition and Diversity Engineering Strategies

Subset Name	Description	Construction Method	Diversity Contribution
MC3D	Stable bulk crystals from 3D databases	Direct inclusion from existing repositories	Baseline equilibrium structures
MC3D-rattled	Geometrically distorted crystals	Large Gaussian noise (20% covariance) applied to atomic positions	High-energy atomic environments
MC3D-random	Chemically randomized systems	Random atomic species substitution on bulk sites with isotropic cell rescaling	Atypical chemical compositions
MC3D-surface	Surface models	Random low-index cuts of bulk crystals	Low-coordination environments
MC3D-cluster	Nanoclusters	Small clusters (2-8 atoms) extracted from bulk environments	Non-periodic, under-coordinated systems
MC2D	Two-dimensional materials	2D database extraction	Dimensionality effects
SHIFTML subsets	Molecular crystals and fragments	Fragmentation and direct inclusion	Molecular and intermolecular interactions

The MAD dataset's distinctive strength lies in its engineered "tails" in energy and force distributions, deliberately incorporating high-energy configurations often absent in conventional databases. This approach significantly enhances model robustness when simulating non-equilibrium processes such as defect migration, surface reactions, and phase transformations [66].

Structural Descriptors and Diversity Quantification

Quantifying database diversity requires sophisticated structural representation methods. The MAD framework utilizes high-dimensional descriptors derived from the Point Edge Transformer (PET) architecture, where each atom-centered environment is encoded as a 512-dimensional vector ξ(Ai). The overall structure is then characterized through statistical moments:

Ξ(A) = [⟨ξ(Ai)⟩, ⟨(ξ(Ai) - ⟨ξ(Ai)⟩)²⟩]

This 1024-dimensional descriptor captures both average local environments and configurational heterogeneity, providing an invariant representation for diversity analysis. Dimensionality reduction techniques like sketch-map enable visualization and gap analysis within the dataset's chemical space [66] [67].

MAD Dataset Construction Workflow: Systematic generation of diverse atomic configurations through multiple perturbation pathways.

Data Quality: Consistency and Fidelity Considerations

Computational Consistency Protocols

Database quality encompasses both the intrinsic accuracy of reference calculations and the consistency of computational parameters across all data points. The MAD dataset employs a uniform density functional theory (DFT) setup with consistent exchange-correlation functional, plane-wave cutoffs (≈110 Ry for wavefunctions, ≈1320 Ry for charge density), and pseudopotentials (SSSP) across all structures. This methodological uniformity ensures a coherent structure-energy mapping, eliminating artifacts arising from inconsistent computational settings [66] [67].

The critical importance of consistent computational protocols becomes evident when integrating data from multiple sources. Different exchange-correlation functionals (e.g., PBE vs. PBEsol) produce systematically different potential energy surfaces, as demonstrated by benchmark studies showing volume differences of 0-2 Å³/atom between these functionals [17]. Such discrepancies introduce noise that impedes model training and reduces predictive accuracy.

Multi-Task Learning for Cross-Domain Transfer

The SevenNet-Omni framework addresses the challenge of integrating heterogeneous databases with incompatible computational settings through a sophisticated multi-task learning approach. This method partitions model parameters into shared (θC) and task-specific (θT) components, formally represented as:

DFTT(𝒢) ≈ f(𝒢; θC, θT) = f(𝒢; θC, 0) + θTᵀ · R(𝒢; θC, θT)

The common PES term f(𝒢; θC, 0) transfers knowledge across domains, while the task-specific correction accommodates systematic differences between computational approaches. This architecture enables effective knowledge transfer from large datasets generated with standard functionals (e.g., PBE) to smaller high-fidelity datasets (e.g., r2SCAN), demonstrating remarkable cross-functional transfer capability even with minimal high-fidelity data (0.5% r2SCAN) [68].

Table 2: Universal MLIP Performance on Phonon Properties Benchmark

Model	Energy MAE (eV/atom)	Force MAE (eV/Å)	Relaxation Failure Rate (%)	Phonon Spectrum Accuracy
CHGNet	~0.03-0.05	~0.05	0.09	Moderate
M3GNet	~0.03-0.05	~0.05	~0.15	Moderate
MACE-MP-0	~0.02-0.04	~0.03-0.04	~0.15	High
SevenNet-0	~0.02-0.04	~0.03-0.04	~0.15	High
MatterSim-v1	~0.02-0.04	~0.03-0.04	0.10	High
ORB	~0.02-0.03	~0.02-0.03	~0.50	Variable
eqV2-M	~0.02-0.03	~0.02-0.03	0.85	Variable

Benchmark results reveal substantial performance variations among universal MLIPs on phonon property predictions, despite similar performance on energy and force predictions for equilibrium structures. Models that directly predict forces without deriving them from energy gradients (ORB, eqV2-M) exhibit significantly higher failure rates in geometry optimization, highlighting the importance of physical consistency in model architecture [17].

Multi-Task Learning Architecture: Knowledge transfer through shared parameters with task-specific corrections for heterogeneous datasets.

Experimental Protocols and Methodologies

Database Construction and Curation Protocol

Objective: Create a diverse, high-quality database for training transferable MLIPs.

Materials and Software:

DFT software (VASP, Quantum ESPRESSO, or ABINIT)
Structure databases (Materials Project, OQMD, ICSD)
Python environment with ASE, pymatgen, or similar libraries
MLIP training framework (DeePMD-kit, LAMMPS, AMPTORCH)

Procedure:

Initial Structure Collection
- Collect 5,000-10,000 stable crystal structures from existing databases
- Ensure broad chemical representation across periodic table
- Include various structure types: bulk, layered, molecular crystals
Configurational Diversity Expansion
- Rattling: Apply Gaussian displacement with 20% covariance to atomic positions
- Chemical Randomization: Substitute elements on crystal sites randomly
- Surface Generation: Create low-index surfaces (100, 110, 111) via cleavage
- Cluster Extraction: Generate non-periodic clusters (2-8 atoms) from bulk
- Molecular Addition: Incorporate molecular crystals and fragments
Reference Calculations
- Employ consistent DFT parameters across all structures
- Use single exchange-correlation functional (PBE recommended for compatibility)
- Set consistent plane-wave cutoff (≥500 eV) and k-point sampling
- Calculate energies, forces, and stresses for all configurations
- Perform phonon calculations for stability verification (subset)
Quality Control and Filtering
- Remove structures with SCF convergence failures
- Filter configurations with unphysical forces (>10 eV/Å)
- Verify energy consistency across similar configurations
- Check for presence of magnetic moments (if applicable)
Dataset Splitting and Validation
- Split data into training (80%), validation (10%), and test (10%) sets
- Ensure chemical and structural diversity in each split
- Create specialized test sets for target applications (surfaces, defects, etc.)

Cross-Domain Transfer Assessment Protocol

Objective: Evaluate MLIP transferability across chemical and computational domains.

Procedure:

Multi-Domain Training
- Implement multi-task architecture with shared and task-specific parameters
- Apply selective regularization to task-specific parameters
- Incorporate domain-bridging sets (0.1% of total data) to align PES across domains
- Train on combined datasets (15+ domains recommended)
Transferability Metrics
- Calculate energy and force MAE on out-of-distribution test sets
- Assess phonon spectrum accuracy compared to DFT reference
- Evaluate geometry optimization success rate (<0.1% failure target)
- Test on targeted challenging properties: adsorption energies, defect formation energies, surface energies
Cross-Functional Transfer Assessment
- Train primarily on large PBE datasets
- Fine-tune with small high-fidelity datasets (r2SCAN, hybrid functionals)
- Evaluate performance on high-fidelity benchmarks
- Target adsorption-energy errors <0.06 eV on metallic surfaces

Table 3: Critical Databases and Software for MLIP Development

Resource	Type	Key Features	Application
MAD Dataset	Atomic structure database	95,595 structures, 85 elements, non-equilibrium focus	Training universal MLIPs
Materials Project	Computational materials database	150,000+ materials, DFT-calculated properties	Initial structure collection
Open Quantum Materials Database (OQMD)	Computational materials database	700,000+ DFT calculations	Chemical space expansion
DeePMD-kit	MLIP training framework	Deep Potential implementation, LAMMPS integration	MLIP development
CHGNet	Pre-trained universal MLIP	Graph neural network, charge-informed	Transfer learning baseline
M3GNet	Pre-trained universal MLIP	Materials graph network, broad element coverage	Transfer learning baseline
SevenNet-Omni	Multi-task MLIP framework	Cross-domain transfer, equivariant architecture	Multi-fidelity learning
VASP	DFT software	High accuracy, extensive functionals	Reference calculations
LAMMPS	MD simulation software	Extensive MLIP support, high performance	Molecular dynamics simulations
ASE	Python package	Atomistic simulation environment	Structure manipulation and analysis

The construction of diverse, high-quality databases represents the foundational step in developing transferable machine learning interatomic potentials. Methodologies such as the MAD dataset's aggressive configurational sampling and SevenNet-Omni's multi-task learning framework provide robust solutions to the critical challenges of configurational diversity and data quality heterogeneity. The experimental protocols outlined in this document enable researchers to create specialized databases optimized for their specific application domains while maintaining sufficient generality for transferability.

Future developments in MLIP database construction will likely focus on incorporating advanced physical effects including explicit magnetism, electron correlations, and van der Waals interactions through multi-fidelity approaches. Automated active learning frameworks that iteratively identify and fill gaps in configurational coverage will further enhance database completeness and efficiency. As these methodologies mature, the materials science community moves closer to the goal of truly universal interatomic potentials that seamlessly bridge quantum-mechanical accuracy with macroscopic simulation scales.

Machine learning interatomic potentials (MLIPs) have emerged as a transformative technology in computational materials science and chemistry, enabling atomistic simulations at near-quantum mechanical accuracy but at a fraction of the computational cost of traditional density functional theory (DFT) calculations [1]. These models serve as critical bridges between the high accuracy but computational intractability of ab initio methods and the efficiency but limited transferability of classical force fields. However, the development and deployment of MLIPs necessitate navigating fundamental trade-offs between accuracy, computational speed, and the scale of addressable system sizes. Understanding these trade-offs is paramount for researchers aiming to apply MLIPs to practical problems in materials design and drug development.

The core challenge stems from inherent tensions between these objectives: higher accuracy typically requires more complex models and extensive training data, which increases computational cost and limits system size. Conversely, models optimized for speed often sacrifice physical fidelity or generalizability. This application note provides a structured framework for evaluating these trade-offs and outlines detailed protocols for selecting, validating, and applying MLIPs to specific research problems across different domains.

Theoretical Foundations of Trade-offs

The Accuracy-Speed Trade-off

The trade-off between accuracy and computational speed represents one of the most fundamental considerations in MLIP development and application. At the architectural level, this manifests as a choice between highly expressive models that capture complex atomic interactions and streamlined models optimized for rapid inference.

Equivariant models exemplify this tension. These architectures explicitly embed physical symmetries (rotational, translational, and reflection invariances) into their structure, leading to superior data efficiency and accuracy for force and energy predictions [11] [1]. For instance, equivariant graph neural networks (GNNs) like NequIP and MACE achieve state-of-the-art accuracy but require computationally intensive tensor operations that can limit their inference speed [11]. In contrast, invariant models or those using simpler representations typically offer faster computation but may struggle with accurately capturing directional dependencies and complex chemical environments [1].

The computational expense scales significantly with model complexity. Models employing higher-order representations or extensive message-passing mechanisms incur substantial overhead during both training and inference [11]. This directly impacts their practicality for large-scale molecular dynamics simulations where thousands of energy/force evaluations are required.

The System Size Challenge

The relationship between system size and computational cost introduces another critical dimension to MLIP optimization. Most conventional MLIPs operate under the "nearsightedness principle," considering only local atomic environments within a fixed cutoff radius to ensure linear scaling with atom count [69]. While this approach enables the simulation of large systems, it systematically neglects crucial long-range interactions such as electrostatic, dispersion, and dipole forces [69].

These long-range interactions are particularly important for accurately modeling biological systems, polar materials, and interfaces where non-local effects significantly influence system behavior and properties. Capturing these effects requires specialized architectural adaptations that increase computational complexity. For example, methods incorporating Ewald summation or self-consistent field iterations for electrostatic interactions add substantial computational overhead compared to models considering only local environments [69].

Table 1: Computational Trade-offs in MLIP Architectural Choices

Architectural Feature	Impact on Accuracy	Impact on Speed	Impact on System Size
Equivariant vs. Invariant	Higher geometric accuracy with equivariance [1]	Slower inference with equivariant operations [11]	Similar scaling for both approaches
Local Environment Only	Limited accuracy for non-local interactions [69]	Faster computation	Enables large system simulation
Long-Range Interactions	Essential for electrostatics, interfaces [69]	Significant computational overhead	Increased memory and computation demands
Model Size (Parameters)	Generally improves accuracy with size [70]	Slower inference	Larger memory footprint

Quantitative Performance Benchmarks

Model Performance Across Applications

Recent comprehensive benchmarking studies reveal how different MLIP architectures perform across various material systems and properties. Universal MLIPs (uMLIPs) demonstrate particularly strong performance across diverse chemical spaces, though with notable variations between specific models.

In the MOFSimBench benchmark evaluating metal-organic frameworks, models like PFP and eSEN-OAM demonstrated excellent accuracy across multiple tasks including structure optimization, molecular dynamics, and host-guest interactions [71]. For structure optimization, PFP achieved the highest performance with 92% of structures optimized within 10% volume change compared to DFT reference, while eSEN-OAM showed top-tier accuracy in bulk modulus predictions [71].

However, performance consistency across different dimensionalities remains challenging. A benchmark evaluating systems of varying dimensionality (0D molecules to 3D bulk materials) found that while most modern uMLIPs perform excellently for 3D systems, accuracy typically degrades for lower-dimensional structures like surfaces (2D), nanowires (1D), and molecules (0D) [70]. The best-performing models—including ORB, equiformerV2, and the equivariant Smooth Energy Network (eSEN)—achieved atomic position errors of 0.01–0.02 Å and energy errors below 10 meV/atom across all dimensionalities [70].

Table 2: Performance Comparison of Selected MLIP Models

Model	Energy MAE (eV/atom)	Force MAE (eV/Å)	Specialized Strengths	Computational Efficiency
PFP	0.006 (QMOF) [71]	N/A	Structure optimization, molecular dynamics [71]	~3.75x faster than MatterSim-v1-5M [71]
eSEN-OAM	N/A	N/A	Bulk modulus, consistent multi-task performance [71]	Slower (280 ms/step) due to 30M parameters [71]
E2GNN	Comparable to ab initio [11]	Comparable to ab initio [11]	Solid, liquid, gas systems with high efficiency [11]	Significant efficiency vs. high-order tensor models [11]
M3GNet	~0.035 (across multiple systems) [17]	N/A	Pioneering uMLIP, broad applicability [17]	Moderate (smaller architecture) [70]
ORB-v2	<0.01 (across dimensionalities) [70]	N/A	Excellent for lower-dimensional systems [70]	Moderate (25M parameters) [70]

Phonon Property Predictions

Phonon calculations represent a particularly demanding test for MLIPs as they depend on the second derivatives of the potential energy surface. A recent benchmark evaluating seven uMLIPs on approximately 10,000 phonon calculations revealed significant performance variations [17].

While models like CHGNet and MatterSim-v1 demonstrated high reliability in geometry optimization (failure rates of 0.09% and 0.10%, respectively), others like eqV2-M showed substantially higher failure rates (0.85%) [17]. Importantly, models that predict forces as separate outputs rather than as exact derivatives of the energy (ORB and eqV2-M) exhibited particular challenges with phonon calculations due to high-frequency errors in the forces that prevented convergence in some cases [17].

This benchmark highlights that excellent performance on energy and force predictions for equilibrium structures does not necessarily translate to accurate phonon properties, emphasizing the need for specialized benchmarking against intended application requirements.

Application Protocols

Model Selection Workflow

Selecting the appropriate MLIP for a specific research application requires systematic evaluation of accuracy requirements, system characteristics, and computational constraints. The following protocol provides a structured approach for model selection:

Step 1: Define Application Requirements

Identify key properties of interest (energies, forces, dynamical properties, phonons)
Determine accuracy thresholds based on intended use (e.g., relative trends vs. quantitative predictions)
Establish computational budget (hardware availability, time constraints)

Step 2: Characterize System Properties

Determine system size (number of atoms) and dimensionality (0D-3D)
Identify dominant interactions (local bonding vs. long-range effects)
Assess chemical complexity (element diversity, known vs. unexplored chemistries)

Step 3: Initial Model Screening

For systems with significant long-range interactions, prioritize models with explicit long-range treatments like SOG-Net [69]
For high-dimensional accuracy requirements, consider equivariant models like E2GNN or MACE [11]
For high-throughput screening of 3D materials, select optimized uMLIPs like PFP or eSEN-OAM [71]

Step 4: Validation and Testing

Perform limited benchmarking on representative system subsets
Validate against available experimental data or high-level theoretical calculations
Assess computational performance on available hardware

Figure 1: Model Selection Workflow for MLIP Applications

Protocol for Accuracy-Focused Applications

For applications demanding high predictive fidelity, such as drug binding affinity calculations or material property prediction for regulatory submissions, the following protocol ensures rigorous model validation:

System Preparation

Generate reference data using consistent DFT parameters (functional, basis set, convergence criteria)
Ensure diverse sampling of relevant chemical space and conformational states
Include off-equilibrium structures to improve model transferability [17]

Model Training and Validation

Implement k-fold cross-validation with chemically meaningful splits
Apply uncertainty quantification to identify low-confidence predictions
Validate against multiple property types (energies, forces, stresses) simultaneously

Performance Assessment

Evaluate on hold-out test sets with comparable size to training data
Compare against baseline models and existing experimental data
Assess robustness through sensitivity analysis and adversarial testing

Protocol for Large-Scale Simulations

For applications requiring large system sizes or extended timescales, such as protein-ligand dynamics or materials phase behavior:

Model Optimization

Select architectures with demonstrated scalability (linear or near-linear scaling)
Implement model compression techniques (pruning, quantization) where applicable
Utilize hardware-specific optimizations (GPU acceleration, distributed computing)

Efficient Sampling Strategies

Employ active learning approaches to focus computational resources on uncertain regions
Implement transfer learning from general uMLIPs to system-specific refinements
Combine with enhanced sampling techniques for rare events

Validation at Scale

Perform smaller-scale validation against reference calculations
Verify conservation laws and thermodynamic consistency
Monitor performance drift during extended simulations

The Scientist's Toolkit

Essential Research Reagents

Table 3: Essential Computational Tools for MLIP Research

Tool Category	Specific Examples	Function	Application Context
Benchmark Datasets	QMOF [71], MOFSimBench [71], MDR Phonon Database [17]	Performance evaluation across diverse chemical spaces	Model validation, comparison, and selection
MLIP Frameworks	DeePMD-kit [1], MACE [17], E2GNN [11]	Ready-to-use implementations of MLIP architectures	Rapid prototyping and application development
Long-Range Methods	SOG-Net [69], DPLR [69]	Capture electrostatic and dispersion interactions	Polar materials, biological systems, interfaces
Universal MLIPs	PFP [71], eSEN [71] [70], ORB [70] [17]	Transferable potentials for diverse chemistries	High-throughput screening, multi-element systems
Validation Tools	Phonon calculators [17], MD analysis suites	Specialized property validation	Ensuring predictive accuracy for specific properties

Experimental Workflow for MLIP Deployment

Figure 2: MLIP Deployment Workflow

The field of machine learning interatomic potentials has matured significantly, with current models offering compelling alternatives to traditional quantum mechanical calculations for many applications. However, the fundamental trade-offs between accuracy, speed, and system size remain central considerations in research planning and implementation.

Researchers must align their model selection with specific application requirements: accuracy-focused applications benefit from equivariant architectures and comprehensive validation; high-throughput screening demands optimized uMLIPs with demonstrated transferability; and systems with significant long-range interactions require specialized architectural treatments. As the field continues to evolve, emerging approaches like hybrid benchmarking, multi-fidelity frameworks, and active learning promise to further optimize these trade-offs, enabling increasingly sophisticated applications across materials science and drug development.

The protocols and guidelines presented herein provide a structured framework for navigating these complex decisions, empowering researchers to leverage MLIP technologies effectively while maintaining awareness of their limitations and appropriate domains of application.

Validating MLIP Accuracy: Benchmarking Force Predictions Across Materials and Properties

The development of Machine Learning Interatomic Potentials (MLIPs) has created a critical need for robust benchmarking methodologies that extend beyond simple energy and force errors. While early MLIP evaluation relied primarily on root-mean-square-error (RMSE) and mean-absolute-error (MAE) for energies and atomic forces, research demonstrates that these static regression metrics show poor correlation with performance on real-world scientific simulation tasks [31]. Models with nearly identical force validation errors can exhibit dramatically different behavior in molecular dynamics simulations, structural relaxations, and property prediction [31] [1]. This protocol establishes comprehensive benchmarking procedures that evaluate MLIPs through the lens of system-level properties, providing researchers with standardized methods for assessing model readiness for computational chemistry, materials science, and drug development applications.

Limitations of Traditional Energy and Force Metrics

Static energy and force metrics, while computationally inexpensive to evaluate, suffer from fundamental limitations in predicting MLIP utility for scientific research:

Limited Sampling of Potential Energy Surface: Standard validation datasets typically sample equilibrium or near-equilibrium configurations, while downstream applications like molecular dynamics routinely explore higher-energy transition states and metastable configurations where MLIPs must extrapolate [31].
Poor Correlation with Dynamical Properties: Studies consistently show that models with similar force MAE values produce substantially different results for dynamical properties, including energy conservation, stability, and sampling accuracy during molecular dynamics simulations [31].
Insensitivity to Physical Realism: Basic error metrics cannot capture whether models reproduce physically correct behavior, such as proper vibrational spectra, thermodynamic properties, or reaction pathways [72] [17].

The following sections provide experimental protocols for benchmarking MLIPs against system-level properties that matter for practical research applications.

Benchmarking Phonon and Vibrational Properties

Experimental Rationale

Phonon spectra and vibrational properties derive from the second derivatives (curvature) of the potential energy surface, making them exceptionally sensitive tests of MLIP accuracy [17]. These properties are fundamental to understanding thermal transport, phase stability, and inelastic scattering phenomena [72]. Benchmarking against experimental phonon data provides critical validation of a model's ability to capture subtle atomic interactions beyond simple energy minimization.

Protocol: Phonon Dispersion and Density of States Calculation

Purpose: To evaluate MLIP performance in predicting harmonic phonon properties compared to reference DFT calculations and experimental measurements.

Materials and Computational Resources:

Relaxed crystal structures (DFT or MLIP-optimized)
Phonon calculation software (Phonopy, ALAMODE, or equivalent)
Reference DFT phonon database (as available)
Experimental inelastic neutron scattering data (for validation)

Methodology:

Structure Optimization:
- Using the MLIP, fully relax the atomic coordinates and lattice vectors of the test structure with convergence criteria of ≤ 0.001 eV/Å for forces and ≤ 0.01 GPa for stress.
- Compare the optimized lattice parameters with DFT and experimental references.

Force Constant Calculation:
- Generate symmetry-inequivalent finite displacements (typically 0.01-0.03 Å) for constructing the dynamical matrix.
- Compute forces on all atoms in supercells (typically 2×2×2 to 4×4×4 depending on system size).
- Extract harmonic force constants through central difference formulation.
Phonon Property Calculation:
- Diagonalize the dynamical matrix along high-symmetry paths for phonon dispersion.
- Calculate phonon density of states using dense q-point meshes.
- Compute derived thermodynamic properties (vibrational entropy, Helmholtz free energy, heat capacity).
Validation Metrics:
- Mean absolute error in phonon frequencies across the Brillouin zone.
- Spearman correlation coefficient for phonon density of states shapes [72].
- Qualitative comparison of phonon dispersion curves with experimental inelastic neutron scattering data [72].

Interpretation: High-performing universal MLIPs should achieve phonon frequency MAE < 0.5 THz and Spearman coefficients > 0.9 compared to reference DFT across diverse materials systems [72].

Table 1: Performance of Selected MLIPs on Phonon Calculations [72] [17]

Model	Phonon Frequency MAE (THz)	Structure Relaxation Success Rate (%)	Lattice Parameter MAE (Å³/atom)
ORB v3	0.32	>99.5	0.18
MatterSim v1	0.35	99.9	0.21
MACE-MP-0	0.41	99.7	0.24
CHGNet	0.52	99.9	0.29
M3GNet	0.61	99.6	0.33

Catalytic Property Benchmarking

Experimental Rationale

Adsorption energy prediction represents a critical test for MLIPs in catalytic applications, as these properties directly correlate with catalytic activity and selectivity [73]. Accurate prediction requires models to faithfully represent potential energy surfaces for both surface structures and molecular adsorbates across diverse chemical environments.

Protocol: Adsorption Energy Validation

Purpose: To systematically evaluate MLIP accuracy for predicting adsorption energies of small and large molecules on catalytic surfaces.

Materials:

CatBench framework or equivalent benchmarking system [73]
DFT-computed adsorption energies for reference (≥47,000 reactions recommended)
Diverse catalyst surfaces (metals, oxides, supported nanoparticles)
Varied adsorbates (from small molecules to complex organics)

Methodology:

System Preparation:
- Generate slab models of catalytic surfaces with appropriate vacuum separation.
- Create adsorbate-surface configurations at relevant binding sites.

Energy Calculation:
- Compute total energy of the relaxed adsorbate-surface system (E_adsorbate+surface).
- Compute total energy of the relaxed clean surface (E_surface).
- Compute total energy of the isolated, relaxed adsorbate molecule (E_adsorbate).
Adsorption Energy Determination:
- Calculate adsorption energy: Eads = Eadsorbate+surface - Esurface - Eadsorbate
- Compare MLIP-predicted adsorption energies with DFT reference values.
Statistical Validation:
- Calculate mean absolute error across diverse reaction spaces.
- Perform multi-class anomaly detection to identify systematic failure modes.
- Assess model robustness across molecule sizes and surface types.

Interpretation: Practically reliable models should achieve adsorption energy MAE ∼0.2 eV or better across both small and large molecules [73].

Table 2: MLIP Performance on Catalytic Property Prediction [73]

Model Class	Small Molecule Adsorption MAE (eV)	Large Molecule Adsorption MAE (eV)	Anomaly Detection Pass Rate (%)
Best Performing	0.18-0.22	0.21-0.26	>95
Intermediate	0.23-0.30	0.27-0.35	80-95
Poor Performing	>0.30	>0.35	<80

Biomolecular Simulation Benchmarking

Experimental Rationale

Biomolecular simulations require MLIPs to maintain stability over extended timescales while reproducing complex phenomena like folding dynamics, solvation effects, and conformational sampling [31]. These applications present distinct challenges due to system size, chemical diversity, and the importance of weak interactions.

Protocol: Stability and Conformational Sampling

Purpose: To assess MLIP performance for biomolecular systems including proteins, peptides, and nucleic acids in aqueous environments.

Materials:

Benchmark systems (small organic compounds, molecular liquids, folded protein domains, flexible peptides) [31]
Solvated systems with explicit water representation
Reference data from experimental structures and/or AIMD simulations

Methodology:

Stability Assessment:
- Initialize molecular dynamics simulations from experimental or DFT-optimized structures.
- Monitor energy conservation and structural drift over 10-100+ ps simulations.
- Quantify root-mean-square deviation (RMSD) from starting structure over time.

Conformational Sampling:
- Perform dihedral scans for backbone and sidechain torsion angles.
- Compare potential of mean force profiles with reference data.
- Assess conformational preferences for flexible peptides.
Solvation and Interactions:
- Calculate radial distribution functions for solute-solvent pairs.
- Evaluate interaction energies for non-covalent complexes.
- Assess hydrogen bonding patterns and lifetimes.

Interpretation: Production-quality MLIPs should maintain stable simulations with energy conservation better than 0.1-1.0 meV/atom/ps and reproduce key conformational preferences within 1 kcal/mol [31].

Integrated Benchmarking Workflow

The following workflow diagram illustrates the comprehensive MLIP benchmarking methodology described in this protocol:

MLIP Benchmarking Workflow

Essential Research Reagents and Computational Tools

Table 3: Essential Research Tools for MLIP Benchmarking

Tool/Resource	Type	Primary Function	Application in Benchmarking
MLIPAudit [31]	Benchmarking Suite	Standardized MLIP evaluation	Provides diverse benchmark systems and leaderboard tracking
CatBench [73]	Specialized Framework	Adsorption energy prediction	Systematic catalyst screening validation
Phonopy [72] [17]	Analysis Software	Phonon property calculation	Harmonic phonon spectra from force constants
Matbench Discovery [17]	Leaderboard	Model performance tracking	Comparative model assessment across properties
ASE (Atomic Simulation Environment) [31]	Computational Toolkit	Atomistic simulations	Interface for MLIP calculators and MD simulations
INSPIRED Software [72]	Specialized Tool	INS spectrum simulation	Experimental validation of phonon properties
Materials Project Database [72] [17]	Reference Data	Crystal structures and properties	Source of benchmark structures and reference data
MD17/MD22 Datasets [1]	Training/Test Data	Molecular dynamics trajectories	Biomolecular force validation

Comprehensive benchmarking that extends beyond energy and force errors to system-level properties is essential for developing reliable MLIPs for scientific research. The protocols presented here for phonon properties, catalytic performance, and biomolecular simulation provide standardized methodologies for assessing model readiness across diverse application domains. By adopting these benchmarking practices, researchers can make informed decisions about model selection, identify areas for improvement, and accelerate the development of more robust and transferable machine learning interatomic potentials.

The accurate prediction of phonon properties represents a critical benchmark for evaluating the performance of machine learning interatomic potentials (MLIPs). Phonons, the quantized lattice vibrations in materials, are fundamental to understanding thermal conductivity, phase stability, and mechanical properties. While MLIPs have demonstrated remarkable accuracy in predicting energies and forces near equilibrium configurations, their ability to capture the second derivatives of the potential energy surface—the harmonic force constants—remains a stringent test of their reliability and transferability. This Application Note examines the current state of MLIPs in predicting phonon properties, providing quantitative benchmarking data, detailed experimental protocols, and essential computational tools for researchers.

Performance Benchmarking of Universal MLIPs

Recent comprehensive benchmarking studies have evaluated the capability of universal MLIPs (uMLIPs) to predict phonon properties across a diverse set of materials. The assessment utilized a dataset of approximately 10,000 non-magnetic semiconductors with phonon calculations derived from density functional theory (DFT), covering a wide range of elements and crystal structures [17].

Table 1: Performance of Universal MLIPs in Predicting Phonon Properties

Model Name	Architecture Foundation	Force Prediction Method	Geometry Relaxation Failure Rate (%)	Phonon Prediction Accuracy
M3GNet	Three-body interactions, atomic positions	Energy derivative	~0.15%	Moderate
CHGNet	Graph neural network	Energy derivative	0.09%	Good (with energy correction)
MACE-MP-0	Atomic cluster expansion	Energy derivative	~0.15%	Moderate
SevenNet-0	NequIP-based, message passing	Energy derivative	~0.15%	Moderate
MatterSim-v1	M3GNet-based, active learning	Energy derivative	0.10%	High
ORB	SOAP with graph network simulator	Separate output	>0.15%	Varies
eqV2-M	Equivariant transformers	Separate output	0.85%	High (when converged)

The benchmarking revealed substantial variations in model performance, with certain uMLIPs achieving high accuracy in predicting harmonic phonon properties while others exhibited significant inaccuracies, despite excelling in energy and force predictions for materials near dynamical equilibrium [17]. Notably, models that predicted forces as a separate output rather than as derivatives of the energy (ORB and eqV2-M) demonstrated higher failure rates in geometry relaxation, highlighting the importance of energy conservation for reliable phonon calculations.

Advanced Computational Methods

Virtual Node Graph Neural Networks (VGNN)

A novel machine-learning framework called the Virtual Node Graph Neural Network (VGNN) has been developed to predict phonon dispersion relations up to 1,000 times faster than other AI-based techniques, with comparable or better accuracy [74].

Table 2: Comparison of Phonon Calculation Methods

Method	Computational Speed	Accuracy	System Size Scalability
Traditional DFT	Baseline (weeks/calculation)	High	Limited
Standard GNNs	100x faster than DFT	Moderate	Limited by fixed graph structure
VGNN Framework	1,000x faster than other AI methods; 1 million x faster than non-AI	High to Very High	Excellent with virtual nodes

The VGNN architecture adds flexible virtual nodes to the fixed crystal structure to represent phonons, enabling the output of the neural network to vary in size and efficiently model high-dimensional quantities like phonon dispersion relations [74]. This approach skips complex calculations when estimating phonon dispersion relations, making it significantly more efficient than standard GNNs while maintaining or improving accuracy.

Gaussian Process Approaches

Gaussian Processes (GPs) provide an alternative machine-learning approach for calculating anharmonic lattice dynamics by building surrogate models of potential energy surfaces [75]. The key advantage of GPs lies in their differentiability – linear operations on a Gaussian process, including differentiation, result in transformed Gaussian processes, enabling simultaneous calculation of force constants of arbitrary order without additional computational cost.

Diagram: Gaussian Process Workflow for Phonon Calculations. This diagram illustrates the process of using Gaussian Processes to calculate phonon properties through automatic differentiation.

Experimental Protocols

Standard Protocol for Phonon Property Benchmarking

Objective: Evaluate the performance of MLIPs in predicting harmonic force constants and phonon properties.

Materials and Software Requirements:

MLIP models (M3GNet, CHGNet, MACE-MP-0, etc.)
Reference DFT phonon database (e.g., MDR database with ~10,000 compounds)
Computational environment with Python and necessary libraries (ASE, Phonopy)
High-performance computing resources for large-scale calculations

Procedure:

Data Preparation
- Obtain the reference phonon dataset containing diverse non-magnetic semiconductors
- Ensure compatibility between the exchange-correlation functional used in reference data and MLIP training (PBE vs. PBEsol)
Geometry Optimization
- Perform structural relaxation using each MLIP with consistent convergence criteria
- Set force convergence threshold to <0.005 eV/Å
- Record failure rates and computational efficiency
Phonon Calculation
- Compute harmonic force constants using finite displacement method
- For each optimized structure, calculate phonon dispersion relations and densities of states
- Compare with reference DFT calculations
Performance Evaluation
- Calculate mean absolute errors in lattice parameters, volumes, and phonon frequencies
- Assess transferability across different crystal systems and chemical compositions
- Identify systematic errors in phonon band structures

Troubleshooting Tips:

For models with high geometry optimization failure rates, consider increasing maximum iteration counts
When phonon spectra exhibit imaginary frequencies, check structural convergence and consider anharmonic effects
For inconsistent results across material classes, examine training data coverage and representation

Protocol for VGNN Implementation

Objective: Implement Virtual Node Graph Neural Networks for accelerated phonon dispersion calculations.

Procedure:

Graph Construction
- Convert atomic structure to crystal graph with nodes representing atoms
- Add virtual nodes connected to real nodes to represent phonon degrees of freedom
- Configure virtual nodes to receive messages from real nodes without affecting accuracy
Model Configuration
- Implement one of three VGNN versions with increasing complexity
- Ensure virtual nodes provide sensitivity to capture subtle changes in phonon structure
- Configure output dimensions to accommodate variable momentum space
Training and Validation
- Train on diverse materials with known phonon dispersion relations
- Use forces and energies as training targets
- Validate prediction accuracy against DFT calculations
Phonon Property Prediction
- Input atomic coordinates directly to obtain phonon dispersion relations
- Calculate thermal properties from predicted phonon spectra
- Extend to alloy systems with complex compositional variations

The Scientist's Toolkit

Table 3: Essential Research Reagents and Computational Tools

Tool/Resource	Type	Function	Access
Phonopy	Software Code	Open-source package for phonon calculations	Open source
M3GNet	Universal MLIP	Pioneering uMLIP with three-body interactions	Open source
CHGNet	Universal MLIP	Graph neural network with small architecture	Open source
VGNN Framework	ML Architecture	Accelerated phonon dispersion predictions	Research code
MDR Database	Reference Dataset	~10,000 phonon calculations for benchmarking	Public database
Gaussian Process Models	Surrogate PES	Bayesian ML for anharmonic force constants	Custom implementation
HiPhive	Software Code	Cluster expansion for anharmonic calculations	Open source

Diagram: MLIP Calculation Pathway for Phonon Properties. This diagram shows the workflow from atomic structure to phonon properties through an MLIP.

The prediction of phonon properties remains a critical test for evaluating the performance and reliability of machine learning interatomic potentials. While universal MLIPs have demonstrated significant capabilities in calculating harmonic force constants, substantial variations exist between different architectures, with models that conserve energy through force derivatives generally showing better reliability for phonon calculations. Emerging methods like Virtual Node GNNs and Gaussian Processes offer promising avenues for accelerated and more accurate phonon calculations, potentially enabling high-throughput screening of thermal properties for materials design. As MLIP development continues, comprehensive benchmarking on phonon properties should be considered an essential validation step to ensure models capture not only energies and forces but also the curvature of the potential energy surface that determines vibrational and thermal behavior.

Universal Machine Learning Interatomic Potentials (uMLIPs) have emerged as transformative tools in computational materials science, offering a bridge between the quantum-mechanical accuracy of ab initio methods and the computational efficiency of classical molecular dynamics simulations. By training on vast datasets derived from density functional theory (DFT) calculations, these models can predict energies, forces, and stresses for diverse atomic systems across the periodic table. This application note provides a detailed comparative analysis of three leading uMLIP frameworks—M3GNet, CHGNet, and MACE—contextualized within the broader research landscape of force field development. We present structured performance benchmarks, detailed experimental protocols, and practical implementation guidelines to assist researchers in selecting and applying these potent tools for simulating complex material behaviors and reactive chemical dynamics.

Performance Benchmarks and Comparative Analysis

Accuracy and Computational Performance

Table 1: Comparative Performance Benchmarks of Universal MLIPs

Model	Energy MAE (meV/atom)	Force MAE (meV/Å)	Phonon Frequency MAE (THz)	Architecture Type	Key Differentiating Features
M3GNet [19] [17] [76]	~35 (MPtrj)	~77 (MPtrj)	~0.45 (PBE benchmark)	3-Body Graph Neural Network	Pioneering universal potential; integrated 3-body interactions
CHGNet [17] [77]	~30 (MPtrj)	~77 (MPtrj)	~0.40 (PBE benchmark)	Charge-Informed GNN	Explicit magnetic moment prediction; charge-aware dynamics
MACE [78] [17] [79]	State-of-the-art	State-of-the-art	~0.30 (PBE benchmark)	Higher-Order Equivariant MPNN	High body-order messages; excellent accuracy/efficiency balance
eqV2-M [17]	-	-	~0.35 (PBE benchmark)	Equivariant Transformer	Top Matbench performer; separate force output
SevenNet-0 [17]	-	-	~0.50 (PBE benchmark)	NequIP-based	Parallelized message-passing; data-efficient

The benchmarking data reveals several critical trends. First, model accuracy is fundamentally linked to architectural choices. MACE's higher-order equivariant message passing enables it to achieve state-of-the-art performance in phonon prediction, a sensitive probe of the potential energy surface's curvature [17]. CHGNet's explicit incorporation of magnetic moments provides a unique charge-informed modeling capability, allowing it to capture electronic degrees of freedom that are crucial for transition metal chemistry [77]. Second, a trade-off exists between force accuracy and energy accuracy. While all models demonstrate reasonable force predictions, energy errors can vary significantly, with CHGNet showing notably higher energy errors without its typical correction procedure [17].

For reactive dynamics applications, studies on hydrogen scattering at copper surfaces indicate that MACE and REANN (a related architecture) provide the best balance between accuracy and computational efficiency, a critical consideration for the millions of force evaluations required in such simulations [78]. The benchmark data also highlights the importance of task-specific validation. While a model may perform well on energy and force prediction near equilibrium, its performance on phonons—which depend on the second derivatives of the energy—may vary significantly [17].

Reliability and Failure Analysis

Table 2: Geometry Relaxation Reliability on Semiconductor Dataset

Model	Failure Rate (%)	Notable Strengths	Notable Weaknesses
CHGNet	0.09	High reliability; small architecture	Systematic force underestimation (softening)
MatterSim-v1	0.10	Active learning for broad accuracy	-
M3GNet	~0.22 (similar to MACE, SevenNet)	Proven universal applicability	-
MACE-MP-0	~0.22 (similar to M3GNet)	Fast, accurate, parallelizable	-
ORB	>0.22 (significantly higher)	SOAP with graph network simulator	Forces not exact energy derivatives
eqV2-M	0.85	Highest accuracy on Matbench	High failure rate; forces not exact derivatives

Reliability in geometry relaxation is a critical metric for practical applications. The failure rates in Table 2 primarily stem from two sources: unphysical force predictions in certain regions of the potential energy surface, and high-frequency noise that prevents convergence when forces are not implemented as exact derivatives of the energy [17]. This is particularly evident for models like ORB and eqV2-M, which do not derive forces via automatic differentiation. For production workflows requiring high-throughput relaxation, CHGNet and MatterSim-v1 currently offer the most robust performance.

Architectural Framework and Methodologies

Model Architectures and Theoretical Foundations

The performance differences between uMLIPs stem from their underlying architectural philosophies. M3GNet incorporates explicit 3-body interactions within a graph network framework, combining traditional many-body potential features with flexible graph representations [19] [76]. This approach provides a balanced description of diverse chemical environments across the periodic table.

CHGNet builds upon graph neural networks but introduces a crucial innovation: the explicit inclusion and prediction of magnetic moments as an atomic property [77]. This allows the model to infer atomic charge states and capture the coupling between electronic and ionic degrees of freedom, which is essential for modeling transition metal compounds and electrochemical systems.

MACE utilizes higher-order equivariant message passing, leveraging the Atomic Cluster Expansion (ACE) framework to construct many-body descriptors with rigorous equivariance properties [78] [17] [79]. This approach enables faster convergence in message-passing steps while maintaining high accuracy, particularly for sensitive properties like phonon dispersion.

Figure 1: Universal MLIP Architectural Framework and Prediction Workflow

Training Data and Sampling Strategies

The accuracy of uMLIPs is fundamentally constrained by the quality and diversity of their training data. The Materials Project Trajectory Dataset (MPtrj) has been instrumental in developing these models, containing over 1.5 million structures with energies, forces, stresses, and magnetic moments [77]. This dataset provides comprehensive coverage of inorganic materials across nearly the entire periodic table.

Advanced sampling strategies like DIRECT (DImensionality-Reduced Encoded Clusters with sTratified) sampling have been developed to address data coverage challenges. This approach employs dimensionality reduction and clustering to select diverse training structures from complex configuration spaces, enabling more robust potential development with fewer active learning iterations [19].

Figure 2: DIRECT Sampling Workflow for Robust Training Set Selection

Experimental Protocols and Application Guidelines

Protocol 1: Structure Relaxation with Pre-trained Models

Objective: Perform crystal structure relaxation using universal potentials. Materials:

Crystal structure file (POSCAR, CIF, or other format)
Python environment with installed MLIP package
Computational resources (CPU or GPU)

Procedure:

Environment Setup: Install the required package (e.g., pip install chgnet or pip install mace-torch)
Structure Initialization: Load your initial crystal structure
Model Initialization: Load the pre-trained potential
Relaxation Execution: Run the relaxation algorithm with appropriate parameters
Result Analysis: Extract final structure, energy, and convergence metrics

Example Code (M3GNet-based Relaxation):

Expected Output: Relaxation from initial 3.3 Å to ~3.169 Å (close to DFT value of 3.168 Å) with energy ~-10.859 eV/atom [76].

Troubleshooting:

For convergence failures, consider increasing the maximum number of ionic steps
For unphysical results, verify the structure initialization and element compatibility
For CHGNet, ensure magnetic moment initialization for magnetic materials

Protocol 2: Molecular Dynamics for Property Prediction

Objective: Perform MD simulations to extract thermodynamic and kinetic properties. Materials:

Equilibrium crystal structure
Fine-tuned or pre-trained uMLIP
MD simulation package (ASE, LAMMPS interface)

Procedure:

System Preparation: Create a supercell appropriate for the target property
Thermalization: Equilibrate the system at target temperature using NVT ensemble
Production Run: Perform NVE or NPT simulation for sufficient duration
Property Extraction: Analyze trajectories for diffusion coefficients, phase behavior, or other properties

Application Example - Li Diffusion in Battery Materials: CHGNet has been successfully applied to study Li diffusion in garnet-type ionic conductors, revealing migration mechanisms and barriers that correlate well with experimental observations [77]. The charge-informed nature of CHGNet enables accurate modeling of the subtle electron redistribution during ion hopping.

Validation:

Compare energy and force distributions with reference DFT calculations
For diffusion, ensure mean-squared displacement shows linear regime
Validate against experimental data (e.g., EXAFS spectra [80]) when available

Protocol 3: Fine-tuning for System-Specific Accuracy

Objective: Improve model accuracy for specific materials systems through fine-tuning. Materials:

System-specific DFT data (energies, forces, stresses)
Pre-trained uMLIP model
Fine-tuning computational framework

Procedure:

Data Generation: Perform targeted DFT calculations (AIMD, distortions, etc.)
Model Preparation: Load pre-trained weights and architecture
Transfer Learning: Continue training with system-specific data
Validation: Test on hold-out structures and properties

Case Study - Transition Metal Dichalcogenides: Fine-tuning CHGNet with ~100 WS₂/MoS₂-specific DFT structures significantly improved agreement with experimental EXAFS spectra and mitigated the systematic force underestimation (softening) common in uMLIPs [80].

Recommended Dataset Sizes:

Minimal fine-tuning: 1-10 structures (viable but limited transfer)
Recommended fine-tuning: ~100 structures (reliable for EXAFS matching)
Comprehensive training: 1000+ structures (for complex phase spaces)

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Computational Tools and Resources for uMLIP Research

Resource Category	Specific Tools	Primary Function	Access Method
Pre-trained Models	M3GNet, CHGNet, MACE-MP-0	Out-of-the-box inference	PyPI packages (`m3gnet`, `chgnet`, `mace-torch`)
Training Datasets	Materials Project Trajectory (MPtrj)	Model training/transfer learning	Materials Project API [77]
Benchmarking Suites	MDR Phonon Database [17]	Phonon property validation	Public repository
Model Architectures	MatGL (M3GNet successor), MACE codebase	Custom model development	GitHub repositories [76] [79]
Simulation Ecosystems	ASE, Pymatgen, DeepMD-kit	Structure manipulation, analysis, MD	PyPI packages

The rapid evolution of universal MLIPs has brought diverse architectures with complementary strengths. M3GNet offers proven reliability and 3-body interactions, CHGNet provides unique charge-informed modeling capabilities, and MACE delivers state-of-the-art accuracy through higher-order equivariance. Selection criteria should prioritize architectural features aligned with specific application needs: charge transfer systems benefit from CHGNet's magnetic moment prediction, high-throughput screening leverages M3GNet's robustness, and highest-accuracy requirements justify MACE's computational demands.

Future development trajectories include improved out-of-equilibrium performance through active learning, integration of electronic property prediction, multi-fidelity training frameworks combining DFT and experimental data, and enhanced interpretability for physical insights. As these models mature, they are poised to become indispensable tools for accelerating materials discovery across energy storage, catalysis, and quantum materials domains.

The accuracy of biomolecular force predictions is a cornerstone of reliable computational chemistry, impacting everything from structure-based drug discovery to the understanding of complex biochemical processes. The emergence of Machine Learning Interatomic Potentials (MLIPs) promises to bridge the long-standing gap between the high accuracy but computational intractability of quantum mechanical methods like Density Functional Theory (DFT) and the speed but limited accuracy of classical force fields. This application note details the protocols for rigorously validating these novel MLIPs, framing the process within the critical context of ensuring that these powerful new tools are trustworthy for real-world biomolecular applications. We focus on benchmarking against DFT reference data and, where possible, experimental observables, providing researchers with a structured framework for evaluation.

The recent release of massive, chemically diverse datasets and the universal models trained on them, such as Meta's Open Molecules 2025 (OMol25) and the Universal Model for Atoms (UMA), marks a potential turning point, described by some as an "AlphaFold moment" for the field [81]. However, the performance of any MLIP is fundamentally constrained by the fidelity of its training data. Recent studies have uncovered unexpectedly large uncertainties in the DFT forces of several popular molecular datasets, highlighting that well-converged numerical settings are a non-negotiable prerequisite for meaningful validation [82]. This note provides methodologies to navigate these challenges and perform robust validation of force predictions for biomolecular systems.

Key Research Reagent Solutions in MLIP Validation

The following table catalogues essential computational tools and data resources that form the modern toolkit for developing and validating MLIPs in biomolecular research.

Table 1: Key Research Reagent Solutions for MLIP Validation

Category	Item	Function in Validation
Benchmark Datasets	OMol25 Dataset [81]	Provides a massive, high-accuracy dataset of over 100 million calculations for training and testing MLIPs on diverse biomolecules, electrolytes, and metal complexes.
	SPICE, Transition-1x, ANI-1x [82]	Serve as standard benchmarks; however, users must be aware of potential force errors in some subsets and verify data quality.
Universal MLIP Models	Universal Model for Atoms (UMA) [81]	A universal architecture trained on OMol25 and other datasets, useful for testing transferability and performance across diverse chemical spaces.
	eSEN Models [81]	Neural network potentials (NNPs) available in both direct-force and conservative-force variants; the conservative variants are recommended for dynamics.
	CHGNet, M3GNet, MACE-MP-0 [17]	Universal MLIPs benchmarked for materials properties; their performance on biomolecular phonon and stability properties can be informative.
Software & Codes	DFT Codes (ORCA, VASP, FHI-aims, Psi4) [82]	Generate reference data for validation. Crucial to use tightly converged numerical settings (e.g., dense integration grids, disabling RIJCOSX) to minimize force errors.
	MLIP Software (DeePMD-kit, NequIP) [1]	Frameworks used to run simulations with MLIPs for validation tasks like geometry optimization and molecular dynamics.
Classical Force Fields	OPLS-AA, CHARMM, AMBER [83]	Traditional force fields used as a baseline for performance comparison, particularly for maintaining native folds in long-timescale simulations.

Quantitative Benchmarking of Force Prediction Accuracy

The first step in validation is a quantitative comparison of MLIP-predicted energies and forces against DFT reference data on standardized benchmarks. The following table summarizes key benchmark findings from recent literature.

Table 2: Quantitative Performance of MLIPs on Energy and Force Prediction

Model / Dataset	Energy Accuracy (MAE)	Force Accuracy (MAE)	Key Findings & Notes
OMol25-trained Models (eSEN, UMA) [81]	Essentially perfect on internal benchmarks	Not explicitly quantified, but reported as highly accurate	Far surpass previous state-of-the-art models; match high-accuracy DFT performance on molecular energy benchmarks.
ANI-1x Dataset Forces [82]	N/A	33.2 meV/Å (error vs. recomputed reference)	This dataset shows significant force component errors. Caution advised when using it for force validation.
SPICE Dataset Forces [82]	N/A	1.7 meV/Å (error vs. recomputed reference)	One of the better datasets for force quality, though some subsets still have issues.
Universal MLIPs (CHGNet, MACE-MP-0, etc.) [17]	~0.035 eV/atom for ground-state geometries	Varies by model	Performance on harmonic phonon properties (derived from force curvature) is mixed, highlighting a key validation target.
Conservative vs. Direct Force Models [81]	Comparable	Conservative models outperform direct counterparts	Conservative-force models are more reliable for molecular dynamics and geometry optimizations.

Critical Protocol: Assessing Underlying DFT Data Quality

Before trusting any benchmark results, it is imperative to assess the quality of the reference DFT data.

Experimental Protocol 1: Validating Reference DFT Forces

Objective: To quantify potential errors in the DFT force components of a benchmark dataset.
Materials: A representative subset (e.g., 100-1000 configurations) of the dataset intended for use.
Software: The DFT code used to generate the original data (e.g., ORCA, VASP, Psi4).
Procedure:
- Check Net Forces: For each configuration in the subset, calculate the magnitude of the net force vector (sum of all atomic forces). A significant net force (> 1 meV/Å per atom) indicates numerical errors [82]. The OMol25 dataset is a benchmark for quality, with net forces exactly zero within numerical precision [82].
- Recompute with Tight Settings: Select a random sample (e.g., 1000 configurations). Recompute the forces using the same functional and basis set but with more robust numerical parameters. Key adjustments often include:
  - Using a denser integration grid (e.g., 99,590 points as in OMol25 [81]).
  - Disabling the RIJCOSX approximation in ORCA, which can be a major source of error [82].
  - Tightening SCF convergence criteria and using a larger plane-wave cutoff (for periodic codes).
- Quantify Discrepancies: For each atom and Cartesian direction, calculate the difference between the original dataset force and the recomputed, tightly-converged force. Report the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) across the sample.
Interpretation: An RMSE in force components significantly larger than the target error of your MLIP (e.g., >10 meV/Å) suggests the dataset may be unsuitable for rigorous validation. In such cases, seek out higher-quality datasets like OMol25 or the subsets of SPICE that show low errors [82].

Experimental Protocols for Biomolecular Validation

Once data quality is established, MLIPs should be validated against biologically relevant benchmarks.

Protocol 2: Stability of Native Protein Folds

This protocol tests an MLIP's ability to maintain the native fold of a biomolecule over time, a critical test for any force field used in drug discovery.

Experimental Protocol 2: Assessing Native Fold Stability

Objective: To evaluate the performance of an MLIP in reproducing the native fold of a protein (e.g., SARS-CoV-2 PLpro) in molecular dynamics (MD) simulations [83].
System Preparation:
- Obtain the experimental structure (e.g., from PDB).
- Place the protein in a solvation box with a water model (e.g., TIP3P). Add ions (e.g., 100 mM NaCl) to replicate physiological conditions.
Simulation Parameters:
- Software: GROMACS, AMBER, or similar MD package with MLIP integration.
- Temperature: 310 K.
- Time Scale: Conduct both short (hundreds of nanoseconds) and long (microsecond+) simulations [83].
- Replicates: Run multiple independent simulations from the same starting structure.
Validation Metrics:
- Root Mean Square Deviation (RMSD): Of backbone atoms relative to the native crystal structure.
- Root Mean Square Fluctuation (RMSF): Of backbone atoms to assess stability of individual residues.
- Key Distances: Monitor specific distances, such as between catalytic residues (e.g., Cα(Cys111)-Cα(His272) in PLpro [83]).
Interpretation: A successful MLIP will maintain low backbone RMSD, show realistic fluctuations, and preserve key structural motifs over time. Benchmark against simulations performed with established force fields like OPLS-AA, which has shown superior performance in maintaining the PLpro fold [83].

Protocol 3: Peptide Folding and Conformational Sampling

Peptides, with their structural plasticity, are a stringent test for an MLIP's balance of forces.

Experimental Protocol 3: Benchmarking on Peptide Systems

Objective: To assess an MLIP's ability to model the conformational ensemble of structured miniproteins, disordered peptides, and folding pathways [84].
System Setup:
- Select a diverse benchmark set of peptides (e.g., tryptophan zipper, disordered sequences like dynorphin A) [84].
- For folded peptides, start simulations from the native state.
- For folding studies, initiate long simulations (e.g., 10 μs) from extended, unfolded conformations [84].
Simulation Parameters:
- Use a 4 fs timestep with hydrogen mass repartitioning.
- Employ the water model recommended for the specific MLIP or force field.
- Save frames frequently (e.g., every 20 ps) for analysis.
Validation Metrics:
- Stability: For folded-state simulations, calculate RMSD to the native structure.
- Folding Behavior: For unfolded-state simulations, analyze if and how the peptide folds into its native state.
- Conformational Ensemble: Compare the population of secondary structures (alpha-helices, beta-sheets) to experimental data (NMR, CD spectroscopy).
Interpretation: No single force field or MLIP currently performs optimally across all peptide types [84]. A robust MLIP should demonstrate reversible fluctuations and avoid strong biases (e.g., over-stabilization of helices or excessive compaction of disordered states).

Workflow Visualization for MLIP Validation

The following diagram illustrates the integrated validation pipeline detailed in this application note, highlighting the critical steps for establishing confidence in biomolecular force predictions.

MLIP Validation Workflow

The validation of machine learning interatomic potentials for biomolecular force prediction is a multi-faceted process that must extend beyond simple energy and force comparisons on small molecules. As this application note outlines, a rigorous protocol begins with a critical evaluation of the underlying DFT data quality, followed by quantitative benchmarking on standardized tasks, and culminates in challenging the model with complex, biologically relevant simulations like protein stability and peptide folding. The recent advent of high-quality, large-scale datasets like OMol25 and universal models like UMA provides an unprecedented foundation for this work. By adhering to these detailed protocols, researchers in computational chemistry and drug development can robustly assess and confidently employ these powerful new tools, accelerating the path from in silico modeling to scientific discovery and therapeutic innovation.

The accurate simulation of defects and fractures is a cornerstone of modern computational materials science, with profound implications for the development of next-generation alloys, energy technologies, and biomedical materials. Machine learning interatomic potentials (MLIPs) have emerged as transformative tools that bridge the quantum accuracy of density functional theory (DFT) with the scale accessibility of classical molecular dynamics, enabling previously impossible investigations of material behavior across relevant time and length scales. However, a critical challenge persists: the transferability of these simulations to complex, real-world systems where multiple defect types interact under non-idealized conditions.

This application note examines the current state of defect and fracture simulations within the broader context of machine learning interatomic potentials for accurate force calculation research. We analyze the methodological frameworks, quantitative performance, and practical protocols that enhance the real-world applicability of computational predictions, providing researchers with actionable guidance for implementing these advanced techniques across diverse material systems.

The Transferability Challenge in Real-World Systems

Real-world material behavior emerges from complex interactions between various defects including vacancies, dislocations, grain boundaries, and fractures. Traditional simulation approaches face significant challenges in capturing this complexity:

Scale Discrepancy: The characteristic scales of extended defects (dislocation networks, general grain boundaries) often exceed computational limits of DFT calculations [85].
Data Bias: MLIPs trained on simple, idealized structures frequently fail when encountering the diverse atomic environments present in actual material deformation processes [55] [85].
Configuration Space Coverage: Existing active learning strategies struggle to achieve comprehensive defect sampling across all relevant loading conditions, particularly in large-scale systems with extended time scales [85].

Table 1: Quantitative Performance Comparison of MLIP Development Approaches

Development Approach	Accuracy Relative to DFT	Computational Efficiency	Extended Defect Coverage	Key Limitations
Classical EAM Potentials	Low	High	Extensive	Poor quantitative accuracy for complex alloys [85]
Basic MLIP (Limited Training)	Medium-High	Medium	Limited	Fails on unseen defect configurations [55] [85]
EIP-GS + PCC-GCMC Framework	High	Medium	Extensive	Requires careful implementation [85]
MTP with Active Learning	High	Medium-High	Good (with proper sampling)	Dependent on training data quality [55]

The reproducibility crisis in machine learning applications further compounds these challenges. Empirical studies indicate that nearly all articles in applied ML fields are not reproducible due to insufficient disclosure across key dimensions, with 72% of articles failing to specify whether datasets are public, proprietary, or commercially available [86]. This severely hampers independent verification and practical implementation of published methods.

Methodological Frameworks for Enhanced Transferability

Integrated Sampling and Reconstruction Framework

The EIP-GS (Empirical Interatomic Potentials-Guided Sampling) and PCC-GCMC (Periodic Configuration Construction via Grand Canonical Monte Carlo) framework addresses fundamental limitations in MLIP development for extended defects [85]:

EIP-GS leverages classical potentials to rapidly generate diverse defect configurations across millions of atoms, capturing complex phenomena like dislocation nucleation, multiplication, and grain boundary interactions.
PCC-GCMC converts non-periodic atomic clusters of selected defects into periodic configurations without vacuum layers, enabling compatibility with standard plane-wave DFT calculations while preserving the local atomic environment.

This approach systematically expands the configuration space covered during MLIP training, incorporating realistic defect structures that exceed traditional DFT capabilities.

Fracture-Aware Pretraining for Generalization

The Garf framework for 3D fracture reassembly demonstrates the value of fracture-aware pretraining in enhancing model generalization [87]. By learning fracture features from individual fragments, models can better handle:

Unseen object shapes and diverse fracture types
Missing or extraneous fragments common in real-world scenarios
Complex breakage patterns beyond synthetic training data

This approach achieves 82.87% lower rotation error and 25.15% higher part accuracy compared to state-of-the-art methods, demonstrating significant improvements in real-world applicability [87].

Active Learning for Targeted Improvement

Active learning (AL) strategies dynamically expand training datasets by identifying and incorporating extrapolative configurations encountered during simulations [55]. This is particularly valuable for:

Defect-Engineered Alloys: MTP development for CoCrFeMnNi high-entropy alloy demonstrated that AL retraining improves potential reliability for simulating vacancies, dislocations, and stacking faults [55].
Efficient Sampling: AL reduces training data size while maintaining accuracy, enabling high-throughput screening of complex alloy systems [55].

Quantitative Assessment of Transferability

Table 2: Performance Metrics Across Simulation Domains

Application Domain	Key Performance Metrics	Traditional Methods	ML-Enhanced Methods	Transferability Gain
3D Fracture Reassembly [87]	Rotation Error (%)	15.42 (baseline)	2.65 (Garf)	82.87% reduction
	Part Accuracy (%)	68.91 (baseline)	86.20 (Garf)	25.15% improvement
Hydraulic Fracturing Prediction [88]	R² Score	0.85-0.93 (ANN)	0.9804 (RF)	5.4-15.3% improvement
	Computational Cost	High (physics-based)	Low (ML-based)	Significant reduction
Extended Defect Modeling [85]	GB Energy Error (%)	>10% (EAM)	<3% (P-MLIP-1)	>70% improvement
	Uncertainty (MAD)	Not quantified	Exceptionally low	Enhanced reliability

The quantitative evidence demonstrates consistent improvements across domains when implementing ML approaches designed for enhanced transferability. Notably, the RF model for hydraulic fracturing evaluation achieved R² = 0.9804 with low computational cost, significantly outperforming SVM and neural networks on a large-scale dataset of 16,000 records [88].

Experimental Protocols

Protocol: MLIP Development for Extended Defects in Metals

This protocol outlines the EIP-GS and PCC-GCMC framework for developing transferable MLIPs, validated for BCC tungsten but applicable to other metal systems [85].

Step 1: Initial Configuration Generation

Perform large-scale molecular dynamics simulations using established EAM potentials
Simulate diverse deformation scenarios: nanoindentation, tensile deformation, shear, fracture
Include polycrystalline and single-crystal systems to capture variety of defect types
Generate at least 100,000 atomic configurations representing different loading conditions

Step 2: Representative Configuration Selection

Apply D-optimality criterion-based algorithm to select representative local atomic environments (LAEs)
Compare generated configurations against basic dataset (ground-state structures, elastic strains, simple defects)
Select distinct atomic clusters (typical size: 100-200 atoms) covering the configuration space
Add vacuum layers (8Å thickness) to create non-periodic structures for initial DFT calculations

Step 3: Periodic Configuration Construction

For each atomic cluster, initialize simulation box slightly larger than atomic coordinates
Insert new atoms via GCMC simulations using EAM potentials, keeping core atoms fixed
Relax coordinates of newly inserted atoms and box size simultaneously
Iterate with gradually increasing box size until lowest-energy configuration is achieved
Select only lowest-energy configurations for DFT calculations

Step 4: MLIP Training and Refinement

Combine basic dataset with new periodic configurations for training foundational model (P-MLIP-0)
For specific applications, perform additional EIP-GS and PCC-GCMC operations
Implement traditional on-the-fly active learning for unforeseen LAEs
Iterate until no new LAEs are produced during validation simulations

Diagram 1: MLIP development workflow for extended defects

Protocol: Active Learning MTP for High-Entropy Alloys

This protocol details development of Moment Tensor Potentials for complex alloys like CoCrFeMnNi, with emphasis on defect engineering applications [55].

Step 1: Training Set Construction

Include various defect-induced configurations: vacancies, dislocations, stacking faults
Incorporate diverse chemical environments representative of HEA composition space
Utilize BFGS unconstrained optimization algorithm for potential training
Implement active learning scheme to dynamically add training data

Step 2: Validation and Error Assessment

Calculate error metrics across different property classes: equation of state, elastic constants
Perform uniaxial tensile deformation simulations to validate mechanical response
Conduct nano-indentation tests to assess potential performance under complex loading
Evaluate solid-liquid interface stability for high-temperature applications

Step 3: Implementation and Scaling

Embed potential into LAMMPS code for public distribution
Verify computational efficiency compared to MEAM potentials and DFT
Test scaling performance on target computational architectures
Document usage for community adoption

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Computational Tools for Defect and Fracture Simulations

Tool/Software	Primary Function	Application Context	Key Considerations
LAMMPS [14] [55]	Large-scale atomic/molecular massively parallel simulator	MD simulations with MLIPs	Supports various MLIP formats; requires potential parameterization
MLIP-3 [85]	Moment Tensor Potential implementation with active learning	OTF-AL for MLIP development	Uses D-optimality criterion and MaxVol algorithm for configuration selection
FracproPT [89]	Hydraulic fracturing design and optimization	Fracture propagation in subsurface formations	Physics-based; can be complemented with ML approaches
PCC-GCMC Code [85]	Periodic configuration construction	Converting defect clusters for DFT calculations	Custom implementation required; integrates with GCMC algorithms
MTP Package [55]	Moment Tensor Potential training and evaluation	HEAs and complex alloys	Active learning capabilities; efficient for multi-element systems

The transferability of defect and fracture simulations to complex real-world systems represents both the primary challenge and most significant opportunity in computational materials science. Methodological frameworks that integrate empirical potential-guided sampling, automated configuration reconstruction, and active learning strategies demonstrate substantial improvements in predictive accuracy across diverse applications—from high-entropy alloys to hydraulic fracturing operations. The quantitative assessments and detailed protocols provided in this application note equip researchers with practical tools to enhance the real-world relevance of their simulations, ultimately accelerating the development of advanced materials through more reliable computational predictions.

As the field progresses, increased emphasis on reproducibility—through detailed methodological disclosure, code sharing, and comprehensive dataset documentation—will be essential for translating computational advances into practical technological innovations. The frameworks outlined here provide a pathway toward this goal, establishing robust foundations for the next generation of transferable simulation approaches.

Conclusion

Machine learning interatomic potentials represent a paradigm shift in atomistic simulations, offering unprecedented opportunities to combine quantum-mechanical accuracy with large-scale simulation capabilities. The development of symmetry-equivariant architectures and sophisticated training strategies has significantly enhanced the accuracy of force predictions, enabling reliable studies of complex systems from metallic alloys to biomolecules. However, challenges remain in ensuring model transferability, particularly for out-of-distribution systems and emergent properties. The implementation of robust uncertainty quantification and active learning frameworks is crucial for building trust in MLIP predictions. Future directions point toward more data-efficient training, improved interpretability, and the integration of electronic structure information through machine learning Hamiltonians. For biomedical research, these advances promise to revolutionize computational drug discovery by enabling accurate simulation of protein-ligand interactions, enzymatic mechanisms, and complex biomolecular dynamics at unprecedented scales, ultimately accelerating the development of novel therapeutics and deepening our understanding of biological processes at the atomic level.

Machine Learning Interatomic Potentials: Achieving Accurate Force Calculations for Materials and Biomolecular Research

Machine Learning Interatomic Potentials: Achieving Accurate Force Calculations for Materials and Biomolecular Research

Abstract

Understanding Machine Learning Interatomic Potentials: From Quantum Mechanics to Accurate Force Fields

Theoretical Foundation: How MLIPs Reconcile Accuracy and Efficiency

The Potential Energy Surface Framework

Architectural Innovations in MLIPs

Quantitative Performance Comparison

Application Protocols for Pharmaceutical Research

Protocol 1: Predicting Molecular Crystal Stability

Protocol 2: Modeling Charged Fluids and Ionic Liquids

The Scientist's Toolkit: Essential Research Reagents

Future Perspectives and Challenges

Theoretical Framework and Key Concepts

Foundational Descriptors for Atomic Systems

The Evolution toward Equivariant Architectures

Quantitative Comparison of Architectures and Performance

Performance Benchmarks of MLIP Architectures

Algorithm Selection Guidelines for Different Data Regimes

Experimental Protocols and Implementation

Protocol 1: Developing a Custom Equivariant MLIP

Protocol 2: Teacher-Student Knowledge Distillation for Efficient MLIPs

Key Software Frameworks and Descriptor Tools

Benchmark Datasets for MLIP Development

Fundamental Symmetry Groups and Their Physical Interpretation

Current Architectural Paradigms for Embedding Equivariance

Quantitative Performance of Equivariant Models

Application Notes and Experimental Protocols

Protocol 1: Implementing an Equivariant GNN for Force Field Training

Protocol 2: Enhancing Generalization with Physics-Informed Sharpness-Aware Minimization (PI-SAM)

Advanced Applications and Future Outlook

Performance Benchmarks and Application-Specific Accuracy

Phonon Property Predictions

High-Pressure Applications

Experimental Protocols and Methodologies

DIRECT Sampling for Robust Training Set Selection

Active Learning for Infrared Spectra Prediction

The Scientist's Toolkit: Essential Research Reagents

Critical Challenges and Future Directions

Current Landscape of MLIP Training Datasets

Experimental Protocols for MLIP Development and Validation

Active Learning for Data Curation and Expansion

Benchmarking MLIPs for Charge-Related Properties

Workflow Visualization for MLIP Development

Model Selection Guidelines

Implementing MLIPs: Architectures, Training Strategies, and Domain Applications

MLIP Architecture Families: Core Principles and Comparative Analysis

Architectural Paradigms and Their Evolution

Quantitative Performance Comparison

Experimental Protocols for MLIP Evaluation and Application

Standardized Benchmarking with MLIPAudit

Molecular Dynamics Simulation with MLIPs

Visualization of MLIP Architectures and Workflows

Conceptual Hierarchy of MLIP Architectures

MLIP Benchmarking and Application Workflow

Application Notes

Protocols

Protocol 2.1: Implementing an Electronic-Structure-Informed Equivariant Model (e.g., OrbNet-Equi)

Protocol 2.2: Engineering Custom Composite Descriptors for 2D Material Properties

Visualization

Protocol 1: Passive Learning

Principle and Application Context

Detailed Experimental Workflow

Workflow Visualization

Protocol 2: Active Learning

Principle and Application Context

Detailed Experimental Workflow

Workflow Visualization

Protocol 3: Multi-Fidelity Learning

Principle and Application Context

Detailed Experimental Workflow

Workflow Visualization

The Scientist's Toolkit: Research Reagent Solutions

Application Notes: Key Studies and Quantitative Data

Notes on Functional Insights

Experimental Protocols

Protocol 1: Trapping and Characterizing a High-Energy Conformational State

Protocol 2: Data Generation for Training a Machine Learning Interatomic Potential

Protocol 3: Simulating Conformational Transitions with an MLIP

Workflow and Pathway Visualizations