Data to Knowledge

Machine learning has transformed the field of materials modelling in the last few years. Given access to high-quality data from computations and/or experiments, machine learning can be used to develop expert systems powered by large language models (such as ChatGPT). These expert systems can be to train surrogate models that can predict properties of structures, eliminating the need for simulations, or speed up simulations by using machine learnt interatomic potentials (MLIPs). The Data to Knowledge resource theme is dedicated to making the creation of these models possible by providing data infrastructure and workflows enabling the generation and exploitation of these machine learnt models. The Data to Knowledge collections comprise curated datasets designed for use in machine learning or generated through machine learning. An example is the Machine Learning Interatomic Potentials (MLIPs) data collection, which includes MLIPs XYZ files used for training, the trained model itself, and, where possible, related data such as AIIDA provenance records. Making these datasets available enables those without the resources to compute data themselves to utilise them for machine learning and modelling. Training is a central focus of this resource theme. We provide two types of training: general training and tool-specific training. Our general training is provided as self-paced learning online.

Sort by

Data Sources

Data to Knowledge Community Data Collections

Data-to-Knowledge Collections provide curated datasets specifically designed for use in machine learning or generated through machine learning processes. An example is the Machine Learning Interatomic Potentials (MLIPs) data collection, which includes MLIP XYZ files used for training, the trained MLIP models, and, where available, additional metadata such as AIIDA provenance information. By making these datasets accessible, researchers without the resources to generate such data themselves can leverage them for machine learning. Additionally, the MLIP models can be directly applied in modeling tasks, enabling broader exploration and advancements in research.

Tools

janus-core, Machine Learning Interatomic Potentials (MLIP) Tool

Janus_core provides versatile tools for exploring a range of machine learning interatomic potentials. It can be accessed through the command line, Python scripts, Jupyter Notebooks, or a web interface. The platform supports various tasks, from basic calculations like single-point energies and geometry optimization to more advanced analyses, such as phonon calculations, molecular dynamics simulations, and nudged elastic band methods.

aiida-mlip, Workflow Management for Machine Learning Interatomic Potentials (MLIP)

AiiDA-MLIP is a plugin for AiiDA, a versatile workflow management system designed for simulations in materials and molecular sciences. Beyond managing workflows, AiiDA distinguishes itself from traditional tools by providing detailed data provenance, ensuring transparency and reproducibility in simulations. AiiDA-MLIP integrates the Janus core into the AiiDA ecosystem, enabling seamless utilization of machine-learned interatomic potentials (MLIPs) within this robust framework.

abcd, Data Management for Machine Learning Interatomic Potentials (MLIP)

abcd, Data Management for Machine Learning Interatomic Potentials (MLIP), offers a suite of tools that allow users to programmatically import data from a repository and utilize it on their local infrastructure. This enables exploration of data in ways that would be too resource-intensive for a general service. Designed specifically for machine-learned interatomic potentials, abcd supports both OpenSearch and MongoDB as backend options, giving users flexibility in their setup. The tool is developed in collaboration with the University of Cambridge.

PSDI Community Data Collections API Python Code

API for depositing and retrieving from PSDI Community Data Collections (https://data-collections.psdi.ac.uk/). This python code can be used by developers who wish to contribute to a PSDI Community Data Collection by automated deposition via its API - for example to perform bulk uploads or automate import pipelines. It can also be used by developers who wish to use data and metadata in a PSDI Community Data Collection to automate their retrieval via its API - for example to perform bulk downloads or automate export pipelines.

ML-PEG (Machine Learning Performance and Extrapolation Guide) Github Repository

Repository to locally host a ML-PEG (Machine Learning Performance Guide) application as an interactive dashboard. ML-PEG is a comprehensive benchmarking framework and interactive performance guide for evaluating Machine Learning Interatomic Potentials (MLIPs) across diverse systems and properties beyond only energies and forces. The interactive performance guide, allowing users to explore and compare MLIP performance and deep dive into errors, connecting performance (or the lack of) to the underlying chemistry and physics. Please note that this code is currently available as an alpha release which is still under development.

Guidance

Chemistry and Materials Machine Learning School (CAMML)

Training is at the core of this resource theme. The Chemistry and Materials Machine Learning (CaMML) school is run by PSDI in collaboration with a range of communities. Training is targeted towards PhD students, in particular those in the Materials and Molecular Simulations field, who have experience of coding but are not highly experienced with machine learning. The aim of this in-person training is to introduce attendees to the latest methods of machine learning for the atomistic simulation of materials.

Machine Learning Interatomic Potentials (MLIP) Training

Training is at the core of this resource theme. These tools help users to get started using our tools.

What We Provide