首 页
滚动信息 更多 >>
本刊2022年SCI影响因子9.7 (2023年6月发布) (2023-10-23)
本刊2021年SCI影响因子12.256 (2022-07-07)
npj Computational Materials 2019年影响因子达到9... (2020-07-04)
npj Computational Materials获得第一个SCI影响因... (2018-09-07)
英文刊《npj Computational Materials(计算材料学... (2017-05-15)
快捷服务
最新文章 研究综述
过刊浏览 作者须知
期刊编辑 审稿须知
相关链接
· 在线投稿
会议信息
友情链接
  中国科学院上海硅酸盐研究所
  无机材料学报
  OQMD数据库
近期文章
A rule-free workflow for the automated generation of databases from scientific literature
发布时间:2024-02-23

A rule-free workflow for the automated generation of databases from scientific literature

Luke P. J. Gilligan,Matteo Cobelli,Valentin Taufour&Stefano Sanvito

npj Computational Materials9: 222 (2023).

Editorial Summary

A workflow for the automated generation of databases from literature

Since the dawn of modern science, there has been a continuous, exponential growth in the volume of published scientific literature. Materials data can provide the foundation for models and theories to navigate the physical/chemical space and ultimately drive discovery. However, access to information from unstructured literature at such a massive scale presents significant technical and practical challenges. In general, curated databases are scarce and often limited to theoretical data only. Large-scale theoretical datasets have been proven to be a revolutionary tool in the search for new materials with unique properties and for the discovery of intricate materials trends. Furthermore, theory datasets have been a platform for constructing machine learning models with enhanced throughput. However, databases containing experimental results are much rarer and typically smaller. The existing landscape of experimental datasets is incomplete and fragmented, and most of the known experimental results remain accessible only through unstructured scientific literature. In this work, a team led by Prof. Stefano Sanvito from the School of Physics, Trinity College, Dublin, Ireland, presented a workflow based on the fine-tuning of BERT models for different downstream tasks, which results in the automated extraction of structured information from unstructured natural language in scientific literature. Contrary to existing methods for the automated extraction of structured compound-property relations from similar sources, this workflow does not rely on the definition of intricate grammar rules. Hence, it can be adapted to a new task without requiring extensive implementation efforts and knowledge. The authors tested their data-extraction workflow by automatically generating a database for Curie temperatures and one for band gaps. These were then compared with manually curated datasets and with those obtained with a state-of-the-art rule-based method. Furthermore, in order to showcase the practical utility of the automatically extracted data in a material-design workflow, the authors employed them to construct machine-learning models to predict Curie temperatures and band gaps. In general, although more noisy, automatically extracted datasets can grow fast in volume and such volume partially compensates for the inaccuracy in downstream tasks.

编辑概述

一个从文献中自动生成数据库的工作流程

自现代科学诞生以来,已发表的科学文献数量持续呈指数级增长材料数据能够为模型和理论提供基础,以指导物理/化学空间,并最终推动发现。然而,从如此大规模的非结构化文献中获取信息,仍存在重大的技术问题和实际挑战。一般来说,人工管理的数据库非常稀缺,往往局限于理论数据。大规模的理论数据集已被证明是寻找具有独特性质的新材料和发现复杂材料趋势的革命性工具。此外,理论数据集已经成为构建具有更高吞吐量的机器学习模型的平台。然而,包含实验结果的数据库则要罕见得多,而且通常规模更小。现有的实验数据集是不完整且碎片化的,并且大多数已知的实验结果仍然只能通过非结构化的科学文献来获得。在本工作中,来自爱尔兰都柏林圣三一大学物理学院的Stefano Sanvito教授团队,针对不同的下游任务提出了一个基于微调BERT模型的工作流程,能够从科学文献中的非结构化自然语言中自动提取结构化信息。现有方法从相似来源中自动提取结构化化合物-性质关系,与之不同,该工作提出的工作流程并不依赖于复杂的语法规则定义。因此,它可以适应新任务,而无需过多的人为干预和知识学习。通过自动生成居里温度数据库和带隙数据库,作者测试了这一数据提取工作流程。然后,他们将这些数据与手动管理的数据集以及使用最先进的基于规则的方法获得的数据集进行了比较。此外,为了展示自动提取的数据在材料设计工作流中的实用价值,他们使用这些数据构建了机器学习模型,用以预测居里温度和带隙。总的来说,尽管自动提取的数据集噪声更大,但可以实现数据量的快速增长,从而部分补偿了下游任务的不准确性。

 
【打印本页】【关闭本页】
版权所有 © 中国科学院上海硅酸盐研究所  沪ICP备05005480号-1    沪公网安备 31010502006565号
地址:上海市长宁区定西路1295号 邮政编码:200050