Jacqueline Cole1 2 3

1, University of Cambridge, Cambridge, , United Kingdom
2, Argonne National Laboratory, Chicago, Illinois, United States
3, Rutherford Appleton Laboratory, Harwell, Oxfordshire, United Kingdom

Large-scale data-mining workflows are increasingly able to predict successfully new materials that possess a targeted functionality [1]. The success of such materials discovery approaches is nonetheless contingent upon having the right database source to mine. This presentation shows how to auto-generate tailor-made databases to search for functional materials to meet the needs of a given device application.

The talk presents the 'chemistry-aware' open-source text- and table-mining software tool, ChemDataExtractor, that can extract large volumes of material-property data from the literature, using natural language processing, optical character recognition and machine learning capabilities [2]. Machine learning is then employed to populate any missing experimental data.

The role of this tool in accelerating materials discovery is illustrated.

[1] J. M. Cole K. S. Low, H. Ozoe, P. Stathi, C. Kitamura, H. Kurata, P. Rudolf, T. Kawase, “Data Mining with Molecular Design Rules Identifies New Class of Dyes for Dye-Sensitised Solar Cells” Phys. Chem. Chem. Phys. 48 (2014) 26684-90. (Communication).
[2] M. C. Swain, J. M. Cole, ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature,
J. Chem. Inf. Model., 2016, 56, 1894–1904