What Can Uni-Mol Do Too? | Combining 3D Geometry and Text, Ushering in a New Era of Molecular Multimodal Learning
With the increasing demand for molecular representation learning in drug discovery and materials design, multimodal representation learning that combines the 3D geometric structure of molecules with biomedical text is becoming a research hotspot. Recently, a paper titled "GeomCLIP: Contrastive Geometry-Text Pre-training for Molecules" [1] published on arXiv introduced an innovative framework called GeomCLIP in detail. This framework combines the 3D geometric information of molecules with text descriptions and significantly enhances the multimodal learning ability of molecular representations through contrastive learning and denoising pre-training tasks, demonstrating superior performance in several downstream tasks.
The research team developed a large-scale dataset called PubChem3D, which contains more than 200,000 pairs of 3D geometry and text descriptions of molecules, covering rich chemical and biological information. The GeomCLIP framework adopts a dual-encoder structure to encode the geometry and text of molecules respectively and aligns the representations of the two modalities through contrastive learning while retaining the ability of the geometric encoder to model the characteristics of 3D molecules. Experiments show that GeomCLIP has achieved excellent performance in molecular property prediction, molecule-to-text retrieval, and 3D molecular caption generation tasks, providing important support for drug research and development and materials design.
This research was completed by a joint team from the Artificial Intelligence Research Laboratory at Pennsylvania State University and the Shenzhen International Graduate School of Tsinghua University. Teng Xiao from Pennsylvania State University is the first author and corresponding author of the paper, and the collaborators include Chao Cui, Huaisheng Zhu, and Professor Vasant G. Honava from Tsinghua University. This work has opened up a new direction for multimodal learning of molecular representations and also provided new solutions for drug discovery and materials science research.