Rainer-Rudolph-Awards Session at the Mosbacher Kolloquium 2024
Designing triosephosphate isomerases using generative language models
Alexander Braun
University Bayreuth
Enzymes are wonderful biocatalysts that increase reaction rates by several orders of magnitude. The ability to catalyze reactions in aqueous solution at atmospheric pressure and at ambient temperatures makes these biomolecules an environmentally friendly and cost-effective alternative to synthetic catalysts used in industry. However, enzymes are mostly restricted to reactions occurring in the context of cellular life. The ability to design tailor-made enzymes is therefore of great interest for biotechnology and chemical industry. Methods to create proteins that catalyze a desired reaction have been mostly based on physics-based heuristics or built on improving promiscuous activity in natural proteins. These methodologies are highly time-consuming and in need of extensive screening. As a shift of paradigm, recent successes in the application of language models based on the transformer architecture in protein sciences inspired the development of unconditional and conditional language models, such as ZymCTRL, to design new protein sequences. Being trained on the BRENDA database of enzymes, ZymCTRL generates putative enzyme sequences according to the enzyme commission number used as input. We assess experimentally the performance of this language model to generate triosephosphate isomerases (TIM), an obligatory oligomeric well-researched enzyme class catalyzing its reaction near the diffusion limit. Shallow filtering of generated putative enzyme sequences resulted in three out of twelve de novo TIMs being active in vivo and able to complement a TIM deficient E. coli strain. In depth characterization of the best performing
artificial enzyme shows in vitro activity just two orders of magnitude below its natural counterparts. Based on the results we propose a filtering mechanism to increase experimental success rates of artificially generated proteins even further. This study highlights the potential of protein language models as tools for the generation of tailored enzymes directly from sequence.