Essay Competition Winner: Artificial Intelligence and its Pursuit of Predicting Gene Expression by Amith Hariharasudhan

DNA is the self-replicating molecule that consists of our genetic information. Sequences of DNA nucleotides compose genes, which encode the amino acid sequences forming proteins. There are approximately 20,000 genes in our body, determining many traits from our susceptibility to specific diseases to facial features and more. However, only 2% of the DNA in our entire genome constitutes genes; the remaining DNA is known as non-coding DNA and their role is to regulate gene expression in different parts of the body according to the functions of the cells located there. Gene expression refers to how genes influence the production of a protein – whether they enhance or inhibit it. However, gaining a clear understanding the mechanisms of these non-coding regions and which genes they affect has been an enduring challenge for scientists.

Advancements in machine learning artificial intelligence have construed in huge developments in understanding these interactions. In a major advance in October 2021, collaborating researchers from Deepmind and Calico published a paper on a novel class of deep learning networks termed Enformers. Enformers have upgraded the capacity to predict gene expression using just the DNA sequence as an input. Enformers derive from another type of neural network architecture – transformers. These are used in natural language processing, which allows computers to understand human text or speech. A common everyday example of this you might see, would be spell check or autocomplete on a word document. The transformers were modified to read long sequences of DNA instead of text, giving rise to Enformer.

Previous methods used to evaluate gene expression have had limited accuracy or require tedious work. For example, experimentally determining the effects of genetic variants (genes exhibiting changes in their base sequence) has proven to be an arduous task. As well as this, there are also restrictions to which cell types can be analysed in the lab. Population-based association studies, also used to analyse genetic variants, mainly identify common ones and introduce difficulties when differentiating between correlation and cause. Initial studies involving deep learning to predict gene expression used convolutional neural networks (CNNs) such as Basenji1 and Basenji2. Basenji2 was able to predict regulatory activity up to 20 kb either side of the transcription start site. However, the unaccounted presence of other regulatory elements at further distances (upstream and downstream of the gene) influencing gene expression compromises the accuracy of Basenji2.

Upstream and downstream components regulating the transcription of a gene including enhancers, silencers and insulators which are often further than 20 kb away from the transcription start site (TSS)
Taken from: https://genomemedicine.biomedcentral.com/articles/10.1186/gm344/figures/1  

Enformer’s gene expression prediction accuracy was considerably higher than previous models. One reason for this is because they implemented self-attention features like those used in natural language processing. This mimics how humans pay attention to information, by emphasising some input data whilst reducing the effect of others enabling the network to better consider the small, important data. Additionally, the second reason relates to the differences in the receptive field of Enformer compared to Basenji2. Receptive fields are the region of input that a unit of a network can view and process; these units then integrate the inputs to evaluate the effects of gene expression. This process is analogous to how visual stimuli is received by restricted regions of the visual field (receptive fields) and how this information is integrated to cover the entire visual field. The receptive field of Enformer is larger than that of CNNs, enabling it to predict regulatory activity up to 100 kb either side of the transcription start site, 5 times more than Basenji2. Enformer’s predictions are even comparable to what experiments yielded, as the influence of the distal regulatory elements are being considered.

Enformer’s contribution scores (portraying the regions of the input sequence that had the greatest impact on gene expression) extend beyond 20 kb as enhancers even beyond this point are included. In contrast, Basenji2 only has spikes in the contribution score up until 20 kb
Taken from: https://deepmind.com/blog/article/enformer 

he ability to predict which genetic variants alter gene expression using Enformer has vast applications. They could enable us to effectively identify newer disease-associated variants in diseases like cystic fibrosis or down syndrome and differentiate them from any false positives which are not directly associated. Moreover, they could substantially improve identification of rare variants in diseases when used alongside the genome-wide association studies (which we mentioned earlier). Finally, they can be used for natural and synthetic sequences that might lack experimental data; improving our understanding of the genome.

Ultimately, these advancements, in addition to previous ones such protein structure prediction using Alphafold2 highlight the potential for AI to accelerate progress in genetics. Despite the more precise gene expression prediction yielded by the Enformer neural network; ample progress remains to be made surrounding other complexities in gene expression and non-coding DNA. Further research using AI can help us explore the complexities of the genome even further.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: