March 2023 - April 2023
Source: https://www.france-genomique.org/technological-expertises/regulome/mapping-of-transcription-start-sites-tss/?lang=en
Identifying transcription start sites (TSS) is a critical step in understanding gene regulation. TSSs mark the initiation points of transcription, where DNA is transcribed into RNA. Accurate TSS identification enables researchers to unravel complex gene regulatory networks and provides valuable insights into various biological processes, diseases, and potential therapeutic targets.
The objective of our project is to explore the latest developments in TSS modeling and assess the effectiveness of these models. To accomplish this, we will leverage DeepTSS [1] to initially predict the approximate locations of TSS in DNA sequences, utilizing CAGE-seq data. As these predictions may contain inaccuracies, we will also utilize DNABert [2] and TSSFinder [3] to validate whether the identified locations indeed correspond to TSS. Additionally, we will perform a secondary evaluation by cross-referencing our results with those of the Eukaryotic Promoter Database (EPDnew), which provides TSS annotations for various organisms. Through these multi-stage evaluations, we aim to increase the accuracy and reliability of our TSS predictions.
This project is in progress, so check back in a month to see the final result.
Python, sklearn, PyTorch, Linux Bash Scripting, Pipeline Development
Mirudhula Mukundan, David Luo, Xueke Jin, Wroochit Mishra
[1] Grigoriadis, D., Perdikopanis, N., Georgakilas, G. K., & Hatzigeorgiou, A. G. (2022). DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data. BMC bioinformatics, 23(2), 1-17.
[2] Ji, Y., Zhou, Z., Liu, H., & Davuluri, R. V. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37 (15), 2112-2120.
[3] de Medeiros Oliveira, M., Bonadio, I., Lie de Melo, A., Mendes Souza, G., & Durham, A. M. (2021). TSSFinder—fast and accurate ab initio prediction of the core promoter in eukaryotic genomes. Briefings in Bioinformatics, 22 (6), bbab198.