Review of Machine Learning Algorithms in Differential Expression Analysis

  • Irina Kuznetsova Graz University of Technology and University of Western Australia
  • Yuliya V Karpievitch University of Western Australia
  • Aleksandra Filipovska University of Western Australia
  • Artur Lugmayr Curtin University
  • Andreas Holzinger Graz University of Technology


In biological research machine learning algorithms are part of nearly every analytical process. They are used to identify new insights into biological phenomena, interpret data, provide molecular diagnosis for diseases and develop personalized medicine that will enable future treatments of diseases. In this paper we (1) illustrate the importance of machine learning in the analysis of large scale sequencing data, (2) present an illustrative standardized workflow of the analysis process, (3) perform a Differential Expression (DE) analysis of a publicly available RNA sequencing (RNA-Seq) data set to demonstrate the capabilities of various algorithms at each step of the workflow, and (4) show a machine learning solution in  improving the computing time, storage requirements, and minimize utilization of computer memory in analyses of RNA-Seq datasets. The source code of the analysis pipeline and associated scripts are presented in the paper appendix to allow replication of experiments.


Download data is not yet available.


[1] H. F. Lodish, Molecular cell biology, 4th ed. New York: W.H. Freeman, 2000, pp. xxxvi, 1084, G-17, I-36 p.

[2] C. Suzanne, "RNA splicing: introns, exons and spliceosome," Nature Education 2008.

[3] T. Stuart, S. R. Eichten, J. Cahn, Y. V. Karpievitch, J. O. Borevitz, and R. Lister, "Population scale mapping of transposable element diversity reveals links to gene regulation and epigenomic variation," Elife, vol. 5, Dec 02 2016.

[4] J. Zhang, R. Chiodini, A. Badr, and G. Zhang, "The impact of next-generation sequencing on genomics," J Genet Genomics, vol. 38, no. 3, pp. 95-109, Mar 20 2011.

[5] H. P. Buermans and J. T. den Dunnen, "Next generation sequencing technology: Advances and applications," Biochim Biophys Acta, vol. 1842, no. 10, pp. 1932-1941, Oct 2014.

[6] T. J. Treangen and S. L. Salzberg, "Repetitive DNA and next-generation sequencing: computational challenges and solutions," Nat Rev Genet, vol. 13, no. 1, pp. 36-46, Nov 29 2011.

[7] B. Langmead. (2015). ADS1: Sequencing by Synthesis. Available: WYFv4

[8] Y. Chu and D. R. Corey, "RNA sequencing: platform selection, experimental design, and data interpretation," Nucleic Acid Ther, vol. 22, no. 4, pp. 271-4, Aug 2012.

[9] C. Trapnell et al., "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks," Nat Protoc, vol. 7, no. 3, pp. 562-78, Mar 01 2012.

[10] A. Lugmayr, C. Scheib, and M. Mailaparampil, "Cognitive Big Data. Survey and Review on Big Data Research and its Implications: What is Really ‘New’? Cognitive Big Data!," Journal of Knowledge Management (JMM), 2016.

[11] A. Lugmayr, C. Scheib, M. Mailaparampil, N. Mesia, and H. Ranta, "A Comprehensive Survey on Big Data Research and It’s Implications - What is really ’new’ in Big Data? It’s Cognitive Big Data," presented at the Proceedings of the 20th Pacific-Asian Conference on Information Systems (PACIS 2016), 2016.

[12] M. W. Libbrecht and W. S. Noble, "Machine learning applications in genetics and genomics," Nat Rev Genet, vol. 16, no. 6, pp. 321-32, Jun 2015.

[13] A. Latorre-Pellicer et al., "Mitochondrial and nuclear DNA matching shapes metabolism and healthy ageing," Nature, vol. 535, no. 7613, pp. 561-5, Jul 28 2016.

[14] N. Y. Fu et al., "EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival," Nat Cell Biol, vol. 17, no. 4, pp. 365-75, Apr 2015.

[15] Z. Wang, M. Gerstein, and M. Snyder, "RNASeq: a revolutionary tool for transcriptomics," Nat Rev Genet, vol. 10, no. 1, pp. 57-63, Jan 2009.

[16] M. Pertea, D. Kim, G. M. Pertea, J. T. Leek, and S. L. Salzberg, "Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown," Nat Protoc, vol. 11, no. 9, pp. 1650-67, Sep 2016.

[17] Y. Han, S. Gao, K. Muegge, W. Zhang, and B. Zhou, "Advanced Applications of RNA Sequencing and Challenges," Bioinform Biol Insights, vol. 9, no. Suppl 1, pp. 29-46, 2015.

[18] S. Andrews. A quality control tool for high throughput sequence data. Available: ojects/fastqc/

[19] M. Martin, "Cutadapt removes adapter sequences from high-throughput sequencing reads," EMBnet.journal, vol. volume 17, no. issue 1, pp. 10-12, 2011.

[20] D. Kim, B. Langmead, and S. L. Salzberg, "HISAT: a fast spliced aligner with low memory requirements," Nat Methods, vol. 12, no. 4, pp. 357-60, Apr 2015.

[21] M. I. Love, W. Huber, and S. Anders, "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2," Genome Biol, vol. 15, no. 12, p. 550, 2014.

[22] FASTX-Toolkit. Available: ml

[23] A. M. Bolger, M. Lohse, and B. Usadel, "Trimmomatic: a flexible trimmer for Illumina sequence data," Bioinformatics, vol. 30, no. 15, pp. 2114-20, Aug 01 2014.

[24] BBDuk. Available:

[25] H. Li and R. Durbin, "Fast and accurate short read alignment with Burrows-Wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754-60, Jul 15 2009.

[26] B. Langmead and S. L. Salzberg, "Fast gappedread alignment with Bowtie 2," Nat Methods, vol. 9, no. 4, pp. 357-9, Apr 2012.

[27] M. D. Robinson, D. J. McCarthy, and G. K. Smyth, "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data," (in eng), Bioinformatics, vol. 26, no. 1, pp. 139-40, Jan 1 2010.

[28] A. C. Frazee, G. Pertea, A. E. Jaffe, B. Langmead, S. L. Salzberg, and J. T. Leek,
"Flexible isoform-level differential expression analysis with Ballgown," bioRxiv, 2014.

[29] P. J. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice, "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants," Nucleic Acids Res, vol. 38, no. 6, pp. 1767-71, Apr 2010.

[30] (2016). Illumina. Available: na-customer-sequence-letter.html

[31] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J Mol Biol, vol. 215, no. 3, pp. 403-10, Oct 05 1990.

[32] W. J. Kent, "BLAT--the BLAST-like alignment tool," Genome Res, vol. 12, no. 4, pp. 656-64, Apr 2002.

[33] H. Li and N. Homer, "A survey of sequence alignment algorithms for next-generation sequencing," Brief Bioinform, vol. 11, no. 5, pp. 473-83, Sep 2010.

[34] An Overview of the Human Genome Project. Available:

[35] M. Burrows and D. J. Wheeler, A Blocksorting Lossless Data Compression Algorithm (no. no. 124). Digital, Systems Research Center, 1994.

[36] G. M. P. Ferragina, "Opportunistic data structures with applications," presented at the Proceedings of the 41st Annual Symposium on Foundations of Computer Science, November 12 - 14, 2000.

[37] K. Juha, "Fast BWT in Small Space by Blockwise Suffix Sorting," (in English), Theoretical Computer Science, vol. 387, no. 3, pp. 249-257, November, 2007 2007.

[38] H. Li et al., "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, no. 16, pp. 2078-9, Aug 15 2009.

[39] H. Thorvaldsdottir, J. T. Robinson, and J. P. Mesirov, "Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration," Brief Bioinform, vol. 14, no. 2, pp. 178-92, Mar 2013.

[40] M. L. Speir et al., "The UCSC Genome Browser database: 2016 update," Nucleic
Acids Res, vol. 44, no. D1, pp. D717-25, Jan 04 2016.

[41] E. G. Stephane Eranian, Tipp Moseley, Willem de Bruijn. Tutorial. Available:

[42] C. Angermueller, H. J. Lee, W. Reik, and O. Stegle, "DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning," Genome Biol, vol. 18, no. 1, p. 67, Apr 11 2017.

[43] Y. V. Karpievitch and J. S. Almeida, "mGrid: a load-balanced distributed computing environment for the remote execution of the user-defined Matlab code," BMC Bioinformatics, vol. 7, p. 139, Mar 15 2006.

[44] Y. V. Karpievitch et al., "Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition," Bioinformatics, vol. 25, no. 19, pp. 2573-80, Oct 01 2009.

[45] Y. Karpievitch et al., "A statistical framework for protein quantitation in bottom-up MS-based proteomics," Bioinformatics, vol. 25, no. 16, pp. 2028-34, Aug 15 2009.

[46] Y. V. Karpievitch et al., "PrepMS: TOF MS data graphical preprocessing tool," Bioinformatics, vol. 23, no. 2, pp. 264-5, Jan 15 2007.

[47] T. Taverner et al., "DanteR: an extensible Rbased tool for quantitative analysis of -omics data," Bioinformatics, vol. 28, no. 18, pp. 2404-6, Sep 15 2012.

[48] Y. V. Karpievitch, S. B. Nikolic, R. Wilson, J. E. Sharman, and L. M. Edwards, "Metabolomics data normalization with EigenMS," PLoS One, vol. 9, no. 12, p. e116221, 2014.

[49] D. Risso, J. Ngai, T. P. Speed, and S. Dudoit, "Normalization of RNA-seq data using factor analysis of control genes or samples," Nat Biotechnol, vol. 32, no. 9, pp. 896-902, Sep 2014.

[50] A. Holzinger, "Machine Learning for Health Informatics," vol. 9605, pp. 1-24, 2016.

[51] A. Holzinger, "Interactive machine learning for health informatics: when do we need the human-in-the-loop?," Brain Inform, vol. 3, no. 2, pp. 119-131, Jun 2016.

[52] A. Holzinger and I. Jurisica, "Knowledge Discovery and Data Mining in Biomedical
Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions," vol. 8401, pp. 1-18, 2014.

[53] A. Holzinger, Biomedical Informatics: Computational Sciences meets Life Sciences. Norderstedt: BoD, 2012.

[54] A. Holzinger, M. Errath, G. Searle, B. Thurnher, and W. Slany, "From Extreme Programming and Usability Engineering to Extreme Usability in Software Engineering Education (XP+UE→XU)," vol. 2, pp. 169-172, 2005.

[55] S. B. Nikolic et al., "Serum metabolic profile predicts adverse central haemodynamics in patients with type 2 diabetes mellitus," Acta Diabetol, vol. 53, no. 3, pp. 367-75, Jun 2016.

[56] V. P. Andreev et al., "Label-free quantitative LC-MS proteomics of Alzheimer's disease and normally aged human brains," J Proteome Res, vol. 11, no. 6, pp. 3053-67, Jun 01 2012.

[57] N. Liang, C. A. Trujillo, P. D. Negraes, A. R. Muotri, C. Lameu, and H. Ulrich, "Stem cell contributions to neurological disease modeling and personalized medicine," Prog Neuropsychopharmacol Biol Psychiatry, May 30 2017.

[58] C. Helf and H. Hlavacs, “Apps for life change: Critical review and solution directions,” Entertainment Computing, vol. 14, pp. 17–22, 2016.

[59] Pogorelc, B. et al. 2012. Semantic ambient media: From ambient advertising to ambient- assisted living. Multimedia Tools and Applications. 58, 2 (2012), 399–425.
How to Cite
KUZNETSOVA, Irina et al. Review of Machine Learning Algorithms in Differential Expression Analysis. International SERIES on Information Systems and Management in Creative eMedia (CreMedia), [S.l.], n. 2016/2, p. 11-24, june 2017. ISSN 2341-5576. Available at: <>. Date accessed: 01 july 2022.


Machine learning; big data; data mining; Next Generation Sequencing; Burrows-Wheeler transform; semiglobal alignment; clustering; biology; RNA-Seq
Share |