Machine Learning Echocardiogram

Machine learning is a method of training machines to make predictions based on experience. It uses an algorithm that iteratively learns from data and keeps getting better with additional data. These models can be effectively used to determine the effect of various variables and relationship between them. Machine learning is now being increasingly used in the field of science.

In this article, I demonstrate the use of machine learning in predicting the 1 year outcome of patients on the basis of their echocardiogram parameters. The dataset was taken from the UCI database This data has the echocardiogram of 132 patients. Due to missing data, 60 patient observations were used.

I will be using several alternate machine learning models and presenting the confusion matrix of them. Confusion matrix is a 2X2 table with the comparison of predicted and actual values.

Importing dataset

data <- read.csv("echocardiogram.data.csv", sep =',', na.strings = c('NA','?')) str(data) names(data) <- c('survival','alive','age','pericardialeffusion','fractionalshortening', 'epss','lvdd','wallmotionscore','wallmotionindex','mult','name','group','aliveat1')

Omit rows with missing data

In this part, we will remove the rows which contain missing data. We will not be imputing the missing values since it may effect out model. We will remove the variables that will not be useful for our study. The column “aliveat1”, which denotes if the patient is alive at 1 year will be turned in to a factor. This will be our dependent variable which we will try to predict with out model on the basis of the echocardiogram paramaters.

data1 <- na.omit(data)
data1 <- subset(data1, select=-c(name, survival, group, mult, alive))
data1$aliveat1 <- factor(data1$aliveat1, levels=c(0,1))
data1$pericardialeffusion <- factor(data1$pericardialeffusion, levels = c(0, 1))

Splitting dataset in to training and testing set

The machine learning model learns from the training test and its accuracy is tested by attempting to predict the value of the dependent variable in the test set.

library(caTools)
set.seed(1234)
split <- sample.split(data1$aliveat1, SplitRatio = 0.8)
train <- subset(data1, split == TRUE)
test <- subset(data1, split==FALSE)

Fitting the machine learning model

In this example, we are trying to predict whether the patient will be alive at 1 year, given the set of echocardiogram parameters. In other words, we are trying to classify the patient as either alive (1) or dead (0). This is a classical Machine learning classification problem and several algorithms may be used to solve it. Here we have used decision tree, Naive bayes, Support Vector Machines (SVM) and Random forest. With larger datasets, we will be able to compare these models based on their Confusion matrix scores.

Decision tree classification

# Model Fitting
library(rpart)
classifier <- rpart(aliveat1~., data=train)
y_pred <- predict(classifier, newdata=test[ , c(1:7)], type='class', method='class', control = rpart.control(minsplit=1))
# Confusion Matrix
library(caret)
confusionMatrix(test$aliveat1, y_pred)

Output of the Confusion matrix:

Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 8 1
         1 1 2
                                          
               Accuracy : 0.8333          
                 95% CI : (0.5159, 0.9791)
    No Information Rate : 0.75            
    P-Value [Acc > NIR] : 0.3907          
                                          
                  Kappa : 0.5556          
                                          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.8889          
            Specificity : 0.6667          
         Pos Pred Value : 0.8889          
         Neg Pred Value : 0.6667          
             Prevalence : 0.7500          
         Detection Rate : 0.6667          
   Detection Prevalence : 0.7500          
      Balanced Accuracy : 0.7778          
                                          
       'Positive' Class : 0 

Support Vector Classification

# SVM Model Fitting
library(e1071)
classifier <- svm(formula = aliveat1~., data=train, type="C-classification", kernel='linear')
y_pred <- predict(classifier, newdata=test[ , c(1:7)], type='class')
# Confusion Matrix
library(caret)
confusionMatrix(test$aliveat1, y_pred)

Output of the Confusion Matrix:

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 10  3
         1  2  3
                                          
               Accuracy : 0.7222          
                 95% CI : (0.4652, 0.9031)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : 0.4122          
                                          
                  Kappa : 0.3478          
                                          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.8333          
            Specificity : 0.5000          
         Pos Pred Value : 0.7692          
         Neg Pred Value : 0.6000          
             Prevalence : 0.6667          
         Detection Rate : 0.5556          
   Detection Prevalence : 0.7222          
      Balanced Accuracy : 0.6667          
                                          
       'Positive' Class : 0     

Naive Bayes Classification


# Naive Bayes Model Fitting
library(e1071)
classifier <- naiveBayes(x = train[-8], y = train$aliveat1)
# Predict Test set results
y_pred <- predict(classifier, newdata = test[-8])
# Confusion Matrix
library(caret)
confusionMatrix(test$aliveat1, y_pred)

Output of the confusion matrix

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 10  3
         1  2  3
                                          
               Accuracy : 0.7222          
                 95% CI : (0.4652, 0.9031)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : 0.4122          
                                          
                  Kappa : 0.3478          
                                          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.8333          
            Specificity : 0.5000          
         Pos Pred Value : 0.7692          
         Neg Pred Value : 0.6000          
             Prevalence : 0.6667          
         Detection Rate : 0.5556          
   Detection Prevalence : 0.7222          
      Balanced Accuracy : 0.6667          
                                          
       'Positive' Class : 0    

Random Forest Classification

# Random forest model fitting
library(randomForest)
classifier <- randomForest(x=train[-8], y = train$aliveat1, ntree=400)
# Predict Test set results
y_pred <- predict(classifier, newdata = test[-8])
# Confusion Matrix
library(caret)
confusionMatrix(test$aliveat1, y_pred)

 

Confusion Matrix Output

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 12  1
         1  3  2
                                          
               Accuracy : 0.7778          
                 95% CI : (0.5236, 0.9359)
    No Information Rate : 0.8333          
    P-Value [Acc > NIR] : 0.8318          
                                          
                  Kappa : 0.3684          
                                          
 Mcnemar's Test P-Value : 0.6171          
                                          
            Sensitivity : 0.8000          
            Specificity : 0.6667          
         Pos Pred Value : 0.9231          
         Neg Pred Value : 0.4000          
             Prevalence : 0.8333          
         Detection Rate : 0.6667          
   Detection Prevalence : 0.7222          
      Balanced Accuracy : 0.7333          
                                          
       'Positive' Class : 0  

Next Generation Sequencing (NGS) Analysis

I have followed the protocol published in the Nature protocol Journal, entitled:

“Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown” 
System setup:
This analysis is best being run on a 64 bit Linux system. This particular work flow has been done on a core i7 system with 16GB RAM, running on Linux mint 17.

Softwares required:

Hisat2: https://ccb.jhu.edu/software/hisat2/index.shtml

Stringtie: https://ccb.jhu.edu/software/stringtie/

samtools: http://samtools.sourceforge.net/

For all the above softwares you have to unpack them and then open a terminal and do the following for each of them:

sudo export PATH=$PATH:path/to/extracted/folderORexecutibleFiles

It may be necessary to add the line “export PATH=$PATH:path/to/extracted/folderORexecutibleFiles” to “.bashrc” file in the home directory. For most it is in “/Home/user”

R: https://www.r-project.org/

In Ubuntu (and derivatives) it may be installed by

sudo apt install r-base r-base-dev

Rstudio: www.rstudio.com/ide/download/desktop

Setting up R environment

Before opening Rstudio, open the terminal and install the following applications in Ubuntu:

sudo apt install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev gfortran

Start Rstudio and enter the following commands:

install.packages("devtools")

>source("http://www.bioconductor.org/biocLite.R")

>biocLite(c("alyssafrazee/RskittleBrewer","ballgown","genefilter","dplyr","devtools"))

Note: If there is any error in the installation, install unmet dependencies, as per in the error message.

Files to be downloaded

Reference genome:

In this paper the reference genome is the Human X chromosome. Only X chromosome is being used as a reference genome in order to reduce the download and analysis time. The user may also use the entire genome of the relevant organism. The reference genome may also be downloaded from the Hisat2 website since it is provided in a pre-indexed format.This link contains the reference genome as well as the sample fastq files. The following workflow includes files from the above archive. The data may be replaced with the relevant data as well.

Procedure

HISAT2: Alignment of RNA-Seq reads to the genome

Map the reads from each sample to the reference genome:

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam

The explanation for each part of the command is as follows:

hisat2: calls the hisat2 program. This will work only if the hisat2 folder has been added to the PATH properly

-p 8: “-p” specifies how many threads should be allocated to the command. Here, 8 threads have been allocated. Should always be equal or less than the available threads on the system

-dta: reports alignments tailored for transcript assemblers

-x: Name (or path to name) after this will the hisat2 index file

chrX_data/indexes/chrX_tran: Path to the index files. “chrX_tran” is the common name of all the reference file in the index folder. e.g., the files in the folder are chrX_tran.1.ht2, chrX_tran.2.ht2, chrX_tran.3.ht2, etc

1: Has to be put before the forward read of the sample

chrX_data/samples/ERR188044_chrX_1.fastq.gz: Forward read of sample

-2: Has to be put before the reverse read of the sample

chrX_data/samples/ERR188044_chrX_2.fastq.gz: Reverse read of sample

-S: Default sam output

ERR188044_chrX.sam: File for SAM output

Likewise, the same commands have to be issued for all the input files, as follows:

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S ERR188104_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S ERR188234_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S ERR188245_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S ERR188257_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S ERR188273_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S ERR188337_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S ERR188383_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S ERR188401_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S ERR188428_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S ERR188454_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S ERR204916_chrX.sam

Samtools: Convert the output sam files to bam

samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam

Usage: samtools [options]

samtools: invokes the program

-@: Set number of sorting and compression threads

8: Here we are using 8 threads since this is being done on an 8 thread system

-o: output to file following this

ERR188044_chrX.bam: Output file

ERR188044_chrX.sam: Input file

Likewise, the same commands have to be issued for all the input files, as follows:

samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam
samtools sort -@ 8 -o ERR188104_chrX.bam ERR188104_chrX.sam
samtools sort -@ 8 -o ERR188234_chrX.bam ERR188234_chrX.sam
samtools sort -@ 8 -o ERR188245_chrX.bam ERR188245_chrX.sam
samtools sort -@ 8 -o ERR188257_chrX.bam ERR188257_chrX.sam
samtools sort -@ 8 -o ERR188273_chrX.bam ERR188273_chrX.sam
samtools sort -@ 8 -o ERR188337_chrX.bam ERR188337_chrX.sam
samtools sort -@ 8 -o ERR188383_chrX.bam ERR188383_chrX.sam
samtools sort -@ 8 -o ERR188401_chrX.bam ERR188401_chrX.sam
samtools sort -@ 8 -o ERR188428_chrX.bam ERR188428_chrX.sam
samtools sort -@ 8 -o ERR188454_chrX.bam ERR188454_chrX.sam
samtools sort -@ 8 -o ERR204916_chrX.bam ERR204916_chrX.sam

Stringtie: Assemble and quantify expressed genes and transcripts

As per the Stringtie website:

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie’s output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.)”

Usage: stringtie [-G ] [-l

stringtie -p 8 -G chrX_data/genes/chrX.gtf -o ERR188044_chrX.gtf ERR188044_chrX.bam

Stringtie: Invokes the program

-p: the integer following this specifies the number of threads to assign to the program

8: 8 threads have been assigned here

-G: Specifies the reference annotation to use for guiding the assembly process (GTF/GFF3)

chrX_data/genes/chrX.gtf: Path to the reference genome here

-o: output file name for the merged transcripts GTF

ERR188044_chrX.bam: This is the input file here. As can be observed in the usage syntax, these arguments may be rearranged.

Rest of the files are processed as follows:

stringtie ERR188044_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188044_chrX.gtf
stringtie ERR188104_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188104_chrX.gtf
stringtie ERR188234_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188234_chrX.gtf
stringtie ERR188245_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188245_chrX.gtf
stringtie ERR188257_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188257_chrX.gtf
stringtie ERR188273_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188273_chrX.gtf
stringtie ERR188337_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188337_chrX.gtf
stringtie ERR188383_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188383_chrX.gtf
stringtie ERR188401_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188401_chrX.gtf
stringtie ERR188428_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188428_chrX.gtf
stringtie ERR188454_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR188454_chrX.gtf
stringtie ERR204916_chrX.bam -p 8 -G chrX_data/genes/chrX.gtf -o ERR204916_chrX.gtf

Stringtie: Merge transcripts from all samples

stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt

Here, stringtie creates a file “stringtie_merged.gtf” using the “mergelist.txt” file. “mergelist.txt” has the list of all the sample.gtf files. Sample.gtf files were generated in the previous step by the “–merge” argument. Paths are necessary if the gtf files are not in the working directory. “mergelist.txt” looks like this in our experiment:

ERR188044_chrX.gtf
ERR188104_chrX.gtf
ERR188234_chrX.gtf
ERR188245_chrX.gtf
ERR188257_chrX.gtf
ERR188273_chrX.gtf
ERR188337_chrX.gtf
ERR188383_chrX.gtf
ERR188401_chrX.gtf
ERR188428_chrX.gtf
ERR188454_chrX.gtf
ERR204916_chrX.gtf

R- For Differential expression analysis

Load the required libraries by entering the following commands at the R prompt:

library(ballgown)
>library(RSkittleBrewer)
>library(genefilter)
>library(dplyr)
>library(devtools)

You should also have a phenotype file in csv format which specifies information about your sample. The example data includes a phenotype file (geuvadis_phenodata.csv) with the following content:

ids

sex

population

ERR188044

male

YRI

ERR188104

male

YRI

ERR188234

female

YRI

ERR188245

female

GBR

ERR188257

male

GBR

ERR188273

female

YRI

ERR188337

female

GBR

ERR188383

male

GBR

ERR188401

male

GBR

ERR188428

female

GBR

ERR188454

male

YRI

ERR204916

female

YRI

Since we already have the expression values of various genes for these samples, we will be using R to see which genes are differentially regulated based on several parameters. Here we may choose to compare them on the basis of “population” or “sex”.

pheno_data <- read.csv("geuvadis_phenodata.csv")

Next, we read in the expression data that has been calculated by StringTie

bg_chrX = ballgown(dataDir <- "ballgown", samplePattern = "ERR", pData=pheno_data)

Removing low-abundance genes

Often with RNA-seq data, we may get genes which have too few or even zero counts. Here we may remove all transcripts with a variance across sample less than one.

bg_chrX_filt = subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)

Finding genes with significantly different expression

Here we identify transcripts that have statistically significant difference in expression between groups. Here we see if there is an expression difference on the basis of sex. We can do so using stattest function of Ballgown.

results_transcripts <- stattest(bg_chrX_filt, feature= "transcript",covariate="sex",adjustvars = c("population"), getFC=TRUE, meas="FPKM")

We use “getFC=TRUE” parameter so that we can look at the confounder-adjusted fold change between the two groups.
Next we identity genes which show statistical significant difference groups
Identify genes that show statistically significant differences between groups. For this we can run the same function that we used to identify differentially expressed transcripts, but here we set feature=”gene” in the stattest command:

results_genes = stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Add gene names and gene IDs to the results_transcripts data frame

results_transcripts←data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

Sort the results from the smallest P value to the largest

results_transcripts = arrange(results_transcripts, pval)
results_genes = arrange(results_genes, pval)