Chapter 2 Introduction
2.1 Course outline
The course is divided into 2 parts. PART 1 will teach you the basics of R programming, computational statistics and its application in solving real world biological problems. PART 2 will teach you how to analyse RNAseq data and will provide informatic workflows that you can use in your current research.
This course is extremely condensed and you are not expected to complete all tasks before or during the 3 day meeting. We will work through coded examples together. Additional tutorials may be scheduled after meeting close, if ESRs feel that this is necessary!
Please also note that this course material comprises embeded video content which is best viewed using a CHROME web browser2.1.1 PART 1
PART 1 is provided in a separate workbook Applications of Computational Statistics.
By the end of PART 1, you should be able to:
- Understand basic statistical concepts
- Perform statistical tests to answer simple to more complex biological questions
- Mine and explore biological data sets
- Process microarray data and perform statistical analysis
- Be able to use the R statistical programming language
The following topics are covered in PART 1:
Data description
- Visualizing distributions
- Measures of central tendency and variability
- Q-Q plot
- Boxplots
- Correlation measures
Probability distribution
- Probability distribution on discrete random variables (binomial, Poisson)
- Probability distribution on continuous random variables (normal, z, t-, chi-squared, F)
Statistical tests
- Hypothesis testing
- Statistical tests with continuous variables
- one sample, two samples, equal and unequal variances
- F-test and ANOVA
- Statistical tests with categorical variables
- Multiple testing
Clustering
- distance definition
- hierarchical clustering
- k-means
- SOM
Principal component analysis
Analysis of microarray data
- normalization
- transformation
- statistical tests
2.1.1.1 Practicals
Practical workbooks are also provided:
- Programming with Applications in R: an introduction
- Programming with Applications in Bioconductor: an introduction
Data resources used for the Practicals are provided below:
- Comprehensive molecular portraits of human breast tumours
- Breast Cancer dataset (eSet.RData)
- AML dataset (Golub.RData)
Practicals P0.1 and P0.2 test R programming basics. Practicals P1 to P7 provide details on how one can apply descriptive statistics to the analysis of transcriptomic data. P1 to P5.3 include exercises as part of assignments A1 and A3
The following topics are covered in the practical component of PART 1:
P0.1. Introduction to R + Getting started + Vectors + Matrices
P0.2. Intro to R continued + Lists + Data frames + Factors and tables
P1. Graphical outputs
P2. Statistical tests
P3. Clustering
P4. PCA
P5. Analysis of microarray data 1/3
P6. Analysis of microarray data 2/3
P7. Analysis of microarray data 3/3
2.1.1.2 Assessment
Assignments
Assignment 1 Assessing R skills & interpretation of statistical tests - Starts in P1 (covers up to P3)
Assignment 2 (Optional): Review and explanation of statistical analysis from a scientific article of the student’s choice . The student should choose a recently published scientific article and explain the aim of the study and how the statistical analyses were performed.
Assignment 3 Assessing statistical analyses using data described in Comprehensive molecular portraits of human breast tumours. This assignment will be broken into different exercises covering the content of the pracs P4 to P8 and will be handed at each prac. Starts in P4 (covers up to P8)
2.1.2 PART 2
This part of the course will provide steps for the analysis of RNASeq data. To bring this part of the course into real world focus we will re-analyse RNAseq data published in Brunton et al. HNF4A and GATA6 Loss Reveals Therapeutically Actionable Subtypes in Pancreatic Cancer. Cell Reports 2020
By the end of PART 2, you should be able to:
- Use the nf-core/rnaseq bioinformatic workflow to map raw RNAseq reads, assess RNAseq QC and generate count files for downstream analysis
- Perform PCA/Hierarchical clustering of normalized RNAseq data
- Perform Differential Gene Expression analysis
- Generate publication quality heatmaps and statistical plots
- Perform Gene Set Enrichment analysis to identify significantly enriched biological pathways
- Subtype PDAC using pre-defined transcriptional signatures