Pancreas Genome Phenome Atlas

1. Data Content

The Pancreatic Analytics Hub hosts 4 core data sources from the public domain: The Cancer Genome Atlas (TCGA), The International Cancer Genome Consortium (ICGC), Genomics Evidence Neoplasia Information Exchange (GENIE) and the Cancer Cell Line Encyclopaedia (CCLE).

TCGA: The Cancer Genome Atlas is a consortium dedicated to the systematic study of alterations in a variety of human cancers. It has made mRNA expression, mutation and methylation data from analysed cohorts publicly available, alongside associated clinical data. Currently, mRNA expression and mutation data from sequenced patients with pancreatic adenocarcinoma are available for analysis through the Analytics Hub, alongside associated clinical data.

ICGC: The International Cancer Genome Consortium is focussed on the generation of comprehensive catalogues of genomic abnormalities (somatic mutations, expression of genes, epigenetic modifications) in tumours from 50 different cancer types. It has made mRNA expression, DNA copy number, mutation and methylation data from analysed cohorts publicly available, alongside associated clinical data. Currently, mRNA expression and mutation data from sequenced patients with pancreatic adenocarcinoma or pancreatic endocrine neoplasms are available through the Analytics Hub.

GENIE: Genomics Evidence Neoplasia Information Exchange is a pilot project that seeks to identify and validate genomic biomarkers relevant to cancer treatment by linking tumour genomic data from clinical sequencing efforts with longitudinal clinical outcomes. It has made mutation data publicly available, alongside associated clinical data for a range of cancer types/subtypes. Mutation data from individuals with pancreatic cancer are available for analysis from the Analytics Hub.

CCLE: Cancer Cell Line Encyclopaedia project is an effort to conduct a detailed genetic characterisation of a large panel of human cancer cell lines. mRNA expression and mutation data for pancreatic cancer cell lines are available from the Analytics Hub.

Table: Features of the Analytics Hub for Publicly Available Data Sources
Results Tab	Analytical features	TCGA	ICGC	CCLE	GENIE
Genomics	Genomic Summary	✓	✓	✓	✓
	OncoPlot	✓	✓	✓	✓
	Somatic Interactions	✓	✓	✓	✓
	Tumour Mutational Burden	N/A	✓	✓	✓
	Lolliplot	✓	✓	✓	✓
	Protein-Protein Networks	✓	✓	N/A	✓
	Drug Prediction	✓	✓	✓	✓
	Oncogenic Pathways	✓	✓	✓	✓
	Survival Analysis	✓	✓	N/A	N/A
Transcriptomics	Principal Component Analysis	✓	✓	✓	N/A
	Expression Profiles	✓	✓	✓	N/A
	Correlation	✓	✓	✓	N/A
	Survival Analysis	✓	✓	N/A	N/A
Matched PCRF Tissue Bank resource		✓	✓	N/A	✓

The Hub also allows researchers to view the research already undertaken and published using specimens from the PCRF Tissue Bank. The Tissue Bank is a unique collaboration between the charity, Barts Cancer Institute, Queen Mary University of London and nine key NHS partners throughout the UK. The Tissue Bank collects and stores tissue, blood, saliva and urine from people with pancreatic cancer and other diseases of the pancreas, alongside samples of blood, urine and saliva from first degree relatives of patient donors and other healthy volunteers. The Hub provides basic overview of the pateints whose donated specimens have contributed to various research projects.

2. Exploring Public Datasets

Researchers can start using the Analytics Hub through the Analytics Hub item on the Home menu.

Inside the Analytics Hub section, users can choose to explore publicly availble datasets , PCRF Tissue Bank data or comparing public datasets for patient subgroups.

2.1 Applying Filters

Once in the specific page for a public dataset, researchers can view the clinical data table and pregenerated results straightaway.

Researchers can also filter patients within the dataset based on clinical and/or molecular attributes, and generate a sub-cohort for further exploration.

Clinical filters: Depending on the availability of data, we provide users with the option for applying filters based on age, sex, race, diagnosis, tumour stage, survival status, survival period, history of Diabetes, and family history.

Molecular filters: KRAS and TP53 missense mutation status for all datasets, as well as transcriptomic/genomic subtype for TCGA and ICGC PACA-AU datasets.

2.2 View Clinical Summary

Overview: Key features of each project are presented as dynamic bar charts. These plots provide a quick and simple means to visualise multiple covariates in relation to each other and identify potential trends in the data.

Patient statistics: The main demographic and clinical features are summarised in this section.

Tumour statistics: Key statistics are summarised for projects that provide clinical/pathological evaluations of the tumours.

2.3 View Genomics Results

Genomics data from publicly available sequencing cohorts can be analysed using the integrated Bioconductor package MAFtools, which facilitates the analysis of somatic variants containing single-nucleotide variants (SNV) and small insertion/deletions (indels), based on variant characteristics, gene interactions and protein changes.

Summary: From this tab, a MAFtools summary plot can be viewed for each cohort, displaying the range of variant classifications, variant types and base substitution profiles as bar plots and boxplots for each user-selected cohort.

OncoPlot: Users can view the top (10, 25, 50) mutated genes in their cohort.

Lolliplot: Users can also select to view amino acid changes within each of the top 50 mutated genes in each cohort as a lollipop plot. The plots display the observed mutation distribution and protein domains, which are labelled for each selected gene. A summary of the observed somatic mutation rate for each selected gene is also provided alongside each plot.

Somatic Interactions: Mutually exclusive or co-occurring set of genes (top 25 mutated) can be analysed between samples, using the pair-wise Fisher’s Exact test to detect significant pairs of genes and visualised as a correlation matrix. Care must be taken with interpreting these correlations if multiple samples are taken from the same patient. Also note that genes mutated in all samples are excluded from this visualisation.

Compare Tumour Mutational Burden: Compare the tumour mutational burden of the selected cohort to that of 33 independent TCGA cohorts derived from the MC3 Project.

Cancer Genome Interpreter (CGI): The CGI is a third-party tool developed to help interpret sequenced cancer genome data, assessing the potential of somatic alterations to act as tumour drivers and their possible effects on treatment response. Potentially oncogenic alterations in the user-selected cohort are presented in tabular format and alluvial plots highlight those that may be therapeutically actionable.

Drug-Gene interactions: The bar plots show reported drug interactions or druggable categories compiled from the Drug Interaction Database. These results are also presented in a searchable tabular format.

Protein-Protein and Drug-Target interactions: Drug-target interaction networks provide a vital tool for the characterisation of clinically actionable alterations across patient subgroups. Network plots displaying protein-protein and protein-drug interactions are presented. Variants within candidate genes can be queried against the DrugBank database for downstream analysis of potential therapeutic candidates.

Oncogenic Pathways: Biological pathways (taken from the KEGG pathway database) enriched in the selected samples are summarised in bar plots. Specific pathways of interest can be selected, with each red box representing a patient and gene names coloured by function: tumour supressor genes in red; oncogenes in blue.

Reactome Pathways: Variants within each queried cohort are first mapped to their corresponding genes. These are then linked to altered biological pathways from the Reactome database. The results are presented as an interactive Voronoi diagram. All significant pathways (p<0.05) in the selected cohort are represented, with the number of patients affected proportional to colour intensity.

Survival analysis: The relationship between a selected somatic variant and overall/disease-free survival in years is presented as a Kaplan-Meier plot. A univariate Cox proportional hazards (PH) regression is applied to the survival data and the samples are assigned to risk groups based on the presence or absence of the specified gene variant. Hazard ratios (HR) and 95% confidence intervals (CI) from the Cox PH model and associated log-rank p-value are shown.

2.4 View Transcriptomics Results

Principal Component Analysis: Principal component analysis (PCA) reduces the dimensionality of data while retaining most of the variation in the dataset, making it possible to visually assess similarities and differences between different samples and determine whether groupings can be identified between individual samples. This exploratory analysis facilitates identification of the key factors affecting the variability in the mRNA expression data.

For each dataset, scatterplots representing the first two and the first three principal components (PCs) of the data are presented. Each data point represents the orientation of a single sample in the transcriptomic space projected on the PCA, with different colours indicating the biological group to which each sample belongs. The percentage values in brackets on each axis indicate the amount of variance in the data explained by the corresponding PC.

The global variability of the data can also be assessed from the scree plot. Here, you can identify the fraction of total variance (y-axis) attributed to each PC (x-axis). The PCs are ordered by decreasing order of contribution to total variance.

Expression Profiles: The distribution of mRNA expression measurements can be visualised across all samples for a user-defined gene (from the top 50 aberrantly expressed genes).

Correlation: Pairwise comparisons of expression profiles can be performed between multiple user-defined genes in each selected dataset and Pearson's correlation coefficients and p-values calculated for each comparison.

For queried set of genes (minimum of 3 genes), the Analytics Hub computes Pearson's correlation coefficients and corresponding p-values for all pairwise combinations of genes and displays the correlation coefficients in a form of pairwise comparison heatmap. The colour of each cell indicates correlation coefficient between corresponding genes labelled on the x-axis and y-axis. The heatmap colour key is displayed on the right-side of the plot with red and blue indicating high and low correlation values, respectively.

Survival Analysis: From this tab the relationship between the expression of genes of interest and survival can be assessed. A univariate Cox proportional hazards (PH) regression is applied to the survival data and the samples are assigned to risk groups based on the median dichotomisation of mRNA expression intensities of the selected gene. Relationships are presented as Kaplan-Meier plots. The hazard ratio (HR) and 95% confidence intervals (CI) from the Cox PH model and associated log-rank p-value are presented in the top right corner of the figure.

2.5 View Similar PCRF Tissue Bank Resource

For the full public dataset or filtered cohort, researchers can view a simple summary of the resources that PCRF Tissue Bank can offer which match the selected clinical parameters: age, sex, race (when available), and diagnosis.

3. PCRFTB Data

This section of the portal provides an overview of the patient cohort from the PCRF Tissue Bank, whose samples have been used by researchers in completed studies. A summary table of the studies can be found at the landing page. The section is divided into four subsections, based on the top-level experimental platform (e.g., genomics, proteomics) employed in the studies. The clinical summary tab under each subsection displays the patient characteristics in terms of age, sex, race and diagnosis. Detail of the completed research studies can be found here.

Sign up here to apply for PCRF Tissue Bank specimens and submit an Expression of Interest. To know more about Application work flow please visit PCRF Tissue Bank For more information on PCRFTB specimens, please contact the team by email.

4. Compare Public Datasets Across Patient Subgroups

For key clinical and molecular subgroups, researchers can easily compare the public study cohorts where the relevant data is available.

For a subgroup (of interest), pre-generated results of key genomics (cohort summary, oncoplot, somatic interactions, drug-gene intreractions, top-10 oncogenic pthways) and transcriptomics (PCA, gene expression profiles) features are available to view for available datasets.

An additional feature here is UpSet plot that shows the intersection of top mutated genes between each available dataset separately and in combination.

5. Use Case: Exploring alternative genomic drivers in PDAC with KRAS^wt

In the largest available public dataset Genomics Evidence Neoplasia Information Exchange (GENIE), 756 confirmed ductal adenocarcinomas with wild type KRAS are identified by filtering using the available options.

The Clinical Summary tab on the results banner shows that while mutations in TP53 are frequent, the majority of KRAS^wt PDAC patients are also wild type for this oncogene.

Summary (and extended) oncoplots under the Genomics tab show the extent of alternative driver mutations in the KRAS^wt cohort. Notably, mutations in BRAF are present >10%; these are known to be mutually exclusive with KRAS mutations.

Under Genomics, several drop-down menus are available for further exploration of the filtered dataset:

Drug Prediction is implemented by the Cancer Genome Interpreter tool embedded in PGPA, which shows the likely clinically actionable targets in the filtered subset, linking each gene (biomarker) with its associated drug in an alluvial plot. The proportion of patients to be included can be selected (5%-25%). (Specific mutations and studies supporting these outputs are provided in a downloadable data table).

Oncogenic pathways from the KEGG pathway database that are enriched in the selected samples are summarised in bar plots, and can be explored individually. Here, ERBB signalling appears to be the most affected pathway, proportionally.

Indeed, oncogenes (in blue) BRAF and PIK3CA appear to be significantly mutated in the filtered cohort (red block = 1 patient), in this pathway.

Drug-target (and protein-protein) interaction networks associated with a specific target (here, BRAF protein) can be investigated using the interactive network graphs, where each spot links out to the DrugBank database for a detailed description of each suggested drug candidate.

The Cancer Cell Line Encyclopedia dataset in PGPA can be used to identify any commercially available cell lines to support functional follow-up experiments in PDAC KRAS^wt models, by filtering as before. Here, four potential cell lines for in vitro work are identified.

Similarly, the PCRFTB Cohort tab on the results banner shows that there are a multitude of tissue types from >300 patients available in the Tissue Bank that may be requested for study, based on the clinical filters selected (here, diagnosis).

The Expression of Interest link in PGPA links the user directly to the PCRFTB site, where an online Expression of Interest form is available to request patient samples to support wet lab investigation/validation studies.