Gene expression and protein pathways
Joel S Bader
CuraGen
jsbader@
IPAM 2000 Nov 8
Nov 8, 2000
Confidential
CuraGen Background
Product pipeline
Therapeutic proteins
Therapeutic antibodies
Drug targets
Technology
High-throughput biology labs
Bioinformatics/information-intensive
Nov 8, 2000
Confidential
Outline
Clustering gene expression data
Interior-node test
Statistical significance
Power
Mapping biological pathways
Metabolic pathways
mRNA coregulation
Protein-protein interactions
Overlaying
ic studies
Disease risk
SNPs and association
Nov 8, 2000
Confidential
Clustering
Standard method (now) for analyzing gene expression data
Unsupervised algorithms
Run pletion
Clustering eventually driven by noise, not biology
Supervised algorithms
Inconsistent, irreproducible
Not amenable to high-throughput
Goal: automated, unsupervised, with meaningful p-value for clusters produced
Collaboration with a W Doerge and Brian Munneke, Dept of Statistics, Purdue
Nov 8, 2000
Confidential
Hierarchical, distance-based algorithm
Initialize: each gene in a single cluster
Repeat
Join clusters with shortest distance
Re-calculate effective distances
Until 1 cluster remains
Neighbor-joining: distance is corrected to be distance between ancestors
Studier & Keppler, Mol Biol Evol 5: 729 (1988)
Unweighted pair group method arithmetic mean: distance is mean of all pair-wise distances
NJ distance
UPGMA distance
Nov 8, 2000
Confidential
(Typical) results
Data taken from X. Wen, …, R Somogyi, PNAS 95: 334 (1998)
Large-scale temporal gene expression mapping of central nervous system development
9 time points, embryonic to adult
Clustering
Multidimensional scaling, ponent/factor analysis
112 genes
Nov 8, 2000
Confidential
Raw data and clusters
Set baseline
Normalize columns (time-points)
Log-transform
Subtract row averages (genes)
Neighbor-joining using correlation distance
Nov 8, 2000
Confidential
Significance tests
Interior-branch test
Parametric based on bra
Gene expression and protein pathways 来自淘豆网www.taodocs.com转载请标明出处.