ToxSci Advance Access originally published online on January 4, 2008
Toxicological Sciences 2008 102(2):444-454; doi:10.1093/toxsci/kfn001
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Integration of Clinical Chemistry, Expression, and Metabolite Data Leads to Better Toxicological Class Separation
,1

* Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University of Denmark, Kemitorvet, Building 208, DK-2800 Lyngby, Denmark
Protein Structure and Biophysics, Protein Engineering
Molecular Genetics, Biotechnology, Novo Nordisk Park, DK-2760 Måløv, Denmark
1 To whom correspondence should be addressed. Fax: +45-45931585. E-mail: skytte{at}cbs.dtu.dk.
Received August 29, 2007; accepted December 21, 2007
| ABSTRACT |
|---|
|
|
|---|
A large number of databases are currently being implemented within toxicology aiming to integrate diverse biological data, such as clinical chemistry, expression, and other types of data. However, for these endeavors to be successful, tools for integration, visualization, and interpretation are needed. This paper presents a method for data integration using a hierarchical model based on either principal component analysis or partial least squares discriminant analysis of clinical chemistry, expression, and nuclear magnetic resonance data using a toxicological study as case. The study includes the three toxicants alpha-naphthyl-isothiocyanate, dimethylnitrosamine, and N-methylformamide administered to rats. Improved predictive ability of the different classes is seen, suggesting that this approach is a suitable method for data integration and visualization of biological data. Furthermore, the method allows for correlation of biological parameters between the different data types, which could lead to an improvement in biological interpretation.
Key Words: data integration; toxicology; clinical chemistry; microarray; metabonomics.
| INTRODUCTION |
|---|
|
|
|---|
Recent developments within high-throughput technologies enquiring biology through expression, protein, and metabolite analyses holds the promise to result in a better understanding and advancement within drug discovery, disease research, and biology in general. A paradox in the increased knowledge from these high-throughput technologies is the confirmation of the limitations of these individual techniques for describing the high nonlinearity and complexity of biological systems. Many resources are currently spent in the scientific community toward integration of different data types aiming to increase the understanding of biological systems due to the synergy of various data types. It is the hope that this synergy will lead to discoveries not achievable by analysis of individual data types (Nicholson et al., 2004
The number of papers describing integration of biological data has increased in recent years, and numerous groups are pursuing different approaches. Great effort is going into building databases with the future intent to integrate different data types such as conventional clinical chemistry (ClinChem) and omics. Many of these initiatives exist within toxicology (Fostel et al., 2005
; Mattes et al., 2004
; Rhodes et al., 2004
; Xirasagar et al., 2006
). Other groups have presented integration methods of expression and clinical chemistry in different areas of biology, such as virus research (Baskin et al., 2004
), toxicology (Boverhof et al., 2005
), and cancer (Rhodes and Chinnaiyan, 2005
). One of the main obstacles toward successful data integration that has to be solved before valuable information can be extracted is the development of appropriate tools for integration (Tyers and Mann, 2003
). Data integration methods for combining significance across data sets and data types already exist (Hwang et al., 2005
; Shannon et al., 2002
). Most of these methods are, however, univariate, and the majority are based on p values, which may not be the most optimal way to describe the highly connected variables of biological systems. A multivariate method for integration of multiple data set could be hierarchical modeling of either principal component analysis (PCA) (Wold et al., 1987
) or partial least squares (PLS) (Westerhuis et al., 1998
; Wold et al., 1996
) models (or a mix of both), representing an unsupervised and supervised method respectively to explore and visualize underlying structures in data sets. PCA will compress the data sets to a lower dimension by finding underlying features which are common to the samples included in the data sets and rank these in decreasing order of variance contribution. These common underlying features can subsequently be analyzed to gain increased understanding of why the samples are different or similarly why groups separate. PLS is a regression extension of PCA aiming at connecting the information of a response variable with a data set through a linear model. When the response variable is a discrete rather than continuous variable, separating data into groups, the approach is commonly referred to as partial least squares discriminant analysis (PLS-DA).
The simplest PCA or PLS-based approach for integration of several different data sets is to calculate individual models for all data sets (one for each data set) and subsequently try to overlay the extracted information. Since the models are calculated independently of each other, integration using this approach may be difficult if the underlying structures of the different data sets are not well aligned. To overcome the problem of having to overlay different models manually in order to integrate information from different data sets, concatenation of different data sets into one large data set can be performed, and a single PCA or PLS model representing all the individual data sets can be calculated. Clearly, this approach will simplify interpretation by eliminating the need to overlay individual models; however, the problem of the underlying structures of the different data sets not being aligned is still potential, and a suboptimal "mean underlying structure" will be calculated. Furthermore, if the size of the individual data sets is very different, this will introduce a bias favoring the largest data set, representing a second problem that also needs to be addressed. This problem can, however, be handled if the different data sets in the concatenated model are appropriately "block scaled", i.e. besides the individual variable scaling the different data sets (blocks) can be given equal weight regardless of their size (number of variables) allowing each block to be considered as a unit and to be given the appropriate variance. Generally, due to the large number of variables included in one model, interpretation may be difficult due to crowded plots.
The problem of overlaying multiple data sets with unaligned underlying structures may to a large extent be solved by performing the modeling iteratively in a so-called hierarchical model (Eriksson et al., 2006b
). The concept of a hierarchical model is to take elements from several models (base level) and use these in a new model (top level). In the setup described above, this could mean taking scores from a number of principal components from the different individual models, concatenating these score vectors into a new data set and calculating a new PCA or PLS model on these. If the size of the elements taken from the different base-level models is fairly equal, the weight problem of different blocks previously described is thus eliminated, and no block scaling is required. Furthermore, as an additional advantage, interpretation of the contributions of the individual data sets is also improved due to less crowded plots. An additional advantage with this approach is the fact that different number of samples may be used in the base-level models. An example could be that a measurement on one sample fails for one analysis but not for another and rather than having to exclude the sample altogether as will be the case when building a model of concatenated data, the good part of the data may be retained and included in a hierarchical model.
In this work, we test the hypothesis of hierarchical modeling superiority for simultaneous analysis of multiple data sets compared to modeling of concatenated data sets. This hypothesis is tested using ClinChem (serum), microarray (liver), and nuclear magnetic resonance (NMR) (urine) data from a toxicity study. All compounds in the study are toxic to liver, and the end point is thus changes in the measured parameters (ClinChem, genes or metabolites) that can be related to changes in liver. There is no specific disease in question in this case study. Proof of concept is accepted if better data separation and classification is observed in the hierarchical model compared to a similar model of concatenated data. The quality of the different models is subjectively evaluated by visual inspection of score plots and objectively evaluated by classification ability estimated through a manual cross-validation (CV) approach.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The samples and data analyzed in this study originate from the COMET consortium (Lindon et al., 2003
Animals.
All animal studies were conducted by the contract research organization Scantox (Scantox, Lille Skensved, Denmark). The animal study served exploratory purposes and therefore was not conducted in accordance with specific guidelines. All compounds were investigated using male Sprague-Dawley rats (Crl:CD(SD)IGS Br, Charles River Deutschland GmbH, Sulzfeld, Germany). Sixteen rats were allocated to a pilot group (three animals), a control (group 1), treated group (group 2) (five animals each), and a group of three replacement animals (group 3). The animals in the pilot group were treated accordingly and observed for 2 days. The concentration of the different compounds as well as route of administration can be seen in Table 1. Group 1 animals (control group) were given vehicle, and group 2 animals were given vehicle plus compound. Only one dose was administered for each compound. At the start of the acclimatization period, the rats were 5–6 weeks old. At the start of treatment, the range of weight variation did not exceed ±2 times the standard deviation of the mean weight (203.9 ±10.2 g). An acclimatization period of 4 days was allowed for the pilot animals and 6 days for the main study animals in order to reject animals in poor condition or with weight outside allowed range. The study took place in a room provided with filtered air supply at a temperature of 21°C ± 3°C and relative humidity of 55 ± 15%. The ventilation system has been designed to give 10 air changes per hour. The room was illuminated to give a cycle of 12 h light and 12 h darkness. Light was on from 0600 h–1800 h. The temperature and relative humidity in the animal room were recorded hourly during the study. Each animal was identified by earmarks with a punched number. The rats were kept in transparent polycarbonate cages (macrolone type III, floor area: 810 cm2) with two or three in each cage. The cages were cleaned and the bedding changed at least twice a week. The bedding was softwood sawdust "Lignocel 3-4" from Hahn & Co., D-24796 Bredenbek-Kronsburg, Germany. A complete pelleted rodent diet (Purina chow 5002 [powder], PMI Nutrition International, Nottingham, UK) was available ad libitum. Analyses for major nutritive components and relevant possible contaminants were performed regularly. The animals had free access to bottles with domestic quality drinking water acidified with hydrochloric acid to pH 2.5 in order to prevent microbial growth. Analyses for relevant possible contaminants were performed regularly. Certificates of analysis have been retained.
|
Compounds.
Three different toxicants were included in this study: ANIT, DMN, and NMF. Control (untreated) (Con) samples were included in the analysis as well.
Sample collection.
For each study, individual animals were housed in metabolism cages and urine samples collected. Urine samples were collected continuously over ice into tubes containing 1% sodium azide between 24 and 48 h postdose. On collection, urine samples were centrifuged at
1200 x g for 10 min, in order to remove any solid debris and stored at – 40°C pending 1H NMR spectroscopic analysis. The animals were anaesthetized by inhalation of a high concentration of CO2 and sacrificed 48 hours after treatment, blood was drawn for clinical chemistry evaluation and organs were frozen for later histopathological analysis. The animals were not fasted prior to exsanguination. A macroscopic examination was performed externally as well as internally after opening the thoracic and abdominal cavities and by observing the appearance of the organs and tissues in situ. Any macroscopic change was recorded with details of the location, color, shape, and size. Tissue samples of the left lateral lobe of the liver were snap frozen in liquid nitrogen, transferred to dry ice, and stored at approximately – 80°C until RNA extraction and microarray analysis. After fixation in 4% buffered neutral formaldehyde, kidney and liver tissues sampled for microscopic examination were trimmed, embedded in paraffin and cut at a nominal thickness of 5 µm, stained with hematoxylin and eosin, and examined under light microscope.
Histopathology.
All animals treated with ANIT showed minimal, multifocal periportal oval cell hyperplasia and minimal, multifocal periportal necrosis and acute inflammation. Moderate multifocal necrosis of bile ducts was also observed in all these animals. All DMN-treated animals displayed mild to moderate diffuse centrilobular vacuolization, moderate to marked centrilobular hemorrhage and necrosis of parenchyma and endothelia, and moderate increase in mitosis. NMF treatment led to mild to moderate diffuse centrilobular degeneration and mild to moderate multifocal centrilobular necrosis. A peer review by internal peer reviewing pathologist was performed on one animal from each group (five animals per group) and on selected other slides. The diagnoses presented represent the consensus opinion of the study pathologist and peer review pathologist.
Clinical chemical data.
Blood sampling for clinical chemistry were taken 24 h after treatment. Blood samples were also taken on the day of necropsy. Blood samples were taken from the orbital venous plexus during CO2 anesthesia. To obtain serum, the blood samples were centrifuged for 10 min at 1270 x g. Glucose was also measured but was excluded from the analysis due to missing values for more than 50% of the samples.
RNA extraction and microarrays.
Frozen liver samples were homogenized in 2 ml TRIzol reagent (Invitrogen; Life Technologies, Scotland) using an Ultra Turrax T25 basic homogenizer (IKA Labortechnik; IKA-Werke, Staufen, Germany). Chloroform (400 µl) was added, and homogenized samples were shaken vigorously by hand for 15 s. Phase separation was achieved by centrifugation for 15 min at 10,000 x g at 4°C. Total RNA from the aqueous phase was purified using the RNeasy 96 Qiagen vacuum system following the "Clean-up" procedure. RNA integrity was confirmed for each of the purifications on a 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA) using the RNA 6000 Nano LabChip kit (Agilent Technologies). All compounds were run in biological quintuplicate (n = 5) experiments except controls, which were made in octaplicate (n = 8). This resulted in a total of 23 samples which were analyzed over 5 days. The samples were analyzed in such a way that repetitions of the same compound were spread out over the five hybridization days, and within each hybridization day, the order of the samples analyzed was randomized. GeneChip One-Cycle Target Labelling and Control Reagents (Affymetrix Inc., Santa Clara, CA) were used for preparing the labeled cRNA samples, following the instructions of the manufacturer. The GeneChip Rat Genome 230 2.0 Array (Affymetrix) was used for the analysis. According to Affymetrix, the microarray contains more than 31,000 probe sets, analyzing over 30,000 transcripts and variants from over 28,000 well-substantiated rat genes.
Microarray depository information.
The microarray data analyzed in this paper has been submitted to the Gene Expression Omnibus database and can be accessed via the ID: GSE5509
[NCBI GEO]
.
NMR analysis.
In all, 400 µl of each urine sample was added to 200 µl of sodium phosphate buffer (0.2M Na2HPO4 in H2O and 0.2M NaH2PO4 in 80:20 H2O:D2O, pH = 7.4) containing 1mM sodium trimethylsilyl [2,2,3,3-2H4] propionate and 3mM sodium azide. All samples were centrifuged at
1800 x g for 5 min to remove any solid debris. 1H NMR spectra were measured at 600 MHz at 300 K using a flow injection system (Bruker Biospin, Karlsruhe, Germany). The water resonance was suppressed by using a standard presaturation pulse sequence with irradiation during a 2-s relaxation delay and also during the 100 ms mixing time. For each sample, 64 transients were collected into 64 K data points using a spectral width of 20 ppm (12 kHz), resulting in a total acquisition time of
4 min per sample. Prior to Fourier transformation, an exponential line broadening of 1 Hz was applied to each free induction decay. The spectra were phased baseline corrected and referenced automatically using a MATLAB (The MathWorks, Natick, MA) routine developed by the COMET team.
Algorithm.
Standard and hierarchical PCA and PLS models were calculated in SIMCA-P+ v.11.0 (Umetrics AB, Umeå, Sweden). Unit variance scaling was used for the individual (base-level) model of ClinChem data since the variables are independent and have been measured on a different scale. Pareto scaling was used for microarray and NMR data for the individual (base-level) models because variables may be biologically connected while at the same time the dynamic range of signal intensity is very large (Eriksson et al., 2006a
). Application of univariate scaling to this kind of data has a tendency to amplify the noise from noninformative variables introducing unnecessary variation into the model (Eriksson et al., 2006a
). For models of concatenated data, all variables are scaled to unit variance and block scaled according to
where Kblock is the number of variables in each block. The block scaling is performed to handle large variation in the number of variables in each data set, and for this block scale to be appropriate all variables across data sets (blocks) have to be scaled to unit variance although this might seem contradictory to the statement above. In the top level of hierarchical models, unit variance was used for all variables (score vectors) but no block scaling applied. PCA is an unsupervised multivariate decomposition method where an original data matrix (X) is deconvoluted into sets of corresponding concentrations (T) and common underlying variables (P) leaving only a residual (E) of unaccounted for variation such that:
. Each vector in P is normalized to unit intensity. PCA can be used to identify common underlying features in complex data which results in a compression of data allowing for graphical interpretation of the major sources of variation between samples. Being unsupervised, PCA will not show anything that is not present in the data. PLS on the other hand is a supervised multivariate method where an original data matrix (X) and a response variable matrix (Y) are simultaneously and dependently decomposed aiming at building a linear relationship between the X and the Y matrices extracting only in the information in X that correlates with the information in Y. PLS-DA is a term commonly used when the Y matrix is used to, e.g., describe grouping in data and therefore consist of a set of discrete (binary) rather than continuous variables. A hierarchical PCA or PLS(-DA) model is simply a model based on, e.g., concentration vectors (scores) from previous calculated models. This can be exemplified as the following, where three PCA models are calculated on three different X-matrices:
![]() |
From the concentration matrices of these three models, a hierarchical model is calculated using, e.g., the first two principal components from each of the three different PCA models:
|
|
TH and PH can then be investigated and any findings can be referred back to the original models. The hierarchical models calculated in this paper are either based on PCA or PLS-DA both for base- and top-level models. This is not a requirement and models may be combined in any manner best suited for the available data. For each individual data set and the selected combinations of these described in Table 2, both PCA and PLS-DA models are calculated. PCA models are used to investigate for outliers and trends in data whereas PLS-DA models are used to perform a manual CV, allowing for estimation of classification performance. All plots shown are based on the PCA models.
|
Microarray data preprocessing.
The raw data files (*.cel files) were imported into R (statistical software R, build 2.0.1, http://www.r-project.org/) and initially processed using the "affy" package (Gautier et al., 2004
NMR data reduction and preprocessing.
A 1-Hz exponential line broadening was applied to each free induction decay prior to Fourier transformation. Subsequently, each NMR spectrum was phase corrected, baseline corrected, and referenced using a MATLAB routine developed in the COMET consortium. This code is not publicly available. Following this, the range of interest in each NMR spectrum was set to 0.2–10.0 ppm which was reduced to 244 variables by integrating regions of equal width (0.04 ppm) using a MATLAB routine developed by the COMET team. This code is not publicly available. Through a number of steps, the data are further reduced as the range from 4.50 to 5.98 ppm was excluded to remove water and any cross-relaxation effects on the urea peak, and the two ranges of 2.66–2.74 and 2.50–2.58 ppm around the shifting citrate peaks were merged by summation into two variables. Finally, in some spectra, peaks from metabolites of the dosed compounds were observed, and these variables were excluded (in all spectra) giving a total of 205 variables. This treatment simplifies statistical analysis of the data and reduces the impact of small variations in chemical shift due to, for example, variation in pH or ionic strength. The described approach was routinely used for NMR data acquired and analyzed in the COMET consortium (Ebbels et al., 2003
).
Data set.
Three different toxicants (ANIT, DMN, and NMF) in quintuplicate (n = 5) and eight control (untreated) (Con) samples were included in the analysis, 23 samples in all. The analyses are conducted on serum ClinChem, gene expression (microarrays), and metabonomics (NMR on urine) data. The ClinChem data set consists of 13 variables, the microarray data of 31,000 variables, and the NMR data of 205 variables.
Model evaluation.
All score and loading plots presented are based on PCA models and have been created without axes to allow for optimal visualization of the score patterns for each model. However, when the plots are not on the same scale, it is somewhat difficult to quantify the between-model quality in terms of group separation. In order to be able to compare group separation between models more objectively, classification performance based on the calculated PLS-DA models is used as a quantitative measure. This test is optimally performed by having two representative sets of samples, using one set of samples to build the model and the second set of samples to test classification performance. However, since the number of samples in this study is rather small and each group therefore consists of a limited number of samples, it is difficult to create two representative data sets of comparable size. The classification test is therefore performed in a manual CV manner by excluding a small number of samples, calculate a model on the remaining samples, and then classify the excluded samples using the calculated model. The results of all CV segments are summed to get an overall estimate of the classification performance. In this paper, the described procedure is performed five times, using the CV classification segments shown in Table 3, and a sample is considered belonging to a group when the classification values is >0.5 and not belonging to a group when <0.5.
|
With the hierarchical models, for each CV segment new base-level models are calculated. Although PCA may be used for base-level models, even with PLS-DA as top-level model, it was decided to use PLS-DA as base-level models to allow for optimal performance through the guided data decomposition of PLS-DA.
Regarding validity of the calculated PCA and PLS-DA models, bilinearity is the most important requirement but preferably data should also be normally distributed since least square algorithms works best with normally distributed data. This is, however, not a strict necessity. Also if missing data exist in the data sets, there should not be any structure in the distribution of the missing values. Since the number of samples in this study is rather small, proper statistical evaluation is not really feasible, and model statistics are used to ensure that the calculated models are valid.
| RESULTS |
|---|
|
|
|---|
Outlier Detection and Data Cleanup
Initial exploratory data analysis was performed on all three data sets (ClinChem, microarray, NMR) in order to detect possible outliers. The PCA model based on microarray data suggested one DMN sample as outlier as it clustered together with Con instead of the other DMN samples, and one NMF sample showed outlying behavior in ClinChem data also clustering tighter with Con than NMF (data not shown). These samples were hereafter excluded from all further PCA modeling, although they did not show outlying behavior in the NMR data set. Subsequently, all data sets used for PCA models then consisted of 21 samples. Particular in the ClinChem data set although also apparent in NMR data set, one of the Con samples is separated from the remaining Con samples. It was decided to keep this sample in the data sets since the sample remains clustered with Con and not any of the other groups when investigating score plots of different PC combinations.
PCA Analysis for Data Inspection
As part of the method evaluation, seven PCA models were calculated. An overview of the basis of the calculated PCA models can be seen in Table 2. Models 1–3 correspond to the cured PCA models described above on the individual data types. Model 4 is a PCA model calculated on concatenated data from all three data set (DS4). Model 5 is a hierarchical model which is created by taking score vectors from the first three principal components of DS1, DS2, and DS3 and collect these into a new data set (DS5 of dimension 21 x 9). Models 6 and 7 are identical to models 4 and 5, respectively, except in models 6 (DS6) and 7 (DS7) only data from ClinChem and microarray are included (dimension 21 x 6). Examining Figures 1–7 in general, it is obvious that the main source of variation is the separation of ANIT from the three other groups. The only data set where this is not the case is for microarray where the main source of variation is the separation of control from the three treatment groups. Examining the score plots of the three individual data sets, NMR data show superior performance regarding group separation, and it looks like there is a tendency toward improved group separation in the models combining data from multiple sources.
PLS-DA for Evaluation of Classification
For the evaluation of the data integration methods described, seven PLS-DA models corresponding to DS1–DS7 are calculated. An overview of the data sets can be seen in Table 2. All CV submodels are calculated using the same number of components as the full model containing all samples. Since one DMN and one NMF samples were found to be outliers in microarray and ClinChem data sets, respectively, these samples will be excluded from all classification models when using concatenation of data sets. However, for hierarchical modeling, it is possible to build the model on base-level models containing different number of samples, and therefore the outlying DMN sample is kept in the ClinChem and NMR base-level PLS-DA models, and the outlying NMF sample is kept in the base-level models for the microarray and NMR data sets. For both modeling approaches, classification of the outlying DMN and NMF samples will be attempted in their respective CV segments. The results of the PLS-DA CV classification can be seen in Table 4. No score plots of the PLS-DA are shown since these resemble those presented for the PCA models.
|
| DISCUSSION |
|---|
|
|
|---|
From analyzing the PCA models of the individual data sets, it appears that only NMR data can separate all four groups although microarray data show a clear separation of control versus toxic in general. As the only data set of the three, clinical chemistry data display a mixing of control and one of the toxic compounds (NMF), which is not resolved by including more PCs in the model and inspecting plots of other PC combinations. From comparing the three plots simultaneously, the underlying structure in the three data sets is slightly different with different location of groups in the three plots. It is therefore not a simple task to manually overlay and interpret, e.g., control versus general toxicity or control versus the individual toxic compounds. Thus, integration through analysis of the three data sets individually is not recommended.
If the three data sets are concatenated, maintaining the variable scaling from the individual models without doing block scaling, the resulting PCA model is virtually identical to the PCA of DS2 (data not shown). Clearly, microarray data are dominating the model due to the many variables of the microarray data set. If on the other hand, all three data sets are scaled to unit variance and the suggested block scaling applied, the resulting PCA of the three concatenated data sets (DS4) can be seen in Figure 4. The score plots in Figure 4 show clear similarities to that of Figure 1 for ClinChem data; however, now all four groups are well separated. It is obvious that the primary source of variation in this model is the distinction of ANIT from all the three other groups. The block scaling in connection with unit variance scaling of all variables has managed to handle the large difference in the size of the data sets. Comparing Figure 4 to Figure 3, which represents the individual data set displaying the best group separation (DS3), it is seen that the two groups of DMN and NMF apparently have switched place. Moreover, comparing the third PC for the same two models, the third component on DS3 is used to separate DMN and NMF, whereas for DS4 the third component separates NMF from Con and DMN (data not shown).
|
|
|
Application of scaling to unit variance is known not to be optimal for both microarray and NMR data due to the fact that these two data types hold a large number of parameters which are not regulated during the toxic insult, and thus, they represent a baseline of nonsystematic variation from animal to animal. This might explain why PCA on DS4 is similar to PCA on DS1. However, if different variable scaling should be applied within each block, different block scaling factors will also have to be used for each block and it is no easy task to determine the correct block-scaling factor. This is the reason why this has not been performed in this paper.
Comparing the hierarchical PCA model when including data (scores) from all three data sets (DS5) with all the previously calculated models, it is seen that the pattern of Figure 5 and Figure 3 are quite similar apart from the fact that again DMN and NMF have switched places. Also, a better separation of the DMN group and the NMF group is observed in the score plots of PCA on DS5 compared to the score plots of PCA on DS3. The large number of variables of the microarray data set compared to the other two sets is appropriately handled without any block scaling.
|
By simply looking at the score plots in Figure 4 and Figure 5, it is difficult to demonstrate a clear advantage of performing hierarchical modeling over modeling of concatenated data in terms of group separation. However, from a modeling point of view, as previously explained, there is an obvious advantage of hierarchical contra concatenated modeling due to the fact that different choices of variable scaling and block scaling can dramatically influence the resulting score plots, and thereby the visual interpretation in a model of concatenated data (data not shown). This is much less of a problem with hierarchical modeling.
Examining the results of the manual CV classification shown in Table 4, it can be seen that NMR data (DS3) show much better classification performance than ClinChem (DS1) or microarray data (DS2) which nicely confirms the results from the PCA analysis. Comparing models based on DS4 and DS5, it is seen that there is no difference in performance between concatenation of data (DS4) and hierarchical modeling (DS5) when all three data sets are included in the models. It is important to remember that the outlying DMN and NMF samples were included in the CV test sets (but not in the calibration sets), and these have therefore been correctly classified in both models of combined data although they were considered to be outliers in the microarray and ClinChem data set, respectively.
Since NMR data (DS3) by far exhibit the best group separation of the individual models, it is interesting to investigate the impact of NMR data on both the concatenated and hierarchical models. Therefore, new models on concatenated (DS6) and hierarchical (DS7) data only based on ClinChem and microarray data were calculated and analyzed. Microarray and ClinChem data are more common than NMR data in these kinds of studies, thus this will furthermore give an indication of the value of the method in this situation.
Comparing the score plots from PCA on DS6 to that of a PCA on DS4, it is seen that the nice group separation displayed in Figure 4 no longer holds for Figure 6, since Con and NMF are starting to overlap. Obviously, NMR data plays an important role in the group separation observed in the model of concatenated data of all three data sets (DS4). Comparing the hierarchical PCA models of Figure 7 (DS7) and Figure 5 (DS5), it is seen that all four groups remain separated upon exclusion of NMR data. It even appears that separation of DMN and NMF is slightly improved, although larger scattering within groups is observed when NMR data are excluded. Further examination of this using PLS-DA and CV classification for DS6, it is seen that two samples are assigned to wrong groups (false positive) and two samples are not assigned to their own groups (false negative). Two of these four samples are the outlying DMN and NMF samples where the DMN sample is not classified as belonging to any of the four groups and the NMF sample is classified as belonging to Con. For PLS-DA on DS7 classification performance is clearly better, with only one sample not being correctly classified. The sample which is not correctly classified is the outlying DMN samples which is wrongly classified as being both a DMN sample and a Con sample (false positive).
|
|
Besides the better classification performance, again the straightforward handling of variable scaling and the fact that block scaling is often not required in hierarchical modeling adds an extra benefit to this type of data integration.
It is important to stress that a benefit of the hierarchical model in this example is not only a better separation of groups and therefore better classification but also the ability to rank the importance of components between the individual data types included in the model and compare correlations of contributions between data types. As an example to highlight this ability, the contributions of the individual elements in the hierarchical PCA model of DS5 relating to the separation of Con and ANIT samples were evaluated. Figure 8 shows the contributions of the top-level hierarchical model (DS5) specific for separating ANIT and controls. These score contributions have been calculated by subtracting the group mean scores for Con from group mean scores for ANIT (i.e., ANIT—Con) and subsequently scale by the absolute value of the normalized linear combination of the loading for the first two components. This means that since control samples represent the untreated state, their biological profile can be considered a zero point and all impact of the ANIT treatment on the animals is displayed relative to this. The size of the individual bars directly expresses the importance of the different model elements in relation to the separation of Con and ANIT and in Figure 8, it can therefore be seen that in absolute values, the first PC of the NMR data set express the largest contribution to the group separation of Con and ANIT, the first PC of microarray data the second largest, and the first PC of the ClinChem data set the third largest contribution. All these show strong positive correlation and to a lesser extent negative correlation with the second component of ClinChem data. For a more in-depth analysis of the actual changes lying behind these finding, the models for the individual data sets must be investigated.
|
Although a thorough biological interpretation will not be attempted here, the general principle will be exemplified, and the important thing to understand is that each contribution in Figure 8 simply represents a principal component from one of the base-level models, and therefore the evaluation of the information content of each contribution is performed by looking at loadings in the relevant component of the corresponding base-level models. In this case score and loading plots from PCA on DS1 to DS3 should therefore be inspected and these can be found in Figures 1–3 and Figures 9–11, respectively. The identifiers in Figure 9 correspond to the individual clinical chemistry parameters; in Figure 10 the microarray probe IDs are directly shown, and in Figure 11 chemical shift values of the different buckets are displayed. The orientation of the loading plots can be directly overlaid the corresponding score plots (Figures 1–3, respectively), and it is therefore possible, using the importance (magnitude) and sign of the contributions in Figure 8, to select which variables for each model to investigate further. In all three loading plots, a solid ellipse indicates variables that show upregulation upon ANIT treatment whereas variables in the dotted ellipse show downregulation upon ANIT treatment. The two ellipses are only tentative and simply indicate the region in the plots where to look for interesting variables.
|
|
|
As stated, no further biological interpretation will be attempted here but based on these findings of correlated elements, it is possible to dig into the underlying data structures and attempt interpretation of correlated underlying features. Therefore, with enhanced group separation, it becomes easier to evaluate and distinguish differences in contributions when comparing individual group differences, e.g., what makes the three toxins different using control samples as a reference point.
As a final remark, it might be argued that the interpretative findings presented in this paper could have been obtained without performing hierarchical modeling and this cannot be denied. However, with the example of ANIT versus Con, all models show clear separation and selection of what to evaluate is not the most difficult task. In situations where the picture is less clear or the models on individual data types contain many principal components, hierarchical modeling may still turn out to be an advantage and the ability of hierarchical modeling to classify all samples correctly and the easy model building is a strong incentive to use this kind of modeling.
| Supplementary Data |
|---|
|
|
|---|
Supplementary table is available online at http://toxsci.oxfordjournals.org/.
| FUNDING |
|---|
|
|
|---|
The COMET consortium, Imperial College, London; Villum Kann Rasmussen Foundation.
|
| ACKNOWLEDGMENTS |
|---|
The authors would like to thank the COMET consortium for allowing access to the data. Conflicts of interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Baskin CR, Garcia-Sastre A, Tumpey TM, Bielefeldt-Ohmann H, Carter VS, Nistal-Villβn E, Katze MG. Integration of clinical data, pathology, and cDNA microarrays in influenza virus-infected pigtailed macaques (Macaca nemestrina). J Virol (2004) 102(2):444–454.
Boverhof DR, Burgoon LD, Tashiro C, Chittim B, Harkema JR, Jump DB, Zacharewski TR. Temporal and dose-dependent hepatic gene expression patterns in mice provide new insights into TCDD-mediated hepatotoxicity. Toxicol Sci (2005) 85(2):1048–1063.
Ebbels T, Keun H, Beckonert O, Antti H, Bollard M, Holmes E, Lindon J, Nicholson J. Toxicity classification from metabonomic data using a density superposition approach: CLOUDS. Anal Chim Acta (2003) 490(1–2):109–122.[CrossRef][Web of Science]
Eriksson L, Johansson E, Kettaneh-Wonl N, Trygg J, Wikström C, Wold S. Multi- and Megavariate Data Analysis—Part 1: Basic Principles and Applications. (2006a) pp. 40–61, 207–220. Umetrics.
Eriksson L, Johansson E, Kettaneh-Wonl N, Trygg J, Wikström C, Wold S. Multi- and Megavariate Data Analysis—Part 2: Advanced Applications and Method Extensions (2006b) pp. 137–152. Umetrics.
Fostel J, Choi D, Zwickl C, Morrison N, Rashid A, Hasan A, Bao W, Richard A, Tong W, Bushel PR, et al. Chemical Effects in Biological Systems—Data Dictionary (CEBS-DD): A compendium of terms for the capture and integration of biological Study Design Description, conventional phenotypes, and Omics data. Toxicol Sci (2005) 88(2):585–601.
Gautier L, Cope L, Bolstad BM, Irizarry RA. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics (2004) 20(3):307–315.
Hwang D. Alistair G. Rust, Stephen Ramsey, Jennifer J. Smith, Deena M. Leslie, Andrea D. Weston, Pedro de Atauri, John D. Aitchison, Leroy Hood, Andrew F. Siegel, et al. (2005). A data integration methodology for systems biology. Proc Natl Acad Sci U.S.A. 102(48), 17296–17301.
Lindon JC, Keun HC, Ebbels TMD, Pearce JMT, Holmes E, Nicholson JK. The Consortium for Metabonomic Toxicology (COMET): Aims, activities and achievements. Pharmacogenomics (2005) 6(7):691–699.[CrossRef][Web of Science][Medline]
Lindon JC, Nicholson JK, Holmes E, Antti H, Bollard ME, Keun H, Beckonert O, Ebbels TM, Reily MD, Robertson D, et al. Contemporary issues in toxicology the role of metabonomics in toxicology and its evaluation by the COMET project. Toxicol Appl Pharmacol (2003) 187(3):137–146.[CrossRef][Web of Science][Medline]
Mattes WB, Pettit SD, Sansone SA, Bushel PR, Waters MD. Database development in toxicogenomics: Issues and efforts. Environ Health Perspect (2004) 112(4):495–505.[Web of Science][Medline]
Nicholson JK, Holmes E, Lindon JC, Wilson ID. The challenges of modeling mammalian biocomplexity. Nat Biotechnol (2004) 22(10):1268–1274.[CrossRef][Web of Science][Medline]
Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nat Genet (2005) 37(6 Suppl.):S31–S37.[CrossRef][Web of Science][Medline]
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM. ONCOMINE: A cancer microarray database and integrated data-mining platform. Neoplasia (2004) 6(1):1–6.[Web of Science][Medline]
Shannon WD, Watson MA, Perry A, Rich K. Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiol (2002) 23:87–96.[CrossRef][Web of Science][Medline]
Tyers M, Mann M. From genomics to proteomics. Nature (2003) 422(6928):193–197.[CrossRef][Medline]
Westerhuis J, Kourti T, MacGregor JF. Analysis of multiblock and hierarchical PCA and PLS models. J Chemo (1998) 12:301.[CrossRef]
Wold S, Esbensen K, Geladi P. Principal component analysis—A tutorial. Chemo Intell Lab Syst (1987) 2:37–52.[CrossRef]
Wold S, Kettaneh N, Tjessem K. Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. J Chemo (1996) 10:462.
Xirasagar S, Gustafson SF, Huang CC, Pan QY, Fostel J, Boyer P, Merrick BA, Tomer KB, Chan DD, Yost KJ, et al. Chemical effects in biological systems (CEBS) object model for toxicology data, SysTox-OM: Design and application. Bioinformatics (2006) 22(7):874–882.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











