ToxSci Advance Access originally published online on May 5, 2007
Toxicological Sciences 2007 99(2):403-412; doi:10.1093/toxsci/kfm108
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Clearing the Standards Landscape: the Semantics of Terminology and their Impact on Toxicogenomics
Toxicogenomic Informatics and Solutions, LLC, P.O. Box 27482, Lansing, Michigan 48909
1 To whom correspondence should be addressed. Fax: (517) 882-0080. E-mail: burgoonL{at}txisllc.com.
Received March 12, 2007; accepted May 1, 2007
| ABSTRACT |
|---|
The emergence of the microarray data standards, especially the Minimum Information About a Microarray Experiment (MIAME), has spurred several organizations to develop their own standards for a myriad of technologies, including proteomics and metabolomics. These efforts have facilitated the creation of several large-scale gene expression repositories, including the toxicology-focused Chemical Effects in Biological Systems Knowledgebase at the National Institute of Environmental Health Sciences. Recently, efforts have been moved toward developing toxicogenomic data standards (e.g., MIAME-Tox), and the U.S. Food and Drug Administration and the U.S. Environmental Protection Agency either have developed or are developing regulatory guidance with respect to pharmaco- and toxicogenomics. However, for the toxicology community to be engaged in the process of standards development and approval, there needs to be a more thorough understanding of the terms associated with electronic data sharing and communication, especially with respect to defining the terms "standard," "controlled vocabulary," "object model," "markup language," and "ontology." This review will discuss these terms, especially as they pertain to toxicogenomics, how they relate to one-another, and what current efforts exist that may impact toxicology.
Key Words: toxicogenomics; data standards; object model; controlled vocabulary; markup language; ontology.
| INTRODUCTION |
|---|
Biological standards are emerging to facilitate data exchange, data integration, and a critical review of large toxicology, genomic, proteomic, and metabolomic data sets. These standards will ensure all existing data is considered and will minimize redundant efforts, while enhancing comprehensive safety assessments. To spur future use of historical toxicogenomics data, organizations are beginning to expend significant resources towards the development of data repositories. Several efforts such as ArrayExpress (Brazma et al., 2003
In order to meet these challenges, the affected communities including drug developers, environmental health scientists, database developers, curators, and administrators as well as regulators, risk assessors, and managers must ensure that their needs are reflected in workable data reporting standards that will encourage compliance and use of the available resources. Consequently, there must be complementary efforts to facilitate the seamless computational reporting and sharing of disparate data between unrelated efforts which can be achieved through the cooperative development of data reporting standards. This review describes the data reporting standards landscape by defining "reporting standard," "controlled vocabulary," "object model," "markup language," and "ontology," and discusses their relationships in order to facilitate greater stakeholder involvement in the establishment of acceptable standards.
| DATA STANDARDS |
|---|
Data reporting standards specify requirements that must be met to describe a concept with a minimal amount of information detail. Providing the requested information and satisfying the level of required detail ensures compliance with the standard. Standards are commonplace within communications, information technology, and engineering where they govern safety, and ensure effective communication between unrelated systems (a.k.a. systems interoperability). For instance, the World Wide Web uses a series of standards, including the Hypertext Transfer Protocol (HTTP) (http://tools.ietf.org/html/rfc2616), Domain Name Service (DNS; ftp://ftp.rfc-editor.org/in-notes/std/std13.txt), and the Transmission Control Protocol and Internet Protocol (TCP/IP; ftp://ftp.rfc-editor.org/in-notes/rfc1155.txt). The HTTP standard defines the text of all messages between client browsers and servers when requesting a webpage, DNS deciphers domain names, such as www.google.com, to IP addresses, such as 72.14.203.9 [EC] 9, and TCP/IP defines the method for communication of the HTTP request message from the browser to the server, and vice versa.
Comparable standard requirements also exist within drug development and environmental health sciences, although they are not as rigorously defined. For instance, the U.S. Food and Drug Administration requires preclinical safety data to adhere to Good Laboratory Practices (21 Code of Federal Regulations [CFR] Part 58), and manufactured drugs to adhere to Good Manufacturing Practices (21 CFR Parts 210 and 211), while journals require all reports of new biological sequences to include database accession numbers from a recognized repository within the manuscript. In general, there is a broad acceptance of data reporting and sharing of data, including the submission of toxiocogenomic data sets to public repositories and to regulatory agencies to support drug and chemical sponsors product applications (Mattes et al., 2004
).
In recent years, there has been a growing interest to standardize the reporting of toxicological data. The Standard for Exchange of Nonclinical Data (SEND), from the Clinical Data Interchange Standards Consortium (CDISC), focuses on the exchange of nonclinical safety data from industry to the U.S. Food and Drug Administration. A cross-disciplinary group of toxicological scientists has recently submitted a "strawman" proposal for a minimal information checklist to spur discussion within the toxicology community, ultimately to define a minimal standard for toxicology data reporting (Fostel et al., in press).
Initial efforts within toxicogenomics have focused on developing MIAME/Tox (http://www.mged.org/Workgroups/rsbi/MIAME-Tox-Checklist.pdf), a toxicogenomic-specific standard based on the Minimum Information About a Microarray Experiment (MIAME; http://www.mged.org/Workgroups/MIAME/miame.html) standard. Although there have been issues, including the balance between the needs of the academic and safety assessment communities, and what constitutes minimum, new groups are being organized to update the initial drafts to form new consensus toxicogenomic data standards.
| CONTROLLED VOCABULARIES |
|---|
A controlled vocabulary is a well defined, agreed upon, exclusive set of terms that are used for communication, to eliminate redundant and synonymous terms. It is created by identifying and clearly defining orthogonal terms within a field's lexicon, thereby eliminating synonyms from the vocabulary, and eliminating ambiguity. For example, the Human Genome Organization (HUGO) list of official gene names and abbreviations is a controlled vocabulary where each gene has only one official name or abbreviation. For instance, the HUGO approved name for the CYP1A1 (official gene abbreviation, or symbol) gene is "cytochrome P450, family 1, subfamily A, polypeptide 1"; it has five unofficial aliases including "aryl hydrocarbon hydroxylase" and "cytochrome P1-450, dioxin-inducible." No other gene within the human genome may be given that gene abbreviation or name. Similarly, the National Toxicology Program (NTP) Pathology Code Tables represent a controlled vocabulary where specific pathology terms, such as lesion types, locations, descriptors, and systems, may be combined to completely describe a histological or gross pathological feature.
Problems arise when different groups develop independent controlled vocabularies within the same or similar fields, as evidenced within pathology with the NTP TDMS Pathology Code Tables, SNOMED_CT, and the Mouse Pathology Ontology. For instance, the term "abscess" is used within the NTP TDMS Pathology Code Table for Microscopic Lesions (code 1), and the SNOMED_CT in two different categories (disorder and morphologic abnormality), but it does not appear in the Pathology Ontology. The term "melanoma in situ" is used within SNOMED_CT and the Pathology Ontology to represent a lesion that may be part of a neoplastic track, while the NTP TDMS Pathology Code Table for Microscopic Lesions only defines Melanoma Benign and Melanoma Malignant. Similarly, only the NTP TDMS Pathology Code Table for Microscopic Lesions and the SNOMED_CT vocabularies define a benign melanoma term ("Melanoma Benign" and "Benign melanocytic neoplasm," respectively), whereas the Pathology Ontology draws no distinction between malignant and benign melanomas.
The problem of a field using multiple controlled vocabularies is the same as the field not using any controlled vocabularies. When the same concept can be referred to in many different ways, the computational system must maintain a list of synonymous terms. The larger problem, however, is when the vocabularies use different levels of detail to describe the same class of concepts. In the melanoma example, where two vocabularies classify a melanoma as either benign or malignant, and the third vocabulary does not draw any distinction, the investigator designing the data management system must decide if (1) the data using the Pathology Ontology should be excluded, (2) if a special vocabulary should be developed to accommodate the lack of information regarding melanoma status ("benign," "malignant," "undefined"), (3) if the data using the NTP TDMS or SNOMED_CT should be censored to match the Pathology Ontology, or (4) if the data using the NTP TDMS or SNOMED_CT should be excluded.
The lack of consistent terminology to describe the same concept (e.g., "Melanoma Benign" vs. "Benign melanocytic neoplasm") makes automated concept mapping between the vocabularies virtually impossible. This is complicated further by differences in specificity among the vocabularies (e.g., one vocabulary defining differences between malignant and benign melanomas whereas another one does not). Ultimately, systems and data interoperability will be impossible without harmonization and standardization of the controlled vocabularies within a community.
| OBJECT MODELS |
|---|
In computer science and software engineering there is a school of programming referred to as object oriented programming (OOP). Programs written using OOP languages are based on the concept of using defined objects that perform specific tasks to intercommunicate and solve some computational problem. Some objects exist solely to store information, and may resemble tangible objects in the real world, while other objects exist solely to use information from other objects to perform a task. When creating OOP software, programmers write classes which are prototypes that specify the characteristics of the object. Classes can create objects based on other classes, and can pass these objects around to other classes.
When planning an OOP application, software engineers and OOP modelers create an object model (OM): a collection of objects and their relationships to one another. The OM helps software engineers and OOP modelers communicate application requirements to programmers, and facilitate the creation of interoperable software within a community. They are also built to facilitate the analysis, planning, and understanding of the computational structures, algorithms, and computer applications that use them. In biological software application example, an OOP modeler may create an OM that specifies a lesion class that contains those characteristics which are true for all lesions (e.g., severity, tissue), and specific lesion classes that are derived from the generic lesion class which specify characteristics specific to that lesion class (e.g., an inflammatory cell infiltration class may specify the infiltrating cell type). Using this model, the programmers could create an application that models the lesions within a specific animal's organs. For instance, if the application were to model all of the lesions within a particular animal's liver, a specific lesion object would be created for each specific lesion in the liver.
Several examples of biological OMs exist, each created for a specific purpose. The Microarray and Gene Expression Object Model (MAGE-OM), created by the Microarray Gene Expression Data Society (MGED), facilitates interoperable gene expression microarray software application development (Spellman et al., 2002
). The SysBio-OM, a product of the National Institute of Environmental Health Sciences' (NIEHS) CEBS project (Xirasagar et al., 2004
), is used to create software that integrates toxicogenomic, proteomic, and metabolomic data. The Functional Genomics Experiment (FuGE) group has created the FuGE-OM to model common components found within genomics, proteomics, and metabolomics (Jones et al., 2006
).
| MARKUP LANGUAGES AND REPORTING FORMATS |
|---|
|
|
|---|
Markup languages are computer languages where identifiers, referred to as tags, are used to specify what the contained information describes. For instance, in Hypertext Markup Language the tags
P
and
/P
enclose a block of text that is to be considered a paragraph, while the tags
B
and
/B
enclose a block of text that should be displayed in a bold type. Web browsers recognize these tags and process the instructions to display the data appropriately. Markup languages are commonly used for data communication and exchange, including the submission of microarray data to a repository such as ArrayExpress.
eXtensible Markup Language (XML) in the biological domain allows individuals to create their own tags to mark data, allowing computers to recognize certain data elements more efficiently than reading unmarked data (Berman, 2005
; Hermjakob et al., 2004
; Hucka et al., 2003
; Spellman et al., 2002
). For example, the MAGE Markup Language (MAGE-ML) specifies the tags
BioSequence
and
/BioSequence
which enclose a block of data that describe a biological sequence, such as a gene, messenger RNA, or amino acid sequence (Spellman et al., 2002
). The Leadscope ToxML is a focused toxicology-specific XML specification for data exchange (https://www.leadscope.com/toxml.php; https://www.leadscope.com/pub/toxml_public.xsd) which includes descriptive tags for a set of compounds, drug information, and specific toxicity tests (e.g., in vivo and in vitro micronucleus studies).
Biological standards organizations tend to implement their OMs in terms of markup languages (Jones et al., 2006
; Spellman et al., 2002
), facilitating data exchange amongst software that adopt the OM. Consequently, the simplest method to export MAGE-ML formatted microarray data is to utilize a MIAME-compliant database coupled to software that are built on the MAGE-OM and take advantage of the MAGE software toolkit (MAGEstk) to export MAGE-ML formatted data.
It should be noted that markup languages can be developed in the absence of OMs, OMs may exist independent of markup languages, and mapping of OMs to markup languages is not exact. It is important to note that XML-based markup languages have certain idiosyncrasies that complicate the cross-mapping of OMs and XML-models since they are predicated upon different semantics and purposes than OMs. For instance, these mapping problems have lead to the development of different flavors of MAGE-ML, further stagnating its broader application, and compromising data reporting and sharing across the genomics community (Burgoon, 2006
).
XML-based files are not the only reporting formats utilized in biology. Other structured formats include the SEND (http://www.cdisc.org/standards), and the preferred format for microarray data submission at the GEO: Simple Omnibus Format in Text (SOFT; http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html). SEND is a tabulation model where a single table represents a particular "SEND Domain" (i.e., a domain is a group of observations or measures which share some commonality, such as body weights and their descriptors), where each column represents a specific type of measurement or descriptor, and each row represents a single record. SOFT utilizes a structured text format that consists of four different types of data lines: (1) entity indicator (denoted by a "
"), (2) entity attribute (denoted by a "!"), (3) data table header (denoted by a "#"), and (4) data table rows (denoted by the lack of any of the above symbols). The structure of a typical SOFT file would be at least one entity indicator line, followed by at least one entity attribute line, followed by a single data table header line, followed by at least one data table row line.
Although many feel that data formats are merely a trivial software matter, investigators are the ones who will be most affected by the existence of a plethora of formats. As it is unlikely that a commercial software developer will align to every single data format, especially as the standards are constantly evolving, an investigator's choice of software may ultimately be affected by which data repository they choose to align with, or the expectations of a regulatory agency. This especially becomes problematic when dealing with XML formats where data may be formatted in many different ways. The existence of several different flavors of MAGE-ML, many of which are not compatible with the repositories, has significantly decreased the utility of the current version of MAGE-ML, and has helped to precipitate the development of MAGE v2 (Burgoon, 2006
).
| ONTOLOGIES |
|---|
An ontology is a complete cataloguing and description of a particular field of study that defines all of the concepts, and their semantics, or relationships to each other. Controlled vocabularies are used in ontology generation to ensure that the concepts are unambiguous and orthogonal. Unlike OMs, ontologies are not tied to computational constructs, such as data and memory structures. Instead, the ontology is concerned with what concepts are contained within the field, what information is required for each concept to have existence, and how different concepts are related to each other. For instance, the Ontology for Biomedical Investigations (OBI; formerly the Functional Genomics Ontology (FuGO); http://obi.sourceforge.net/) specifies classes of entities that are required for describing and communicating functional genomics experiments, including the significant difference between experimental groups, termed a "P_value" (concepts in OBI/FuGO are given unique identifiers, in this case: FuGO_150). The P_value term is a type of "quantitative_confidence_value" (FuGO_83), thus FuGO_150 exhibits an "is-a" relationship with FuGO_83 (i.e., P_value is-a quantitative_confidence_value).
Ontologies are application independent, existing at a higher level of abstraction than an OM. Depending upon their designs, ontologies have several uses, from systems interoperability (i.e., interoperations of divergent systems, including software, workflows, data management, and data generation activities), knowledge management, knowledge discovery, and artificial intelligence.
A common misconception is that markup languages and OMs are types of ontologies; however, the relationship between ontologies, markup languages, and OMs is more complex. An ontology may take any form necessary to communicate or represent it. An ontology may be communicated as a graph of nodes with arcs representing the semantics of the ontology, or as a markup language. This distinction defines an ontology as the concept, while OMs and markup languages are ways to describe the concept, share it with others, or to implement it within a software system.
| RELATIONSHIPS AMONG STANDARDS, CONTROLLED VOCABULARIES, ONTOLOGIES, AND MARKUP LANGUAGES |
|---|
|
|
|---|
Controlled vocabularies provide defined terms and concepts essential to the development of OMs, markup languages, ontologies, and standards (Fig. 1). OMs implement controlled vocabularies in the naming of classes within the model. Similarly, for markup languages, the controlled vocabulary provides the names of the tags that are used to denote specific concepts. Ontologies use the controlled vocabulary to describe concepts, properties of the concepts, and their semantics. Standards utilize controlled vocabularies to clearly define the terms being used, and to facilitate communication and implementation. Controlled vocabularies are essential for OMs, markup languages, and ontologies, as it is generally computationally expensive to accept the use of synonyms. Standards, however, do not have to utilize controlled vocabularies, but their use simplifies development and minimizes ambiguity and miscommunications.
|
Ontologies and reporting standards share similar goals, depending upon the purpose of the ontology. An ontology can be considered a computationally friendly, or mathematically defined, standard if it (1) completely models the field of toxicology, includes subjects, treatments, chemicals, and outcomes, (2) defines semantic relationships between the concepts, and (3) differentiates between optional and those essential terms for a concept to exist. Both the ontology and the standard define minimal sufficiency to achieve compliance—for an ontology, this is the minimum information required to describe the concept, for a standard this is the point where sufficient information is provided. Likewise, by defining the minimum amount of information required, it is possible to identify cases where sufficiency is not achieved. For instance, in developing an internal ontology for our laboratory, we have identified aspects of experimental annotation which are required to place experimental results within the context of the study (e.g., route of administration, husbandry details). This information is required by the ontology for a complete description of an experiment, and therefore is necessary for an experiment to exist. In the absence of the requested minimal information it is not possible to independently interpret the data and/or determine the experimental details involved in obtaining the data.
To drive our knowledge management and discovery operations we are using the web ontology language (OWL; http://www.w3.org/TR/owl-features/), an XML method for encoding an ontology. OWL-encoded files are machine-readable and understandable representations of our ontology and associated data. Reasoners, software applications that utilize the rules from the ontology to perform first order logic and inference, are used to derive new knowledge, and to identify areas within our knowledgebase where insufficient data and information exist to comply with our standards.
| CURRENT EFFORTS AND FUTURE DIRECTIONS |
|---|
Several organizations have begun to focus attention on developing standards, markup languages, OMs, controlled vocabularies, and ontologies for various technologies and disciplines of biology (Table 1). This ever growing list includes reporting concepts for a wide diversity of technologies including cellular assays, flow cytometry, and RNAi, in addition to discipline-specific concepts, such as studies on Arabidopsis, environmental transcriptomics, and nutrigenomics.
|
Although no single reporting concept is all-encompassing for the entire field of toxicology, there is an effort to develop a foundational standard for toxicology data reporting that defines the information that is required to adequately describe the common toxicological aspects of biomedical investigations (Fostel et al., in press). It is envisioned once this standard has been vetted, updated, and approved by the toxicology community, that it will serve as the foundation platform for more specific standards within toxicology to better enable the sharing and deposition of data into repositories. Utilizing this single base platform standard model will facilitate cross-technology data integration (e.g., transcriptomic, proteomic, metabolomic, clinical pathology, histopathology), as the same toxicology-specific metadata will be required to be reported in all cases. This will allow databases based on the standard to be generated that are capable of large-scale data sharing and the formation of data pipelines to streamline data deposition.
Standardizing the data and the databases also serves a practical function from the mechanistic and safety assessment stand-points. For instance, standardized databases that can more readily share data will allow investigators to make better comparisons of data sets by quickly identifying the similarities and differences between studies. For instance, software that is tuned to the standards will be capable of identifying common study parameters such as exposure parameters (e.g., chemical treatment, vehicle chemical, initial exposure, dosing regimen, exposure duration, exposure dose, organ/tissue dose, route of administration), study timeline (e.g., subject age at exposure, subject age at sacrifice or harvest), and assay end-points (e.g., clinical pathology, histopathology, transcriptomics, proteomics, metabolomics). This allows investigators to quickly identify similar studies and to facilitate integration of results from different studies to enhance mechanistic data interpretation and to help identify gaps and overlaps in the data to effect more meaningful safety assessments.
A data standard that requires specification of the dosing regimen and assay protocols in a standardized format will allow databases to be built which can more readily parse this information, and make more of the information available for cross-dataset comparison. This enables investigators to begin to (1) identify study parameters that differ between studies, (2) quickly assess how these differences may impact the results and how they may influence any differences between studies, (3) make determinations of the data quality, and (4) to rank and prioritize the studies for inclusion within a safety assessment. For example, when compiling a database similar to the relative effect potency (REP2004) database (Haws et al., 2006
; Van den Berg et al., 2006
), to assess toxic equivalency factors, it is essential that all of the relevant studies report similar study parameters to allow stakeholders to make a reasonable judgment as to whether or not a study should be included. In this case information regarding the calculated relative effect potency (REP) of the chemical relative to the potency of 2,3,7,8-tetrachlorodibenzo-p-dioxin is necessary, as are the assay end-points used to calculate the REP, treatment duration, animal strain (in vivo), cell type (in vitro), route of administration (in vivo), number of experimental units (e.g., animals) per group, time between the last treatment and the assay, the number of dose levels under consideration, whether a plateau was reached or a maximum dose was used, and how the REP was calculated. The REP2004 database, and its predecessors, was built by investigators combing the literature. By utilizing a standard data reporting format, coupled to an appropriately built and community-vetted standard, this same information could be deposited into a large-scale data repository to facilitate construction of the next generations of the REP and similar databases.
Although having a dearth of technology and domain-specific standards shows a healthy interest in making data widely available and increasing the potential for reproducing studies, it also runs the risk of creating an undue data reporting burden. Many of these pioneering standards organization and their supporters are beginning to assert that funding agencies must play a role in standards enforcement, and that data should be submitted to a repository (Brooksbank and Quackenbush, 2006
; Fiehn et al., 2006
). Although the goal of near complete adherence to the standards is admirable, there still remains many unresolved issues regarding funding agency enforcement: (1) how will funding agencies police and enforce standards enforcement and compliance, (2) what will the repercussions be for noncompliance, (3) will funding agencies require deposition of data to a suitable repository, and (4) how will funding agencies compensate investigators for data deposition activities? In exchange for compliance, it is not unreasonable to expect that the scientific community, as well as the taxpayers, would want demonstrated evidence from the funding agencies that this additional reporting burden has benefited stakeholders, taxpayers, and has enhanced the current practice of scientific exchange.
These key benchmarks may become even more difficult to reach as funding agency-based standards enforcement is not a "no-cost" program. Consider the impact in terms of costs and time for an investigative team associated with data reporting may be several days or weeks associated with microarray data submissions. Consider further what would happen if an investigator were reporting data regarding cell-based assays, flow cytometry, and complementary DNA microarrays within the same study, and the time and costs associated with reporting all of these data. There are also costs associated with developing, managing, and maintaining the data repositories, which are either funded through public or private resources. All of these costs would be in addition to the costs associated with policing and enforcement activities that would have to be undertaken by the funding agency, including data reporting on oversight and stewardship. These real costs need to be considered and balanced against the benefits of a mandatory system, versus the costs and benefits of a voluntary reporting system.
| CONCLUSIONS |
|---|
Drug development and environmental health scientists must become cognizant and involved in the ongoing standardization efforts as organizations developing standards intend to use journals and funding agencies to enforce minimum public reporting and sharing requirements (Ball et al., 2004
A recent survey of GEO and ArrayExpress identified that a majority of the submitted experiments within the sample failed to provide raw microarray data. The survey further concluded that more strict adherence to the MIAME standard must be enforced (Larsson and Sandberg, 2006
). However, MIAME, which is a guideline as opposed to a standard, makes no specific requirements, including no absolute specifications requiring data deposition to repositories, nor the availability of raw data (Ball et al., 2002
; Burgoon, 2006
). Misinterpretation of the intent and content of the MIAME guidelines, which many incorrectly assume to be a standard, is rampant throughout the community requiring a larger discourse to minimize further confusion.
The growing list of reporting concepts brings with it an increased potential for accessibility of biological data, but also harbors an increased reporting burden, and increased costs in terms of data reporting, infrastructure development, and policing and enforcement activities. At this time there has not been any substantial analysis of the costs and benefits associated with implementing all of these reporting concepts, comparisons of voluntary and mandatory reporting models, or the establishment and development of repositories. Only recently has a call for standardizing the standards, and identifying inefficiencies and redundancies been made (Ball, 2006
). No effort for consolidating or developing a consolidated biological data repository has been made. Thus, the status quo is a disorganized system of tactically oriented grass-roots organizations, characterized by redundant and duplicative efforts, leading to inefficient use of talented resources, development of community fragmentation (i.e., already evidenced within the microarray community in re MAGE vs. SOFT), and further complicating any future data integration efforts. To combat this within the toxicology community, a group of scientists have created a "strawman" foundational toxicology standards proposal (Fostel et al., in press), upon which more technology and community-focused standards may be built. By re-examining the standards movement within specific communities of practice, creating foundational frameworks within those communities, and building technology-specific standards upon the foundation, it will become possible to create better community-based data integration environments. But this can only be successful with widespread community input.
By becoming conversant in the data sharing vocabulary and technology, investigators will become educated assets to the standards and software development communities, with a voice in the development of new community standards. There is an ever growing need for toxicology data standards to justify the large expenditure of resources in support of toxicogenomic data repositories (e.g., CEBS, and the International Life Sciences Institute Health and Environmental Sciences Institute's ToxMIAMExpress). As has been pointed out in a recent commentary by the President of MGED, the MIAME standards were not developed by those who are most likely to be impacted by it, and thus the likelihood of it being applied and used effectively is lessened (Ball, 2006
). Nevertheless, journal editors, reviewers, and funding agencies continue to adopt or consider requiring these standards. Ideally, engagement with a more knowledgeable community will affect change with respect to the data report standards and repositories, making them more responsive to their overall goals and useful to the user community.
| ACKNOWLEDGMENTS |
|---|
The author would like to thank Dr Tim Zacharewski, Dr Jeremy Burt, Ed Dere, Cora Fong, and Josh Kwekel for their critical reading of this manuscript.
| REFERENCES |
|---|
Ball CA. Are we stuck in the standards? Nat. Biotechnol. (2006) 24:1374–1376.[CrossRef][Web of Science][Medline]
Ball CA, Brazma A, Causton H, Chervitz S, Edgar R, Hingamp P, Matese JC, Parkinson H, Quackenbush J, Ringwald M, et al. Submission of microarray data to public repositories. PLoS Biol. (2004) 2:E317.[CrossRef][Medline]
Ball CA, Sherlock G, Parkinson H, Rocca-Sera P, Brooksbank C, Causton HC, Cavalieri D, Gaasterland T, Hingamp P, Holstege F, et al. Standards for microarray data. Science (2002) 298:539.[CrossRef][Web of Science][Medline]
Berman JJ. Pathology data integration with eXtensible Markup Language. Hum. Pathol. (2005) 36:139–145.[CrossRef][Web of Science][Medline]
Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al. ArrayExpress—A public repository for microarray gene expression data at the EBI. Nucleic Acids Res. (2003) 31:68–71.
Brooksbank C, Quackenbush J. Data standards: A call to action. OMICS (2006) 10:94–99.[CrossRef][Web of Science][Medline]
Burgoon LD. The need for standards, not guidelines, in biological data reporting and sharing. Nat. Biotechnol. (2006) 24:1369–1373.[CrossRef][Web of Science][Medline]
Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. (2002) 30:207–210.
Fiehn O, Kristal B, van Ommen B, Sumner LW, Sansone SA, Taylor C, Hardy N, Kaddurah-Daouk R. Establishing reporting standards for metabolomic and metabonomic studies: A call for participation. OMICS (2006) 10:158–163.[CrossRef][Web of Science][Medline]
Fostel JM, Burgoon LD, Zwickl C, Lord PG, Corton JC, Bushel P, Cunningham M, Fan L, Edwards SW, Hester S, et al. Towards a checklist for exchange and interpretation of data from a toxicology study. In: Toxicol. Sci. (in press).
Haws LC, Su SH, Harris M, Devito MJ, Walker NJ, Farland WH, Finley B, Birnbaum LS. Development of a refined database of mammalian relative potency estimates for dioxin-like compounds. Toxicol. Sci. (2006) 89:4–30.
Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, et al. The HUPO PSI's molecular interaction format—A community standard for the representation of protein interaction data. Nat. Biotechnol. (2004) 22:177–183.[CrossRef][Web of Science][Medline]
Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, et al. The systems biology markup language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics (2003) 19:524–531.
Jones AR, Pizarro A, Spellman P, Miller M. FuGE: Functional Genomics Experiment Object Model. OMICS (2006) 10:179–184.[CrossRef][Web of Science][Medline]
Larsson O, Sandberg R. Lack of correct data format and comparability limits future integrative microarray research. Nat. Biotechnol. (2006) 24:1322–1323.[CrossRef][Web of Science][Medline]
Mattes WB, Pettit SD, Sansone SA, Bushel PR, Waters MD. Database development in toxicogenomics: Issues and efforts. Environ. Health Perspect. (2004) 112:495–505.[Web of Science][Medline]
Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. (2002) 3:RESEARCH0046.[Medline]
Van den Berg M, Birnbaum LS, Denison M, De Vito M, Farland W, Feeley M, Fiedler H, Hakansson H, Hanberg A, Haws L, et al. The 2005 World Health Organization reevaluation of human and Mammalian toxic equivalency factors for dioxins and dioxin-like compounds. Toxicol. Sci. (2006) 93:223–241.
Waters M, Boorman G, Bushel P, Cunningham M, Irwin R, Merrick A, Olden K, Paules R, Selkirk J, Stasiewicz S, et al. Systems toxicology and the Chemical Effects in Biological Systems (CEBS) knowledge base. EHP Toxicogenomics (2003) 111:15–28.[Medline]
Xirasagar S, Gustafson S, Merrick BA, Tomer KB, Stasiewicz S, Chan DD, Yost KJ 3rd, Yates JR 3rd, Sumner S, Xiao N, et al. CEBS object model for systems biology data, SysBio-OM. Bioinformatics (2004) 20:2004–2015.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. R. Williams-Devane, M. A. Wolf, and A. M. Richard Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress Toxicol. Sci., June 1, 2009; 109(2): 358 - 371. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

