Using the same element for multiple purposes isn't always wise! |
The fashion these days is to standardize at the individual common data element (CDE) level. Thus, when standards are developed, they tend to focus more on characteristics of the CDE alone and less on its relationship with other CDEs. While this may initially seem to be the simplest and most logical approach, it risks omitting information that later users may need to interpret the data correctly, especially when the data are pooled for meta-analyses or data mining. If “quality data” are defined as “data that are fit for their intended uses,”1 then this threatens the quality of the data and the conclusions drawn from them.
The Case for CDEs
Many organizations are developing CDE libraries, from CDISC2 to many parts of the US National Institutes of Health3 to medical associations such as the American Heart Association4. They differ in their uses of the data, e.g., for patient care vs. clinical research, but there are some basic characteristics that most include in their definitions, which can be found in Table 1. Some even standardize categories of elements, e.g., generic yes/no question, or generic start and end dates. Many standards are robust and resolve many issues, such as different variable names in different databases, different code lists, conflicting types (e.g., character vs. numeric) and so on, making data-sharing easier and more reliable. It makes sense to organize standards by CDE; from a data collection point of view, it is the smallest independent unit that cannot be further subdivided, and it can be grouped in different ways to form case report forms, health care charts, etc.
The Challenges with CDEs
There are drawbacks to this approach. Many CDEs are not independent, and either need or are supportive of other CDEs, without which they lose their meaning. Some of these relationships may seem self-evident, and many think they do not need to be defined, but other relationships are less so. Where and how to define these relationships is a challenge, as they reasonably belong in the definitions of all members of the relationship. It is not considered good practice, however, to duplicate information, and some relationships may not hold for all studies. Also, data have many lives, from capture, e.g., on a case report form (CRF) for a clinical study, to inclusion in a regulatory submission, to data repositories, to supporting clinical care, and much else that cannot even be envisioned yet. Even supposedly self-evident relationships may not be so obvious in other contexts.
Because the data will outlive their initial purpose, it is important for users to understand the assumptions and constraints that influenced the data elements’ definitions, or they may use the data inappropriately. Even if the data elements library accompanies the data and the elements were used exactly as defined, the libraries often do not define these relationships.
Examples of Challenges
1. Measurements and Units of Measure
This is the most obvious example. Measurements have no meaning if the unit of measure is not defined. For example, a weight of 25 is not useful unless we know if it is in kilos or pounds. Both are reasonable pediatric values, but can have different implications depending upon the age or condition of the subject. Many organizations preprint or “paint” the units on the CRF because the form filler either just knows to use one unit or the other, or the protocol defines it. When the data are entered into a database or electronic CRF (eCRF), the units may be placed in the label of the measurement (perhaps to minimize the number of variables required) or may not be captured electronically at all, under the premise that the protocol defined the unit so it can only have one value and therefore does not have to be “databased.” If the data without units from one organization are later pooled with data from another, one can no longer assume that the units are the same and weights become unusable.
2. Adverse Event and Severity of Event
There are two potential challenges here. The first is that Severity of Event has no meaning by itself, but only in relationship to the adverse event (AE). This may seem obvious, but when specific AEs are assessed they may be pre-printed on the CRF, and because the focus is on the study, it is assumed that everyone knows what the AE was. If the AE term is not included in the database, all the AE data become unusable, and not just the severity. The more general point here is that any element that describes or modifies another element may not be usable unless the modified element is included. The second challenge, while not directly about combinations of elements, is also a potential quality issue.
There are typically 2 code lists (or sets of controlled terminology) used with the severity field: Mild, Moderate and Severe (for most AEs) and the same three plus Life-Threatening and Fatal. It should be very clear which one is used in each study, and this information should remain with the data throughout its lifecycle. This is because if only Mild, Moderate and Severe are present in the database, it is unknown if Life-Threatening and Fatal were not on the CRF or were not observed in the study. This is true for all cases where subsets of code lists are used.
3. Reference Ranges (aka Normal Ranges)
Reference ranges are elements that define the highest and lowest expected values of a specific response. The most common reference ranges are the upper and lower limits associated with laboratory tests. For example, the expected minimum and maximum values of urine glucose for healthy individuals might be 0 mg/dL and 15 mg/dL respectively. A value of 20 mg/dL is not interpretable in the absence of the reference range. Although it requires more storage space, it is best to include the relevant reference range on every record in a lab dataset as this helps to prevent ambiguity. While ranges for many tests can be found in textbooks or online, these may not be applicable to the subject population in the study, which can influence the interpretation of the data and affect the reliability of the conclusions. This requirement should be specified in the standard data element library.
4. Image Quality
A recent set of CRFs captured information about images, and the following question was seen:
Indicate the quality of the image:
- Adequate quality
- Exemplary quality
- Limited quality
- Not adequate quality
5. Overlapping Elements
In a data elements library reviewed recently, the following three CDEs were present:
- Reason test was not completed
- Equipment failure/error
- Medical reason
- Other
- Participant death
- Participant refusal
- Participant withdrew
- Scheduling problem
- Unknown
3. Medical reason test was not completed
- Abnormal laboratory level
- Adverse Event
- Claustrophobia
- Injection complication
- Progressive disease
- Question 1 asks for a general reason why the test was not completed, and one of the responses is Medical Reason.
- Question 2 provides a place to specify a reason if it was not included in the list.
- Question 3 asks for the specific medical reason.
This issue could be resolved either by not using the three elements together, or by laying out the CRF such that the Medical Reason list from Question 3 was next to the Medical Reason response in Question 1, although this works better for paper CRFs than it does for electronic ones. For electronic ones, it might be managed using cursor controls where the list of medical reasons is only accessible if Medical Reason is checked for Question 1.
In any case, the issue remains that this kind of information is not typically included in the CDE library and thus may not be apparent to users, leading to the same data being captured inconsistently.
Conclusions
For data to be high quality, current and future users must understand how to use the data appropriately. These examples show that having standard CDEs is not enough to ensure high quality data, and that users must also understand, among many other things, the relationships that elements have to each other and how to use them together. This may seem obvious, and perhaps people who do not understand it should not be designing clinical trials data. This may or may not be true, but what is obvious to one person is not necessarily as obvious to another, especially when the uses are very different. Including these relationships in CDE libraries may be appropriate, but in other cases it may not be, for example, because they are very study- or institution-specific. In addition, CDE libraries are not typically stored with the data, and may not be available to later users. What is really needed is a mechanism for ensuring that these design and handling rules and assumptions are associated with the data as they progress through their lifecycle, and that data are defined in a way that is easy to review and apply. There is no such structure currently available, but in order to ensure that the data repositories of the future are robust and their data are used in appropriate ways, this issue will have to be addressed.
Table 1. Data Element Characteristics that are Commonly Standardized
End Notes
1 Institute of Medicine. Assuring Data Quality and Validity in Clinical Trials for Regulatory Decision Making. Jonathan R. Davis, Vivian P. Nolan, Janet Woodcock, and Ronald W. Estabrook, Editors. National Academy Press, c. 1999
2 Clinical Data Interchange Standards Consortium. www.cdisc.org
3 National Institutes of Health: These are some institutes that have published CDE information: National Cancer Institute Data Standards caBIG: https://caBIG.nci.nih.gov
National Institute of Neurological Disorders and Stroke:
http://www.ninds.nih.gov/research/clinical_research/toolkit/common_data_elements.htm
Office of Rare Diseases Research: http://www.grdr.info/index.php/common-data-elements
4 American Heart Association: http://circ.ahajournals.org/content/112/12/1888.full
© Kit Howard, Kestrel Consultants, Inc.
Originally Appeared in K-News, Feb 2012.