Skip to main content
Home / Resources / Blog / Using Define-XML for Faster, Better Quality and More Efficient Studies

Using Define-XML for Faster, Better Quality and More Efficient Studies

Since its inception in 1997, the Clinical Data Interchange Standards Consortium (CDISC) has developed and supported globally adopted data standards to improve clinical trial efficiency. Clinical data standards are now recognized as playing a vital role in the entire end-to-end clinical trial process. Standardization allows for faster, better quality and less costly drug discovery. 

One of the most widely used standards today is Define-XML. The latest version of Define-XML is v2.1.7, which went live in March 2024. Get release information on the CDISC website.

What is Define-XML? 

According to CDISC: “Define-XML is required by the United States Food and Drug Administration (FDA) and the Japanese Pharmaceuticals and Medical Devices Agency (PMDA) for every study in each electronic submission to inform the regulators which datasets, variables, controlled terms, and other specified metadata were used.” 

The FDA’s Technical Conformance Guide explains that Define-XML is “arguably the most important part of the electronic dataset submission for regulatory review” because it helps reviewers gain familiarity with study data, its origins and derivations. 

The standard itself is known as ‘Define-XML’. The file that’s submitted to the FDA upon completion is the data definition file, known simply as ‘define.xml’. 

Define-XML as a dataset descriptor 

It is commonly thought that Define-XML is simply a dataset descriptor: a way to document what datasets look like, including the names and labels of datasets and variables, what terminology is used etc. This is essentially what Define-XML was created for. 

But by instead thinking of Define-XML as a tool to create better quality, more efficient clinical studies, users can unlock the true potential of the standard. 

Progressive uses of Define-XML 

You can use Define-XML to help you optimize the end-to-end clinical trial process in the following ways: 

1) Use Define-XML in your CRF design process

Many organizations treat Define-XML as an afterthought: only when case report forms (CRFs) are designed, data is collected and the study is complete, do they think about creating the define.xml file for FDA submission. 

But this approach can lead to incomplete data, the need for protocol amendments, complex mapping, increased quality control. How do you know when designing the CRF that you’re collecting all the relevant SDTM data? And when the data has been collected, how can you verify the submission is what you intended when you have no study definition to compare it with? It can take valuable time and resources to make sure all data has been collected in the right format, and ultimately can delay the study process. 

A more efficient approach is to use Define-XML to define your study, end-to-end, right at the start. This includes defining SDTM, SEND, and ADaM datasets upfront. 

Using Define-XML and SDTM to design submission datasets at the start of a study makes it easier to set up your study and create your CRFs. By setting out what information should ultimately appear in your submission datasets before you collect any patient data, you can create CRFs with confidence, knowing that you’re collecting all the required information in the right format. 

For example, the SDTM standard gives the ‘Identifier’, ‘Topic’, ‘Qualifier’, and ‘Timing’ variables required in your submission datasets. If you know upfront what variables to use, you can create your CRFs accordingly. 

You can also do your dataset annotation of CRFs with SDTM variables upfront. This can help ensure all your collected data has a place in SDTM. This has the additional benefit of providing basic mapping between the forms and the datasets. CDISC provides a mechanism to extend Define-XML which is permissible and allows the storage of additional metadata such as complex dataset mappings (e.g. how data may be merged into one single dataset from two sources). 

Using Define-XML for Faster, Better Quality and More Efficient Studies 

In this way, using Define-XML upfront, rather than retrospectively, can help you ensure your study is a success.  

2) Use Define-XML in EDC data conversions 

Define-XML is not limited to just describing CDISC SDTM and ADaM dataset structures. From an electronic data capture (EDC) system, you can export proprietary dataset formats which can be described using the Define-XML model. With the right tools, you can automatically generate a Define-XML that describes the EDC export datasets using the CRFs/eCRFs themselves. This can then be displayed in a friendly HTML or PDF format allowing early visibility of the datasets that will be delivered by the EDC system. 

The Source Proprietary dataset spec enables upfront mapping of EDC datasets to SDTM datasets. These mappings can be described (and made machine-executable) using extensions to Define-XML and human-readable SDTM mapping specifications produced automatically, aiding review and approval of mappings. 

In addition, the Define-XML mapping extensions provide a machine-executable format that can be processed by data transformation code to enable the automatic conversion of datasets in commercially available tools. 

The diagram below shows the flow of data from data capture through to CDISC datasets and the part CDISC metadata plays. Metadata is used in designing data capture forms using CDISC ODM and Define-XML in designing destination datasets. This vendor-neutral metadata can form the basis of form and dataset libraries which can be re-used from study to study. 

Using Define-XML for Faster, Better Quality and More Efficient Studies 

3) Creating and re-using dataset libraries 

Define-XML is the perfect tool to help you create libraries of datasets (EDC, SDTM, ADaM), mappings, page links to CRF variables, and so on for re-use from one study to the next. 

A metadata-driven approach using Define-XML can optimize a single study from set-up to submission. But creating libraries of reusable metadata will make future studies even more efficient. 

If you have a library of data acquisition forms, proprietary EDC datasets, SDTM datasets, ADaM, and dataset mappings that are approved internally and ready to use, you’ll only have to create new content where there is a specific requirement for it. All other approved metadata is already there in your library. 

4) Automating dataset validation 

Another major advantage to defining datasets upfront is that validation can also be done up front. By creating a prospective definition of the intended datasets at the start of the study, it is possible to machine-validate study dataset designs for conformance to external standards. It is also possible to validate that populated datasets match the original specifications. This way, data quality and submission compliance are built-in upfront with less reliance on downstream validation. 

We go into a little more detail on validation possibilities below: 

  1. Compare study dataset designs, including controlled terminology, to external and internal standards 

 
When designing SDTM datasets and creating controlled terms, it is imperative that these comply with the latest and/or chosen version of National Cancer Institute Controlled Terminology (NCI CT). During the dataset design phase, automatic comparisons and compliance checks should be made with the appropriate version of NCI CT. 
 
Companies should also develop their own domains that comply with CDISC SDTM but include content that falls outside of the standard Implementation Guide domains. For example, specialist findings domains may be required for a particular therapeutic area. In this scenario, companies should compare study dataset designs against their own data standards to check for differences and either accept or reject them accordingly. 
 
 

  1. Compare ‘As specified’ study dataset specification against ‘As delivered’ study dataset designs 
     
    Increasingly, studies are outsourced to Contract Research Organizations (CRO) and this leads to an increased burden on sponsors. This tends to happen in two areas: (a) upfront specification of deliverables and (b) downstream validation of those deliverables. 
     
    When dataset validation is done upfront, a human-readable target SDTM specification (in HTML, PDF, Word or Excel) can be given to a CRO to describe what is expected to be in the delivered datasets: an ‘As specified’ study dataset specification. 
     
    When CROs return the datasets, they should also provide ‘As delivered’ study dataset metadata. With both ’As specified’ and ’As delivered’ study dataset metadata available, it is easy to compare the study dataset metadata to verify that the ’As delivered’ dataset matches what was specified. 
     
     
  1.  Compare dataset data to dataset metadata and SDTM or ADaM 
     

    Having a target SDTM Define-XML available upfront allows automated comparison of delivered datasets against study dataset metadata, either as specified or as delivered. Comparing data to as specified Define-XML verifies that the data matches what was originally intended/specified. And comparing data to as delivered Define-XML ensures that the data matches the dataset definition. This is important as it will ultimately be this as delivered Define-XML that is submitted to the FDA. 
     
    Find out how to de-risk trial submissions with Pinnacle 21 Enterprise – Book a free no-obligation software demo now >>  

Define.xml file submission

As we’ve shown, there are many benefits to using Define-XML – not only as a dataset descriptor, but to streamline the clinical study process. 

Define-XML should not be thought of as simply a submission deliverable, but as a CDISC model that helps optimize the end-to-end clinical trial process. It can be used to establish dataset libraries that promote study-to-study re-use, as well as to drive efficiencies through expedited study set-up and streamlined dataset conversions. 

Define-XML submission checklist 

Learn what mistakes to avoid when creating your define.xml with our submission checklist. 

How detailed should you be?  

When should you create a PDF file? 

Our checklist will help you avoid the pitfalls and tell you all you need to know to create a compliant Define.xml submission.

About the author

Ed Chappell
By: Ed Chappell

Ed Chappell has been working as a Solutions Consultant with Formedix, now part of Certara, for over 15 years, and has 22 years’ experience in data programming. He authored and presents our training courses for SEND, SDTM, Define-XML, ODM-XML, Define-XML and Dataset-XML. 

Ed was heavily involved in the development of our dataset mapper and works closely with customers on SDTM dataset mapping. As an expert in clinical data programming, Ed also supports customers with Interim Analysis (IA) SDTM and FDA SDTM clinical trial submissions.