The Data Validation Process

The GEMINI 2010-2017 dataset is comprised of administrative and clinical data from 245,599 patient admissions at 7 hospitals in Ontario.

Assessing the quality of data extracted from hospitals

Pasricha SV, Jung HY, Kushnir V, Mak D, Koppula R, Guo Y, Kwan JL, Lapointe-Shaw L, Rawal S, Tang T, Weinerman A, Razak F, Verma AA.

To assess GEMINI’s data extraction and transfer procedure from hospital sites, the GEMINI team performed a computational data quality assessment, which involved a series of 7 data quality checks, followed by manual data validation to identify data quality issues arising from extract-transform-load and transfer processes.

Computational checks were performed based on visualizations and tabulations of data. All data was deidentified, processed, and organized into related categories prior to commencing the quality checks. Checks 1-4 were designed to identify errors in data completeness associated with data extraction and transfer procedures while Checks 5-7 consisted of detailed inspections of selected variables:

Admissions over time: This check explored whether data for patient admissions was systematically missing from any data table.

Data volume over time: In contrast to Check 1, this check assessed missing data, rather than missing patient admissions. Temporal patterns of missingness as well as variations in trends were carefully inspected.

Admission-specific data volume over time: Check 3 examined the total volume of data per patient admission to eliminate variations in data volume that may be driven by variations in the number of patient admissions.

Distribution of data in relation to admission and discharge times: This check examined whether date and time information collected from patient admissions and discharges were plausible and ensured that data was not missing from a specific portion of a patient’s hospitalization (e.g. emergency department stay).

Overall variable presence over time: The missingness of every variable from each data table was examined. This allowed the team to identify temporal patterns and clustering in variable missingness.

Specific variable presence over time: The quality of data categorization and standardization for a specific variable was assessed through visual inspection. This check allowed for identification of problems with variable mapping and standardization.

Plausibility check: Each variable was inspected to ensure it contained plausible values.

When complications were identified, data was iteratively re-extracted from hospitals to correct problems.

The GEMINI dataset was then compared to the gold star – manually abstracted data from patients’ individual hospital record. Comparison between the GEMINI dataset and 7,844 patients’ hospital electronic records show the GEMINI dataset achieved high specificity (99%-100%), high sensitivity (95%-100%) and high overall accuracy (98%-100%), suggesting that although manual data validation is labour-intensive, it may be necessary to validate individual hospital electronic records for each patient admission.

As such, the GEMINI experience demonstrates that computational data quality assessment and manual data validation are complementary and combining both should be the ideal method to assess the quality of large clinical databases.