Skeletal development of the hand and wrist: digital bone age companion—a suitable alternative to the Greulich and Pyle atlas for bone age assessment? | SpringerLink

This study is Health Insurance Portability and Accountability Act compliant and institutional review board approved. A waiver of informed consent was issued by the institutional review board.

Subjects (bone age study selection)

We identified bone age examinations interpreted clinically with the G&P atlas over 1 year prior to the start of our study to limit potential for recall bias. We collected information from the reports of these bone age examinations stored on the PACS to include: (1) patient gender, (2) interpreting physician, and (3) original bone age report qualitative impression (i.e., normal vs. delayed vs. advanced skeletal maturity). The radiographs themselves were not reviewed as part of study selection. We identified 96 bone age examinations (Table 1), which had been previously interpreted at least 1 year previously at a tertiary care center by two fellowship-trained faculty pediatric radiologists (10 and 29 years of focused pediatric radiology experience) using the G&P method. The cohort was chosen to achieve an approximately even distribution by (1) patient gender and (2) interpreting physician. One-third of studies in the cohort were purposefully chosen to have abnormal skeletal maturity to create a more rigorous study more likely to reveal a difference in performance. The pediatric radiologist readers were blinded to this information.

Table 1 Characteristics of the children included in the bone age sample for interpretation as a group and by reader

Full size table

The chronological age of patients (n = 96) utilized in this study ranged from 1.8–17.2 years (median 11.3 years; SD 4.0 years). Additional details about the cohort are provided in Table 1.

Bone age interpretation

The same pediatric radiologists who had previously interpreted the selected examinations for clinical purposes with the G&P atlas were blinded with respect to the original report and asked to re-read their own prior studies (n = 96, comprised of 50 studies interpreted by radiologist 1 and 46 by radiologist 2). Recall bias was limited by use of studies that were greater than 1 year old. Research study reads were randomized to either G&P (n = 48, comprised of 25 for radiologist 1 and 23 for radiologist 2) or DBAC (n = 48, comprised of 25 for radiologist 1 and 23 for radiologist 2). Bone age standard selection, overall impression (i.e. normal, delayed, or advanced maturity), interpretation-report cycle time, and report typographical or speech recognition errors (i.e., one error = single instance of incorrect word or spelling) were recorded for each of the research interpretations.

As the G&P method has been widely utilized for decades, it is presumed that this clinical workflow is understood. As the integrated DBAC method is likely unfamiliar to many readers, it is illustrated in Fig. 1 and described briefly here. When interpreting a bone age study, the radiologist launches DBAC by clicking on an icon within the RIS or PACS depending on the local system configuration. In our system, readers can click on an icon in either the RIS (Radiant, Epic Systems, Verona, WI) or the PACS (Carestream version 11.3, Carestream Health, Rochester, NY), choosing whichever is found to be more convenient. This same click also passes the patient and study context to DBAC, which then displays the standard of correct gender and most closely matching age. The radiologist may optionally call up additional standards and/or guiding annotations if desired until the best match is made. If the reader believes the clinical image falls between two standards, this adjustment can be manually entered into DBAC. With context sharing and standard matching complete, a structured report is then generated.

We wish to clarify our rationale for evaluating the overall skeletal maturity assessment (i.e., normal, delayed or advanced) beyond simply evaluating the bone age standard assessment. The bone age standard assessment reflects how closely the reader believes the candidate image matches one standard over another; however, the overall skeletal maturity assessment includes additional steps of looking up the standard deviation, considering the chronological age of the patient and potentially making a calculation (e.g., when the estimated skeletal age does not match the chronological age, the radiologist may compare the two to determine if they are within two standard deviations of each other), and finally classifying the study into one of three basic categories (i.e. normal, delayed, or advanced maturity). Because the overall skeletal maturity assessment includes additional steps, these are further potential sources of human error beyond performing skeletal age assessment alone. Since DBAC automates standard deviation data look-up, calculates the number of standard deviations difference between estimated skeletal age and chronological age, and classifies the numerical result (normal vs. delayed vs. advanced), there is potential for differences in overall skeletal maturity assessment between DBAC and G&P methods, even in cases with matching bone age standard assessments.

Benchmark reading

One of our objectives was to determine whether DBAC would yield similar bone age results to G&P; however, we recognized that if we only compared DBAC-based interpretations to the original G&P-based interpretations a confounding variable would be intraobserver variability known to be intrinsic to the G&P method [2]. Thus, we desired a stronger benchmark than the original clinical interpretation alone. Since there is no perfect gold standard in the case of bone age results, we decided to use a two-out-of-three approach to establishing a benchmark with at least one of the two matching results coming from G&P. When the research interpretation agreed with the original clinical G&P-based interpretation, the two matching results were considered the benchmark. When a research interpretation disagreed with the original clinical G&P interpretation, the radiologist that produced the discrepant result between her own readings was informed of a conflict and asked to perform a blinded third reading using G&P. This generated a two-out-of-three tie-breaker result and established the benchmark. In such cases, the tie-breaking interpretation took place 4 weeks after the initial research interpretation to limit recall bias.

We wish to further clarify the rationale for using a third tie-breaking interpretation with G&P by the same reader as the benchmark. The original clinical reading alone was considered insufficient as a benchmark because of the expected intraobserver variability. An alternative study design could utilize a second reader as the tie-breaker, but that introduces interobserver variability, and our study was focused on intraobserver performance when using two different bone age methods. It should be emphasized that in all cases of discrepancy, at least one if not both of the two-out-of-three results came from a blinded G&P reading.

In summary, regardless of match or discrepancy between clinical and research reads, the benchmark always included a blinded interpretation with G&P. This is important because G&P is the widely accepted method though intraobserver variability can be expected.

Timing and reporting

Interpretation-report cycle time was defined as the time interval beginning with the study being opened on PACS and ending with the corrected report being signed in our reporting application (Powerscribe, Nuance Communications, Inc., Burlington, MA). The cycle included reviewing the study, referencing a bone age resource (G&P or DBAC), and creating and editing the report in Powerscribe. The faculty radiologists signed reports in the same reporting application (i.e., Powerscribe) regardless of assigned atlas; however, the DBAC workflow included an additional step whereby the first report draft was initiated by DBAC and copied into Powerscribe for editing and signature. While Powerscribe offers automated structured report templates and auto-import of some patient information regardless of use of G&P or DBAC, the purpose of the extra DBAC step was to use context sharing to automatically launch a proposed matching standard by age and gender, automate look-up of standard deviation value by age and gender, auto-calculate the number of standard deviations the patient is from the mean to aid final maturity assessment by the radiologist, and populate this information into a structured report.

Subjective workflow experience

After completing all research interpretations in this study, each radiologist completed a Likert survey measuring subjective preferences between methods in the following nine categories: “image quality to include fine bone detail available in the images,” “utility of text and/or arrows in providing aid to interpretation,” “accuracy,” “efficiency,” “ease of use,” “confidence in my calculations,” “confidence in my overall result,” “resident checkout of bone age studies (from experience using both methods clinically with residents),” and “overall experience when reading bone age studies without a resident.” Each category included no further information beyond the above category names, and readers were asked to choose one of the following subjective responses: strongly prefer G&P, prefer G&P, no preference, prefer DBAC, or strongly prefer DBAC.

Statistical analysis

Intraobserver agreement and agreement with benchmark

Intraobserver agreement for both bone age standard determination and overall maturity assessment (i.e., normal, delayed, or advanced) was summarized as the percentage of reads in which old and new determinations agreed. Intraobserver or intrarater agreement represents the degree of agreement for a single reader performing repeated bone age interpretations on the same study, i.e., how often does the reader agree with himself or herself when re-reading the same study? The intention was to compare how likely the reader was to agree with his/her original clinical interpretation with G&P with one method or the other. Since DBAC is based upon G&P, and G&P has known intraobserver variability, one might expect similar rates of intraobserver variability. Benchmark agreement represents the degree of agreement for each method to the benchmark result. Benchmark agreement allows a more reliable assessment of DBAC performance than comparing to only a single G&P result because of intraobserver variability intrinsic to G&P. Comparisons of agreement were conducted by way of Fisher’s exact tests, and p ≤ 0.05 was utilized to determine statistical significance.

Interpretation-report cycle time

The interpretation-reporting times were analyzed on the natural logarithmic scale via a linear mixed model (LMM). Hypothesis testing was conducted by way of a linear contrast of the LMM least squares means using p ≤ 0.05 to determine statistical significance. Since the data were analyzed on the natural logarithmic scale, the results were transformed to the geometric mean.

Report typographical/transcription errors

Report error rates were compared between the two methods via a paired random permutation test. To clarify the rationale for including this metric, we emphasize that DBAC produces a standardized report based upon the standard selection by the reader. Report generation is made possible because the gender, date of birth, and date of study were already made available to DBAC by electronic context sharing. The digital atlas looks up the appropriate standard deviation and completes basic calculations. The thought is that this semiautomated report may contain fewer textual errors than one generated by a radiologist.

Subjective workflow experience

An exact binomial test was utilized to test whether the readers’ responses to the survey questions could have occurred simply by chance if the reader did not prefer one bone age assessment method over the other.