How easily can an existing stroke outcome deep learning model become attuned to new acquisition protocols and patient cohorts?
Anne Nielsen1,2, Mikkel Bo Hansen1, Soren Christensen3, Maarten Lansberg3, Greg Zaharchuk4, and Kim Mouridsen1

1Center of Functionally Integrative Neuroscience and MINDLab, Department of Clinical Medicine, Aarhus University, Aarhus, Denmark, 2Cercare Medical, Aarhus, Denmark, 3Department of Neurology, Stanford University, Stanford, CA, United States, 4Department of Radiology, Stanford University, Stanford, CA, United States


Acute ischemic stroke is a major disease and one of the leading causes of adult death and disability. Final outcome prediction is hampered by the heterogeneity and physiological complexity of stroke progression. Convolutional neural networks have shown promising results in final outcome predictions. However, less attention has been paid to the generalizability of the results across patient cohorts. We test the applicability of an existing neural network trained on two clinical studies to completely independent cohort from the DEFUSE 2 trial. We examine how a few additional patients can be used to obtain performance comparable to the original studies.


Acute ischemic stroke is one of the leading causes of adult death and disability in the western part of the world. Stroke progression is heterogeneous, physiologically highly complex, and quick treatment initialization is crucial for good outcome prognosis. Accurate predictions of final tissue outcome via automatic methods will be an important tool for neuroradiologists treating stroke patients. The deep learning methods known as convolutional neural networks (CNN) has recently proved to be powerful tools in final outcome prediction. However, it is unclear how well such networks generalize to new data sets with different treatment protocols, which at the same is decisive for widespread use of such technology. In this project, we aim to investigate how a CNN perform on a new data set from facilities unknown by the original network and how a few patients can mediate the performance.


Two data sets were used: Dataoriginal: 229 patients from the I-KNOW multicenter1 (112) and perconditioning2 (117) studies. The mean of the follow-up infarct volume was 22 ml. Further details are listed in Hansen et al3. Datanew: 34 patients from the DEFUSE 2 study4. The mean of the follow-up infarct volume was 121 ml. Further details are listed in Lansberg et al4. For both data sets, patients with acute diffusion-weighted imaging (DWI), perfusion-weighted imaging (PWI) and T2-weighted-Fluid-Attenuated Inversion Recovery (T2-FLAIR) and follow-up T2-FLAIR with manual final infarct delineation were included. We compared two different models based on a CNN5 consisting of 37 layers (presented at ISMRM 20176). The first model (CNN0) was trained on 194 patients from Dataoriginal. The second model (CNN10) used CNN0 as a starting point and was then post-trained on 10 patients from Datanew for approximately 2.5 hours. The predictions were based on T2-FLAIR, trace DWI, apparent diffusion coefficient (ADC) and the following perfusion biomarkers: mean transit time (MTT), cerebral blood volume (CBV), cerebral blood flow (CBF), cerebral metabolism of oxygen7 (CMRO2), relative transit time heterogeneity (RTH), and the bolus delay estimated directly by the vascular model. The CNN0 and CNN10 models were evaluated on 24 independent patients from Datanew. The area under the receiver operation characteristic curve (AUC), the dice coefficient, and comparison between follow-up and predicted final infarct volume was evaluated and compared using Wilcoxon signed-rank tests (predictive maps are thresholded at 0.7). Moreover, the models were tested on 35 patients from Dataoriginal and the AUCs were compared through a Wilcoxon signed-rank test.


Figure 1 shows examples from the predictive models. For patient A, the predictions from the two models are similar whereas CNN0 underestimates the final infarct in the other cases. Both models underestimate patient D’s final infarct. Figure 2 shows boxplots of AUCs for the predictive models. The highest AUC was obtained for CNN10 (0.91±0.06) with a significant difference (p<0.0001) compared to CNN0 (0.71±0.13). For the dice coefficients, the difference was significant as well (p<0.0001), with CNN10 (0.35±0.19) being superior to CNN0 (0.03±0.06). As illustrated in Figure 3, the predicted volume of CNN10 is closer to the ground truth volume (mean of difference: 42.8 ml, std: 42.8) compared to CNN0 (mean of difference: 66.3 ml, std: 70.1). Both models underestimate the largest lesions. For reference, patient D from Figure 1 represents the right-most dot in Figure 3. Comparison of the models using test data from Dataoriginal showed no significant difference in AUCs (CNN0: 0.93±0.1, CNN10: 0.92±0.9, p=0.12), or in dice coefficient (CNN0: 0.39±0.28, CNN10: 0.28±0.19, p=0.08).


We show a significant performance improvement on Datanew when the predictive model was post-trained on a few patients. We speculate this difference to be caused by the substantial difference in follow-up infarct volumes between the two data sets. The post-training approach was fast and can lead to better site-specific predictions for the CNN models’ performance which are known to be very data dependent. Our results show the importance of including a large variety of data during the training process and how a few additional cases can improve performance. A few patients from Datanew have larger final infarct lesions than the hypoperfused area on the acute scan (e.g. patient D in Figure 1). This might be due to later occlusions and as a result both predictive models underestimate these lesions due to lack of information.


Predictive performance of CNNs on new data sets can be improved by post-training on a few patients. This will improve final outcome prediction and generalization of CNNs yielding better automated decision support systems providing recommendations for personalized treatment and thereby hopefully better outcome for the individual patient.


No acknowledgement found.


1. I-KNOW. Integrating information from molecules to man: Knowledge discovery accelerates drug development and personalized treatment in acute stroke. 2006

2. Hougaard KD, Hjort N, Zeidler D, Sorensen L, Norgaard A, Thomsen RB, et al. Remote ischemic perconditioning in thrombolysed stroke patients: Randomized study of activating endogenous neuroprotection - design and mri measurements. Int J Stroke. 2013;8:141-146

3. Hansen MB, Nagenthiraja K, Ribe LR, Dupont KH, Ostergaard L, Mouridsen K. Automated estimation of salvageable tissue: Comparison with expert readers. J Magn Reson Imaging. 2016;43:220-228

4. Lansberg MG, Straka M, Kemp S, Mlynash M, Wechsler LR, Jovin TG, et al. Mri profile and response to endovascular reperfusion after stroke (defuse 2): A prospective cohort study. Lancet Neurol. 2012;11:860-867

5. Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. ArXiv e-prints. 2015;1511

6. Nielsen A, Hansen MB, Mouridsen K, Boldsen JK. Deep learning: Utilizing the potential in data bases to predict individual outcome in acute stroke. ISMRM. 2017

7. Jespersen SN, Østergaard L. The roles of cerebral blood flow, capillary transit time heterogeneity, and oxygen tension in brain oxygenation and metabolism. J Cereb Blood Flow Metab. 2012;32:264-277


Results from the predictive models for patients from Datanew. Four biomarkers used for predictions (TRACE DWI, CMRO2, MTT and Tmax) are shown as well as the acute T2-FLAIR with the final infarct shown as a red contour.

Boxplot showing the area under the receiver operation characteristic curve (AUC) for the 24 test patients in Datanew.

Ground truth final infarct volume vs. predicted final infarct volume for the two predictive models CNN0 and CNN10.

Proc. Intl. Soc. Mag. Reson. Med. 26 (2018)