Cascaded 3D fully convolutional neural network for segmenting amygdala and its subnuclei
Yilin Liu1, Brendon Nacewicz1, Gregory Kirk1, Andrew Alexander1, and Nagesh Adluru1

1University of Wisconsin Madison, Madison, WI, United States


We address the problem of segmenting subcortical brain structures that have small spatial extent but are associated with many neuropsychiatric disorders and neurodegenerative diseases. Specifically, we focused on the segmentation of amygdala and its subnuclei. Most existing methods including deep learning based focus on segmenting larger structures and the existing architectures do not perform well on smaller structures. Hence we designed a new cascaded fully convolutional neural network with architecture that can perform well even on small structures with limited training data. Several key characteristics of our architecture: (1) 3D convolutions (2) deep network with small kernels (3) no pooling layers.


The amygdala is a critical region of the “social brain”1, and is believed to be closely related to anxiety2 and autism3, but as a deep heterogenous cluster of subnuclei, surrounded by vasculature and sources of MRI field inhomogeneities, it remains an extremely difficult region to quantify by commonly-used automated methods. Additionally, animal models continue to differentiate subnuclei of the amygdala, identifying structural bases of fear generalization in basal and lateral nuclei distinct from output projections from centromedial regions. Therefore, a reliable quantitation of the amygdala and its substructures is urgently needed. Recently, deep learning methods such as convolutional neural networks (CNN) have demonstrated state-of-the-art performance in object-level and pixel-level classification4, due to its ability to learn hierarchical features from training data5. Although CNNs have already been investigated in several medical applications, most of the current approaches failed to incorporate the full spatial extent during segmentation and use only a slice-by-slice 2D neighborhoods6,7 for lowering the computation and memory costs. Our work, building on previous work8, takes advantage of the full 3D spatial neighborhood information by using 3D patches as inputs for the network. Our architecture for segmenting the amygdala has exhibited significant increase in accuracy compared to Freesurfer and FSL's FIRST, and better results than other recent deep learning techniques9,10, without requiring any post-processing steps. To the best of our knowledge, this work is also the first fully convolutional neural network (FCN) model for segmenting the subnuclei of the amygdala. Our architecture for segmenting these 10 subfields has also demonstrated very promising results.


The design and training of our networks were based on the LiviaNET framework7. The pre-processing step involves the standard skull-stripping for extracting the brain and volume-wise intensity normalization from the $$$T_1$$$-weighted MRI. We extracted patches of size $$$25\times 25\times 25$$$ from both the training set and the corresponding ground truth labels. Patches were randomly and uniformly sampled from the entire brain without any special emphasis on the regions of interests (ROIs). The key technical contribution is the design of cascaded FCN which uses the same architecture but learns different weights based on different tasks. The FCN was independently trained for (1) segmenting the full amygdala and (2) segmenting its subnuclei. Note that this is different from a hierarchical organization of tasks. The architecture (Fig. 1) consists of 10 successive convolutional layers with kernel size $$$3\times 3\times 3$$$ and a classification layer (a set of $$$1\times 1\times 1$$$ convolutional kernels) with softmax activation function for obtaining normalized probabilities for each class that each voxel belongs to. The number of kernels on each layer is $$$20,20,30,30,50,50,50,75,75,75,100,$$$ respectively. We employed negative likelihood as cost function, which is optimized using the RMSprop11 optimization method with the decay rate 0.9. The initial learning rate was set to 0.001, decaying 0.5 after every epoch. Each epoch consists of 2 sub-epochs. To avoid bias, we employed the leave-one-out cross validation scheme for our experiments, i.e., for each epoch, 11 subjects were used for training and the left one was for validation. This was repeated 12 times until every subject's segmentation had been used for validation.The network for each task was separately trained for 20 epochs. The similarity and discrepancy of the automatic and manual segmentation were evaluated using the Dice Similarity Coefficient (DSC)12 which is defined as the following for automatic segmentation $$$A$$$ and manual segmentation $$$M$$$, $$$DSC(A, M) = \frac{2|A\cap M}{|A|+|M|}$$$.$$$\mathtt{DSC}$$$ values are in the $$$[0,1]$$$ range, where 1 indicates 100% overlap with the ground truth, and 0 indicates no overlap.


Our architecture is presented in Fig.1. Our method shows excellent performance in segmenting the two whole amygdalae with the mean dice scores: 0.88 and 0.89, respectively. It also shows very promising results in segmenting the subnuclei (Fig.2). Fig.3 shows the effects of geometric features of the subfields on the dice score. Figs.4,5 show some representative segmentation results.

Conclusion and discussion

We found that the volume-to-surface ratio is inversely related to the difficulty of the segmentation, which could account for the lower accuracy yielded when segmenting the residue regions of the amygdala. For example, this difficulty persisted even with skip connections of intermediate layers. We will also validate our method on other amygdala dataset. These difficulties could potentially be tackled by (1) exploring the advantage of dilated convolution13 for enlarging the receptive field to incorporate knowledge from even bigger spatial extents, (2) applying data augmentations methods such as affine transformations and generative adversarial networks (GANs)14 to address the issue of limited data size and make the segmentation more robust, (3) enhance the training data for further generalization to data acquired from different $$$T1$$$-weighted sequences and different scanners.


NIH grants U54-HD090256 and P50-MH100031 are acknowledged.


1. Nacewicz BM, Angelos L, Dalton KM, Fischer R, Anderle MJ, Alexander AL, Davidson RJ (2012) Reliable non-invasive measurement of human neurochemistry using proton spectroscopy with an anatomically defined amygdala-specific voxel. Neuroimage 59(3):2548–2559

2. Shin, L.M. & Liberzon, I. The neurocircuitry of fear, stress, and anxiety disorders. Neuropsychopharmacology 35, 169-191 (2010)

3. Baron-Cohen, S., Ring, H.A., Bullmore, E.T., Wheelwright, S., Ashwin, C., & Williams, S.C.R. ( 2000). The amygdala theory of autism. Neuroscience and Biobehavioral Reviews, 24, 355–364.

4. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.

5. Lecun, Y., Bengio, Y. & Hinton, G. Deep Learning. Nature 521, 436-444(2015)

6. Abhijit Guha Roy, Sailesh Conjeti, Debdoot Sheet, Amin Katouzian, Nassir Navab, Christian Wachinger:Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data.CoRR abs/1705.00938 (2017)

7. O. Ronneberger, P. Fischer, and T.Brox, U-net: Convolutional networks for biomedical image segmentation, in MICCAI, pp.234-241, Springer, 2015

8. Dolz J, Desrosiers C, Ben Ayed I. 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study. NeuroImage (2017).

9. Wachinger, C., Reuter, M., Klein, T.: DeepNAT: deep convolutional neural network for segmenting neuroanatomy. NeuroImage (2017)

10. Mehta, R., and Sivaswamy, J., 2017. M-net: A Convolutional Neural Network for deep brain structure segmentation. In Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on (pp. 437-440). IEEE

11. Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.

12. L. R. Dice, Measures of the amount of ecologic association between species, Ecology 26 (3) (1945) 297-302

13. F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutionals, in ICLR, 2016.

14. Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. NIPS, 2014.


The network consists of 10 successive convolutional layers with kernel size $$$3\times 3\times 3$$$ and a fully connected layer replaced by a large set of $$$1\times 1\times 1$$$ convolutions. The number of kernels on each layer except for the classification layer is $$$20,20,30,30,50,50,50,75,75,75,100,$$$ respectively. The number of neurons on the classification layer could be either 3 or 11 (denotes the number of classes, including the background), depending on the task. Higher resolution image available online.

Segmentation results measured using Dice scores for the bilateral whole amygdalae (left and right) and their subfields. The bar graph shows our segmentation results compared with other deep learning based and classical approaches.Higher resolution image available online.

The relationships between several geometric features of the regions and the segmentation performance. Among these factors that could contribute to the final segmentation accuracy we find that the lower volume-surface ratio could account for the lower dice yielded by the residues of the two amygdalae. We can see that as soon as the volume to surface ratio is at least 1, the Dice scores rise exponentially fast. Higher resolution image available online.

Representative segmentation results for the two whole amygdalae and their subfields. Higher resolution image available online.

An animation (that could be viewed online in a browser) to show qualitatively how our cascaded 3D FCN method output agrees with the ground truth through all the slices in coronal, sagittal and axial views. The green outlines show the result of out network and the magenta and orange outlines are human expert designated ground truth.

Proc. Intl. Soc. Mag. Reson. Med. 26 (2018)