Modulation of Expectation on Sound-to-Meaning Mapping during Speech Processing: An fMRI Study
Bingjiang Lyu1,2,3, Jianqiao Ge1,2,3, Zhendong Niu4, Li Hai Tan5, Tianyi Qian6, and Jia-Hong Gao1,2,3

1Center for MRI Research, Peking University, Beijing, People's Republic of China, 2McGovern Institute for Brain Research, Peking University, Beijing, People's Republic of China, 3Beijing City Key Lab for Medical Physics and Engineering, Peking University, Beijing, People's Republic of China, 4School of Computer Science and Technology, Beijing Institute of Technology, Beijing, People's Republic of China, 5Center for Language and Brain, Shenzhen Institute of Neuroscience, Shenzhen, People's Republic of China, 6MR Collaborations NE Asia, Siemens Healthcare, Beijing, People's Republic of China


Spoken language comprehension relies on both the identification of individual words and the expectations arising from contextual information. A distributed fronto-temporal network is known to facilitate the mapping of speech sounds onto corresponding meanings. However, how prior expectations influence this efficient mapping at the neuroanatomical level, especially for individual words, remains unclear. Using functional magnetic resonance imaging, we addressed this question in the framework of the dual-stream model by investigating both the neural substrates and their mutual functional and effective connectivity. Our results revealed how this ubiquitous sound-to-meaning mapping in daily communication is achieved in a predictive manner.


To investigate the predictive brain mechanism underlying the effortless and efficient information exchange via speech, and to reveal how prior expectation influences the sound-to-meaning mapping at the neuroanatomical level in a natural language context.


Thirty right-handed native Mandarin Chinese speakers were enrolled in this study(aged 21-28, 15 male). Three types of auditory stimuli, i.e., expected phrases (EPs),unexpected phrases (UPs), and time-reversed phrases (TPs), were presented. Subjects were asked to judge the gender of speakers. Chinese idioms generally have a high transitional probability, therefore, by manipulating the last portion of a Chinese idiom,expectation violations can be naturally induced. The EPs were normal Chinese idioms,while the UPs were created by keeping the first portion of an idiom and replacing the last portion with character(s) of another irrelevant idiom. The TPs were derived equally from both EPs and UPs to create a low-level acoustical match. Theexperimental procedure was adopted from a previous study1. MRI data were acquired using a MAGNETOM Trio 3T MR scanner (Siemens, Erlangen, Germany) with a 12-channel head coil. Thirty-five axial slices that covered the whole brain were acquired using a T2*-weighted gradient-echo EPI sequence with the following parameters: TR/TE/FA = 2080ms/30ms/90o, matrix size = 64 x 64, in-plane image resolution = 3mm x 3mm, slice thickness = 3mm, slice gap = 0.75mm. Fifteen subjects participated in a post-scan dictation to write down the phrases they heard inthe scanner after fMRI experiment.Both univariate and multivariate pattern analyses (MVPA) were conducted to give a comprehensive description of the involved neural substrates (Fig. 1). The voxel-wise Mahalanobis distances (MDs), which take the covariance structure into consideration and reduce the contribution from noisy voxels, between different task conditions were used as features for training SVM classifiers. According to the dual-stream model2,3,seven regions of interest in the left hemisphere (i.e., the anterior superior temporal gyrus (aSTG), superior temporal pole (STP), posterior middle temporal gyrus (pMTG), primary auditory cortex (PAC), pars triangularis of the IFG (IFGtr), pars opercularis of the IFG (IFGop), and supplementary motor areas (SMA)) were selected for further psychophysiological interaction (PPI)4 and dynamic causal modelling(DCM)5,6 analysis.


The results of univariate and multivariate analyses are separately shown in Fig. 2A&B. PPI analysis revealed that the left IFGtr tended to exhibit stronger functional connectivity with both streams for UPs than for EPs (Fig. 3). We found that the strengths of the two connections with IFGtr were both positively correlated with subjects’ performance in the dictation, one of them was along the ventral stream(IFGtr-aSTG; r = 0.78, P < 0.001; r = 0.54, P = 0.039; Fig. 4A&B), and the other connection located within the dorsal stream (IFGtr-IFGop; r = 0.55, P = 0.035; Fig.4B). Moreover, the strengthened connectivity between aSTG and STP for processing UPs relative to TPs showed a positive correlation with the dictation accuracy for UPs(r = 0.83, P < 0.001; Fig. 4A). The winning model and the modulations of EPs and UPs obtained from the post-hoc DCM analysis are shown in Fig. 5.


We found that Ups can induce more significant activations relative to EPs in the left aSTG and IFGtr; moreover, stronger functional connectivity between them could predict better dictation performances when expectations were violated. These results suggest the aSTG plays an important role in facilitating rapid sound-to-meaning mapping with top-down constraints. Furthermore, the cortical dynamic activation indicates enhanced modulation on IFGtr-to-aSTG, aSTG-to-STP, and reciprocal IFGtr-to-STP connections in response to UPs. The enhanced feedback connections from the IFGtr may be attributed to the additional requirements of top-down constraints which help to determine the word identity or the lack of higher-level information to accomplish semantic integration/retrieval. The results of pattern analysis suggest that the pMTG might also encode information at a lower level for further top-down modulation of speech processing. Consistent with this perspective,we identified an enhanced feedback connection from the left IFGop to pMTG and another enhanced forward connection from the pMTG to IFGop via SMA during the processing of UPs. Functional connectivity results suggest that the IFG subserves the integration of dorsal and ventral streams7 by showing corresponding enhanced connections. Moreover, enhanced information flow from IFGtr to IFGop was discovered for both EPs and UPs, which may be attributed to a transformation from category-invariant representations into motor-articulatory representations for intelligible speech3.


The results of this study suggest that the human brain relies on adjacent cortical areas and their interconnections for efficient back-and-forth processing of local and contextual information, which facilitates speech processing in a predictive manner.


This work was supported by China’s National Strategic Basic Research Program (973; Grant 2012CB720700), National Natural Science Foundation of China (Grants 31200761, 31421003, 81227003, and 81430037), Beijing Municipal Science & Technology Commission (Grant Z161100000216152), and Shenzhen Peacock Plan (GrantKQTD2015033016104926).


1 Leff AP, Schofield TM, Stephan KE, et al. The cortical dynamics of intelligible speech. J Neurosci. 2008;28:13209-13215.

2 Hickok G, Poeppel D. The cortical organization of speech processing. Nat Rev Neurosci. 2007;8:393-402.

3 Rauschecker JP, Scott SK. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nat Neurosci. 2009;12:718-724.

4 Friston KJ, Buechel C, Fink GR, et al. Psychophysiological and modulatory interactions in neuroimaging. Neuroimage. 1997;6:218-229.

5 Friston KJ, Harrison L, Penny W. Dynamic causal modelling. Neuroimage.2003;19:1273-1302.

6 Rosa MJ, Friston K, Penny W. Post-hoc selection of dynamic causal models. JNeurosci Meth. 2012;208:66-78.

7 Bornkessel-Schlesewsky I, Schlesewsky M, Small SL, et al. Neurobiological roots of language in primate audition: common computational properties. Trends Cogn Sci.2015;19:142-150.


Figure 1 Schematic diagram showing complementary contributions of the univariate and multivariate analyses. (A) Local activation strength is indicated by parameter estimates of the target voxel (i.e., solid colors in the center of each matrix). Local activation patterns around the target voxel are shown in transparent colors. (B) Procedure for the multivariate pattern analysis, with brain activity patterns indexed by the Mahalanobis distance (MD) among the three task conditions and the baseline for each voxel. The MD patterns for each task condition were then entered into the linear support vector machine (SVM) classifiers.

Figure 2 The results of (A) univariate and (B) multivariate analyses showing neural substrates underlying speech processing. The results are shown after thresholding atvoxel-level P < 0.001 with cluster-level FWE correction of P < 0.05.

Figure 3 Results of the psychophysiological interaction (PPI) analysis. (A) PPI connectivity for UP > EP among the seven seed areas within the left fronto-temporal cortex. Warm colors indicate a strengthened connection for unexpected phrases relative to expected phrases, and cold colors represent the opposite effect. (B) Enhanced PPI connectivity with the pars triangularis of the inferior frontal gyrus (IFGtr) as the seed area induced by unexpected phrases relative to that induced by expected phrases. The results were thresholded at voxel-level P < 0.005 with cluster-level FWE correction of P < 0.05.

Figure 4 Inter-subject brain-behavior correlation results. The dictation accuracy denotes the percentage of correct characters identified out of all the phonologically correct characters during the post-experiment dictation test, which indicates successful sound-to-meaning mapping. These results survived multiple comparisons correction (FDR P < 0.05).

Figure 5 DCM specification and modulation results. (A) Full model specified according to the dual-stream model. The directed dashed lines indicate the hypothesized connections. The hollow arrows represent auditory inputs. Significant modulatory effects of expected phrases and unexpected phrases are shown separately in (B) and (C). The black and grey arrows indicate positive and negative modulation,respectively.

Proc. Intl. Soc. Mag. Reson. Med. 25 (2017)