Animals
Animal procedures were approved by the Stony Brook University Animal Care and Use Committee and carried out in accordance with the National Institutes of Health standards. Experiments were conducted using male C57BL/6 mice (Charles River Laboratories). Mice were housed with free access to food, but water was restricted after the initiation of behavioral training. On training days, water was available during task performance (2.5 μL for each correct trial); on non-training days, water bottles were provided to the mice for at least 1 h per day.
Behavior
Experiments were conducted in a dark, single-walled, sound-attenuating training chamber (22 cm × 15 cm). The chamber contained three nose pokes, each of which consisted of an infrared LED/infrared phototransistor pair connected to the Bpod system (Sanworks, LLC) for response detection. The activation of a central nose poke was required for trial initiation. One speaker embedded in the wall delivered auditory cues or distractors. The sound intensities at three different positions in the behavior chamber was calibrated monthly, and they are within ±1 dB range of differences. Two white LEDs were mounted in two reward nose pokes for the “visual task.” Water rewards were controlled by the Bpod system and delivered from the wall-mounted nose pokes.
Freely moving mice were trained to perform a set of 2AFC tasks, as previously described [30, 33]. Each trial was initiated when the mouse inserted its nose into the center port of a three-port operant chamber. After a delay period (200–300 ms; uniform distribution), a 100-ms stimulus would be present, indicating which nose poke (left or right) would be rewarded with water. Mice then selected the left or right goal port based on sensory stimuli. Note that the mouse must stay in the center port until the 100-ms stimulus finished, which prevents the influence of movement to the measurement of auditory response. The mouse will not get reward if the mouse withdraws from the center port before the end of the auditory stimulus, and we excluded these trials from our analysis. Every mouse was trained to perform in a sound block and light block in randomized sequences.
Auditory stimuli consisted of a pseudorandom, 100-ms stream of 30-ms pure overlapping tones presented at 200 Hz. Eighteen possible tone frequencies were logarithmically spaced from 5 to 40 kHz. For each trial, either the low stimulus (5 to 10 kHz) or the high stimulus (20 to 40 kHz) was selected as the target, and the mice were trained to report low or high by choosing the correct port for the water reward. Correct responses were rewarded with water (2.5 μL for each correct trial), and error trials were punished with a 4-s time out. The sound intensity was calibrated to 60 dB.
It takes about 4 weeks to train mice to perform the task, 2–3 h for one session per day. We first train mice to learn the task in which only visual cues were presented. One week later, most mice can reach above 90% accuracy of performance rate. We then train the mice in the task sound block (‘attending to sound’). Most mice can learn this task version within a week. Then we train the mouse to do the task light block (‘ignoring sound’) which also takes about 1 week. Finally, we train mice to perform the sound block and light block in the same day with a dummy miniscope mounted on their head to make sure they can switch the task during the recording day.
Calcium imaging procedure
AAV9-calmodulin protein kinase II (CaMKII)-GCaMP6f (University of Pennsylvania Vector Core) was injected into the auditory cortex at the following stereotaxic coordinates: 2.92 mm caudally from bregma, 4.2 mm laterally from midline, and at a 2.25-mm depth from the depth of the bregma. One week after the injection, a prism probe (diameter: 1.0 mm; length: approximately 4.3 mm; pitch: 0.5; numerical aperture: 0.5; Inscopix) was implanted 0.2 mm laterally from the injection site. Three weeks later, a base plate was implanted after checking the calcium signal.
Images were acquired at 20 frames per second using Inscopix nVista. At the beginning of each imaging session, the protective cap was removed from the previously implanted base plate and attached to the microscope. The imaging field of view (maximal size, 1 × 1 mm) was then selected by adjusting the focus. Focal planes were 150–200 μm away from the prism. During recording, the LED output power of the microscope was set at 30% of the maximum. The time stamps of the behavior events were exported from the Bpod system to the microscope for synchronization.
All task and passive sessions were recorded on the same day to prevent a population shift across days. The camera was attached to the animals’ head during the entire recording period with all recoding blocks. Recording began 15 min after the start of the session, which allowed the mice to switch strategies from those of previous tasks. Each task session was recorded for 10–15 min. The mice had a 1-h gap with free access to food between the recording sessions to recover the motivation of the animals. The 1-h gap can also prevent the photobleaching of calcium signals after continuous imaging for a long period. We left the camera attached to the animals during the gaps, preventing the potential shift of the field of views between the blocks. After the task sessions, the mice had a 1-h gap with free access to water and food to let them lose motivation before recording the passive sound response. The passive sound response was recorded for 5 min in the same chamber as the task session. The 100-ms cloud of tones was presented every 3–4 s during the passive session.
The acquired images were spatially downsampled by a factor of 2 to compress the size of the video. We concatenate the calcium imaging videos from different blocks of the same recording session and perform motion correction [26] for the video using Mosaic (version 1.2; Inscopix, Palo Alto, CA). After motion correction, we trim the video by blocks and extract the spatial and temporal components (Z-scored ΔF/F) of the recorded neurons by an extended constrained non-negative matrix factorization (CNMF-E) algorithm [21, 31]; the minimal correlation was set to 0.95 and the minimal peak noise ratio was set to 10 during the initialization step of the CNMF-E. Cell registration was applied to the spatial component, which allowed tracking of the same neuron from different sessions based on the spatial correlation and center distance [23].
Data analysis and statistics
Criteria for sound responses: We applied bootstrap (one-sided) to test if a neuron has significant response to the stimulus. The number of bootstrap samples is 10,000. The p-value is defined as the probability of the calcium signal in the tested frame as large as it has been detected if the tested frame is not significantly different from the average intensity of the baseline frame. If a neuron had at least two consecutive frames within 500 ms after sound onset with p-values smaller than 0.01 in the bootstrap analysis of a certain trial type, the neuron was identified as a sound responsive neuron. The baseline was selected between 200 and 500 ms before sound onset in the passive block, and between 200 and 500 ms before trial initiation in the task block. Due to the signal signal-to-noise ratio of one-photo Ca2+ imaging and the feature of calcium trace extracting algorithm, we focused our single neuron level analysis on the excitatory enhanced responses.
The Modulation Index is defined as the following equition:
$$ Modulation\ Index=\frac{Value_A-{Value}_B}{Value_A+{Value}_B} $$
The value is the peak intensity of the average calcium trace within 0–500 ms after sound onset. When the task state is compared with the passive state, A represents the task state, and B represents the passive state. When the attended state is compared with the ignored state, A represents the ignored state, and B represents the attended state.
An SVM with a linear kernel was used for all decoders. We used the calcium signal intensity from the single frame right after the sound offset to prevent the influence of other behavioral factors such as movement. The same number of trials (49.00 ± 2.61) from different blocks were randomly selected to balance the decoder. Two-thirds of the data were used for training/validation and the remaining one-third of the data were used for testing. The model was regularized with an L1 penalty to prevent overfitting where the regularization parameter was selected by 5-fold cross-validation. Significant decoding accuracies were determined by comparing the accuracy of the real data with the shuffled data in which behavioral data were shuffled relative to each neuronal activity. All quantified decoding was performed in the full dimensional space. To visualize the high dimensional ensemble activity, we performed principal component analysis on population responses across different attentional conditions (20 trials were selected randomly for each state), then single trial data were projected onto the first three principal components.
The modulation vector from the passive to attended states (vPA) is defined as the normal vector of the SVM decision boundary at that point from the passive to the attended state, and the modulation vector from the passive to ignored states (vPI) is defined as the normal vector of the decision boundary at that point from the passive to the ignored state. The angle θ between the two modulation vectors is calculated using the following formula:
$$ \theta =\operatorname{arccos}\left(\frac{{\boldsymbol{v}}_{\boldsymbol{PA}}\cdotp {\boldsymbol{v}}_{\boldsymbol{PI}}}{\left\Vert {\boldsymbol{v}}_{\boldsymbol{PA}}\right\Vert \left\Vert {\boldsymbol{v}}_{\boldsymbol{PI}}\right\Vert}\right). $$
The chance level of θ was calculated from the decoder trained using shuffled data.