Subjective Model Predicts Human Brain's Response to Timbral Differences Better Than Objective Models

AcousticTheory · July 28

The reliance upon subjective descriptions of timbre in the more preferred model will disappoint the audio omniskeptics, for whom perception doesn't exist, only measurable linearity does.

AcousticTheory · July 29

What can we learn from this paper? The portion of the brain that perceives timbre is not perceiving spectral content directly based on spectrum centroid, but is being activated by aspects of timbre and time content that correlate well to subjective descriptions of timbre.

Yes, spectrograms of the signals are provided. The spectrogram of each signal shows frequency content over time. That's just what it does. (It does not, however, 'hear' the signal content for you, and the spectrogram may not be resolving enough to show artifacts you are able to hear.)

What the article is describing is how those sounds activate groups of neurons according to perceptual descriptions of timbre rather than according to descriptions of timbre based on spectral weighting, lower or higher. In other words, the fMRI results are being used to 'see' timbre in the form of neuron response generated, and that neuron response is being generated more consistently according to perceptual descriptions of timbre than mere spectral weighting.

"Grey (1977) used MDS to identify three dimensions that best represented the distribution of timbres. The first dimension was related to the spectral energy distribution of the sounds (ranging from a low to high spectral centroid, corresponding to timbral descriptors ranging from dull to bright), and the other two related to temporal patterns, such as whether the onset was rapid (like a struck piano note or a plucked guitar string) or slow (as is characteristic of many woodwind instruments) and the synchronicity of higher harmonic transients." (Allen, Neuroimage)

"Elliott et al. (2013) extended Grey’s approach by using 42 natural orchestral instruments from five instrument families, all with the same F0 (311 Hz, the E♭ above middle C). After collecting similarity and semantic ratings, they performed multiple analyses, including MDS. They consistently found five dimensions to be both necessary and sufficient for describing the timbre space of these orchestral sounds." (Allen, Neuroimage) Elliott paper

From the Elliott paper, five dimensions of timbre are identified:

1) Tendency of an instrument sound to be hard, sharp, explosive, and bright with high frequency balance

2) Tendency of an instrument sound to be ringing, dynamic, vibrato, or have a varying level

3) Tendency of an instrument sound to be noisy, small, and unpleasant

4) Tendency of an instrument sound to be compact, steady, and pure

5) A fifth dimension which had no correlation to a semantic descriptor but still appeared in identified similarity between sounds

Each of these five dimensions describes a continuum between the semantic descriptor and its opposite, for each dimension:

1) hard, sharp, explosive, and bright with high frequency balance vs. dull, soft, calm, and having a low frequency balance

2) ringing, dynamic, vibrato, varying level vs. abrupt, static, constant (steady)

3) noisy, small, and unpleasant vs. tonal, big, pleasant

4) compact, steady, and pure vs. scattered, unsteady, and rich.

5) some undescribed quality vs. some other undescribed quality.

The Elliott study associated the following acoustic correlates with each of the five dimensions:

1) Broad temporal modulation power, fast transients with equally fast harmonics

2) Small temporal modulations of broad spectral patterns, small decreases in the fluctuations of specific harmonics, slower than average modulation of partials.

3) "Perceptual ordering of this “noisy, small instrument, unpleasant” dimension does not depend on spectrotemporal modulations, or...it does so in a nonlinear way" (Elliott); though the associated descriptors describe audible strain or compression.

4) Distribution of spectral power between odd harmonics or even harmonics; small decrease in spectral power in the range of formants, where spectral modulation was slower; faster amplitude modulations typical

5) Slower amplitude modulations in certain areas of the spectrum; subtlety of this is likely the cause for not being associated with a descriptor.

The subjective timbre model employed by the subject paper (Allen, Neuroimage) is based on the Elliott model, so understanding that model is crucial. The finding that the neuron response more closely aligned with these five dimensions than other models based on isolated spectral or temporal characteristics of sounds is essentially a strong proof that Elliott's model of timbre is the one most closely associated with real brain activity and thus real perception of timbre.

The instruments used to produce the sounds introduce much more error to the fundamental pitch than audio circuits can, which is why those instruments are used to create the instrumental sounds on a recording instead of oscillators fed to choruses of audio circuits that were otherwise optimized for linearity. When we do feed oscillators to audio circuits to produce instrumental sounds, we mean for those audio circuits to introduce a huge amount of nonlinearity and spuriae. However, our sense of correctness of timbre of an instrument is going to be based on the perceptual model proposed by Elliott and confirmed by Allen, so the purpose of audio playback circuits and electromechanical means, acoustical spaces, etc., will be to reproduce those timbral dimensions unmolested. If those timbral dimensions are impacted by adding or changing audio components, we must look to the spectral or time basis of all of those identified components of timbre to identify what happened.

Out of the Elliott 5-dimensional timbral model, we can see that the following areas of a reproduction system's performance are especially important for accuracy: Spectral modulation power over the entire spectral frequency range or narrower spectral bands, temporal accuracy of the modulation of spectral power, low intermodulation distortion between spectral bands, low dynamic compression, minimized impact of large signals and their harmonic structure upon quieter signals and their harmonic structure. In particular, low intermodulation distortion and low dynamic compression stick out as being essential to timbral accuracy in a system where frequency response linearity and high signal-to-noise ratio are already assured. Additionally, control of reactive loads is something that should not be neglected because of the need to control not only dynamic attack (through high spectral modulation on the leading edges of sounds) but also the decay of sounds. Modern highly linear amplifiers, DACs, preamps, etc. are able to do all of these things well, essentially perfectly, but modern speakers and even headphones stick out as the remaining source of most audio system nonlinearity. Harmonic distortion of audio components, if large, may also have a significant impact on timbre, but it is not necessary to have those distortions present at all in the signal chain if not desired.

Here is how the finding might be applied: A brightening or a darkening of perceived timbre associated with a component's addition or removal may not properly be restored to a realistic condition exclusively by modifying the component's frequency response to move the centroid of its response curve. Because no conventional measurements of audio components directly predict the neuron activation by those timbral dimensions, only the linearity of audio components passing a signal from input node to output node, we have to synthesize from available measurement data some kind of expectation about what improvement we would need to see within the conventional measurements to correct the timbral issue, even if the frequency response looks flat. It is essential to point out that perception of timbre by the listener is not considered within that group of conventional measurements, so the realistic perception of audio (timbre or otherwise) is also not considered, only the electrical linearity of the device passing the signal.

AcousticTheory · July 29

1 hour ago, pkane2001 said:

You seem to read too much into this paper. It was an evaluation of computer-based neural network performance trained on a set of subjective descriptors as input, and neuron activation patterns as output. Other models were similarly trained using spectrum centroid and spectrum-temporal inputs instead of subjective descriptors. Again, only COMPUTER-based neural network performance was compared in this study, not the human brain.

Spectrum-temporal model was within the margin of error of timbre model in predicting activation patterns:

There is nothing in the paper to conclude that one set of inputs is used by the brain to recognize timber over any others. The only conclusion is that a computer-based neural net trained on subjective descriptors as input was better at predicting cortex activation patterns than a couple of other neural nets trained on different set of inputs (cochlear mean and spectral centroid), while the STM model performed just as well as Timber in this prediction task.

The result is actually interesting, but doesn't justify any of the conclusions you seem to draw from it related to audio.

From the abstract: "In cortical regions at the medial border of Heschl’s gyrus, bilaterally, and regions at its posterior adjacency in the right hemisphere, the timbre model outperforms even the complex joint spectrotemporal modulation model. These findings suggest that the responses of cortical neuronal populations in auditory cortex may reflect the encoding of perceptual timbre dimensions." (Allen)

I take it somewhat for granted that these experiments were devised by someone smarter than I am, and peer-reviewed by other people smarter than I am. If the conclusions in this article could be dismissed offhand based on a 'gotcha' hidden in the article somewhere, the paper would be taken apart by its reviewers and the broader community of academics, and it would be better not to publish it at all. The methodology of this paper compared the outputs from computer brain models to the observed stimulation in human subjects as monitored using functional MRI. When performance of the paper's modeling was restricted to the above named areas of the auditory cortex, perceptual timbre models outperformed the STM model in predicting the real-world fMRI results; you cherry-picked your quote to contain the words "no significant difference" so I'll cherry-pick another:

However, we observed that the timbre model outperformed the joint STM model in a subset of the auditory cortical locations. Specifically, the timbre model performed significantly better in regions medial and posterior to HG, particularly in the right hemisphere. This suggests that while the timbre model only contains five features, it may be capturing some semantic or perceptual tuning properties of the auditory cortex that extend beyond those captured by the spectrotemporal model. (Allen)

The superiority of the Timbre model, based on the 5D timbre model of Elliott, in predicting activity in some regions of the auditory cortex (of human fMRI subjects) confirms that the 5D model is able to account for some part of brain activity that the STM model cannot, and this is the key finding of the paper. Most of my other analysis comes from Elliott's work and not the subject Allen paper, in describing what may be important to audio designers for accurately capturing timbre, following from Elliott's model of timbre perception.

AcousticTheory · July 30

26 minutes ago, pkane2001 said:

It's not a human brain that's processing these inputs -- it is a Mathnet neural network trained on some subset of inputs. There's no proof or any reasonable conclusion that human brain must work just like the timbre model that can be drawn from this study.

Why involve human subjects and an fMRI facility then? Seems like they would just be wasting the subjects' time if they were just comparing models to one another to look at the variance between them. The point is that the model based on perceptual descriptors of timbre more closely matched brain activity in certain areas of the brains of human subjects, listening to actual sounds played for them.

The relevance to this forum's category of discussion is that conventional measurements can show you data about a signal passing through a device, but they cannot 'hear' the sound for you, and this is the first paper confirming the timbre model of Elliott (correlating acoustical dimensions of instrument timbre to subjective descriptions of timbre) more closely mimics how human beings process timbre in parts of the brain according to neuron activation than one that is simply based on spectrum distortions or temporal distortions (STM) which may be easier to treat or correct in isolation. The Elliott model demands a more complex synthesis of the analysis of those distortions to figure out how a person will perceive those distortions. The problem, in the minds of audio omniskeptics, is that this is a step away from eliminating the pesky listener entirely, because it suggests an analysis that is perceptually derived has significant value, instead of eliminating the listener's perception from the analysis. The response is to dig in and insist that this thing is not a thing.

Sign In

Subjective Model Predicts Human Brain's Response to Timbral Differences Better Than Objective Models

Recommended Posts

AcousticTheory

Link to comment

AcousticTheory

Link to comment

AcousticTheory

Link to comment

AcousticTheory

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Activity

Immersive

Subscriptions

My Details