Jump to content
IGNORED

Subjective Model Predicts Human Brain's Response to Timbral Differences Better Than Objective Models


Recommended Posts

16 minutes ago, AcousticTheory said:

What can we learn from this paper? The portion of the brain that perceives timbre is not perceiving spectral content directly based on spectrum centroid, but is being activated by aspects of timbre and time content that correlate well to subjective descriptions of timbre.

 

Yes, spectrograms of the signals are provided. The spectrogram of each signal shows frequency content over time. That's just what it does. (It does not, however, 'hear' the signal content for you, and the spectrogram may not be resolving enough to show artifacts you are able to hear.)

 

What the article is describing is how those sounds activate groups of neurons according to perceptual descriptions of timbre rather than according to descriptions of timbre based on spectral weighting, lower or higher. In other words, the fMRI results are being used to 'see' timbre in the form of neuron response generated, and that neuron response is being generated more consistently according to perceptual descriptions of timbre than mere spectral weighting.

 

"Grey (1977) used MDS to identify three dimensions that best represented the distribution of timbres. The first dimension was related to the spectral energy distribution of the sounds (ranging from a low to high spectral centroid, corresponding to timbral descriptors ranging from dull to bright), and the other two related to temporal patterns, such as whether the onset was rapid (like a struck piano note or a plucked guitar string) or slow (as is characteristic of many woodwind instruments) and the synchronicity of higher harmonic transients." (Allen, Neuroimage)

 

"Elliott et al. (2013) extended Grey’s approach by using 42 natural orchestral instruments from five instrument families, all with the same F0 (311 Hz, the E♭ above middle C). After collecting similarity and semantic ratings, they performed multiple analyses, including MDS. They consistently found five dimensions to be both necessary and sufficient for describing the timbre space of these orchestral sounds." (Allen, Neuroimage) Elliott paper

 

From the Elliott paper, five dimensions of timbre are identified:

 

1) Tendency of an instrument sound to be hard, sharp, explosive, and bright with high frequency balance

2) Tendency of an instrument sound to be ringing, dynamic, vibrato, or have a varying level

3) Tendency of an instrument sound to be noisy, small, and unpleasant

4) Tendency of an instrument sound to be compact, steady, and pure

5) A fifth dimension which had no correlation to a semantic descriptor but still appeared in identified similarity between sounds

 

Each of these five dimensions describes a continuum between the semantic descriptor and its opposite, for each dimension:

 

1) hard, sharp, explosive, and bright with high frequency balance vs. dull, soft, calm, and having a low frequency balance

2) ringing, dynamic, vibrato, varying level vs. abrupt, static, constant (steady)

3) noisy, small, and unpleasant vs. tonal, big, pleasant

4) compact, steady, and pure vs. scattered, unsteady, and rich.

5) some undescribed quality vs. some other undescribed quality.

 

The Elliott study associated the following acoustic correlates with each of the five dimensions:

1) Broad temporal modulation power, fast transients with equally fast harmonics

2) Small temporal modulations of broad spectral patterns, small decreases in the fluctuations of specific harmonics, slower than average modulation of partials.

3) "Perceptual ordering of this “noisy, small instrument, unpleasant” dimension does not depend on spectrotemporal modulations, or...it does so in a nonlinear way" (Elliott); though the associated descriptors describe audible strain or compression.

4) Distribution of spectral power between odd harmonics or even harmonics; small decrease in spectral power in the range of formants, where spectral modulation was slower; faster amplitude modulations typical

5) Slower amplitude modulations in certain areas of the spectrum; subtlety of this is likely the cause for not being associated with a descriptor.

 

The subjective timbre model employed by the subject paper (Allen, Neuroimage) is based on the Elliott model, so understanding that model is crucial. The finding that the neuron response more closely aligned with these five dimensions than other models based on isolated spectral or temporal characteristics of sounds is essentially a strong proof that Elliott's model of timbre is the one most closely associated with real brain activity and thus real perception of timbre.

 

The instruments used to produce the sounds introduce much more error to the fundamental pitch than audio circuits can, which is why those instruments are used to create the instrumental sounds on a recording instead of oscillators fed to choruses of audio circuits that were otherwise optimized for linearity. When we do feed oscillators to audio circuits to produce instrumental sounds, we mean for those audio circuits to introduce a huge amount of nonlinearity and spuriae. However, our sense of correctness of timbre of an instrument is going to be based on the perceptual model proposed by Elliott and confirmed by Allen, so the purpose of audio playback circuits and electromechanical means, acoustical spaces, etc., will be to reproduce those timbral dimensions unmolested. If those timbral dimensions are impacted by adding or changing audio components, we must look to the spectral or time basis of all of those identified components of timbre to identify what happened.

 

Out of the Elliott 5-dimensional timbral model, we can see that the following areas of a reproduction system's performance are especially important for accuracy: Spectral modulation power over the entire spectral frequency range or narrower spectral bands, temporal accuracy of the modulation of spectral power, low intermodulation distortion between spectral bands, low dynamic compression, minimized impact of large signals and their harmonic structure upon quieter signals and their harmonic structure. In particular, low intermodulation distortion and low dynamic compression stick out as being essential to timbral accuracy in a system where frequency response linearity and high signal-to-noise ratio are already assured. Additionally, control of reactive loads is something that should not be neglected because of the need to control not only dynamic attack (through high spectral modulation on the leading edges of sounds) but also the decay of sounds. Modern highly linear amplifiers, DACs, preamps, etc. are able to do all of these things well, essentially perfectly, but modern speakers and even headphones stick out as the remaining source of most audio system nonlinearity. Harmonic distortion of audio components, if large, may also have a significant impact on timbre, but it is not necessary to have those distortions present at all in the signal chain if not desired.

 

Here is how the finding might be applied: A brightening or a darkening of perceived timbre associated with a component's addition or removal may not properly be restored to a realistic condition exclusively by modifying the component's frequency response to move the centroid of its response curve. Because no conventional measurements of audio components directly predict the neuron activation by those timbral dimensions, only the linearity of audio components passing a signal from input node to output node, we have to synthesize from available measurement data some kind of expectation about what improvement we would need to see within the conventional measurements to correct the timbral issue, even if the frequency response looks flat. It is essential to point out that perception of timbre by the listener is not considered within that group of conventional measurements, so the realistic perception of audio (timbre or otherwise) is also not considered, only the electrical linearity of the device passing the signal.

 

You seem to read too much into this paper. It was an evaluation of  computer-based neural network performance trained on a set of subjective descriptors as input, and neuron activation patterns as output. Other models were similarly trained using spectrum centroid and spectrum-temporal inputs instead of subjective descriptors. Again, only COMPUTER-based neural network performance was compared in this study, not the human brain.

 

Spectrum-temporal model was within the margin of error of timbre model in predicting activation patterns:
 

image.png.9a9d5b182bc7ad0c8a970758ca2f4e7f.png

 

There is nothing in the paper to conclude that one set of inputs is used by the brain to recognize timber over any others. The only conclusion is that a computer-based neural net trained on subjective descriptors as input was better at predicting cortex activation patterns than a couple of other neural nets trained on different set of inputs (cochlear mean and spectral centroid), while the STM model performed just as well as Timber in this prediction task.

 

The result is actually interesting, but doesn't justify any of the conclusions you seem to draw from it related to audio.

Link to comment
11 hours ago, AcousticTheory said:

 

From the abstract: "In cortical regions at the medial border of Heschl’s gyrus, bilaterally, and regions at its posterior adjacency in the right hemisphere, the timbre model outperforms even the complex joint spectrotemporal modulation model. These findings suggest that the responses of cortical neuronal populations in auditory cortex may reflect the encoding of perceptual timbre dimensions." (Allen)

 

I take it somewhat for granted that these experiments were devised by someone smarter than I am, and peer-reviewed by other people smarter than I am. If the conclusions in this article could be dismissed offhand based on a 'gotcha' hidden in the article somewhere, the paper would be taken apart by its reviewers and the broader community of academics, and it would be better not to publish it at all. The methodology of this paper compared the outputs from computer brain models to the observed stimulation in human subjects as monitored using functional MRI. When performance of the paper's modeling was restricted to the above named areas of the auditory cortex, perceptual timbre models outperformed the STM model in predicting the real-world fMRI results; you cherry-picked your quote to contain the words "no significant difference" so I'll cherry-pick another:

 

However, we observed that the timbre model outperformed the joint STM model in a subset of the auditory cortical locations. Specifically, the timbre model performed significantly better in regions medial and posterior to HG, particularly in the right hemisphere. This suggests that while the timbre model only contains five features, it may be capturing some semantic or perceptual tuning properties of the auditory cortex that extend beyond those captured by the spectrotemporal model. (Allen)

 

The superiority of the Timbre model, based on the 5D timbre model of Elliott, in predicting activity in some regions of the auditory cortex (of human fMRI subjects) confirms that the 5D model is able to account for some part of brain activity that the STM model cannot, and this is the key finding of the paper. Most of my other analysis comes from Elliott's work and not the subject Allen paper, in describing what may be important to audio designers for accurately capturing timbre, following from Elliott's model of timbre perception.

 

I quoted the results from the conclusion section, but which model performs better is academic. It's the performance of computer neural network models with some specific set of inputs that are compared in this paper. It's not a human brain that's processing these inputs -- it is a Mathnet neural network trained on some subset of inputs. There's no proof or any reasonable conclusion that human brain must work just like the timbre model that can be drawn from this study.

 

Link to comment
6 hours ago, Audiophile Neuroscience said:

Hi Paul,

I originally missed the detail of  how they generated the audio samples from the 42 natural orchestral instrument stimuli. If correct, they used Matlab (Mathworks) to analyze the audio and classify into different models that they then generated futher audio samples for training and testing. Would that be correct (I have only come across matlab image tools before)?

 

Anyway I can relate if you have some reservations about this brave new world of machine neural networks but in all honesty I would have thought it was right up your alley, machines analyzing and outputting audio samples with Trump-like "HUGE" accuracy, objective measurements and all that jazz. They even have a Delta Audio Matlab!

 

I believe @AcousticTheory was actually just relaying the findings and conclusions in the study and its referenced source material. What do you have an issue with exactly and how would you have done things differently?

 

Hi David,

 

What I disagree with is the conclusion by @AcousticTheory that this study somehow proves anything about how human brain processes timbre. There was no such analysis done. The goal of the study was to test a computer model that predicts neuron activation patterns using computer-generated neural networks with different types of inputs, not human brain. To draw any conclusions from this study as to how we humans process timbre is not supported by the facts.

Link to comment
43 minutes ago, Audiophile Neuroscience said:

 

One wonders whether objectively derived neural networks based on subjective perceptual models may be a much more realistic way to assess the performance of audio gear when it comes to matching what we hear.

Sorry, couldn't answer in detail earlier (and still really can't from this d*mn tiny screen), but neural networks are hardly an objective way to measure anything. Training them, and selecting the right inputs as well as selecting training and testing data is an art rather than science. What's more, there is no guarantee these will make as accurate a prediction on a wider data set, for example one that includes the same exact piece played through two different amplifiers.

 

Accuracy of the timbre model (63%, +/-1%) to predict a brain activation pattern, while may be a few percent better than the previous, competing STM model (60%, +/-1%), is still very low and not a major improvement, IMHO. Also remember, this is while trying to differentiate between completely different, orchestral recordings.  I think I could come up with a 100% accurate method of differentiating diverse orchestral pieces using any number of existing measurements, without resorting to subjective descriptors or fMRI :)

 

PS: the numbers quoted may be slightly off, not looking at the article right now. These are from memory

 

Link to comment
6 hours ago, Audiophile Neuroscience said:

I suggest that to differentiate between diverse orchestral pieces with high accuracy would require no measurements at all, for most listeners.

 

Agree. And that's why the results of the study, while interesting, a 63% recognition rate between different orchestral pieces is just not great. But then, the neural net isn't just recognizing the music, it's mapping the music to predicted areas of neuronal activation in the brain. A much more complex task.

 

6 hours ago, Audiophile Neuroscience said:

What interests me is when traditional measurements fail in more complex tasks of timbre perception or other audibility.

 

Perhaps because they were not designed for this purpose? 

Link to comment
7 hours ago, Jud said:


Certainly there are any number of perceptual tasks involving pattern recognition at which humans are currently better than AIs.

 

Of course this comes with a corollary, which is that humans are also prone to recognizing patterns that don’t exist (optical or auditory illusions). Measuring equipment isn’t subject to these (AFAIK). I suppose, depending on how an AI is trained, it might be given the ability to recognize the same sorts of illusions humans are subject to.

 

True. AI is starting to exhibit some of the same issues humans have ('hallucinations' in generative AI). All it is is a pattern match that isn't accurate, causing a missed recognition or incorrect prediction. These are often the result of skewed/biased data, incomplete training, or over-training of a model, as well as insufficient size of the model. 

 

Human brain contains about 100 trillion connections.  ChatGPT 4 is getting close at 1.8 trillion :)

 

 

 

Link to comment

A more interesting paper (IMHO) than the one in the OP is this:
 

Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones

Taffeta M. Elliott,a) Liberty S. Hamilton, and Frederic E. Theunissen [https://doi.org/10.1121/1.4770244]

 

The five dimensional timbre model in this paper is what is used to provide "subjective" inputs to the neural net in the brain activation study.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...