Subjective Model Predicts Human Brain's Response to Timbral Differences Better Than Objective Models

February 8

"Timbre, the perceptual quality of a sound, is defined as everything by which a listener can distinguish between two sounds with the same loudness, pitch, spatial location, and duration." [Citation omitted.]

Timbre therefore includes those subtle differences in sound quality pursued by audiophiles.

The academic paper linked below, published in 2019, describes an experiment in which subjects were asked to distinguish timbral differences between sounds while brain response was measured via fMRI. The fMRI eliminates some difficulties in experiments that require a conscious verbal response, such as whether there is a response that is subconscious, or whether the subject is insufficiently certain to give a definite verbal response.

The results showed that a (pre-existing) model of timbre constructed from subjective descriptions of timbral differences predicted brain response as measured by fMRI better than 3 models of timbral differences constructed from objective measurements of differences in the sounds themselves, or in lower level auditory processing in the cochlea. That is, at least as of the publication date in 2019, models based on objective measurements didn’t account for the higher level auditory processing in the brain that went into the subjective model.

The best performing among the objective models takes into account not only frequency-based but time-based objective characteristics, which makes a nice counterbalance to some objective measurements that tend to concentrate primarily on the frequency-based characteristics of sound reproduction.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5747995/

July 28

The reliance upon subjective descriptions of timbre in the more preferred model will disappoint the audio omniskeptics, for whom perception doesn't exist, only measurable linearity does.

July 29

A quick perusal of the article left me wondering how they constructed these sounds that represented the various models. In general the sounds constructed utilizing subjective descriptions more accurately predicted mapped areas in the brain as determined on fMRI than did sounds constructed using measured values of the stimulus or its downstream representation in the cochlear. Top down processing of complex subjective musical phenomena like timbre likely better explains their perception if judged by fMRI responses

July 29

What can we learn from this paper? The portion of the brain that perceives timbre is not perceiving spectral content directly based on spectrum centroid, but is being activated by aspects of timbre and time content that correlate well to subjective descriptions of timbre.

Yes, spectrograms of the signals are provided. The spectrogram of each signal shows frequency content over time. That's just what it does. (It does not, however, 'hear' the signal content for you, and the spectrogram may not be resolving enough to show artifacts you are able to hear.)

What the article is describing is how those sounds activate groups of neurons according to perceptual descriptions of timbre rather than according to descriptions of timbre based on spectral weighting, lower or higher. In other words, the fMRI results are being used to 'see' timbre in the form of neuron response generated, and that neuron response is being generated more consistently according to perceptual descriptions of timbre than mere spectral weighting.

"Grey (1977) used MDS to identify three dimensions that best represented the distribution of timbres. The first dimension was related to the spectral energy distribution of the sounds (ranging from a low to high spectral centroid, corresponding to timbral descriptors ranging from dull to bright), and the other two related to temporal patterns, such as whether the onset was rapid (like a struck piano note or a plucked guitar string) or slow (as is characteristic of many woodwind instruments) and the synchronicity of higher harmonic transients." (Allen, Neuroimage)

"Elliott et al. (2013) extended Grey’s approach by using 42 natural orchestral instruments from five instrument families, all with the same F0 (311 Hz, the E♭ above middle C). After collecting similarity and semantic ratings, they performed multiple analyses, including MDS. They consistently found five dimensions to be both necessary and sufficient for describing the timbre space of these orchestral sounds." (Allen, Neuroimage) Elliott paper

From the Elliott paper, five dimensions of timbre are identified:

1) Tendency of an instrument sound to be hard, sharp, explosive, and bright with high frequency balance

2) Tendency of an instrument sound to be ringing, dynamic, vibrato, or have a varying level

3) Tendency of an instrument sound to be noisy, small, and unpleasant

4) Tendency of an instrument sound to be compact, steady, and pure

5) A fifth dimension which had no correlation to a semantic descriptor but still appeared in identified similarity between sounds

Each of these five dimensions describes a continuum between the semantic descriptor and its opposite, for each dimension:

1) hard, sharp, explosive, and bright with high frequency balance vs. dull, soft, calm, and having a low frequency balance

2) ringing, dynamic, vibrato, varying level vs. abrupt, static, constant (steady)

3) noisy, small, and unpleasant vs. tonal, big, pleasant

4) compact, steady, and pure vs. scattered, unsteady, and rich.

5) some undescribed quality vs. some other undescribed quality.

The Elliott study associated the following acoustic correlates with each of the five dimensions:

1) Broad temporal modulation power, fast transients with equally fast harmonics

2) Small temporal modulations of broad spectral patterns, small decreases in the fluctuations of specific harmonics, slower than average modulation of partials.

3) "Perceptual ordering of this “noisy, small instrument, unpleasant” dimension does not depend on spectrotemporal modulations, or...it does so in a nonlinear way" (Elliott); though the associated descriptors describe audible strain or compression.

4) Distribution of spectral power between odd harmonics or even harmonics; small decrease in spectral power in the range of formants, where spectral modulation was slower; faster amplitude modulations typical

5) Slower amplitude modulations in certain areas of the spectrum; subtlety of this is likely the cause for not being associated with a descriptor.

The subjective timbre model employed by the subject paper (Allen, Neuroimage) is based on the Elliott model, so understanding that model is crucial. The finding that the neuron response more closely aligned with these five dimensions than other models based on isolated spectral or temporal characteristics of sounds is essentially a strong proof that Elliott's model of timbre is the one most closely associated with real brain activity and thus real perception of timbre.

The instruments used to produce the sounds introduce much more error to the fundamental pitch than audio circuits can, which is why those instruments are used to create the instrumental sounds on a recording instead of oscillators fed to choruses of audio circuits that were otherwise optimized for linearity. When we do feed oscillators to audio circuits to produce instrumental sounds, we mean for those audio circuits to introduce a huge amount of nonlinearity and spuriae. However, our sense of correctness of timbre of an instrument is going to be based on the perceptual model proposed by Elliott and confirmed by Allen, so the purpose of audio playback circuits and electromechanical means, acoustical spaces, etc., will be to reproduce those timbral dimensions unmolested. If those timbral dimensions are impacted by adding or changing audio components, we must look to the spectral or time basis of all of those identified components of timbre to identify what happened.

Out of the Elliott 5-dimensional timbral model, we can see that the following areas of a reproduction system's performance are especially important for accuracy: Spectral modulation power over the entire spectral frequency range or narrower spectral bands, temporal accuracy of the modulation of spectral power, low intermodulation distortion between spectral bands, low dynamic compression, minimized impact of large signals and their harmonic structure upon quieter signals and their harmonic structure. In particular, low intermodulation distortion and low dynamic compression stick out as being essential to timbral accuracy in a system where frequency response linearity and high signal-to-noise ratio are already assured. Additionally, control of reactive loads is something that should not be neglected because of the need to control not only dynamic attack (through high spectral modulation on the leading edges of sounds) but also the decay of sounds. Modern highly linear amplifiers, DACs, preamps, etc. are able to do all of these things well, essentially perfectly, but modern speakers and even headphones stick out as the remaining source of most audio system nonlinearity. Harmonic distortion of audio components, if large, may also have a significant impact on timbre, but it is not necessary to have those distortions present at all in the signal chain if not desired.

Here is how the finding might be applied: A brightening or a darkening of perceived timbre associated with a component's addition or removal may not properly be restored to a realistic condition exclusively by modifying the component's frequency response to move the centroid of its response curve. Because no conventional measurements of audio components directly predict the neuron activation by those timbral dimensions, only the linearity of audio components passing a signal from input node to output node, we have to synthesize from available measurement data some kind of expectation about what improvement we would need to see within the conventional measurements to correct the timbral issue, even if the frequency response looks flat. It is essential to point out that perception of timbre by the listener is not considered within that group of conventional measurements, so the realistic perception of audio (timbre or otherwise) is also not considered, only the electrical linearity of the device passing the signal.

July 29

16 minutes ago, AcousticTheory said:

What can we learn from this paper? The portion of the brain that perceives timbre is not perceiving spectral content directly based on spectrum centroid, but is being activated by aspects of timbre and time content that correlate well to subjective descriptions of timbre.

Yes, spectrograms of the signals are provided. The spectrogram of each signal shows frequency content over time. That's just what it does. (It does not, however, 'hear' the signal content for you, and the spectrogram may not be resolving enough to show artifacts you are able to hear.)

What the article is describing is how those sounds activate groups of neurons according to perceptual descriptions of timbre rather than according to descriptions of timbre based on spectral weighting, lower or higher. In other words, the fMRI results are being used to 'see' timbre in the form of neuron response generated, and that neuron response is being generated more consistently according to perceptual descriptions of timbre than mere spectral weighting.

"Grey (1977) used MDS to identify three dimensions that best represented the distribution of timbres. The first dimension was related to the spectral energy distribution of the sounds (ranging from a low to high spectral centroid, corresponding to timbral descriptors ranging from dull to bright), and the other two related to temporal patterns, such as whether the onset was rapid (like a struck piano note or a plucked guitar string) or slow (as is characteristic of many woodwind instruments) and the synchronicity of higher harmonic transients." (Allen, Neuroimage)

"Elliott et al. (2013) extended Grey’s approach by using 42 natural orchestral instruments from five instrument families, all with the same F0 (311 Hz, the E♭ above middle C). After collecting similarity and semantic ratings, they performed multiple analyses, including MDS. They consistently found five dimensions to be both necessary and sufficient for describing the timbre space of these orchestral sounds." (Allen, Neuroimage) Elliott paper

From the Elliott paper, five dimensions of timbre are identified:

1) Tendency of an instrument sound to be hard, sharp, explosive, and bright with high frequency balance

2) Tendency of an instrument sound to be ringing, dynamic, vibrato, or have a varying level

3) Tendency of an instrument sound to be noisy, small, and unpleasant

4) Tendency of an instrument sound to be compact, steady, and pure

5) A fifth dimension which had no correlation to a semantic descriptor but still appeared in identified similarity between sounds

Each of these five dimensions describes a continuum between the semantic descriptor and its opposite, for each dimension:

1) hard, sharp, explosive, and bright with high frequency balance vs. dull, soft, calm, and having a low frequency balance

2) ringing, dynamic, vibrato, varying level vs. abrupt, static, constant (steady)

3) noisy, small, and unpleasant vs. tonal, big, pleasant

4) compact, steady, and pure vs. scattered, unsteady, and rich.

5) some undescribed quality vs. some other undescribed quality.

The Elliott study associated the following acoustic correlates with each of the five dimensions:

1) Broad temporal modulation power, fast transients with equally fast harmonics

2) Small temporal modulations of broad spectral patterns, small decreases in the fluctuations of specific harmonics, slower than average modulation of partials.

3) "Perceptual ordering of this “noisy, small instrument, unpleasant” dimension does not depend on spectrotemporal modulations, or...it does so in a nonlinear way" (Elliott); though the associated descriptors describe audible strain or compression.

4) Distribution of spectral power between odd harmonics or even harmonics; small decrease in spectral power in the range of formants, where spectral modulation was slower; faster amplitude modulations typical

5) Slower amplitude modulations in certain areas of the spectrum; subtlety of this is likely the cause for not being associated with a descriptor.

The subjective timbre model employed by the subject paper (Allen, Neuroimage) is based on the Elliott model, so understanding that model is crucial. The finding that the neuron response more closely aligned with these five dimensions than other models based on isolated spectral or temporal characteristics of sounds is essentially a strong proof that Elliott's model of timbre is the one most closely associated with real brain activity and thus real perception of timbre.

The instruments used to produce the sounds introduce much more error to the fundamental pitch than audio circuits can, which is why those instruments are used to create the instrumental sounds on a recording instead of oscillators fed to choruses of audio circuits that were otherwise optimized for linearity. When we do feed oscillators to audio circuits to produce instrumental sounds, we mean for those audio circuits to introduce a huge amount of nonlinearity and spuriae. However, our sense of correctness of timbre of an instrument is going to be based on the perceptual model proposed by Elliott and confirmed by Allen, so the purpose of audio playback circuits and electromechanical means, acoustical spaces, etc., will be to reproduce those timbral dimensions unmolested. If those timbral dimensions are impacted by adding or changing audio components, we must look to the spectral or time basis of all of those identified components of timbre to identify what happened.

Out of the Elliott 5-dimensional timbral model, we can see that the following areas of a reproduction system's performance are especially important for accuracy: Spectral modulation power over the entire spectral frequency range or narrower spectral bands, temporal accuracy of the modulation of spectral power, low intermodulation distortion between spectral bands, low dynamic compression, minimized impact of large signals and their harmonic structure upon quieter signals and their harmonic structure. In particular, low intermodulation distortion and low dynamic compression stick out as being essential to timbral accuracy in a system where frequency response linearity and high signal-to-noise ratio are already assured. Additionally, control of reactive loads is something that should not be neglected because of the need to control not only dynamic attack (through high spectral modulation on the leading edges of sounds) but also the decay of sounds. Modern highly linear amplifiers, DACs, preamps, etc. are able to do all of these things well, essentially perfectly, but modern speakers and even headphones stick out as the remaining source of most audio system nonlinearity. Harmonic distortion of audio components, if large, may also have a significant impact on timbre, but it is not necessary to have those distortions present at all in the signal chain if not desired.

Here is how the finding might be applied: A brightening or a darkening of perceived timbre associated with a component's addition or removal may not properly be restored to a realistic condition exclusively by modifying the component's frequency response to move the centroid of its response curve. Because no conventional measurements of audio components directly predict the neuron activation by those timbral dimensions, only the linearity of audio components passing a signal from input node to output node, we have to synthesize from available measurement data some kind of expectation about what improvement we would need to see within the conventional measurements to correct the timbral issue, even if the frequency response looks flat. It is essential to point out that perception of timbre by the listener is not considered within that group of conventional measurements, so the realistic perception of audio (timbre or otherwise) is also not considered, only the electrical linearity of the device passing the signal.

You seem to read too much into this paper. It was an evaluation of computer-based neural network performance trained on a set of subjective descriptors as input, and neuron activation patterns as output. Other models were similarly trained using spectrum centroid and spectrum-temporal inputs instead of subjective descriptors. Again, only COMPUTER-based neural network performance was compared in this study, not the human brain.

Spectrum-temporal model was within the margin of error of timbre model in predicting activation patterns:

There is nothing in the paper to conclude that one set of inputs is used by the brain to recognize timber over any others. The only conclusion is that a computer-based neural net trained on subjective descriptors as input was better at predicting cortex activation patterns than a couple of other neural nets trained on different set of inputs (cochlear mean and spectral centroid), while the STM model performed just as well as Timber in this prediction task.

The result is actually interesting, but doesn't justify any of the conclusions you seem to draw from it related to audio.

July 29

1 hour ago, pkane2001 said:

You seem to read too much into this paper. It was an evaluation of computer-based neural network performance trained on a set of subjective descriptors as input, and neuron activation patterns as output. Other models were similarly trained using spectrum centroid and spectrum-temporal inputs instead of subjective descriptors. Again, only COMPUTER-based neural network performance was compared in this study, not the human brain.

Spectrum-temporal model was within the margin of error of timbre model in predicting activation patterns:

There is nothing in the paper to conclude that one set of inputs is used by the brain to recognize timber over any others. The only conclusion is that a computer-based neural net trained on subjective descriptors as input was better at predicting cortex activation patterns than a couple of other neural nets trained on different set of inputs (cochlear mean and spectral centroid), while the STM model performed just as well as Timber in this prediction task.

The result is actually interesting, but doesn't justify any of the conclusions you seem to draw from it related to audio.

From the abstract: "In cortical regions at the medial border of Heschl’s gyrus, bilaterally, and regions at its posterior adjacency in the right hemisphere, the timbre model outperforms even the complex joint spectrotemporal modulation model. These findings suggest that the responses of cortical neuronal populations in auditory cortex may reflect the encoding of perceptual timbre dimensions." (Allen)

I take it somewhat for granted that these experiments were devised by someone smarter than I am, and peer-reviewed by other people smarter than I am. If the conclusions in this article could be dismissed offhand based on a 'gotcha' hidden in the article somewhere, the paper would be taken apart by its reviewers and the broader community of academics, and it would be better not to publish it at all. The methodology of this paper compared the outputs from computer brain models to the observed stimulation in human subjects as monitored using functional MRI. When performance of the paper's modeling was restricted to the above named areas of the auditory cortex, perceptual timbre models outperformed the STM model in predicting the real-world fMRI results; you cherry-picked your quote to contain the words "no significant difference" so I'll cherry-pick another:

However, we observed that the timbre model outperformed the joint STM model in a subset of the auditory cortical locations. Specifically, the timbre model performed significantly better in regions medial and posterior to HG, particularly in the right hemisphere. This suggests that while the timbre model only contains five features, it may be capturing some semantic or perceptual tuning properties of the auditory cortex that extend beyond those captured by the spectrotemporal model. (Allen)

The superiority of the Timbre model, based on the 5D timbre model of Elliott, in predicting activity in some regions of the auditory cortex (of human fMRI subjects) confirms that the 5D model is able to account for some part of brain activity that the STM model cannot, and this is the key finding of the paper. Most of my other analysis comes from Elliott's work and not the subject Allen paper, in describing what may be important to audio designers for accurately capturing timbre, following from Elliott's model of timbre perception.

July 30

6 hours ago, pkane2001 said:

You seem to read too much into this paper. It was an evaluation of computer-based neural network performance trained on a set of subjective descriptors as input, and neuron activation patterns as output. Other models were similarly trained using spectrum centroid and spectrum-temporal inputs instead of subjective descriptors. Again, only COMPUTER-based neural network performance was compared in this study, not the human brain.

Spectrum-temporal model was within the margin of error of timbre model in predicting activation patterns:

There is nothing in the paper to conclude that one set of inputs is used by the brain to recognize timber over any others. The only conclusion is that a computer-based neural net trained on subjective descriptors as input was better at predicting cortex activation patterns than a couple of other neural nets trained on different set of inputs (cochlear mean and spectral centroid), while the STM model performed just as well as Timber in this prediction task.

The result is actually interesting, but doesn't justify any of the conclusions you seem to draw from it related to audio.

Hi Paul,

I originally missed the detail of how they generated the audio samples from the 42 natural orchestral instrument stimuli. If correct, they used Matlab (Mathworks) to analyze the audio and classify into different models that they then generated futher audio samples for training and testing. Would that be correct (I have only come across matlab image tools before)?

Anyway I can relate if you have some reservations about this brave new world of machine neural networks but in all honesty I would have thought it was right up your alley, machines analyzing and outputting audio samples with Trump-like "HUGE" accuracy, objective measurements and all that jazz. They even have a Delta Audio Matlab!

I believe @AcousticTheory was actually just relaying the findings and conclusions in the study and its referenced source material. What do you have an issue with exactly and how would you have done things differently?

July 30

11 hours ago, AcousticTheory said:

From the abstract: "In cortical regions at the medial border of Heschl’s gyrus, bilaterally, and regions at its posterior adjacency in the right hemisphere, the timbre model outperforms even the complex joint spectrotemporal modulation model. These findings suggest that the responses of cortical neuronal populations in auditory cortex may reflect the encoding of perceptual timbre dimensions." (Allen)

I take it somewhat for granted that these experiments were devised by someone smarter than I am, and peer-reviewed by other people smarter than I am. If the conclusions in this article could be dismissed offhand based on a 'gotcha' hidden in the article somewhere, the paper would be taken apart by its reviewers and the broader community of academics, and it would be better not to publish it at all. The methodology of this paper compared the outputs from computer brain models to the observed stimulation in human subjects as monitored using functional MRI. When performance of the paper's modeling was restricted to the above named areas of the auditory cortex, perceptual timbre models outperformed the STM model in predicting the real-world fMRI results; you cherry-picked your quote to contain the words "no significant difference" so I'll cherry-pick another:

However, we observed that the timbre model outperformed the joint STM model in a subset of the auditory cortical locations. Specifically, the timbre model performed significantly better in regions medial and posterior to HG, particularly in the right hemisphere. This suggests that while the timbre model only contains five features, it may be capturing some semantic or perceptual tuning properties of the auditory cortex that extend beyond those captured by the spectrotemporal model. (Allen)

The superiority of the Timbre model, based on the 5D timbre model of Elliott, in predicting activity in some regions of the auditory cortex (of human fMRI subjects) confirms that the 5D model is able to account for some part of brain activity that the STM model cannot, and this is the key finding of the paper. Most of my other analysis comes from Elliott's work and not the subject Allen paper, in describing what may be important to audio designers for accurately capturing timbre, following from Elliott's model of timbre perception.

I quoted the results from the conclusion section, but which model performs better is academic. It's the performance of computer neural network models with some specific set of inputs that are compared in this paper. It's not a human brain that's processing these inputs -- it is a Mathnet neural network trained on some subset of inputs. There's no proof or any reasonable conclusion that human brain must work just like the timbre model that can be drawn from this study.

July 30

6 hours ago, Audiophile Neuroscience said:

Hi Paul,

I originally missed the detail of how they generated the audio samples from the 42 natural orchestral instrument stimuli. If correct, they used Matlab (Mathworks) to analyze the audio and classify into different models that they then generated futher audio samples for training and testing. Would that be correct (I have only come across matlab image tools before)?

Anyway I can relate if you have some reservations about this brave new world of machine neural networks but in all honesty I would have thought it was right up your alley, machines analyzing and outputting audio samples with Trump-like "HUGE" accuracy, objective measurements and all that jazz. They even have a Delta Audio Matlab!

I believe @AcousticTheory was actually just relaying the findings and conclusions in the study and its referenced source material. What do you have an issue with exactly and how would you have done things differently?

Hi David,

What I disagree with is the conclusion by @AcousticTheory that this study somehow proves anything about how human brain processes timbre. There was no such analysis done. The goal of the study was to test a computer model that predicts neuron activation patterns using computer-generated neural networks with different types of inputs, not human brain. To draw any conclusions from this study as to how we humans process timbre is not supported by the facts.

July 30

26 minutes ago, pkane2001 said:

It's not a human brain that's processing these inputs -- it is a Mathnet neural network trained on some subset of inputs. There's no proof or any reasonable conclusion that human brain must work just like the timbre model that can be drawn from this study.

Why involve human subjects and an fMRI facility then? Seems like they would just be wasting the subjects' time if they were just comparing models to one another to look at the variance between them. The point is that the model based on perceptual descriptors of timbre more closely matched brain activity in certain areas of the brains of human subjects, listening to actual sounds played for them.

The relevance to this forum's category of discussion is that conventional measurements can show you data about a signal passing through a device, but they cannot 'hear' the sound for you, and this is the first paper confirming the timbre model of Elliott (correlating acoustical dimensions of instrument timbre to subjective descriptions of timbre) more closely mimics how human beings process timbre in parts of the brain according to neuron activation than one that is simply based on spectrum distortions or temporal distortions (STM) which may be easier to treat or correct in isolation. The Elliott model demands a more complex synthesis of the analysis of those distortions to figure out how a person will perceive those distortions. The problem, in the minds of audio omniskeptics, is that this is a step away from eliminating the pesky listener entirely, because it suggests an analysis that is perceptually derived has significant value, instead of eliminating the listener's perception from the analysis. The response is to dig in and insist that this thing is not a thing.

July 30

22 minutes ago, AcousticTheory said:

Why involve human subjects and an fMRI facility then?

It is a study designed to create a computer model predicting neural pattern activation. That is the goal. To do this, human brain(s) are required to collect information that can be used to train and then test the model.

July 31

14 hours ago, pkane2001 said:

Hi David,

What I disagree with is the conclusion by @AcousticTheory that this study somehow proves anything about how human brain processes timbre. There was no such analysis done. The goal of the study was to test a computer model that predicts neuron activation patterns using computer-generated neural networks with different types of inputs, not human brain. To draw any conclusions from this study as to how we humans process timbre is not supported by the facts.

14 hours ago, pkane2001 said:

It is a study designed to create a computer model predicting neural pattern activation. That is the goal. To do this, human brain(s) are required to collect information that can be used to train and then test the model.

Paul I think we need to accept what the aims or goal/s that were being explored as is stated in the study, the method employed notwithstanding, or whether you feel such goals were achieved, or conclusions valid - but your comments do raise some interesting points (separate post below).

Firstly their goal of the study is as stated by them (and which they conclude supported by the study) as quoted from different parts of the article:

"Here we test an encoding model that is based on five subjectively derived dimensions of timbre to predict cortical responses to natural orchestral sounds. [other researchers] consistently found [these] five [subjective] dimensions to be both necessary and sufficient for describing the timbre space of these orchestral sounds. The aim of the current study was to determine whether similar dimensions can be identified in the cortical representations of timbral differences." and they "explored the possibility that a subjectively based model of timbre could predict patterns of cortical activation in response to sound." This was done compared to other models that, although I don't recall as stated specifically, could be held as more "objective" models compared to the "subjective model of timbre"

Secondly, Their conclusions:

"The timbre model provides an efficient representation of processing in human auditory cortex via a compact model whose features are based on subjective ratings of timbre. Our results suggest that the distributed neural representation of timbre in the cortex may align with perceptual categorizations of timbre. Consequently, it may be possible to assign semantic labels to the multidimensional tuning of neuronal populations"

Thirdly, Their caveats:

First is the universal caveat that we need more studies.

"Since the employed timbre model was customized for this particular set of orchestral instruments, studies that test a broader range of stimuli (i.e., more musical instruments, speech, and other natural sounds) are recommended in order to determine the extent of this model’s generalizability."

then

"an area that warrants future research is the development of methods to optimally combine models that explain different parts of the variance (see e.g., de Heer et al., 2017)."

July 31

14 hours ago, pkane2001 said:

It is a study designed to create a computer model predicting neural pattern activation. That is the goal. To do this, human brain(s) are required to collect information that can be used to train and then test the model.

So, following on from their goals as stated by the authors, it was not as you say “designed to create a computer model predicting neural pattern activation”. The computer model here is the test method used by which they are testing various hypotheses or “goals”. Some may find objection to the method they employed. That objection being to conclusions relating to as you say “proves anything about how human brain processes timbre”.

I do find that interesting. Two points spring to mind.

They are assuming that the test method used ie models generated by real sounds once processed by a machine neural network (AI to most?) and then the same machine used to generate real sounds representing the various models used in testing (if I have that right)– is a valid method. This reminds me of other situations where I wish to distinguish the actual validity of the test method procedure, not just assuming it is valid eg ABX blind testing procedures and why there might be false negatives. I will not be drawn into an argument about blind testing except to say no-one has provided to my satisfaction real test of test measurements as for example, its positive and negative predictive values expressed as percentage probabilities.

It also raises the issue of how various audio measurements are sometimes held to represent what we hear. This study measures real sound, compares signals and concludes something about those signals to be same or different and in this case outputs real sound based on measurements and analysis by this machine “neural network”. I do see at least parallels to what other audio test devices do. In this case , using objective measures (and using their conclusions) favors a subjective outcome when it comes to the perception of timbre.

July 31

On 2/8/2024 at 2:37 PM, Jud said:

The best performing among the objective models takes into account not only frequency-based but time-based objective characteristics, which makes a nice counterbalance to some objective measurements that tend to concentrate primarily on the frequency-based characteristics of sound reproduction.

Very interesting, and a reasonable explanation for why, just for example, a tube based amplifier may produce a more realistic sounding acoustic guitar or vocal than a solid state with better measurements

August 1

10 hours ago, PeterG said:

Very interesting, and a reasonable explanation for why, just for example, a tube based amplifier may produce a more realistic sounding acoustic guitar or vocal than a solid state with better measurements

One wonders whether objectively derived neural networks based on subjective perceptual models may be a much more realistic way to assess the performance of audio gear when it comes to matching what we hear.

August 1

43 minutes ago, Audiophile Neuroscience said:

One wonders whether objectively derived neural networks based on subjective perceptual models may be a much more realistic way to assess the performance of audio gear when it comes to matching what we hear.

Sorry, couldn't answer in detail earlier (and still really can't from this d*mn tiny screen), but neural networks are hardly an objective way to measure anything. Training them, and selecting the right inputs as well as selecting training and testing data is an art rather than science. What's more, there is no guarantee these will make as accurate a prediction on a wider data set, for example one that includes the same exact piece played through two different amplifiers.

Accuracy of the timbre model (63%, +/-1%) to predict a brain activation pattern, while may be a few percent better than the previous, competing STM model (60%, +/-1%), is still very low and not a major improvement, IMHO. Also remember, this is while trying to differentiate between completely different, orchestral recordings. I think I could come up with a 100% accurate method of differentiating diverse orchestral pieces using any number of existing measurements, without resorting to subjective descriptors or fMRI :)

PS: the numbers quoted may be slightly off, not looking at the article right now. These are from memory

August 1

1 hour ago, pkane2001 said:

Sorry, couldn't answer in detail earlier (and still really can't from this d*mn tiny screen), but neural networks are hardly an objective way to measure anything. Training them, and selecting the right inputs as well as selecting training and testing data is an art rather than science. What's more, there is no guarantee these will make as accurate a prediction on a wider data set, for example one that includes the same exact piece played through two different amplifiers.

Accuracy of the timbre model (63%, +/-1%) to predict a brain activation pattern, while may be a few percent better than the previous, competing STM model (60%, +/-1%), is still very low and not a major improvement, IMHO. Also remember, this is while trying to differentiate between completely different, orchestral recordings.

vv

1 hour ago, pkane2001 said:

neural networks are hardly an objective way to measure anything. Training them, and selecting the right inputs as well as selecting training and testing data is an art rather than science. What's more, there is no guarantee these will make as accurate a prediction on a wider data set, for example one that includes the same exact piece played through two different amplifiers.

Perhaps the word "measurement" is out of context relating to ANN which seem to talk about "input signals" or nodes which depending on the model get somehow 'measured'/classified in some fashion. A grayscale image may have nodes representing the level of each pixel and 'measuring' tens of thousands of input nodes.

The important thing for me however is their predictive value in "outputs" especially in the area of so called "semantic segmentation" in complex nonlinear functions such as perception of timbre qualities might be.

Those predictions can be objectively measured in various ways, the most obvious is accuracy, how often does it get it right. Both an objective and meaningful figure.

Traditional audio measurements do not always correlate to perceptual predictions.

Even more interesting ,as one person with a Bachelors of Engineering in Artificial Intelligence & Robotics put it when measuring the learning performance of ANN's there is also a matrix that "breaks down true positives, true negatives, false positives, and false negatives, giving a clearer picture, especially in binary classification tasks"

August 1

2 hours ago, pkane2001 said:

Accuracy of the timbre model (63%, +/-1%) to predict a brain activation pattern, while may be a few percent better than the previous, competing STM model (60%, +/-1%), is still very low and not a major improvement, IMHO. Also remember, this is while trying to differentiate between completely different, orchestral recordings.

I do think you may have been a little dismissive of this study based on your previous comments regarding the study's stated goals and their results/conclusions (see my previous posts and from @AcousticTheory).

You correctly point out one results in isolation to another important result according to the authors. The subjective timbre model does more accurately predict the mapping to some specific auditory cortical locations “performing significantly better” than the spectrotemoral model and suggesting it is “capturing some semantic or perceptual tuning properties of the auditory cortex that extend beyond those captured by the spectrotemporal model”.

That is a finding related to how humans process timbre. “The timbre model provides an efficient representation of processing in human auditory cortex via a compact model whose features are based on subjective ratings of timbre. Our results suggest that the distributed neural representation of timbre in the cortex may align with perceptual categorizations of timbre. Consequently, it may be possible to assign semantic labels to the multidimensional tuning of neuronal populations. “

August 1

3 hours ago, pkane2001 said:

I think I could come up with a 100% accurate method of differentiating diverse orchestral pieces using any number of existing measurements, without resorting to subjective descriptors or fMRI :)

Certainly there are any number of perceptual tasks involving pattern recognition at which humans are currently better than AIs.

Of course this comes with a corollary, which is that humans are also prone to recognizing patterns that don’t exist (optical or auditory illusions). Measuring equipment isn’t subject to these (AFAIK). I suppose, depending on how an AI is trained, it might be given the ability to recognize the same sorts of illusions humans are subject to.

August 1

16 minutes ago, Jud said:

3 hours ago, pkane2001 said:

I think I could come up with a 100% accurate method of differentiating diverse orchestral pieces using any number of existing measurements, without resorting to subjective descriptors or fMRI :)

Certainly there are any number of perceptual tasks involving pattern recognition at which humans are currently better than AIs.

Of course this comes with a corollary, which is that humans are also prone to recognizing patterns that don’t exist (optical or auditory illusions). Measuring equipment isn’t subject to these (AFAIK). I suppose, depending on how an AI is trained, it might be given the ability to recognize the same sorts of illusions humans are subject to.

I suggest that to differentiate between diverse orchestral pieces with high accuracy would require no measurements at all, for most listeners.

What interests me is when traditional measurements fail in more complex tasks of timbre perception or other audibility.

In the setting of poor correlation of audio measurements and audibility, Is there a better and/or complementary method that resembles the way humans perceive that this study hints at. A method that might help answer questions about why one component sounds different, if differences exist beyond illusion or bias, assist interpretation of ABX tests, and provide a platform for increasing accuracy, ever learning without bias, and providing predictive values expressed as a probability

August 1

6 hours ago, Audiophile Neuroscience said:

I suggest that to differentiate between diverse orchestral pieces with high accuracy would require no measurements at all, for most listeners.

Agree. And that's why the results of the study, while interesting, a 63% recognition rate between different orchestral pieces is just not great. But then, the neural net isn't just recognizing the music, it's mapping the music to predicted areas of neuronal activation in the brain. A much more complex task.

6 hours ago, Audiophile Neuroscience said:

What interests me is when traditional measurements fail in more complex tasks of timbre perception or other audibility.

Perhaps because they were not designed for this purpose?

August 1

7 hours ago, Jud said:

Certainly there are any number of perceptual tasks involving pattern recognition at which humans are currently better than AIs.

Of course this comes with a corollary, which is that humans are also prone to recognizing patterns that don’t exist (optical or auditory illusions). Measuring equipment isn’t subject to these (AFAIK). I suppose, depending on how an AI is trained, it might be given the ability to recognize the same sorts of illusions humans are subject to.

True. AI is starting to exhibit some of the same issues humans have ('hallucinations' in generative AI). All it is is a pattern match that isn't accurate, causing a missed recognition or incorrect prediction. These are often the result of skewed/biased data, incomplete training, or over-training of a model, as well as insufficient size of the model.

Human brain contains about 100 trillion connections. ChatGPT 4 is getting close at 1.8 trillion :)

August 1

A more interesting paper (IMHO) than the one in the OP is this:

Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones

Taffeta M. Elliott,a) Liberty S. Hamilton, and Frederic E. Theunissen [https://doi.org/10.1121/1.4770244]

The five dimensional timbre model in this paper is what is used to provide "subjective" inputs to the neural net in the brain activation study.

August 2

12 hours ago, pkane2001 said:

A more interesting paper (IMHO) than the one in the OP is this:

Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones

Taffeta M. Elliott,a) Liberty S. Hamilton, and Frederic E. Theunissen [https://doi.org/10.1121/1.4770244]

The five dimensional timbre model in this paper is what is used to provide "subjective" inputs to the neural net in the brain activation study.

Yes I did have a look previously as it was referenced quite a bit within this present study and indeed this study builds on that work comparing models of temporal and spectral characteristics. That I think is the point of this study making it so interesting.

September 29

Is the source code and training dataset available on GitHub / paperswithcode / HuggingFace to reproduce their result?

Subjective Model Predicts Human Brain's Response to Timbral Differences Better Than Objective Models

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in