Tutorial for Time Scale Modification Subjective Quality Assessment
What will you be doing?
In this testing, you will be asked for your opinion on the quality of a range of audio files that have been processed using Time-Scale Modification.
We are interested in your view on the quality of the Time-Scale processing. The test page will look like the image below. To rate files, click the play button and adjust the location of the slider
for each file, before proceeding to the next page. Reference files for each processed signal are provided at the top of each page.
These are useful for judging the impact of the Time-Scaling processing. You can listen to the files as many times as you want.
You will need to have played all reference and processed signals before moving to the next page.
Please use the best headphones or speakers that you have access to, in a quiet room, while doing the testing.
What is Time-Scale Modification?
Time-Scale Modification (TSM) is the process of changing the length of a signal without changing the frequency content of the signal.
For a recording of speech, this is analogous to talking faster or slower. Like-wise it is like a musician playing faster or slower.
Note the difference between this and changing the speed on a record player. As the record spins faster, the pitch increases.
This is what TSM avoids.
What is ideal TSM?
Ideal TSM is an important concept in this testing.
For Speech and Musical signals, the processed signal should be indistinguishable from person talking or musician playing faster or slower.
As an example, if you are to clap once per second, and then increase to twice per second, the sound of the clap instelf does not change,
only how often the claps occur. However, if you were to sing a song, slower than the normal tempo, you would need to hold the same notes longer than before.
Current TSM algorithms do not currently acheive this, however there has been a large amount of progress in the past 10 years.
This research will form the basis for the development of future TSM algorithms.
Common Artefacts in TSM
There are multiple methods used for TSM. These methods all give various advantages and disadvantages.
Commonly, these disadvantages are artefacts that are added to the signal during the processing. Below are some of the more common artifacts.
Audio examples are given.
Pitch Modification
This artefact occurs when maintaining the frequency content of the signal is not considered by the TSM algorithm.
If the distance between samples is increased, the signal will be longer, but the pitch will also be lower.
In this audio example the original signal is followed by a signal that has been slowed down.
You should be able to hear a difference in pitch between the two signals.
Transient Duplication/Skipping
This artefact of time domain TSM methods occurs when the same part of the signal is used multiple times at different offsets
and is most noticeable with percussive signals. Transients are either added to the output signal multiple times (Slower) or removed (Faster).
In this audio example the original signal is followed by copies of the signal that have been slowed down.
You should be able to hear the start of the bell happening multiple times.
Temporal Smearing
This artefact found in frequency domain methods occurs due to the stationarity requirement of the short time Fourier transform.
As the length of the frame increases, the frequency resolution increases, but the temporal resolution decreases and as a result
the location of transients can only be approximated.
In this audio example the original signal is followed by slow version of the same signal.
You should be able to hear the attack of the sound lengthen in the processed version.
Reverberation
This artefact, found mostly in frequency domain methods, is caused by a loss of phase coherence between bins.
In this audio example the original signal is followed by a signal that has been slowed down.
You should be able to hear that the second signal sound more distant or has more of an echo.
Musical Noise
This artefact occurs mostly in the high frequency parts of signals. Often, this part of the signal is filled with transient, percussive, noisy sources.
As the Fourier transform assumes the signal is a combination of sinusoids, the random nature of these sound sources becomes ordered,
but with a non-natural harmonics series.
In this audio example the original signal is followed by a signal that has been slowed down. This is the least noticeable artefact.
Listen for a metallic addition to the sound.
Suggestions for rating
If the quality of the audio is the same as the reference, rate high.
Don't consider the speed when rating quality, ie."This sounds good for the time scale". This will lead to skewed results.
More distortion and artefacts means a lower score.