EEEM030: Speech & Audio Processing & Recognition Semester 1 2021/2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
EEEM030: Speech & Audio Processing & Recognition
FHEQ Level 7 Examination
Semester 1 2021/2
Q1.
Assume the relationship between the produced speech signals(n) from the lips and the
excitation signal u(n) from the vocal cord can be described by the vocal tract filter using the following difference equation:
s[n] = 2a cos(β) s[n − 1] − a2 s[n − 2] + (bN )s[n − N]
−2a(bN ) cos(β) s[n − N − 1] + a2 (bN )s[n − N − 2] + u[n] − bu[n − 1]
where n is the discrete time index, and a, b, N, and β are the coefficients used to determine
the shape of the filter, and β is in radian. DenoteS(z) as the Z-transform of the speech
signals(n) and U(z) as the Z-transform of the excitation signal u(n). Assume the values of the above coefficients and the sampling rate F kHz are given in Table 1.1.
Second rightmost digit of student number (URN) |
a |
b |
N |
β |
F |
0 |
0.4 |
0.8 |
4 |
π/6 |
10 |
1 |
-0.4 |
0.3 |
4 |
4π/5 |
6 |
2 |
-0.3 |
0.5 |
4 |
π/4 |
12 |
3 |
0.7 |
0.6 |
4 |
π/8 |
9 |
4 |
-0.6 |
0.4 |
4 |
π/5 |
8 |
5 |
-0.3 |
0.5 |
4 |
π/4 |
12 |
6 |
0.7 |
0.6 |
4 |
π/8 |
9 |
7 |
-0.4 |
0.3 |
4 |
4π/5 |
6 |
8 |
0.4 |
0.8 |
4 |
π/6 |
10 |
9 |
-0.6 |
0.4 |
4 |
π/5 |
8 |
Table 1.1. The values of the parameters specified with respect to your student number (URN). For example, if your URN is “6789012”, the second rightmost digit is “1”, so you will use: a = −0.4, b = 0.3, N = 4, β = 4π/5, and F = 6 kHz.
(a) Derive the transfer function of the vocal tract filter by using Z-transform:
H(z) = [20 %]
(b) Calculate the poles and zeros of this filter. [20 %]
(c) Sketch the pole-zero plot of H(z). [15 %]
(d) Estimate the formant frequencies (in Hz) that correspond to the poles displayed in the above pole-zero plot. [15 %]
(e) Analyse the bounded input and bounded output (BIBO) stability of the system. If the
values of “ and bare both doubled, analyse the stability of the system again. [15 %]
(f) Re-estimate the formant frequencies that correspond to the poles when N = 6. [15 %]
Q2.
(a) The Mel-frequency scale was designed as a perceptual scale of pitches judged by listeners to be equal in distance from one another:
fmel = 1127.01048 ln(1 + )
A testis carried out where a listener is asked to judge the intervals between three different sinusoidal tones. It is judged that the following tones are subjectively equidistant in frequency: f1, f2, and f3 . Use the values off1 and f2 in Table 2.1, estimate f3 (in Hz), assuming f3 is the highest of the three tones. [20 %]
(b) Humans perceive sounds to arrive from a certain location.
(i) Using about 100 words, explain the roles of the interaural phase difference
(IPD) and interaural level difference (ILD) for human sound source localization over a broadband frequency range. [10 %]
(ii) The IPD can be calculated as Φ = 2 πfr(θ + (sin θ))/c. A sound source
emitting a pure tone atf is located at the angle θ from the median plane.
Using the values off and θ in Table 2.1, calculate the IPD for a listener with a head radius of r = 0.085 m, assuming that c = 344 m/s. Is the IPD cue reliable for source localization? [20 %]
Second rightmost digit of student number (URN) |
f1 (Hz) |
f2 (Hz) |
f (Hz) |
θ (degrees) |
0 |
510 |
970 |
3320 |
40 |
1 |
1030 |
1780 |
1400 |
30 |
2 |
1340 |
2230 |
2660 |
50 |
3 |
1030 |
1780 |
1400 |
30 |
4 |
510 |
970 |
3320 |
40 |
5 |
1340 |
2230 |
2660 |
50 |
6 |
1030 |
1780 |
1400 |
30 |
7 |
510 |
970 |
3320 |
40 |
8 |
1340 |
2230 |
2660 |
50 |
9 |
510 |
970 |
3320 |
40 |
Table 2.1. The values of the parameters specified with respect to your student number (URN). For example, if your URN is “6789012”, the second rightmost digit is “1”, so you will use: f1 = 1030 Hz, f2 = 1780 Hz, f = 1400 Hz, and θ = 30 degrees.
A set of domain-specific language models is developed as part of a restaurant recommendation service in the USA. Based on thousands of utterances recorded during trials, the raw 1-gram and 2-gram counts have been used to obtain corresponding unigram and bigram models, L0(1) and L0(2), with the probabilities shown in Tables 2.2 and 2.3, respectively.
w |
I |
want |
to |
eat |
Chinese |
food |
lunch |
spend |
|
<\s> |
P(w) |
0.0418 |
0.0153 |
0.0123 |
0.0398 |
0.0026 |
0.0180 |
0.0056 |
0.0046 |
0 |
0.1538 |
Table 2.2. Unigram language model L0(1) for the restaurant recommendation service based on raw counts. Key:
W |
I |
want |
to |
eat |
Chinese |
food |
lunch |
spend |
<UNK> |
<\s> |
P(w| |
0.2500 |
0.0153 |
0.0399 |
0.0123 |
0.0026 |
0.0180 |
0.0056 |
0.0046 |
0 |
0 |
P(w|I) |
0.0020 |
0.3265 |
0 |
0.0036 |
0 |
0 |
0 |
0.0008 |
0 |
0.0122 |
P(w|want) |
0.0022 |
0 |
0.6559 |
0.0011 |
0.0065 |
0.0065 |
0.0054 |
0.0011 |
2023-08-17