EEEM030: Speech & Audio Processing & Recognition 2019/0
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
EEEM030: Speech & Audio Processing & Recognition
FHEQ Level 7 Examination
LSA 2019/0
Section A
A1.
(a) Describe the masking effect in human hearing. Explain how the masking effect could be used for lossy compression of speech. [20 %]
(b) A vocal tract filter is described by the following difference equation:
S[n] = μ ∙ S[n − 2] + u[n],
where S[n] is the produced speech signal, u[n] is the excitation signal from the vocal cord, n is the discrete time index, and μ is a coefficient constant. Assuming μ in terms of Table 1,
Second rightmost digit of student number (URN) |
μ |
F |
0 |
-0.36 |
8 |
1 |
-0.04 |
10 |
2 |
-0.25 |
22 |
3 |
-0.09 |
16 |
4 |
-0.16 |
32 |
5 |
0.25 |
12 |
6 |
0.49 |
18 |
7 |
-0.81 |
14 |
8 |
0.64 |
20 |
9 |
0.09 |
24 |
Table 1 The values of the parameters specified with respect to your student number (URN). Example: if your URN is “6789012”, the second rightmost digit is “1”, so you will use: μ = −0.04, and F = 10.
(i) Determine the transfer function of this filter,i.e. the z-transform H(z), and the causal impulse response ℎ(n). Write down the calculation process for how your answer is reached. [20 %]
(ii) Sketch the pole-zero plot of H(z) and the magnitude spectra |H(幼)|. [20 %]
(iii) Assuming the sampling rate is F kHz (find the value in terms of your URN), estimate
the formant frequencies (in Hz) that correspond to the poles displayed in above pole- zero plot. Detail the calculation process for how your answer is reached. [20 %]
(iv) Suppose the sign of μ is reversed, for example, changing from μ = −0.04 to μ =
0.04 if the second rightmost digit of URN is “ 1”, or changing from μ = 0.64 to
μ = −0.64 if the second rightmost digit of your URN is “8” . Re-sketch the pole-zero plot of H(z) and the magnitude spectra |H(幼)|, and compare them with those plots in (ii). Explain how youreach the answer. [20 %]
(a) A signal containing a quasi-stationary segment of a vowel is band-limited using an ideal low- pass filter with a cut-off frequency at the Nyquist frequency, such that the power spectral density of the signal contains peaks at the first M harmonics of the vowel, and higher harmonics are cut off. Assume the sample rate is F kHz. Calculate the frequency difference between the first and the tenth harmonic, using the parameter values corresponding to your student number (URN), as given in Table 2. [30 %]
Second rightmost digit of student number (URN) |
M |
F |
0 |
20 |
8 |
1 |
22 |
8 |
2 |
50 |
22 |
3 |
38 |
16 |
4 |
90 |
32 |
5 |
28 |
12 |
6 |
36 |
18 |
7 |
30 |
14 |
8 |
55 |
20 |
9 |
62 |
24 |
Table 2 The values of the parameters specified with respect to your student number (URN). Example: if your URN is “6789012”, the second rightmost digit is “1”, so you will use: M = 22, and F = 8.
(c) Suppose we have a discrete speech signal, s[n], of length N samples.
(i) The real cepstrum of a quasi-stationary segment of a vowel of length N samples is computed, and prominent peaks of decreasing amplitude exist at quefrency bins τ = P, 2P, 3P, …. Assuming the sampling frequency is F kHz, what is the average pitch frequency of the vowel in Hz? The value ofF can be found from Table 3 in terms of your student number (URN). Explain how you have reached your answer. [25 %]
(ii) Suppose the first four cepstral coefficients c[τ], τ = 0,1,2,3, derived from a frame of speech data, are specified in Table 3.
. What are the values of the final three cepstral coefficients c[N-3], c[N-2], and c[N- 1]? Explain how you have reached your answer. [20 %]
. Derive a Fourier series expression for the resulting log-magnitude spectral envelope of the vocal tract that gives rise to this data. Explain how you have reached your answer. [25 %]
Second rightmost digit your URN |
P |
F |
c[0] |
c[1] |
c[2] |
c[3] |
0 |
40 |
8 |
6.0 |
- 1.2 |
- 1.8 |
2.5 |
1 |
45 |
8 |
-5.5 |
1.3 |
-2.4 |
0.9 |
2 |
90 |
22 |
4.9 |
2.6 |
1.8 |
-3.7 |
3 |
85 |
16 |
7.7 |
-2.2 |
-3.5 |
2.3 |
4 |
110 |
32 |
9.6 |
2.5 |
1.6 |
-3.5 |
5 |
60 |
12 |
-3.7 |
3.9 |
-2.7 |
2.9 |
6 |
80 |
18 |
-6.5 |
-4.6 |
1.4 |
- 1.3 |
7 |
70 |
14 |
8.2 |
1.7 |
-2.2 |
2.4 |
8 |
90 |
20 |
-8.4 |
5.2 |
-3.1 |
-2.2 |
9 |
120 |
24 |
6.5 |
-4.4 |
1.8 |
-3.1 |
Table 3 The values of the parameters specified with respect to your student number (URN). Example: if your URN is “6789012”, the second rightmost digit is “1”, so you will use: P = 45, F = 8, c[0] = -5.5, c[1] = 1.3, c[2] = -2.4, c[3] = 0.9.
Section B
B1.
(a) What dynamic programming method is typically employed for efficient decoding in
automatic speech recognition systems, and how does it differ from that used in training? [10 %]
(b) A tracking system is built for a small delivery enterprise using an ergodic (fully-connected) 3- state HMM whose observations are 2D continuous features formed from noisy visual and GPS tracking data. The three states represent the presence of red ‘R’ (i=1), green ‘G’ (i=2) or blue ‘B’ (i=3) delivery vans in the company’s loading bay respectively.
(i) Draw a state topology diagram for the model, including null states for entry and exit, considering the state-transition matrix A in Table B1.1. [15 %]
(ii) Using the state-transition matrix A in Table B1.1 and the output probability densities
bi(t) in Table B1.2 for the given observations, show that the cumulative likelihoods δt(i) at time t=2 are δ2(1)=0.0049 and δ2(2)=0.0034. [25 %]
0 |
0.4 |
0.5 |
0.1 |
0 |
0 |
0.5 |
0.1 |
0.3 |
0.1 |
0 |
0.2 |
0.6 |
0.1 |
0.1 |
0 |
0.3 |
0.3 |
0.2 |
0.2 |
0 |
0 |
0 |
0 |
0 |
Table B1.1: State transition probability matrix incorporating the entry and exit probabilities, A={πj,aij,ηi}.
t |
2023-08-17