Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EEEM030: Speech & Audio Processing & Recognition

FHEQ Level 7 Examination

LSA 2019/0

Section A

A1.

(a)        Describe the masking effect in human hearing. Explain how the masking effect could be used for lossy compression of speech. [20 %]

(b)        A vocal tract filter is described by the following difference equation:

S[n] = μ ∙ S[n − 2] + u[n],

where S[n] is the produced speech signal, u[n] is the excitation signal from the vocal cord, n is the discrete time index, and μ is a coefficient constant. Assuming μ in terms of Table 1,

Second rightmost digit

of student number

(URN)

μ

F

0

-0.36

8

1

-0.04

10

2

-0.25

22

3

-0.09

16

4

-0.16

32

5

0.25

12

6

0.49

18

7

-0.81

14

8

0.64

20

9

0.09

24

Table 1 The values of the parameters specified with respect to your student number (URN). Example: if your URN is “6789012”, the  second rightmost digit is “1”,  so you will use: μ = −0.04, and F = 10.

(i)         Determine the transfer function of this filter,i.e. the z-transform H(z), and the causal impulse response ℎ(n). Write down the calculation process for how your answer is reached. [20 %]

(ii)        Sketch the pole-zero plot of H(z) and the magnitude spectra |H(幼)|. [20 %]

(iii)       Assuming the sampling rate is F kHz (find the value in terms of your URN), estimate

the formant frequencies (in Hz) that correspond to the poles displayed in above pole- zero plot. Detail the calculation process for how your answer is reached. [20 %]

(iv)       Suppose the sign of μ is reversed, for example, changing from μ = −0.04 to μ =

0.04 if the second rightmost digit of URN is 1”, or changing from μ = 0.64 to

μ = −0.64 if the second rightmost digit of your URN is “8” . Re-sketch the pole-zero  plot of H(z) and the magnitude spectra |H(幼)|, and compare them with those plots in (ii).  Explain how youreach the answer. [20 %]

(a)        A signal containing a quasi-stationary segment of a vowel is band-limited using an ideal low- pass filter with a cut-off frequency at the Nyquist frequency, such that the power spectral density of the signal contains peaks at the first M harmonics of the vowel, and higher harmonics are cut off. Assume the sample rate is F kHz. Calculate the frequency difference between the first and the tenth harmonic, using the parameter values corresponding to your  student number (URN), as given in Table 2. [30 %]

Second rightmost digit

of student number

(URN)

M

F

0

20

8

1

22

8

2

50

22

3

38

16

4

90

32

5

28

12

6

36

18

7

30

14

8

55

20

9

62

24

Table 2 The values of the parameters specified with respect to your student number (URN). Example: if your URN is “6789012”, the second rightmost digit is “1”,  so you will use: M = 22, and F = 8.

(c)         Suppose we have a discrete speech signal, s[n], of length N samples.

(i)         The real cepstrum of a quasi-stationary  segment of a vowel of length N samples is computed, and prominent peaks of decreasing amplitude exist at quefrency bins τ = P, 2P, 3P, …. Assuming the sampling frequency is F kHz, what is the average pitch frequency of the vowel in Hz? The value ofF can be found from Table 3 in terms of your student number (URN). Explain how you have reached your answer. [25 %]

(ii)         Suppose the first four cepstral coefficients c[τ], τ = 0,1,2,3, derived from a frame of speech data, are specified in Table 3.

.    What  are the values  of the  final three  cepstral coefficients c[N-3], c[N-2], and c[N- 1]? Explain how you have reached your answer. [20 %]

.    Derive   a  Fourier   series  expression  for  the  resulting   log-magnitude  spectral envelope of the vocal tract that gives rise to this data. Explain how you have reached your answer. [25 %]

Second rightmost digit your URN

P

F

c[0]

c[1]

c[2]

c[3]

0

40

8

6.0

- 1.2

- 1.8

2.5

1

45

8

-5.5

1.3

-2.4

0.9

2

90

22

4.9

2.6

1.8

-3.7

3

85

16

7.7

-2.2

-3.5

2.3

4

110

32

9.6

2.5

1.6

-3.5

5

60

12

-3.7

3.9

-2.7

2.9

6

80

18

-6.5

-4.6

1.4

- 1.3

7

70

14

8.2

1.7

-2.2

2.4

8

90

20

-8.4

5.2

-3.1

-2.2

9

120

24

6.5

-4.4

1.8

-3.1

Table 3 The values of the parameters specified with respect to your student number (URN). Example: if your URN is “6789012”, the  second rightmost digit is “1”,  so you will use: P = 45, F = 8, c[0] = -5.5, c[1] = 1.3, c[2] = -2.4, c[3] = 0.9.

Section B

B1.

(a)        What   dynamic   programming   method  is  typically  employed  for  efficient  decoding  in

automatic speech recognition systems, and how does it differ from that used in training? [10 %]

(b)        A tracking system is built for a small delivery enterprise using an ergodic (fully-connected) 3- state HMM whose observations are 2D continuous features formed from noisy visual and GPS tracking data. The three states represent the presence of red ‘R’ (i=1), green ‘G’ (i=2) or blue B’ (i=3) delivery vans in the company’s loading bay respectively.

(i)        Draw a state topology diagram for the model, including null states for entry and exit, considering the state-transition matrix A in Table B1.1. [15 %]

(ii)        Using the state-transition matrix A in Table B1.1 and the output probability densities

bi(t) in Table B1.2 for the given observations, show that the cumulative likelihoods δt(i) at time t=2 are δ2(1)=0.0049 and δ2(2)=0.0034. [25 %]

0

0.4

0.5

0.1

0

0

0.5

0.1

0.3

0.1

0

0.2

0.6

0.1

0.1

0

0.3

0.3

0.2

0.2

0

0

0

0

0

Table B1.1: State transition probability matrix incorporating the entry and exit probabilities, A={πj,aij,ηi}.

t