Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EEEM030: Speech & Audio Processing & Recognition

Semester 1 2018/9

Section A

A1.

(a) What is speech diarization? Give examples of key technologies that can be used to achieve speech diarization. [20%]

(b) Two files of speech data, A and B respectively, are recorded using a conventional sound card in a PC with a sampling rate at 8 kHz.

(i) A frame of a vowel segment was extracted from file A. The real cepstrum of this frame is computed. Prominent peaks of decreasing amplitude exist at quefrency bins 80, 160, 240, … Calculate the pitch frequency of this frame. [20%]

(ii) Another frame of a vowel segment is extracted from file B. Its pitch frequency is estimated to be twice as high as that of the vowel frame from file A. The plot of its autocorrelation function appears to be strongly periodic. If we measure ten complete periods of the autocorrelation function, how many samples should they occupy? Which frame of the above two vowel segments is more likely to be from female speech, and why? [20%]

(c) Cepstral coefficients can be calculated using Fourier transform and the logarithmic function. Suppose the first five cepstral coefficients derived from a frame of speech data with N samples are found to be:

c[0] = 5.3;  c[1] = -1.9;  c[2] = 2.8;  c[3] = -4.3; c[4] = 2.1.

Derive a Fourier series expression for the resulting log-magnitude spectral envelope of the vocal tract that gives rise to this data.  [40%]

A2.

(a) Describe the masking effect in human hearing. Use a sketch plot to illustrate your answer. State the difference between temporal masking and spectral masking. [30%]

(b) A 40-milliseconds window is used for computing the spectrogram of a speech signal, sampled at 8 kHz. The overlap between two adjacent windows is 25%. Calculate the hop size in samples. [20%]

(c)   A sixth order all-pole model of the vocal tract is obtained using linear prediction. Poles are located at

(i) Sketch the pole-zero plot for the transfer function H(z), showing both poles and zeros.  [20%] 

(ii) Assuming the sampling rate is 8 kHz, estimate the formant frequencies (in Hz). [20%] 

(iii) Suppose a speech signal is produced based on H(z), and a residual signal is obtained by inverse filtering the speech signal, briefly describe the main difference between the original speech signal and the residual signal. [10%]

Section B

B1.

(a)

(i) How does training alter the relationship between the parameters of an HMM and the training data? [10%]

(ii) The Expectation Maximization (EM) algorithm is a well-known machine learning technique for training a model that can subsequently be used for recognition. Explain what results from each of its two key steps. [10%]

(iii) Sketch the state topology for the HMM whose state transition probabilities aij are given in Table 1.1 in the usual format incorporating the entry and exit probabilities, pi and hj respectively. [10%]

(b)

(i) Using the state transition probabilities in Table 1.1, the observations ot in Table 1.2, and their output probability densities bi(ot) in Table 1.3, calculate the values of the three forward likelihoods at(i) that are missing from Table 1.4 at t = 1,2. [10%]

(ii) Hence calculate the two occupation likelihoods gt(i) missing from Table 1.4 to show that the HMM’s occupation likelihood for state 1 is gt(1) » 1 at times t = 1,2, where gt(i) = at(i) bt(i) / P(O|l), and where at(i) is termed the forward likelihood and bt(i) the backward likelihood (values for state 1: b1(1) = 2.14e-07; b2(1) = 1.70e-06), and P(O|l) = 2.7286e-08 for this given observation sequence, O = {ot} for t = 1..9. [10%]

(c)

(i) State the Baum-Welch re-estimation formula for re-estimating the mean mi of a gaussian pdf and, using your answers from part (b) together with the occupation likelihood values provided in Table 1.5, compute the maximum likelihood estimate of m2 for state 2, accordingly. [10%]

(ii) Comment on the value that you obtained for m2 with those observations. [10%]

(iii) Given that state 1 dominates the occupation likelihoods for the first five observations, whereas state 2 dominates the last three, what trend would you anticipate in the re-estimated variances S1 and S2, based on the values of the observations? [10%]

(d) In machine learning experiments, it is typical to split a development dataset in training and test data.

(i) Explain why this is done. [10%]

(ii) Give an example of one common approach to experimenting with a development set in this way, with details of how the data are split, used for training, used for testing and results obtained. [10%]

0

0.60

0.40

0

0

0.98

0.01

0.01

0

0.07

0.89

0.04

0

0

0

0

Table 1.1: State transition probabilities, A = {piaijhj}.

Time, t

1

2

3

4

5

6

7

8

9

Obs, ot

4.01

1.19

4.08

5.45

3.73

4.55

5.15

4.94

5.06

Table 1.2: Observation sequence, O = {ot} for t = 1..9.

Time, t

1

2

3

4

5

6

7

8

9

b1(ot)

2.12e-01

1.28e-01

2.05e-01

7.01e-02

2.36e-01

1.56e-01

9.52e-02

1.15e-01

1.04e-01

b2(ot)

9.53e-06

3.14e-79

5.07e-05

1.59e-01

3.50e-09

1.59e-01

1.51e+00

1.91e+00

1.91e+00

Table 1.3: Output probability densities, bi(ot) with ot for t = 1..9 and states i = 1,2.

Time, t

1

2

3

4

5

6

7

8

9

at(1)

-

-

3.22e-03

2.21e-04

5.13e-05

7.84e-06

7.32e-07

8.45e-08

1.15e-08

at(2)

-

4.00e-82

8.12e-09

5.11e-06

2.37e-14

8.14e-08

2.27e-07

3.99e-07

6.79e-07

Table 1.4: Forward likelihoods, at(i) for t = 1..9 and states i = 1,2.

Time, t

1

2

3

4

5

6

7

8

9

gt(1)

-

-

1.00e+00

9.98e-01

1.00e+00

5.39e-01

4.01e-02

5.50e-03

4.20e-03

gt(2)

2.14e-06

1.79e-81

5.50e-07

1.65e-03

1.89e-08

4.61e-01

9.60e-01

9.94e-01

9.96e-01

Table 1.5: Occupation likelihoods, gt(i) for t = 1..9 and states i = 1,2.

B2.

(a)

(i) With reference to the formula for Bayes’ theorem, describe how the acoustic model and the language model contribute in deciding the recognition output for a given utterance.   [10%]

(ii) In terms of the probability of a word sequence W, comprising n words, write down an expression that would approximate P(W) using a bigram or 2-gram language model, where w0 = {S} and wn+1 = {/S} denote the start and end of the utterance respectively.   [5%]

(iii) For a voice-enabled drinks machine, sketch the recognition network with following grammar, inserting null states as necessary, where $TAG defines a component, | denotes an alternative, [ ] an optional element, and  ( ) groups items together:

$COLOUR = ( BLACK | WHITE | GREEN )

$DRINK = ( TEA | COFFEE )

$SUPPLEMENT = ( WITH ( MILK | SUGAR ) [ AND ( MILK | SUGAR ) ] )

{S} ( [ PLEASE ] [ $COLOUR ] $DRINK [ PLEASE ] [ $SUPPLEMENT [ PLEASE ] ] ) {/S}    [15%]

(b) Calculate the word sequence probability for the phrase “tea, please, with milk and sugar”

(i) according to a 1-gram language model, using the values in Table 2.1   [15%]

(ii) according to a 2-gram language model, using the values in Table 2.2   [15%]

(c) Taking P($OOV) = 0.02 for an out of vocabulary word and adopting a backoff approach to deal with undefined probabilities, calculate the word sequence probability for the phrase “mint tea, please” according to a 2-gram language model with backoff.   [15%]

(d)

(i) Explain what language model discounting is, including why it is needed is practical systems.  [10%]

(ii) Name and describe one discounting method, being careful to define any terms you use.   [15%]