Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EEEM030: Speech & Audio Processing & Recognition

FHEQ Level 7 Examination

Semester  1  2021/2

Q1.

Assume the relationship between the produced speech signals(n) from the lips and the

excitation signal u(n) from the vocal cord can be described by the vocal tract filter using the following difference equation:

s[n] = 2a cos(β) s[n − 1] − a2 s[n − 2] + (bN )s[n − N]

−2a(bN ) cos(β) s[n − N − 1] + a2 (bN )s[n − N − 2] + u[n] − bu[n − 1]

where n is the discrete time index, and a, b, N, and β are the coefficients used to determine

the shape of the filter, and β is in radian. DenoteS(z) as the Z-transform of the speech

signals(n) and U(z) as the Z-transform of the excitation signal u(n). Assume the values of the above coefficients and the sampling rate F kHz are given in Table 1.1.

Second rightmost

digit of student

number (URN)

a

b

N

β

F

0

0.4

0.8

4

π/6

10

1

-0.4

0.3

4

4π/5

6

2

-0.3

0.5

4

π/4

12

3

0.7

0.6

4

π/8

9

4

-0.6

0.4

4

π/5

8

5

-0.3

0.5

4

π/4

12

6

0.7

0.6

4

π/8

9

7

-0.4

0.3

4

/5

6

8

0.4

0.8

4

π/6

10

9

-0.6

0.4

4

π/5

8

Table  1.1.  The  values  of  the  parameters  specified  with  respect  to  your  student number (URN). For example, if your URN is “6789012”, the second rightmost digit is “1”, so you will use:  a  = −0.4, b  = 0.3, N  = 4, β  = 4π/5, and F  = 6 kHz.

(a)        Derive the transfer function of the vocal tract filter by using Z-transform:

H(z) = [20 %]

(b)        Calculate the poles and zeros of this filter. [20 %]

(c)        Sketch the pole-zero plot of H(z). [15 %]

(d)        Estimate the formant frequencies (in Hz) that correspond to the poles displayed in the above pole-zero plot. [15 %]

(e)        Analyse the bounded input and bounded output (BIBO) stability of the system. If the

values of “ and bare both doubled, analyse the stability of the system again. [15 %]

(f)         Re-estimate the formant frequencies that correspond to the poles when N = 6. [15 %]

Q2.

(a)        The Mel-frequency scale was designed as a perceptual scale of pitches judged by listeners to be equal in distance from one another:

fmel  = 1127.01048 ln(1 + )

A testis carried out where a listener is asked to judge the intervals between three different sinusoidal tones. It is judged that the following tones are subjectively equidistant in frequency: f1, f2, and f3 . Use the values off1  and f2  in Table 2.1, estimate f3  (in Hz), assuming  f3  is the highest of the three tones. [20 %]

(b)        Humans perceive sounds to arrive from a certain location.

(i)          Using about 100 words, explain the roles of the interaural phase difference

(IPD) and interaural level difference (ILD) for human sound source localization over a broadband frequency range. [10 %]

(ii)        The IPD can be calculated as Φ  = 2 πfr(θ + (sin θ))/c. A sound source

emitting a pure tone atf is located at the angle θ from the median plane.

Using the values off and θ in Table 2.1, calculate the IPD for a listener with a head radius of r  = 0.085 m, assuming that c = 344 m/s. Is the IPD cue reliable for source localization? [20 %]

Second rightmost

digit of student

number (URN)

f1 (Hz)

f2 (Hz)

f (Hz)

θ (degrees)

0

510

970

3320

40

1

1030

1780

1400

30

2

1340

2230

2660

50

3

1030

1780

1400

30

4

510

970

3320

40

5

1340

2230

2660

50

6

1030

1780

1400

30

7

510

970

3320

40

8

1340

2230

2660

50

9

510

970

3320

40

Table  2.1.  The  values  of  the  parameters  specified  with  respect  to  your  student  number (URN). For example, if your URN is “6789012”, the second rightmost digit is  “1”, so you will use: f1  = 1030 Hz, f2  = 1780 Hz, f = 1400 Hz, and θ = 30 degrees.

A set of domain-specific language models is developed as part of a restaurant recommendation service in the USA. Based on thousands of utterances recorded during trials, the raw 1-gram and 2-gram counts have been used to obtain corresponding unigram and bigram models, L0(1) and L0(2), with the probabilities shown in Tables 2.2 and 2.3, respectively.

w

I

want

to

eat

Chinese

food

lunch

spend

<\s>

P(w)

0.0418

0.0153

0.0123

0.0398

0.0026

0.0180

0.0056

0.0046

0

0.1538

Table 2.2. Unigram language model L0(1) for the restaurant recommendation service based on raw counts.  Key:  unknown, <\s> sentence end.

W

I

want

to

eat

Chinese

food

lunch

spend

<UNK>

<\s>

P(w|)

0.2500

0.0153

0.0399

0.0123

0.0026

0.0180

0.0056

0.0046

0

0

P(w|I)

0.0020

0.3265

0

0.0036

0

0

0

0.0008

0

0.0122

P(w|want)

0.0022

0

0.6559

0.0011

0.0065

0.0065

0.0054

0.0011