Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Module:           CMP-6026A/7016A Audio-visual Processing

Assignment:    CW2: Audio-visual speech recognition

Set by: Daniel Paredes-Soto ([email protected])

Date set: Wednesday 22 October 2025

Value: 50%

Date due: Wednesday 10 December 2025 (Week 12)

Returned by: 17 December 2025

Submission: Blackboard

Checked by: Jacob Newman ([email protected])

Learning outcomes

•  Explain how humans produce speech from audio and visual perspectives and how these differ across different speech sounds and be able to give examples of how these are subject to noise and distortion.

•  Apply a range of tools to display and process audio and visual signals and be able to analyse these to find structure and identify sound or visual events.

•  Transfer knowledge learnt into code that extracts useful features from audio and visual data to provide robust and discriminative information in a compact format and apply this to machine learning methods.

•  Design and construct audio and visual speech recognisers and evaluate their performance under varying adverse operating conditions.

•  Work in a small team and organise work appropriately using simple project management techniques before demonstrating accomplishments within a professional setting.

Specification

Overview

This assignment involves the design, implementation and evaluation of a speaker- dependent audio-visual speech recognition system to recognise the names of 20 students in clean and noisy conditions. As this assignment builds on the first one, you will likely be using Python and TensorFlow to complete this work.

Description

This assignment has three parts:

1.  You will  construct a  lip-reading system using the vocabulary of names you developed in the first part of this module. Since you have already developed the architecture of an acoustic classifier, this new task ought to amount to extracting features from a video and using those features instead of the audio features.

2.  You will develop an audio-visual recogniser. To do this effectively you will need a set of audio features and a set of video features. The video features may be at a different sampling rate than the audio so you may need to resample the video  features.  There are then two ways to combine the features:  early integration or late integration. In early integration, the audio and video features are joined together (to form a longer vector that is sometimes called a supervector).  These  supervectors are then used to train and test a new classifier. In late integration you build two separate classifiers for the audio and video streams and you build some logic that is able to pick the probable name given the list of possible outputs from each classifier.

3.  You will prepare a 15 minute presentation for Wednesday Week 12. Consider carefully the content of your presentation. Present concise and informative data that illustrates the results of your work. Favour graphical content rather than lots of text on your slides. You might also consider that if there are multiple presentations on the same problem you may want to adjust the presentation content on the day.

To achieve your objectives, you will need to consider the following aspects:

Speech data collection and labelling: if you recorded simultaneous audio and visual sequences for the initial coursework then you can simply reuse the acoustic features and labels from those exercises. Otherwise, you will need to re-record and label instances of the student names so that you have visual features and acoustic features for audio and visual recognition. It is usually sensible practice to work with a subset of the data while you develop your techniques.

Feature extraction: working in small teams, usually of two people, (use the same team from the previous coursework) you should first consider what visual features you will use. Previously you may have used MFCCs as these are the standard features used for acoustic speech recognition. However, for visual speech (lip-reading) there is no real agreement as to what form the visual speech features should take. These could be image-based features (DCT is common, but equally PCA-based features could be used). Image-based features provide an implicit visual speech feature by coding the image containing the mouth, they do not code the mouth directly. Alternatively, the features might use higher level knowledge and provide an explicit visual speech feature by measuring properties of the position and shape of the speech articulators directly (e.g. mouth width and height), or they might include both shape and appearance  information.  Typically,  both visual-only and audio-visual speech recognisers perform better if both shape and appearance information are included. However, there is increased complexity in extracting shape as specific feature points must be identified in each and every video frame. Furthermore, for lip-reading to be practical there is a strong need to extract simple features.

Marks will be awarded based on the effectiveness of the feature extraction used and the completeness of the testing (e.g. comparing different types of feature).

Together with your partner you will need to decide on the number of visual feature dimensions that you will use for training and testing your recognisers. You might consider which of the features are perceptually significant (e.g. by visualising the features), or you might determine the optimal number of features empirically using some form of objective measure of recogniser performance.

Visual-only speech recognition: after extracting your visual features, you should measure the baseline performance of a speech recogniser using these visual features. That is, you should build a visual-only speech recogniser trained and tested using only the visual features. You may reuse many of the scripts developed for coursework one.

You do not need to consider the performance of this recogniser as a function of noise since the visual features from your video are not affected by acoustic noise.

Audio-visual speech recognition: you should integrate the acoustic and the visual information to build audio-visual speech recognisers. The data-rate for these two modalities is likely to be different so one or other of the features will need to be re- sampled. It is customary to up-sample the video data to the acoustic data-rate rather than down-sample the acoustic data to match the visual. This ensures that none of the original information is lost.

You then need to decide if you will consider early or late integration of the acoustic and the visual information and implement the recognisers accordingly. You might also consider comparing the performance of your features for both early and late integration to determine if one approach might be better than the other.

You should consider the performance of your audio-visual recogniser as a function of the acoustic noise (use the same noise files as used in the previous coursework). You can then report the effective gain (or otherwise) that arises from incorporating the visual information into your acoustic-only recogniser.

Relationship to formative assessment

Formative assessment takes place during all lab classes through discussion of your analysis, designs and implementations. These labs underpin the coursework and relate directly to the different parts.

Deliverables

This assignment has no written work. The sole form of our assessment of you will be via the presentation.  Note that an important part of the marking structure is our assessment of your credibility and professionalism and it is worth discussing with us how this is best established.

Delivery of the assessment will be through an in-person oral presentation. The session will take the form of a mini-conference in Week 12, and the session will consist of a small group of presentations. You must attend the whole of the session allotted to you.

Both team members should submit your group presentation using the appropriate submission point on Blackboard. The first slide of your presentation should contain the title of your project and the names of your teams members.

Unless you inform us otherwise (by email or in person), we will assume that each member of your team has made an equal contribution to the final submission. If there is a disagreement in your team about the share of contribution, there may be a delay in providing your final mark and feedback, as we will have to discuss the issue with each person in your group.

Resources

You will need to use audio/visual recording equipment/software, learning toolkits as used in the lab classes. These resources have been introduced in the lectures and lab classes.

Marking scheme

The marking structure is as follows:

•  Background and introduction - 5 %

•  Visual feature extraction - 30 %

•  Evaluation of classifiers - 30 %

•  Quality of visual materials and presentation skills - 10 %

•  Question answering - 10 %

•  Organisational skills, professionalism and credibility - 15 %

Note  -  it is expected that all people in the team will make approximately  equal contributions. If it becomes apparent that this is not the case, then for fairness, the distribution of marks allocated may be adjusted.