Job opportunities

View All Vacancies

Research Fellow in Machine Learning for Audio Captioning

Department of Electrical & Electronic Engineering

Location:  Guildford
Salary:  £36,914 to £39,152
Post Type:  Full Time
Advert Placed:  Thursday 18 March 2021
Closing Date:  Thursday 15 April 2021
Interview Date:  Friday 23 April 2021
Reference:  015821

Applications are invited for a Research Fellow (RF) position for 22 months within the Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey, UK, to work on a project titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired”, which is a collaborative project between the University of Surrey and the Izmir Katip Celebi University (IKCU), Turkey, with project partners from charities and industrial sectors working with the hearing and visually impaired. This project aims to address fundamental challenges in audio and image captioning, develop new algorithms to improve performance of audio and image captioning algorithms, and application tools that could be used by the hearing and visually impaired to access audio and image content.

The work at Surrey will focus on new methods and algorithms of automated audio captioning and natural language description of audio. This work is built on the significant contributions of CVSSP in the area of acoustic scene analysis, audio event detection, environmental sound recognition, and audio tagging, together with preliminary results on audio captioning. This new project offers an opportunity to take this work to the next stages, and demonstrate the benefit of such technologies for the hearing and visually impaired. A smartphone based prototype will be developed for audio and visual captioning jointly by Surrey and IKCU. New data will also be gathered, including audio-visual data for captioning, and user feedback for the prototype system.

The postholder will be responsible for investigating and developing audio signal processing, machine learning algorithms for natural language description of sound, and implementing software for prototyping the concept and algorithms. The postholder should have a doctoral level (or equivalent) research and development experience in electronic engineering, applied mathematics, computer science, artificial intelligence, machine learning, natural language processing, or related subjects. The postholder should ideally have experience in one of the following areas: audio captioning, machine description of audio, audio classification, audio tagging, image captioning, video captioning, translations between audio/image and texts, and/or translation between audio and video.

The post-holder will be based in CVSSP, and work under the direction of the Principal Investigator Prof Wenwu Wang, with co-supervision by Prof Sabine Braun, Director of the Centre for Translation Studies, at University of Surrey, and in collaboration with Dr Volkan Kilic, from the IKCU, Turkey.

CVSSP is an International Centre of Excellence for research in Audio-Visual Machine Perception, with over 150 researchers, a grant portfolio of £24M (£17.5M EPSRC) from EPSRC, EU, InnovateUK, charity and industry, and a turnover of £7M/annum. The Centre has state-of-the-art acoustic capture and analysis facilities and a Visual Media Lab with video and audio capture facilities supporting research in real-time video and audio processing and visualisation. CVSSP has a compute facility with 120 GPUs and >1PB of high-speed secure storage.

For informal inquiries, please contact Prof Wenwu Wang (Email:; Web:


Email details to a friend
Further details:

Please note, it is University Policy to offer a starting salary equivalent to Level 3.6 (£31,866) to successful applicants who have been awarded, but are yet to receive, their PhD certificate.  Once the original PhD certificate has been submitted to the local HR Department, the salary will be increased to Level 4.1 (£32,817).

Get updates


Athena Swan Bronze Award / Disability Confident Committed / Stonewall Diversity Champion