VoX

A brief summary of the challenge being addressed:

Two persons frequently communicate with each other via e-mail. However, e-mail is good only for text, and for graphics transmission. Standard sound formats that encode human speech, produce extremely large outputs, that are not proper for e-mail communication. However, if certain assumptions are made about human speech, the communication will be efficient.

The speech profile of this person can be created which will contain the collection of elementary sounds uttered by him/her. This profile will be a one-time download for the listeners. The actual audio messages can be encoded based on the profile. The users will only need to download the encoded data (which will be much smaller than the actual audio data). This can be decoded using the profile stored earlier by the user, and the audio can be regenerated.

Kind of software being developed:

The project involves building a system for exchanging voice messages over mail, using very high speech compression, as described above. The sender will record his voice message and transform it into the coded, compressed file using the encoder module. The coded file is transferred as an email attachment. The receiver passes the attached file through the decoder module, which reproduces the original speech. Both the encoder and decoder will use a repository of speech segments. The repository may be transported by CDs, or may be made available for download, etc. The entire system (encoder, decoder and repository generator) will be prepared and coded for Linux.

Brief description of various components:

The project will deliver an easy-to-use package which will enable the proposed exchange of voice messages.

The repository-generator tool works on a large sample of speech to generate the corpus using clustering and Mel-frequency cepstrum coefficients (MFCC) feature extraction processes.
The encoder tool will take a sound file and convert it into a compressed binary file, using the repository.
The decoder tool does the opposite job.

The system will be user-friendly. Once the repository generation and exchange process is over, communication can begin almost instantly.

References:

Ki-Seung Lee and Richard V. Cox, A very low bit rate speech coder based on a recognition/synthesis paradigm, IEEE Transactions on Speech and Audio Processing, 2001
Suresh Balakrishna, Speech Recognition using Mel Cepstrum features, Mississippi State University, 1998
George Tzanetakis and Perry Cook, Multifeature audio segmentation for browsing and annotation, Proceedings 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
Regine Andre-Obrecht, A new statistical approach for automatic sound segmentation of continuous speech signals, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol 36, No. 1, January 1988
ISIP Automatic Speech Recognition (http://isip.msstate.edu/projects/speech/)
http://isl.ira.uka.de/speechCourse/overview/contents.html
Douglas O'Shaugnessy, Speech Communications - Human and Machine, Universities Press, 2001
John Watkinson, The Art of Digital Audio, 3rd Edition