Two persons frequently communicate with each other via e-mail. However, e-mail is good only for text, and for graphics transmission. Standard sound formats that encode human speech, produce extremely large outputs, that are not proper for e-mail communication. However, if certain assumptions are made about human speech, the communication will be efficient.
The speech profile of this person can be created which will contain the collection of elementary sounds uttered by him/her. This profile will be a one-time download for the listeners. The actual audio messages can be encoded based on the profile. The users will only need to download the encoded data (which will be much smaller than the actual audio data). This can be decoded using the profile stored earlier by the user, and the audio can be regenerated.
Kind of software being developed:
The project involves building a system for exchanging voice messages over mail, using very high speech compression, as described above. The sender will record his voice message and transform it into the coded, compressed file using the encoder module. The coded file is transferred as an email attachment. The receiver passes the attached file through the decoder module, which reproduces the original speech. Both the encoder and decoder will use a repository of speech segments. The repository may be transported by CDs, or may be made available for download, etc. The entire system (encoder, decoder and repository generator) will be prepared and coded for Linux.
Brief description of various components:
The project will deliver an easy-to-use package which will enable the proposed exchange of voice messages.
- The repository-generator tool works on a large sample of speech to generate the corpus using clustering and Mel-frequency cepstrum coefficients (MFCC) feature extraction processes.
- The encoder tool will take a sound file and convert it into a compressed binary file, using the repository.
- The decoder tool does the opposite job.
The system will be user-friendly. Once the repository generation and exchange process is over, communication can begin almost instantly.
References:
- Ki-Seung Lee and Richard V. Cox, A very low bit rate speech coder based on a recognition/synthesis paradigm, IEEE Transactions on Speech and Audio Processing, 2001
- Suresh Balakrishna, Speech Recognition using Mel Cepstrum features, Mississippi State University, 1998
- George Tzanetakis and Perry Cook, Multifeature audio segmentation for browsing and annotation, Proceedings 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
- Regine Andre-Obrecht, A new statistical approach for automatic sound segmentation of continuous speech signals, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol 36, No. 1, January 1988
- ISIP Automatic Speech Recognition (http://isip.msstate.edu/projects/speech/)
- http://isl.ira.uka.de/speechCourse/overview/contents.html
- Douglas O'Shaugnessy, Speech Communications - Human and Machine, Universities Press, 2001
- John Watkinson, The Art of Digital Audio, 3rd Edition