Creation of a Doctor-Patient Dialogue Corpus Using Standardized Patients


Robert S. Belvin (1), Win May (2), Shrikanth Narayanann (3), Panayiotis Georgiou (3), Shadi Ganjavi (4)

(1) HRL Laboratories LLC; (2)University of Southern California, Keck School of Medicine; (3) University of Southern California, Department of Electrical Engineering; (4) University of Southern California Department of Linguistics




In this paper we describe the development of a doctor-patient dialogue corpus to support a speech-to-speech machine translation effort for English-Persian medical dialogues. The corpus was developed by recording and transcribing English-to-English dialogues between medical students and standardized patients (actors who have been trained to portray illness or injury victims), and then translated into Persian. We discuss some of the benefits and drawbacks to creating a corpus in this way. Benefits include the ability to customize the corpus in a way that would be infeasible for actual doctor-patient data and avoidance of privacy and legal issues, while drawbacks include the fact that the Persian does not originate as speech, but as text translation of English speech. We address concerns such as the authenticity of the dialogues and the value of such data for system development.


Medical Dialogue Data, Standardized Patients, Machine Translation, Corpus Development, Persian, Farsi, Doctor-Patient Interaction

Language(s) English, Persian
Full Paper