Abstract
This study explored the potential of automatic speech recognition (ASR) transcription for linguistic research using a preliminary analysis of transcripts generated by three ASR systems: Google Cloud Speech-to-Text, Rev AI, and Whisper. The analysis included 214 sample files of spontaneous speech from Japanese learners in the International Corpus Network of Asian Learners of English corpus. OpenAI was employed to remove disfluencies from the transcripts, which resulted in highly accurate pruned versions. The performance of the ASR systems was evaluated using these pruned transcripts, with word error rate (WER) serving as the evaluation metric. Among these systems, Whisper exhibited the best performance, achieving a WER of 28.7% for pruned transcripts and 18.8% for pruned-normalized transcripts. This study also identified challenges for ASR in transcribing spontaneous non-native speech. The three systems, including Whisper, exhibited higher WERs for non-native speakers than for native English speakers. In addition, all three systems tended to correct erroneous morphological forms in verbs and nouns, with Whisper showing the most prominent trend, suggesting that ASR transcripts generated by the current systems, particularly Whisper, may not be suitable for error analyses involving morphological forms.