Evaluation Framework Design of Spoken Term Detection Study at the NTCIR-9 IR for Spoken Documents Task

This paper describes a design of spoken term detection (STD) studies and their evaluating framework at the STD sub-task of the NTCIR-9 IR for Spoken Documents (SpokenDoc) task. STD is the one of information access technologies for spoken documents. The goal of the STD sub-task is to rapidly detect presence of a given query term, consisting of word or a few word sequences spoken, from the spoken documents included in the Corpus of Spontaneous Japanese. To successfully complete the sub-task, we considered the design of the sub-task and the evaluation methods, and arranged the task schedule. Finally, seven teams participated in the STD subtask and submitted 18 STD results. This paper explains the STD sub-task details we conducted, the data used in the sub-task, how to make transcriptions by speech recognition for data distribution, the evaluation measurement, introduction of the participants’ techniques, and the evaluation results of the task participants.


Introduction
The growth of the internet and the decrease of the storage costs are resulting in the rapid increase of multimedia contents today.For retrieving these contents, available text-based tag information is limited.Spoken Document Retrieval (SDR) and Spoken Term Detection (STD) are promising technologies for retrieving these contents using the speech data included in them.
Most of general ways to perform SDR and STD use automatic speech recognition (ASR).
First, target speeches are translated to text symbol sequences that are equal to word or subword sequences in many cases.Then, text information retrieval (IR) techniques are adapted to search speeches or uttered terms related to a given query from a spoken documents collection.problem in SDR and STD studies.Therefore, many SDR and STD researchers have challenged to improve IR performance for very noisy text documents.
SDR and STD studies have been evolving since test collection was constructed.For example, the Text Retrieval Conference (TREC) has dealt with SDR since 1996 (Garofolo, Auzanne, and Voorhees 2000).The task of SDR is to identify a target spoken document from among a large number of spoken documents, where the target is often defined as containing a particular topic, for example, the content of a news clip.However, SDR has limitations: even if the target spoken document is identified, the user must browse the entire spoken document to confirm its content, even if the user would rather browse only the section containing the keyword of interest.
Therefore, STD functionality is needed.So, the National Institute of Standards and Technology (NIST) has set up STD test collections and collected the results of conference attendees (NIST 2006).Research and development of SDR and STD has been actively carried out, owing to the construction of these test collections.
In SDR and STD research, a standard test collection is indispensable because the performance of SDR and STD depends on the query set, the size and category of the spoken documents, and the correct documents or sections, and the occurrences in spoken documents that are included in such a test collection.Test collections for Japanese are highly desired by the spoken documents processing community in Japan.
Therefore, the Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing (SIG-SLP) of the Information Processing Society of Japan, developed prototypes of SDR test collections; the CSJ Spoken Term Detection test collection (Itoh, Nishizaki, Hu, Nanjo, Akiba, Kawahara, Nakagawa, Matsui, Yamashita, and Aikawa 2010) and the CSJ Spoken Document Retrieval test collection (Akiba, Aikawa, Itoh, Kawahara, Nanjo, Nishizaki, Yasuda, Yamashita, and Itou 2009).The target documents of both the test collections are spoken lectures in the Corpus of Spontaneous Japanese (CSJ) (Maekawa, Koiso, Furui, and Isahara 2000).By using (and extending) these test collections, we performed two sub-tasks were conducted in the NTCIR-9.
The NTCIR Workshop 1 is a series of evaluation workshops designed to enhance research in information access technologies by providing large-scale test collections and a forum for researchers.
We proposed a new task called "IR for Spoken Documents," shortened to "SpokenDoc," and it was accepted as one of the core tasks in the ninth NTCIR Workshop (NTCIR-9) (Sakai and Joho 2011).In the NTCIR-9 SpokenDoc (Akiba, Nishizaki, Aikawa, Kawahara, and Matsui 2011), 1 http://research.nii.ac.jp/ntcir/ntcir-9/index.html we evaluate STD and SDR, especially based on a realistic ASR condition, where the target documents are spontaneous speech data with high word error rates and high OOV rates.
In this paper, we especially focus on the STD sub-task at the NTCIR-9.This paper describes the task design and evaluation framework for STD sutdies at the NTCIR-9 SpokenDoc task.In our STD sub-task, we distributed the reference transcriptions produced by ASR systems to task participants.This aims at an competition at which ASR performance does not influence the STD evaluation of each participant's result.ASR for a spontaneous speech is difficult.Therefore, it is hard to achieve high performance STD in terms of the current STD studies.The design of the STD sub-task at NTCIR-9 is the challenge for breakthrough on the STD researches.
The organization of this paper is as follows.Section 2 describes an outline of STD task and Section 3 introduces other evaluation frameworks related to our STD sub-task.In Section 4, we explain the task definition of the NTCIR-9 SpokenDoc STD sub-task.Section 5 shows the evaluation results of the sub-task participants, and finally, we give our conclusion in Section 6.

Outline of STD
The goal of an STD task is to rapidly detect the presence of a term, consisting of one word or a sequence of words consecutively spoken, from a spoken documents collection.The effectiveness of an STD system is evaluated on detection rate, detection accuracy of a given query term, retrieval speed, and processing resources which a computer requires.Therefore, an system, which can rapidly identify locations of a given query term in a spoken document with less computer resources, must be a good STD system.However, there is a trade-off between detection rate and accuracy.The STD research is the challenge to breakthrough this tread-off problem.
Figure 1 shows a generic processing flow diagram of an STD task.First, target speeches are translated to text symbol (word or sub-word) sequences by using an ASR system.Next, index is created from the symbol sequences.There are some methods on how to make symbol sequences and an index for robustly searching query terms.When a query term is input to a search engine, it detects locations in which the query term exists from the indexed speeches.
It is hard to detect terms corresponding to a query term because any ASR system often generates speech recognition errors.In addition, words, which are not listed on an ASR dictionary used by an ASR system, are never correctly translated to text.These words are called out-ofvocabulary (OOV).The STD study is also the challenge to overcome the speech recognition error and OOV problems.
It is necessary to consider three key points to put STD applications into practical use: the first is detection performances (detection rate and accuracy), the second is search time and the third is computer resources.For example, even if an STD system has good detection performances, it may not be useful if the system process lower speed.Therefore, the participants' STD systems were evaluated in terms of the three key points.

Related evaluation frameworks
This section describes other evaluation frameworks related to our STD evaluation framework and especially clarifies the differences between our proposing, the TREC and the NIST frameworks.
The TREC-6 SDR track (Garofolo, Voorhees, Stanford, and Jones 1997) held in 1997 is the first search task targeted for speech data.The known-item retrieval task at the TREC-6 was similar to our STD task, in which spoken documents including keywords composing a query were retrieved.In this task, however, the retrieval techniques tackling the OOV and transcription error problems caused by ASR were not compered between the task participants.
The TREC-8 SDR track (Garofolo et al. 2000), the final SDR evaluation framework in the TREC series, had the ad-hoc SDR task.The ad-hoc SDR task is different from the STD task.
However, the goal of both the tasks is efficient IR from the large scale speech data.The TREC-8 used the broadcasting news corpus whose duration was 560 hours.The scale of the speech data was the same as our STD target speech.
On the other hand, our STD sub-task used the lecture speech corpus, the CSJ.Most speeches in the CSJ were hard to transcribe by ASR.This is because the speeches in the CSJ are spontaneous.Therefore, many falters, restated words, and filled pauses are included in the speeches.
They damage ASR performance.In addition to this, the CSJ has over 1,400 speakers.We used the speech data which is more varied on speaker individuality and speaking style.
In contrast to the TREC-8 SDR track, the NIST evaluation framework has the same goal as our STD evaluation framework.However, there were some problems described in followings: (1) The query set in the NIST task was composed of highly-frequent word N-grams automatically extracted from the speech corpus.It is easy to detect highly-frequent words from the speech data because of easy speech recognition of them.The NIST task was easy.
(2) The NIST task did not disinguish between in-vocabulary (IV) terms and OOV terms in the query set because the task organizers did not provide any resources necessary for the distinction.The task participants were required to perform ASR of the target speech by themselves and had to make their own distinction between IV and OOV terms.It is ideal to clarify the definition of IV and OOV terms in the evaluation sets for precisely examining the applicability of STD methods into practical retrieval.Therefore, it was hard to compare the STD performances between the STD methods in the NIST task because the differences between the ASR systems changed the level of task difficulty.
(3) The NIST prepared three sorts of target speeches in the English task.The durations of all of them were just a few hours.Therefore, the NIST task was too easy.
(4) In the NIST task, a detection error trade-off curve that plots term false alarm probability versus term miss probability, and the optical and actual points on the curve were used for evaluating the STD performance.The optical/actual points were called MTWV (maximum term weighted value) and ATWV (actual term weighted value), respectively.These evaluating measures calculated based on heuristic parameters depending on a target speech.This is specialized to the NIST task.
We have set up the SpokenDoc STD sub-task in the NTCIR-9 based on the problems of the NIST task.Comparing the NIST task with our STD sub-task, ours has following advantages: (1) The query set in our STD sub-task was formed by human-selecting keywords (mostly proper nouns), which are likely to be used for IR.Furthermore, the frequencies of the query terms are widely distributed and the number of syllables composing a query term is also considered.Therefore, our task provided the STD evaluation at a variety of levels of detection difficulty.
(2) We provided the two sorts of reference transcriptions produced by the two ASR systems to the task participants.The participants could use them.In addition to this, the ASR dictionary, acoustic and language models were also distributed.We announced the guidelines of ASR for participants who wants to use own an ASR system, because they perform their STD under the same ASR conditions (such as training data and vocabulary for ASR) of the reference transcriptions.Therefore, participants challenged the sub-task at the same level of detection difficulty for the all participants because the definition of IV and OOV terms were commonalized.
(3) Our target speech was much larger than the NIST's one.The durations were about 602 and 44 hours for the two query sets, respectively.
(4) A recall-precision curve and F-measure, harmonic mean of recall and precision rates, were used for the STD evaluation in our sub-task.These evaluation values allowed us to compare the STD performances between the all participants' STD systems at a macroscopic perspective.Furthermore, we also evaluated the STD performance from microscopic perspective based on MAP (mean average precision) metric.False detection, one of STD errors, were adequately evaluated.

STD task design at the NTCIR-9 4.1 Target speech collection
Our target document collection is the CSJ released by the National Institute for Japanese Language.Among the CSJ, 2,702 lectures (about 602 hours) are used as the target documents for our STD sub-task (referred to as ALL).The subset 177 lectures (about 44 hours) of them, called CORE, is also used for the target for our STD sub-task (referred to as CORE).
The participants were required to purchase the data by themselves.Each lecture in the CSJ is segmented by the pauses that are no shorter than 200 msec.Each segment is called Inter-Pausal Unit (IPU).An IPU is short enough to be used as the alternate to the position in the lecture.
Therefore, the IPUs are used as the basic unit to be searched in our STD sub-task.

Transcription of speech
A standard STD method first transcribes the audio signal into its textual representation by using ASR, followed by text-based retrieval.The participants could use the following two types of transcriptions.
(1) Reference automatic transcriptions The organizers prepared two reference automatic transcriptions.They enabled those who were interested in STD but not in ASR to participate in our sub-task.They also enabled comparisons of the STD methods based on the same underlying ASR performance.The participants can also use both transcriptions at the same time to boost the performance.The textual representation of the transcriptions will be the N-best list of the word or syllable sequence depending on the two background ASR systems, along with their lattice and confusion network representations.Table 1 shows the word-based correct rate ("W.Corr.")and accuracy ("W.Acc.") and the syllable-based correct rate ("S.Corr.")and accuracy ("S.Acc.") for these reference transcriptions.
(2) Participant's own transcription The participants could use their own ASR systems for the transcription.To enjoy the same IV and OOV conditions, we recommended that their word-based ASR systems should use the same vocabulary list as our reference transcription, but this was not necessary.When participating with their own transcriptions, the participants were encouraged to provide them to the organizers for future SpokenDoc test collections.

Speech recognition models
To realize open speech recognition, we used the following acoustic and language models, which were trained under the condition described below.
All speeches except the CORE parts were divided into two groups according to the speech ID number: an odd group and an even group.We constructed two sets of acoustic models and language models, and performed automatic speech recognition using the acoustic and language models trained by the other group.
The acoustic models are triphone based, with 48 phonemes.The feature vectors have 38 dimensions: 12-dimensional Mel-frequency cepstrum coefficients (MFCCs); the cepstrum difference coefficients (delta MFCCs); their acceleration (delta delta MFCCs); delta power; and delta delta power.The components were calculated every 10 ms.The distribution of the acoustic features was modeled using 32 mixtures of diagonal covariance Gaussian for the HMMs.
The language models are word-based trigram models with a vocabulary of 27 k words.On the other hand, syllable-based trigram models, which were trained by the syllable sequences of each training group, were used to make the syllable-based transcription.
We used Julius (Lee and Kawahara 2009) as a decoder, with the ASR dictionary containing the above vocabulary.All words registered in the dictionary appeared in both training sets.The odd-group lectures were recognized by Julius using the even-group acoustic model and language model, while the even-group lectures were recognized using the odd-group models.
Finally, we obtained N-best speech recognition results for all spoken documents.The followings models and dictionary were made available to the participants of the SpokenDoc task.
• Odd acoustic models and language models • Even acoustic models and language models • A dictionary of the ASR

The task definition
Our STD task is to find all IPUs that include a specified query term in the CSJ.A term in this task is a sequence of one or more words.This is different from the STD task produced by NIST (NIST 2006).
Participants can specify a suitable threshold for a score for an IPU; if the score for a query term is greater than or equal to the threshold, the IPU is output.One of the evaluation metrics is based on these outputs.However, participants can output up to 1,000 IPUs for each query.
Therefore, IPUs with scores less than the threshold may be submitted.

STD query set
We provided two sets of query term lists, one for ALL lectures and one for CORE lectures.
Each participant's submission (called a "run") should choose the list corresponding to their target document collection, i.e., either ALL or CORE.
We prepared the 50 queries sets for the CORE and ALL lectures sets.For the CORE, 31 of the all 50 queries are OOV queries that are not included in the ASR dictionary and the others are IV queries.On the other hand, for the ALL, 24 of the all 50 queries are OOV queries.The average occurrences per a term is 7.1 times and 20.5 times for the CORE and ALL sets, respectively.
Each query term consists of one or more words.Because the STD performance depends on the length of the query terms, we selected queries of differing length.Query lengths range from 4 to 14 morae.

System output requirement
When a term is supllied to an STD system, all of the occurrences of the term in the speech data are to be found and score for each occurrence of the given term are to be output.
All STD systems must output following information: • document (lecture) ID of the term, • IPU ID, • a score indicating how likely the term exists with more positive values indicating more likely occurrence, • a binary decision as to whether the detection is correct or not.
The score for each term occurrence can be of any scale.However, a range of the scores must be standardized for all the terms.

Evaluation measures
As described bofore, an system, which can rapidly identify IPUs including a given query term with less computer resources, is good.Therefore, we officially use recall-precision curves, Fmeasures and mean average precision (MAP) for evaluating detection performance.In addition, we also evaluate STD systems by retrieval time and memory consumption of a computer.
Calculation of recall, precision, F-measure and MAP is described as follows.IPUs detected by each system were judged by whether or not the IPUs included the specified term.The judgment was based on a "correct IPUs list" for each specified term.The definition of correct IPUs for a specified term is based on perfect matching to the manual transcriptions of the CSJ in Japanese representation (Kanji, Hiragata and Katakana).
The evaluation measure for effectiveness is the F-measure at the decision point specified by the participant, based on recall and precision averaged over queries (described as "F-measure (spec.)").The F-measure at the maximum decision point (described as "F-measure (max)"), recall-precision curves and mean average precision (MAP) are also used for analysis purposes.
They are defined as follows: (2) where N corr and N spurious are the total number of correct and spurious (false) term (IPU) detections whose scores are greater than or equal to the threshold, and N true is the total number of true term occurrences in the speech data.Recall-precision curves can be plotted by changing the threshold value.In the evaluation, the threshold value is varied in 100 steps.F-measure at the maximum decision point is calculated at the optimal balance of Recall and P recision values from the recall-precision curve.
MAP for the set of queries is the mean value of the average precision values for each query.
It can be calculate as follows: where Q is the number of queries and AveP (i) means the average precision of the i-th query of the query set.The average precision is calculated by averaging of the precision values computed at the point of each of the relevant terms in the list in which retrieved terms are ranked by a relevance measure.
where r is the rank, N i is the rank number at which the all relevance terms of query i are found, and Rel i is the number of the relevance terms of query i. δ r is a binary function on the relevance of a given rank r.

Schedule
The NTCIR-9 SpokenDoc task was worked out as following schedule:

Task participants of STD task
In the NTCIR-9 SpokenDoc STD sub-task, seven teams participated in the task with 18 submission runs.The team IDs are listed in Table 2.All the seven teams submitted the results for the CORE query set.However, only the two teams submitted the results of the ALL query set.
AKBL (Kaneko et al. 2011)  IWAPU (Saito et al. 2011) submitted two runs for the CORE set.They used multiple OWN transcriptions of various subword units, including monophone, triphone, syllable, demiphone, Table 3 The number of transcription(s) used for each run.
and Sub-Phonetic Segment (SPS), by using the multiple speech recognition systems prepared for these units.The query was also converted to these subword sequences, and then the detection was performed for each subword representation using the DTW algorithm.These multiple detection results were integrated into the final results by interpolating their detection scores linearly to improve the STD performance.
NKGW (Iwami and Nakagawa 2011) submitted one run for the CORE set.Two OWN transcriptions were used at the same time: the word-based transcription was used for the IV search query and the syllable-based transcription was used for the IV and OOV query.For the syllable-based transcription, an inverted index based on the syllable trigram was used for indexing, and the substitution and insertion errors were dealt with by introducing extra errorprediction indices, while the deletion error was dealt with by removing a syllable from the input query sequence.The index was also augmented by the distance score according to the error prediction, which was used to reduce false detections without applying the expensive DTW-based confirmation.
NKI11 (Katsurada et al. 2011) submitted two runs for the CORE set and two runs for the ALL set.A suffix array was used for indexing, which was constructed from the phoneme sequences obtained from the transcriptions of spoken documents.At detection time, the suffix array was searched against the phoneme sequence of the query using the DTW algorithm.To improve the efficiency of the search on the suffix array, the phoneme sequence of the long query term was divided into subsequences, each of which was then searched against the suffix array.The detection results of the subsequences were further confirmed to form the final detection results.
The transcription used was REF-SYLLABLE.
RYSDT (Nanjo et al. 2011)  YLAB (Yamashita et al. 2011) submitted one run for the CORE set.They used NO transcription obtained from speech recognition, but used the vector quantization (VQ) code sequence of the spoken document.For each document group consisting of lectures by the same speaker, the individual VQ code set was produced and used only for it.The V-P score, which encoded the similarity between a phoneme and a VQ code, was obtained from the same document group and used for the detection guided by the DTW algorithm.

Formal-run evaluation results
The evaluation results are summarized in Figure 2 and Table 4 for the CORE query set of the 13 submitted runs and the baseline.Figure 3 and Table 5 also show the STD performance for the ALL query set of the five submitted runs and the baseline.The offline processing time and index size (memory consumption) are also shown in Table 6 only for the runs using some indexing method for efficient search.
The baseline system used dynamic programming (DP)-based word spotting, which could decide whether or not a query term is included in an IPU.The score between a query term and an IPU was calculated using the phoneme-based edit distance.The phoneme-based index for the baseline system was made of the transcriptions of REF-SYLLABLE.The decision point for calculating F-measure (spec.) was decided by the result of the dry-run query set.We adjusted the threshold to be the best F-measure value on the dry-run set, which was used as a development set.
For the CORE query set, most of the runs that used subword-based indexing and a simple  matching method (DP or exact matching) outperformed the baseline performance for F-measure (max) and F-measure (spec.).On the other hand, the runs based on the Hough Transform algorithm (teams AKBL and RYSDT ) and the VQ code book (team YLAB ) performed below the baseline.
The best STD performance was "ALPS-1," which uses much more of the information in the speech.It used 10 kinds of transcriptions of the speech.However, the retrieval time was the worst among all the submissions."IWAPU-1" also obtained good STD performance by using a few subword-based indices.Therefore, combinations of multiple indexes may be effective in improving STD performance.NKGW and NKI11 achieved performance a little better than the baseline.However, their searches were faster than those of teams ALPS and IWAPU.
The tasks using the ALL query set may be more difficult than those using the CORE query set because the baseline performance for ALL is less than that for CORE.Nevertheless, the only runs of team RYSDT outperformed the baseline for F-measure (max).These results are better than the CORE query set.

Discussion
This section describes that how our STD evaluation framework contributes to the new findings through the conduct of the competition.
As described in Section 3, the one of characteristic in our STD evaluation framework is to distribute the reference transcriptions.In addition to this, the ASR dictionary, the acoustic models and the language models were also provided to the task participants.This contributed greatly to evaluate the participants' STD outputs.As shown in Figure 1, the ASR performance for the target speech data directly affects the STD performance.Therefore, it is very important to compare the indexing and search methods by providing the reference transcriptions, the dictionary, the models, and its training conditions.
As shown in Table 3, four participants used the reference transcriptions.This made us compare the STD performances between the participants' STD techniques, such as indexing and search methods, under the same ASR accuracy of the target speech data.For example, teams NKI11 and AKBL used the same reference transcription (REF-SYLLABLE), and finally, NKI11 got better the STD performance.NKI11 had better indexing and search methods than AKBL, however, AKBL's search engine could find the terms so quickly.
On the other hand, teams ALPS and IWAPU used own transcriptions obtained by their ASR systems under the same ASR conditions as the reference transcriptions.ALPS used not only the reference transcriptions but also the additional eight sorts of transcriptions.Combining these transcriptions got the best STD performance among the all participants.IWAPU achieved the good STD performance by improving the acoustic models, which made the good ASR under the same ASR conditions for the all participants.Although the search methods of ALPS, IWAPU, and NKI11 were similar, we can find the differences between the STD performances.This is attributable to the variety of ASR systems and the modeling techniques.Providing the transcriptions, the dictionary, and the models enabled us to compare the STD algorithms and discuss the STD performances when similar STD methods were used on our STD evaluation framework.This may be very useful for STD researchers around the world.
The other characteristic in our STD evaluation framework is to provide two types of test sets: ALL and CORE query set, which have a different size of the speech data.In the sub-task, only two teams NKI11, RYSDT, and the baseline used both the query sets.As shown in Table 4 and   Table 5, the larger target speech data (the ALL query set) degraded the evaluation measures related to false detections (precision rate and MAP) definitely, however, the rate of decline was not so large even if the speech data for the ALL query set was about 14 times as much as size of the CORE query set.For example, the both results of the ALL and the CORE query sets from team RYSDT were almost the same performance on F-measure(max) and MAP.For the baseline, as shown in Figures 2 and 3, the precision rate decrease to 20% (for the ALL set) from 55% (for the CORE set) at the point of 50% of recall rate.However, the false detections just increased 2.3-fold in the ALL query set, even if the speech data size was 14 times larger than the CORE set speech data.On the other hand, NKI11 got worse the STD performance from the CORE query set.This found that the RYSDT 's STD technique was robust for speech data size changes.
In the NIST evaluation framework, false detection error was evaluated by false alarm probability based on duration of the target speech data.In other words, false alarm probability declines in proportion to target speech size.Therefore, the NIST evaluation cannot completely investigate the changes of STD performance when speech data sets in various sizes are used.
This evaluation framework clarified the two findings: an STD researcher can sufficiently estimate the STD performance for a big scale speech data set without the same scale speech data of the target speech, and the STD performance from the specific STD method did not depend on the size of target speech data.These findings may also be useful for STD researchers.

Conclusions
This paper described the design and evaluation framework for the spoken term detection study which is the one of information access technologies.We managed the STD sub-task, one of the SpokenDoc task at the NTCIR-9 workshop.In the sub-task, we supplied the data, including rich transcriptions and some models for ASR, and test query sets for the CORE and ALL lectures.
Finally, seven terms participated into the STD sub-task, and 18 runs were submitted.Each run was evaluated in the detection performances: recall-precision curve, F-measures, MAP, retrieving time and computer resources (memory consumption).
The variety of methods for STD have been proposed in the STD sub-task.Using rich transcriptions from multiple ASR systems (Nishizaki et al. 2011) was gotten the best F-measure and MAP among the all submissions, but its retrieving speed was very low.It is useless on a real application.On the other hand, using the efficient indexing methods (Iwami and Nakagawa 2011;Katsurada et al. 2011) achieved the high-speed retrieval.However, the detection performances were significantly less effective than the top performance.
It is important to analyze advantages and disadvantages of each STD technique and share the analyses with the task participants.We hope that the NTCIR-9 SpokenDoc will contribute to generate some new ideas of the STD study and to develop information access technologies for spoken documents.

Fig. 1
Fig. 1 Generic STD processing flow diagram (a) Word-based transcription (denoted as "REF-WORD") obtained by using a wordbased ASR system.In other words, a word n-gram model was used as the language model of the ASR system.With the textual representation, it also provides the vocabulary list used in the ASR, which determines the distinction between the IV query terms and the OOV query terms used in our STD sub-task.(b) Syllable-based transcription (denoted as "REF-SYLLABLE") obtained by using a syllable-based ASR system.The syllable n-gram model was used for the language model, where the vocabulary is all Japanese syllables.Using this model can avoid the OOV problem of the spoken document retrieval.Participants who want to focus on the open-vocabulary STD can use this transcription.
for participation: Oct./2010 ∼ Dec./2010 Data distribution: on Mar./2011 Dry-run work and results release: on May/2011 Formal-run work and results release: on July/2011 Participants' draft paper submission: on Sept./2011 Participants' camera-ready paper submission: on Nov./2011 Workshop meeting: on Dec./2011 submitted two runs CORE set.Their indexing method, called Metric Subspace Indexing, was quite different from those used in text indexing.The term detection was performed on these indices and based on the Hough Transform algorithm usually used in image processing.Methods for incorporating the multiple candidates from speech recognition into their indexing were also investigated.They used the REF-SYLLABLE transcription.ALPS (Nishizaki et al. 2011) submitted two runs for the CORE set.The 10 (including REF-WORD, REF-SYLLABLE, and OWN) transcriptions obtained from various recognition systemswere incorporated into a sausage-style lattice, called PTN, and the search was performed on it by using the DTW (Dynamic Time Warping) algorithm.To reduce false detections, two additional scores, which roughly corresponded to the degree of consensus and ambiguity in the competing syllables in the lattice, were also incorporated into the distance score used in the DTW process.
submitted three runs for the CORE set and three runs for the ALL set.The term detection was performed based on the Hough Transform, a line detection algorithm usually used in image processing.Several filtering methods were applied to the query document image plane to improve the line detection performance.The transcription used was REF-WORD.

Fig. 2
Fig. 2 Recall-precision curves for the CORE query sets

Fig. 3
Fig. 3 Recall-precision curves for the ALL query sets

Table 2
STD sub-task participants.

Table 4
STD evaluation results on each measurement for all submitted runs of the CORE set."Search time" shows the average time for finishing search process for each query.

Table 5
evaluation results on each measurement for all submitted runs of the ALL set."Search time" shows the average time for finishing search process for each query.

Table 6
System information related to the offline processing for those runs using indexing method.