Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval
Advisor : Dr. Hsu
Presenter : Zih-Hui Lin
Author :Ying Zhang and Phil Vines
1
Motivation
Objective
Previous work
Methodology
Experiments and results
Conclusions
Outline
2
Motivation
One of the major remaining reasons that CLIR does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms.
it will not be recognized, and segmented into either smaller sequences of characters or individual characters
北野武→(north limit military)
Previous work has either relied on manual intervention or has only been partially essful in solving this problem.
3
Objective
We propose a segmentation free method which can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work
4
English translation extraction in Chinese-English CLIR
Chinese OOV term detection
北野武(north limit military) → Pvalue given by the HMM will be very low if Pvalue < Pmin → contains OOV terms
web text extraction
we extract strings that contain the Chinese query
terms and some English text from the Web.
collection of co-occurrence statistics,
translation selection.
search for longest Chinese substring Ct:
search for the English term etwith the highest frequency:
1. |Ctargets| = max(|Cij|).
2. f(et, Ct) = max(f(ei,Ctargets)).
3. Add (Ct, et) into the translation dictionary.
(etargets) = max(f(ei)).
(et’,Ct’) = max(f(etargets,Cij )).
3. if Ct’≠ Ct and et’≠ et , add (Ct’, et’) into the translation dictionary.
北野武(Kitano Takeshi)c4 c5 c6 e 1
導演北野武( Kitano Takeshi)c2 c3 c4 c5 c6 e1
5
Chinese translation extraction in English-Chinese CLIR
Extraction of web text
use Google to fetch the top100 Chinese documents with the English OOV term eoov as the query.
Collection of co-occurrence statistics
accumulate the frequency foov.
considering all
北野武(north 来自淘豆网www.taodocs.com转载请标明出处.