Publications - Published papers

Please find below publications of our group. Currently, we list 565 papers. Some of the publications are in collaboration with the group of Sonja Prohaska and are also listed in the publication list for her individual group. Access to published papers (access) is restricted to our local network and chosen collaborators. If you have problems accessing electronic information, please let us know:

©NOTICE: All papers are copyrighted by the authors; If you would like to use all or a portion of any paper, please contact the author.

<tt>litsift</tt>: Automated Text Categorization in Bibliographic Search

Lukas C. Faulstich, Peter F. Stadler, Caroline Thurner, Christina Witwer


PREPRINT 03-014: [ PDF ]  [ PS ]
[ Publishers's page ]

Status: Published

In: Data Mining and Text Mining for Bioinformatics [Proceedings of the European Workshop Held in Conjunction with ECML / PKDD- 2003 in Dubrovnik, Croatia 22 September, 2003], Tobias Scheffer Ulf Leser (eds.), Humboldt Univ., Berlin, 2003, pp. 20-25.


In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose. In this study we use the topic of conserved secondary RNA structures in viral genomes as running example. Pubmed result sets for two virus groups, Picornaviridae and Flaviviridae, have been manually labeled by human experts. We evaluated various classifiers from the Weka toolkit together with different feature selection methods to assess whether classifiers trained on documents dedicated to one virus group can be successfully applied to filter documents on other virus groups. Our results indicate that in this domain a bibliographic search tool trained on a reference corpus may significantly reduce the amount of time needed for extensive literature recherches.


Automated Text Categorization, Document Filtering