Leon Derczynski

Department of Computer Science, University of Sheffield

Current Projects

Question Answering against Very Large Text Collections

Background

While search engines return documents in response to a user query, the new technology of open domain question answering (QA) attempts to return precise answers to specific questions. For example to the question "How tall is the Eiffel Tower?" a search engine will return a set of pages which the user must read to determine an answer, while a QA system will return a precise answer, e.g. "324 metres".

In 1999 the US National Institute of Standards and Technology (NIST) introduced, as part of the Text REtrieval Conference (TREC) an open international evaluation exercise for question answering systems which has run annually since then. Each year the task set has been made more realistic and hence more challenging. The Sheffield NLP group has taken part most years and is likely to again in 2008. While our performance has always been competitive, restricted time and resources have limited our performance. One of the objectives of this project will be to put Sheffield in a better position next year in the ranking against other leading research groups, such as Microsoft, IBM, MIT, Tokyo, and Edinburgh.

Most QA systems, including those developed at Sheffield, involve the use of a conventional search engine to retrieve a set of texts deemed likely to contain an answer to a question and then use a second component, an answer extraction component, to identify which segments of the returned texts actually are the answer.

The Sheffield Natural Language Processing group now has implementations of two search engines for QA, as well as two answer extraction engines, one shallow approach based on semantic tagging of question-specific answer types (e.g. persons in relation to "Who" questions), the other on deeper linguistic analysis of text. The group also has several interface components that allow these engines to be run in various configurations against various text collections (including the Web). One of these interfaces makes use Web server technologies (Tomcat/Apache) to run the QA systems on a host machine and deliver answers to remote web clients. Finally, a previous Darwin project on this topic has developed a web-based failure analysis tool that allows the results of different system configurations against various question sets to be stored in a relational database and then analysed, in order to better understand which techniques work best.

Progress

Data driven approach to query expansion

Passing a plaintext query (after pronoun resolution) to high-performance IR engines still produces unsatisfactory coverage. As the answers for previous years TREC tasks are known, text from these is used to extend queries. Sample extension report

Formal comparison and comparison of IR engines for question answering

As Lucene only reaches around 60% coverage after question pre-processing, work is needed if the IR component is ever to supply text containing the answers to an answer extraction (AE) processor. Failing for 40% of questions before AE is performed is no good. Alternative engines (including Indri, Terrier) are compared against Lucene in various configurations, and performance measured through different TREC QA years and question types, in order to identify weak spots and provide objective comparison of these engines in this task.

Spoken Language Processing by Mind and Machine

Fantastic and compelling set of lectures by Roger Moore.

Modelling and Simulation of Natural Systems


Selected past projects

Computation Systems Biology

Although modelling and simulation have been used to investigate processes and phenomena in the physical sciences and engineering for many years, they have been less widely used in biology, physiology and medicine until recently. This ART will introduce students to examples of complexity in biological systems, the philosophy of modelling and simulation, and the role of systems analysis and computation in the biosciences. The material covered will relate to activities in the Computational Systems Biology research group in the Department.

Machine learning techniques for document selection

As humans use information retrieval systems, a wealth of data is generated. The problem of determining documents relevant to a query can be learned instead of developing a blind information retrieval system. Feedback on relevance can be used as training data for machine learning algorithms, with the end goal of creating a system reliant on human relevance judgements instead of conventional information retrieval methods.

This project will review approaches used for returning search results over a collection of independent documents, evaluation of information retrieval systems, and teaching machine learning algorithms to classify documents given a natural language query.

The performance of a set of machine learning algorithms at classifying relevant documents was examined. Some exploratory work on optimising problem representations is undertaken, with varying degrees of success. Other approaches for gathering data and classifying documents to aid humans in search are also discussed.

COM3220 - Advanced Software Engineering

A series of software engineering related seminars, reviewing current publications in software engineering. A full list of presented SE papers is available.

My review of Nerur, S., Mahapatra, R.K., Mangalaraj, G. (2005) “Challenges of migrating to agile methodologies” in Communications of the ACM, 2005, Vol 48 issue 5, pp 72 – 78 can be found here: Derczynski, L. (2006) "Organisational changes in migration to agile development strategies" (pdf). You can also see the accompanying slides (PowerPoint).

COM3250 - Machine learning

Decision Trees in Weka

For both datasets your objective is to learn a decision tree to classify instances. In the case of the mushroom data the tree will be for classifying mushrooms into the classes ‘e’ (edible) and ‘p’ (poisonous), i.e. the target attribute is class. In the case of animals the tree will classify them into one of 7 types, i.e. the target attribute is type. For each dataset you should try to do this using two decision tree learning algorithms in Weka: ID3 and J48. The latter is the Weka implementation of Quinlan’s C4.5 decision tree algorithm, which provides various of the refinements to the basic ID3 algorithm decision tree (e.g. methods for handling numeric attributes, missing data, overfitting).

Once again use the mushroom.arff and zoo.arff datasets from the UCI datasets. Instead of using the default 10 fold cross-validation when testing the ID3 and C4.5 classifiers, chose the percentage split test option and vary the split at several points in the range 1% to 99% How do ID3 and C4.5 compare as the proportion of test/training data is altered?

Data: ARFF datasets, including those used

Findings: Building decision trees in WEKA for mushroom and animal classification

Text Classification in Weka

The aim of the assignment is to investigate how machine learning algorithms in Weka can be used to carry out a text classification task. The task is to classify movie reviews as positive or negative, a task sometime referred to as sentiment detection. The data set to use is the polarity dataset v2.0 available from http://www.cs.cornell.edu/people/pabo/movie%2Dreview%2Ddata/

Findings: Sentiment detection with movie reviews, using a Naive Bayes classifier and n-gram analysis


Links

Information Retrieval
Usability
Machine Learning
Software Engineering

Contact details

email me - leondz@gmail.com

time for bed