Our NSF proposal is here .
For the first year, we have focused on the following media sources based on their electronic availability and their reputation: Washington Post, BBC, Agence France Presse (AFP), Reuters, Boston Globe, Houston Chronicle, Scripps Howard Newswire, and the Associated Press.
The programs for extracting articles from the Web are complete. As of this report filing, we have obtained all articles pertaining to the Middle East from: Boston Globe: from 1979 to 2003 Houston Chronicle: 1985 to 2003 Scripps Howard newswire: 1990 to 2003 Washington Post: 1977 to 2003 Associated Press: 1998 We expect to complete the acquisition of all articles from the sources listed in (1) above, by the end of July 2003.
We have completed the implementation of a mySQL database of news articles, and downloaded the stories from the sources above into it. The database also offers support for linking events data derived from these stories. The database will be an important tool for extracting and coding a diverse set of events data collections.
We have analyzed the coding program TABARI developed by Phil Schrodt at the University of Kansas (with support of the NSF) as a tool for extracting events from the articles that we have gathered. TABARI can code simple declarative sentences well, but fails to detect events for sentences in the passive voice, and for complex sentences structures with relative clauses. Such complex sentences occur often in our sources making parsing of the sentences a necessary step. TABARI relies on static, user-generated noun and verb dictionaries.
We have successfully used Charniak's parser to generate the most likely parse of a sentence and to extract the main actors and verbs in a sentence. Once the actors and action are identified, the event can be generated using the WEIS coding standard. Our approach is to pre-process sentences using the Charniak parser in order to simplify sentences for processing by TABARI. We are currently working on techniques to dynamically augment the noun and verb dictionaries for use by TABARI.