What's Going On: Answering Questions With Web Corpora

June 28, 2017

Pedro Szekely understands dark matter in an all-too- human, not cosmological, universe. Beginning in 2015, his work with Craig Knoblock on sophisticated, cloud-based analytics has generated substantial media attention for its ability to expose human trafficking hidden deeply online.

The system essentially uses open-source software to transform hard-to-search data into concrete law enforcement leads.

Szekely recently gave a seminar describing how cloud-based Domain-Specific Insight Graphs (DIGs) plumb the dark and open web. That vast counterpart to the conventional, surface web is a favorite destination for human traffickers, gun runners and other illegal operatives. Part of ISI's continuing "What's Going On" series that deepens researchers' knowledge of work being conducted Institute-wide, the talk was attended by about 30 ISIers in Marina del Rey, California and Arlington, Virginia.

Szekely began with the mandate issued by his DARPA project manager: Search millions of web pages, all in different formats, for names, phone numbers and any other relevant information of specific people who had little other identifying information. The task is part of Memex, a DARPA browser created to help law enforcement uncover patterns and relationship in previously unsearchable criminal data. His project manager, says Szekely, "poses these crazy challenges and people actually figure out how to do something useful."

While the Szekely/Knoblock team didn't have to identify pages, which were provided by a different Memex group, they did have to deal with the challenge of extremely noisy data. For example, the name "Charlotte" could refer either to a person or the North Carolina city. In fact, names are almost always invented, and intentionally confusing, to make searching more difficult. Ethnicities may be couched in terms like "caramel", hair colors may use synonyms like "auburn" for "red", and identifying numbers like phone, height, weight, price, and social media locators may or may not appear.

That meant the system should be able to find relevant pages, build a search index and answer questions - many of which may be posed by users with little or no DIG training. Szekely and Knoblock also sought to create hybrid information retrieval (IR) and semantic web knowledge graphs, have those graphs scale well beyond traditional limits, and record data provenance so information could be traced.

Their solution involves locating the most precise matches first, then progressively relaxing criteria. Szekely then walked through those steps, including data extractions, creating IR/knowledge graphs (KGs), KG reasoning, candidate generation and rankings. "When we started, we thought it would be a miracle if it answered anything," says Szekely. "But surprisingly, it does."

In fact, the system is accurate at a rate of less than one but greater than .9, extremely high considering the substantial analytic barriers involved. DIG has a base of over 100 million web pages, from which two billion records have been extracted, and collects 5,000 pages hourly.

Multiple US law enforcement agencies now are using DIG to answer questions and transform replies into maps, timelines and tables that help identify traffickers. Officials can pose such questions as, "What other ads were authored by the same person?", "What is the most common ethnicity of massage parlor workers in Orange County?" or "What is the average price of escort services in San Bernardino?"

Interestingly, DIG doesn't have a deep model, which means it answers questions without fully understanding what is happening on each web page. Szekely views that as a strength, since the system can respond quickly - and users can go to searched pages to verify DIG's results. While DIG doesn't perform well when questions have no good answers, he says, "at least we return the right documents."