Plush Thoughts

Where information lives…

Posted in Information Extraction, Information Sources by plushloony on August 6, 2009

I’d like to take a look at the problem of identifying most valuable publicly available web sources of unstructured or semi-structured data, sources that could be efficiently used for extraction of useful information. By valuable of source I mean maximal ratio of information extraction algorithms complexity required to extract data to amount of potentially useful information extracted by that methods. Here are just a few thoughts on this…

We don’t expect single text taken from the web to be a trusted source. I think it’s obvious that text created by some author reflects his mistakes and subjective opinion, and also any method of information retrieval returns a number of “false positives”. Thus we need to consider sources that provide significant amount of similar text on a same subject but generated by different authors. In this case common things could be considered as trusted or at least as important facts.

The next problem is homogeneity…One author could describe a lot of things in one text. It significantly increase complexity of information processing if we’d need to identify which authors’ thought relates to what subject.

And as the last item in the list of most significant problems I would name relevancy. For example, if we want to know an opinion about a new movie we are not interested in reading press release cross posted in someone’s blog, technically it’s possible, but it’s hard to identify that such blog item is not a review.

So from my point of view

  • good sources: customers reviews, professional product reviews
  • bad sources: blogs, forums, press releases

…other examples?

Follow

Get every new post delivered to your Inbox.