Where information lives…
I’d like to take a look at the problem of identifying most valuable publicly available web sources of unstructured or semi-structured data, sources that could be efficiently used for extraction of useful information. By valuable of source I mean maximal ratio of information extraction algorithms complexity required to extract data to amount of potentially useful information extracted by that methods. Here are just a few thoughts on this…
We don’t expect single text taken from the web to be a trusted source. I think it’s obvious that text created by some author reflects his mistakes and subjective opinion, and also any method of information retrieval returns a number of “false positives”. Thus we need to consider sources that provide significant amount of similar text on a same subject but generated by different authors. In this case common things could be considered as trusted or at least as important facts.
The next problem is homogeneity…One author could describe a lot of things in one text. It significantly increase complexity of information processing if we’d need to identify which authors’ thought relates to what subject.
And as the last item in the list of most significant problems I would name relevancy. For example, if we want to know an opinion about a new movie we are not interested in reading press release cross posted in someone’s blog, technically it’s possible, but it’s hard to identify that such blog item is not a review.
So from my point of view
- good sources: customers reviews, professional product reviews
- bad sources: blogs, forums, press releases
…other examples?
NLP for free! NLP for fun!
I can’t say that I know a lot services on the web that provide result of sophisticated natural language processing (NLP). Ok, there are huge search machines, news aggregators, plagiarism identification processors, some academic research projects and … what else?…Sure on the corporate and government level there are systems that are processing data like unstructured customers’ feedback, communication data for security purposes, news etc, but NLP gives mostly nothing to average people. Nothing for having fun! Unlikely it’s hard to suggest cool problem to resolve using NLP methods. I simply believe that not too many people know how interesting is to work on them. More over most part of software created for NLP and information extraction purposes is a subject of research and absolutely free. Don’t think that its ugly student-made programs, if you think so – take a look at GATE (General Architecture for Text Engineering) by Sheffield University.

It’s really powerful open source software that already has dozens of extensions and could be used in mostly all text processing tasks. System is well documented and has preconfigured module called ANNIE to solve standard problem of annotating English text with morphological information, tokenizing, stemming, extracting named entities. Also it’s simple to write your own grammar rules to extract any kind of information you need. Try to play with it and may be you’d have an idea of how to get some value from it! I’d share some experience of using it in the further posts.
Paroles, paroles, paroles…
Don’t you think that there is a lot of absolutely useless information in the world? Information that is hard to call so. Terabytes of redundant characters…Take a look at science articles, PHD works, textbooks etc. All of them could be much thinner if their authors leave only main ideas and accurate proofs based on rules of formal logic. Even in some areas where is no place for gas, paper works stays full of empty sentences and redundant word constructions. This note is not an exception! It’s probably in human nature to generate twaddle and hide ideas and facts behind the wall made of words and secondary mind images.
It’s not a problem if the topic you are interested in is described in trusted, complete and the only source. You could allow yourself to spend hours or even days studying it. If you are professional you’d also find what you need in 1000 pages book. Even if two-three sources are enough to get result – context search would be a solution. But what if search result for your query consist of hundreds documents and you are not so familiar with a subject area?
There is a simple example – the review of a new gadget. For an advanced internet user it’s an easy task to get hundreds of related articles, identify most promising, complete and trustful and…finally you’d get subjective opinion of one or several “experts”. Not sure that anyone will ever carefully read more than 3-4 reviews, or study badly organized or foreign language resources.
Some companies say that their products “make sense from content”. Sometimes they mean aggregation, search or data mining and analytics. How could we find facts and knowledge behind words and sentences? That would a topic of some records in this Blog…
Do you have any examples of services that use same idea as the basis?
To be continued…
leave a comment