Open Calais looks to be one of the best ways to start implementing semantic web capabilities on your websites and blogs. As Web 2.0 standards become entrenched, the movement towards Web 3.0 will inevitably start to quicken. What is the semantic web? In short, it is a way to improve the meta-tagging of information to facilitate machine processing. To quote Tim Berners-Lee1:
“For the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning. Artificial-intelligence researchers have studied such systems since long before the Web was developed. Knowledge representation, as this technology is often called, is currently in a state comparable to that of hypertext before the advent of the Web: it is clearly a good idea, and some very nice demonstrations exist, but it has not yet changed the world. It contains the seeds of important applications, but to realize its full potential it must be linked into a single global system.”
To build this global standard, the W3C group has been promoting the Resource Description Framework (RDF) to structure data. RDF basically establishes a system of “subject-object-predicate” that can be used to increase the complexity of meta-tagging, and do so in a way that can be automatically processed by machine agents like search engines and AI. To give an example of this, say we are searching for “Berlin”. The query runs and millions of results return, matching web pages where the word “Berlin” is listed in the text. This is excellent, but it provides no way for the computer to really sort or sift the information (other than PR), unless the query is made more complex, and it searches for additional terms. With RDF, we would be able to code a simple structure like “Berlin” (subject) is the “capital” (predicate) of “Germany” (object), and if we then asked the computer “what is the capital of Germany,” it could search through the sources of RDF data, analyzing according to pre-set rules of verification and trust, to report, according to 745,321 sites “Berlin is the capital of Germany.” However simple the above example is, it should immediately become evident that the three variables can be combined in numerous ways to dynamically change the way search results are drawn and the way data can be processed across a large, decentralized network of information & documents like the World Wide Web. There is a lot of great documentation on this online at sites like W3C.
So, to set up Open Calais on a website, which will automatically scan your site’s content and code these relationships of “subject-object-predicate” based on existing keywords and machine learning, the RDF framework must first be installed on the site. For example in Drupal 6.x, this can be done by installing and configuring the RDF module and optionally the ARC2 library. See the following pages for more information:
The RDF module and ARC2 library provide support for the following standards:
- N-Triples: http://www.w3.org/TR/rdf-testcases/#ntriples
- RDF/JSON: http://n2.talis.com/wiki/RDF_JSON_Specification
- RDF/PHP: http://drupal.org/node/219870
- RDF/XML: http://www.w3.org/TR/rdf-syntax-grammar/
- TriX: http://www.w3.org/2004/03/trix/
- Turtle: http://www.dajobe.org/2004/01/turtle/
You can look through the details and see which one you would like to use for your site. I chose XML as it seems to be the broad favorite for RDF data.
So what does Open Calais really do? It sends your site’s content automatically to the Reuters service, which scans it using machine learning processes to look for keywords and patterns, and then sets up the RDF meta-tag structure. Your site then begins to develop as part of the semantic web. As is written on the Open Calais site2:
“The Calais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well. This metadata gives you the ability to build maps (or graphs or networks) linking documents to people to companies to places to products to events to geographies to… whatever. You can use those maps to improve site navigation, provide contextual syndication, tag and organize your content, create structured folksonomies, filter and de-duplicate news feeds, or analyze content to see if it contains what you care about.”
The Open Calais module installs easily on top of the RDF framework.
Drupal Module Page: http://drupal.org/project/opencalais
WordPress Module Page: http://tagaroo.opencalais.com/
- http://www.sciam.com/article.cfm?id=the-semantic-web&print=true [↩]
- http://www.opencalais.com/about [↩]