Why the name “Unique Resource Identifier” doesn’t tell the full story…

One of the fundamental parts – perhaps the most fundamental part – of the Semantic Web is the Unique Resource Identifier, or URI. The humble URL (Unique Resource Locator) – which users will no doubt be more familiar with – is a subset of these.

Every entity (or object) in a semantic web environment is assigned a URI. This is a key part of making an entity more semantically comprehensible to a computer system – if an entity is uniquely identified by a particular string, it explicitly can’t be something else – elementary really, but it’s easy to forget how truly thick computers are… To put it another way, the URI is a simple way of getting round ambiguity: with the use of URIs it’s not a problem for a computer system to tell if you’re talking about Birmingham in the West Midlands of the UK, or Birmingham, Alabama in the US of A.

However, this uniqueness – referred to in the first word of the acronym “URI” – only gets at part of the issue. On top of uniquely identifying a particular thing (yes, “thing” is a technical term in Semantic Web jargon) so it cannot be confused with  another thing, it is just as important that the same URI is used whenever making reference to that particular entity. For example, it’s not good for your semantic model if you’ve gone to the trouble of defining separate, unique identifiers for Birmingham, UK and Birmingham USA if it suddenly turns out that you’ve been using two different strings to refer to Birmingham, UK: your system will think that there are three places called “Birmingham” that you’re interested in. On top of this any properties or facts that you’ve associated with Birmingham, UK 1 will not be associated with Birmingham, UK 2, despite them being the same place in the real world. This example provides a good way to start thinking about this problem, however it makes sense now to look at how this might affect us in practice.

A practical way to assign a URI to an entity in a semantic system is through a classification structure (which is a type of taxonomy). Essentially, this is what  a library classification scheme, or a biological taxonomy does, with each class and sub-class being assigned some sort of unique identifier.

If you design the taxonomy and are the only person responsible for classifying entities within it then the idea of a number of examples of the same type of entity being assigned different URIs is less of a problem. However, in our increasingly collaborative working world, it is fairly likely that a number of people will be using the taxonomy to classify entities: the inconsistency between different individuals’ subjective opinions becomes a problem. For example, if classifying private hospital is it first and foremost a healthcare provider, or a private company?; if classifying a public hospital is it first and foremost a  healthcare provider or a public service? The problem implicit in a two-dimensional taxonomy structure are immediately apparent: both points of view are valid, but neither fully captures the semantics of the entity being classified; it stands to reason that wherever hospitals are to be classified, you’d want them to be together rather than spread across the classification structure. “But what does that do to the semantics?” I hear you cry! All sounds like a bit of a pickle…

… but – even though it may be a bit counter-intuitive to both taxonomy designer and taxonomy user to begin with – help is at hand!

  1. The first part of the solution is to understand that the taxonomy is primarily a tool to aid in assigning a URI: it is an effort to break the constituent parts of a model into discrete (or atomic) parts or types; it should not be treated as an effort to capture the semantics of a system.
  2. The second part of this is to inform your users of the first part. It can be a little counter intuitive to new users to accept that by grouping different types of thing they are not applying the definitive semantics for a system, but making it possible to do this in future. Once users understand this of a system, they should be less worried about putting everything in the correct place and more concerned with grouping like with like.
  3. Even so… you will need a few ground rules: it’s probably a good idea to group objects by function rather than concepts such as ownership, medium or other factors such as age; all of these things can be defined as properties later.
  4. Try to design your taxonomy to encourage users to group like with like: this may mean raising particular sub-classes up in the taxonomy to be visible to users sooner. For example – to return to the hospital example from earlier – if you have a section in your taxonomy relating to “private companies” and “public services”, it may make sense to include a class on the same level of these for “healthcare providers”.
  5. Try not to worry! Even if there are flaws with your initial taxonomy design, when you come to apply an ontology to your structure, properties such as owl:sameAs can act as get out of jail free cards.

Ultimately, the reason for all of the above is that a taxonomy used in a semantic system should not been seen as an attempt to capture all of the semantics that will make your model needs. A taxonomy should simply be seen as a method of stating that “this is different type of thing to that” and “this is the same type of thing as this”. Any shortcomings in the semantics captured by the taxonomy can be more than made up for through the use of an ontology to add layers of complexity that would not be possible using two-dimensional structure.

So remember, uniqueness is indeed crucial when generating identifiers for entities in your semantic system, it’s almost as important to make sure that these are applied consistently and accurately.

Guardian Data Blog

 

FactMint's visualization proved to be the most popular on the Guardian's UK News section

The FactMint Researcher was used to create a collection of heat-maps for the Guardian Data Blog’s London: The Data series. The heat-maps, which used open-data around the 2008 Mayoral Election, were used to demonstrate the voting tendencies for different geographic and social areas.

The visualization proved very popular with Guardian readers, most read on the UK News section and top of the “Zeitgeist” for most of the 24 hours after it was published.

You can view the page on the Data Blog, here.

Find our more about the Guardian Data Blog on its website.

London Elects

London Elects is the independent team that organises the Mayor of London and the London Assembly elections. This includes everything from designing and printing the ballot papers, managing the counting of votes, to delivering a public awareness campaign to tell Londoners about the election and how they can vote.

London Elects use the FactMint Researcher’s data-mining and semantic web technologies to provide its users with a powerful research tool.

Sarah Garrett, Communications Manager at London Elects, said:

“FactMint has allowed us to interrogate a huge amount of information – including years of electoral data across hundreds of wards and well as social, economic and ethnicity date and other demographic information.”

“This has given us valuable intelligence and insight that has helped inform our campaign and reach as many people as possible.”

See the body’s home page for information on London Elects.

Using data to make predictions for the 2012 London Mayoral election

FactMint has recently done work for both London Elects (who run the Mayoral Elections) and the Guardian’s Data Blog. Both projects were based around the upcoming Mayoral Election, and we’ve gathered some really interesting data on the subject. So, independently from those two parties, I’ve decided to do a little data-analytics myself and see if there are any predications to be made.

Firstly, from the visualizations we created with the Guardian it became very clear that 2nd preference vote in the Mayoral Election has almost no effect on the outcome of the poll. You can compare the heat-maps  of the 1st and 2nd preference votes on the Guardian’s Data Blog, here. To prove this point consider that Boris, who won the election, did not get the most 2nd preference votes in any one of London’s 624 wards.

So – focusing on the 1st preference votes, as by far the most influential – let’s look at the tendencies of wards to vote a particular way based upon their extent of deprivation (as that scale considers a number of environmental criteria, such as wealth, crime and employment). The following graphic – which the FactMint Researcher produced – plots a point for the share of votes each ward casts for each party, against that ward’s extent of deprivation, from the 2008 Mayoral election.

Share of votes cast by each ward for the Conservative (blue), Labour (red) and Lib Dem (yellow) candidates in the 2008 Mayoral Election, against the ward's extent of deprivation

There are some clear patterns which come from this visualization:

Firstly, at the low end of the scale (the wards which do not suffer from deprivation) Labour were the least popular party, the Lib Dems occupying a slightly less popular area with the Conservatives the clear favourites.

At the other end of the scale there is a very distinct flip between the Labour candidate’s popularity and that of his Conservative counterpart. Amongst very deprived areas the Conservative and Liberal Democrat parties were both unpopular and Labour were much more popular.

The pattern seems very strong, particularly with the Labour and Conservative parties, so it is reasonable to assume some causality here. Making the leap-of-faith that voting one way or the other will not notably affect how deprived you are (at least not relative to the rest of the population), we can postulate that the more deprived have a disposition towards Labour and the less deprived towards the Conservatives.

We can also see, by scanning along the right of the scatter graph, that the Lib Dems were not the favoured party in any of the Wards.

So, if we are to use this data to predict the outcome of this year’s election, we need to understand how the extent of deprivation has changed over the past four years in London. The Guardian posted an article on this very subject (here). The headline figures from that article are that 430 of London’s neighbourhoods have become “significantly more deprived” while only 374 have become “significantly less deprived”, since 2004. If the trend is towards deprivation – as we might guess in the current economic climate – then this years votes should lean towards Labour more than they did in 2008. So it looks like it could be a close one…

That said, it’s politics and anything can happen! Looking forward to the post-mortem.

London Elects announce work with FactMint

We are really pleased and excited to announce our first customer: London Elects. On the 3rd of April 2012, London Elects issued the following press release…

London Elects announce work with FactMint

London Elects – the body that runs the Mayor of London and London Assembly elections – has announced it is working with London-based start-up FactMint to help its campaign to engage voters for the 2012 elections.
The FactMint Researcher takes advantage of data-mining and semantic web technologies to provide users with a powerful research tool. The joint research and development project has allowed FactMint to develop their product in a live environment, while helping London Elects to shape their campaign.
London Elects are running a campaign to raise awareness of the elections and provide Londoners with information on how to vote. Their campaign will appear in print, radio, outdoor advertising and online and the www.londonelects.org.uk website will be a centre-piece.
Sarah Garrett, Communications Manager at London Elects, said:
“FactMint has allowed us to interrogate a huge amount of information – including years of electoral data across hundreds of wards and well as social, economic and ethnicity date and other demographic information.”
“This has given us valuable intelligence and insight that has helped inform our campaign and reach as many people as possible.”
Chris Scott, CEO of FactMint, said:
“We are delighted that London Elects has chosen to use FactMint to help manage their domain knowledge around the London Mayoral election.
“The new system allows London Elects to bring together a previously disparate collection of data sources, including spreadsheets and online content, with a single unified network of knowledge. London Elects are now able to ask questions of their data and quickly create reports and visualizations of answers.
“One challenge we were particularly excited about was generating heat-maps of election-related data. Using the FactMint system, London Elects can create borough or London-wide maps visualising, for example, the turnout in each ward, in seconds rather than hours.”
Technical details
The FactMint Researcher uses a web-based interface designed to make the Semantic Web accessible to any user. Users can build visual queries through a drag-and-drop interface, and pose actual questions to the system, which returns concrete answers.
Behind the FactMint Researcher is an advanced triplestore, the FactMint Engine. FactMint Engine has taken RDF technology far beyond it’s historic limitations, seamlessly integrating data sources from across the Linked Data Cloud with local graphs and using advanced Natural Language Generation techniques.

A big question…

In order to explain FactMint’s reason for being (and, in fact, Linked Data in general) I often use the following anecdotal example:

What are the top 5 schools, in the UK, for providing British Prime Ministers?

  • Do you know?
  • How would you find out?
  • How long would that take?

In this post I would like to quickly explore those questions and try to define the problem, as I see it today.

So, to the first point, no: I don’t know the answer. I’d guess, with a pretty high level of confidence, that Eton is number one – Campbo’, at least, counts for one there. After that, not a clue. The only other school I can think of off-the-top-of-my-head is my school, but, much as I liked Wrenn in Wellingborough, I severely doubt it ever produced a Prime Minister.

Given my state of ignorance, then, how could I find the answer? Well the obvious answer is research.

First I tried AQA (the text service, Any Question Answered). I sent them the question, exactly as typed above, then waited… 32 minutes later and my phone beeped it’s alert for a reply. That was actually quite a tense half an hour – this was before FactMint was incorporated (or even named) and if AQA could answer the question for a quid it would have seriously hurt my business case! Fortunately for me, the text read as follows:

“Sorry, 63336 can’t find the top 5 schools. The top 2 schools are Eton, which has produced 19 Prime Ministers, and Harrow, which has produced 7.”

So where next? I could ask (directly or via a Web search) as many schools as I could find and compare their answers but that would doubtless be unreliable and I’d need to get a comprehensive list of schools from somewhere. Not to mention that it would take weeks. A better approach would be to investigate each Prime Minister – they’re doubtlessly well documented on Wikipedia and across the Web in general. So I began. First stop, a list of Prime Ministers of Britain… good work Wikipedia community. The page included some “Before Walpole” but I don’t think they count. The page was also split my monarch so I had 9 charts to merge. Because of the layout of the HTML they didn’t paste into a spreadsheet properly so 73 <ctrl-c> <ctrl-v>s later I had a list.

Next job, get their schools. 1 for Eton – Cambo’; 1 for Kirkcaldy High School – Gordon;  Fettes College – Blair; Rutlish Grammar School – John Major. 5 minutes in and a little over 5% of the list done. Being the pragmatic / easily bored type, I decided not to complete this experiment. Roughly 1% a minute means that I can hazard a guess at the answer to the third question I posed… about 10 minutes for building the spreadsheet, 100 minutes getting the list of schools and a couple of minutes to total up and sort… just shy of 2 hours.

That – in a nutshell – is FactMint’s reason for being. The technologies which build up the Semantic Web can make that kind of query trivially easy. As we were using Wikipedia for the traditional research, let’s try the same thing with Freebase (roughly, an RDF database built on Wikipedia and loads of other stuff). Freebase isn’t easy to use, if you’re not a developer type, but it can get you the data quickly.

So, I get my phone timer ready, point one tab of my browser at the Freebase Query Editor and one at the Freebase page for David Cameron, and here I go…

6 and a half minutes later and I’m there. It still wasn’t the ideal process and would be completely inaccessible if you weren’t happy with JSON. Freebase also failed to give me the actual answer – it knew 31 of the 73 British Prime Minister’s schools. Still, a pretty good response (probably with the same coverage as WIkipedia would have given me) in just over a 20th of the research time.

The next step from Freebase is obvious. The calculations required to give me that results took a minuscule fraction of a second. The 6 and a half minutes were used up by my human mind trying to instruct a computer program on what, exactly, I wanted to know. And from that comes the mission statement of FactMint, “to create beautiful and intuitive ways for people to interact with the Semantic Web”. When these knowledge-bases become easy to query, the answer to the question I originally posed, and many others like it, become commodities; more complex queries become ask-able; and less time is spend searching, copying, pasting and sorting in Excel – surely everyone wants that!

Oh, and incase you were interested: Eton is number one, then Harrow, then Westminster. Charterhouse, Chatham House, Fettes, Haileybury, Rugby, Rutlish and Winchester all come in a fair way behind.

Mission statement

In order to focus our efforts – but leave us room to evolve as a company – we’ve decided to commit ourselves to a mission statement:

To create beautiful and intuitive ways for people to interact with the Semantic Web.

I’d like to take a minute to explain the rational behind the statement and exactly what we meen by it.

Firstly, and perhaps most importantly, let’s consider what creating “beautiful and intuitive” interfaces entails. Clearly there are many applications which utilize Semantic Web technologies but running a SPARQL or SeRQL query, viewing XML RDF or deciphering a complex JSON payload does not count as intuitive. The aim for FactMint will always be to make these technologies accessible and easy to use – for people with very little knowledge of the discipline.

Of course, there are beautiful (arguably, at least) applications which use RDF technologies, but let me draw your attention to the word “interact”. A number of publishers now use RDF, from simply pulling metadata into a site to categorizing content using inference. The BBC’s World Cup 2010 site is a great example; a page could be automatically constructed around “Group C” (or pretty much any other concept important to the competition) by inferring that articles about Wayne Rooney should be considered Group C news, as he plays for England, who were in Group C. That was a very good – and clever – use of the technology. The difference between that kind of use-case and helping people “interact” with the Semantic Web is perhaps tenuous, but here is, from my point of view:

The BBC’s example, along with so many other, is using RDF to construct an HTML Web page. That page will be same for every user who views it, at any one given point time. Basically, they are facilitating a publish-consume model. While I concede that consuming content is a subset of interaction, the goal for FactMint is that its users can explore the data contained within the Semantic Web as they see fit, define their own query not consume a query created by a developer working for a publisher. Basically, FactMint will always aim to make it easy for users to fully interact with the Semantic Web, in any way they could want.

So, that’s what we’re doing.

I’m on the Semantic Web

This article has been republished from my blog, here: http://chrisscott.org/technology/semantic-web/im-on-the-semantic-web

That’s right. About a fortnight ago I decided it was about time to practice what I preach (well, specifically what I was due to preach at last weeks excellent ePublishing Innovation Forum) and get myself onto the Semantic Web. For those new to the concept of the Semantic Web, I’m talking about creating an RDF graph which includes a resource describing me.

So, without further ado, here I am:

http://chrisscott.org/about/card#me

The document at the end of that link is a FOAF Personal Profile Document. As you can see, the URI above includes the fragment “me”. This is a fairly important part of the Linked Data concept as it allows one of the axioms, that the URI is dereferenceable, whilst also identifying a resource, “me”, which can be used to link the graph to others. So, if you are curious, take a look at my personal profile and check out the “me” resource – it’s pretty simplistic but a good starting point.

So, how did I go about creating my personal profile on the Semantic Web? Well I started with a step I urge everyone to do: I signed up to the Opera community. You can do the same here. Once you’ve done that you can go to your profile and click on the “FOAF” link on the right hand side of the footer:

My profile page in the Opera community.

That’s the quickest and easiest way to get yourself represented on the Semantic Web but for me Opera do not give you enough control. For example, I cannot use the foaf:weblog predicate to point to this blog, only the one which Opera host for me (that said, they do support the rdfs:seeAlso predicate so my private personal profile is referenced by my Opera one). For that reason, I took the XML generated for my Opera community profile, tweaked it a bit and uploaded it onto this domain.

Give it a go! I’d love to hear how people get on…

NB: I ended up going on a bit in the draft of this post about the FOAF vocab design and got a bit technical, so I’ve separated that content off into this post.