Friday, May 31, 2013

LOD and licenses (citing another blog)

I have stumbled on a blog post on licenses of datasets in the LOD. I think it's worth having a look:
http://www.licensius.com/blog/lodlicenses

Cities in DBPedia

Here is the first post of this blog. Starting with something very simple: a list of cities and their size in DBpedia. Surprises at the end of the post.

How many cities are in DBpedia ?

SELECT count (distinct ?x) WHERE {
?x a <http://dbpedia.org/ontology/City>
}

Asking the endpoint at http://dbpedia.org/sparql results in 15868
Asking the endpoint at http://lod.openlinksw.com/sparql returns 15881
(from now on all queries refer to: http://dbpedia.org/sparql)


The above results are more or less consistent, we can imagine there is some slight misalignment of version between the two endpoint. Just to double check that we didn't get a partial results because of server overload we check the headers, that seem to be fine.


curl --head http://dbpedia.org/sparql?query=SELECT+count+%28distinct+%3Fx%29+WHERE+%7B%0D%0A%3Fx+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FCity%3E%0D%0A%7D
HTTP/1.1 200 OK
Date: Thu, 30 May 2013 23:26:39 GMT
Content-Type: application/sparql-results+xml; charset=UTF-8
Content-Length: 435
Connection: keep-alive
Server: Virtuoso/06.04.3135 (Linux) x86_64-generic-linux-glibc212-64  VDB
Accept-Ranges: bytes


The first surprise (to me) is that I made the same query a few days ago, and I had different results: I had more cities-population association, in a good part explainable by cities having more than one population associated, with extremely big numbers. I remember that the croatian city? of Sisak had about 10^5 people (together with some more plausible value. In fact the bigger value was seemingly a concatenation error). Now all data seems fine.

If I run a query for all cities whose population is bigger than that of the state they are in, I only get two results (the error being in the country size).

SELECT distinct * WHERE {
?x a <http://dbpedia.org/ontology/City>
optional {?x <http://dbpedia.org/ontology/populationTotal> ?p}
optional {?x <http://dbpedia.org/ontology/country> ?y .
?y <http://dbpedia.org/property/populationCensus> ?p2}
filter(?p2<?p)
}

"x","p","y","p2"
"http://dbpedia.org/resource/La_Asunci%C3%B3n",28500,"http://dbpedia.org/resource/Venezuela",27150
"http://dbpedia.org/resource/Barinas,_Barinas",251535,"http://dbpedia.org/resource/Venezuela",27150

The same queries a few days ago was (to my memory) returning more values, including Sisak. In my experience, results were actually varying from time to time, at times being limited to the two cities in Venezuela, at times including Sisak.

Everything seems fine with the data now, but wait until the end of the post...


So, we have populations for a subset of known cities:

SELECT ?x ?p  WHERE {
?x a <http://dbpedia.org/ontology/City> .
?x <http://dbpedia.org/ontology/populationTotal> ?p}

13628

If have a quick look at the distribution, it seems more or less normal (is this expected for city sizes?)  but it is also very spiky

hist(logPop,breaks=1000)



It's easy to explain where the spike-ness is coming from. If we see the distribution of the last 4 digits of city sizes, here is the result:

res3<-pop%%100000

Apart from a decreasing trend due to the presence of cities with less than 100k people, we could expect a flat distribution. But as many population are approximations, we see lot of spikes that corresponds to rounded numbers. Most frequent round-ups are at 50k, then at every major 10k. But many smaller approximations are visible as well in an almost constant line of smaller spikes (again, there is a bias to the left of the chart due to cities with less than 100k pp).


Getting to the end of the post, what about Sisak ?

If we query for it:

SELECT distinct * WHERE {
<http://dbpedia.org/resource/Sisak> <http://dbpedia.org/ontology/populationTotal> ?o }

476992030567891

The result returned is still some order of magnitude bigger than the world population. This is what I remember from a couple of days ago.


What happened ?

SELECT distinct * WHERE {
<http://dbpedia.org/resource/Sisak> <http://dbpedia.org/ontology/populationTotal> ?o .
<http://dbpedia.org/resource/Sisak> a ?x}

ox
476992030567891http://dbpedia.org/ontology/Place
476992030567891http://www.w3.org/2002/07/owl#Thing
476992030567891http://dbpedia.org/class/yago/SpaTownsInCroatia
476992030567891http://dbpedia.org/class/yago/RomanTownsAndCitiesInCroatia
476992030567891http://dbpedia.org/ontology/Town
476992030567891http://dbpedia.org/ontology/PopulatedPlace
476992030567891http://dbpedia.org/ontology/Settlement
476992030567891http://www.opengis.net/gml/_Feature
476992030567891http://umbel.org/umbel/rc/PopulatedPlace
476992030567891http://umbel.org/umbel/rc/Village
476992030567891http://schema.org/Place
476992030567891http://dbpedia.org/class/yago/CitiesAndTownsInCroatia
476992030567891http://umbel.org/umbel/rc/Location_Underspecified
476992030567891http://umbel.org/umbel/rc/Town
476992030567891http://dbpedia.org/class/yago/GeoclassSeatOfAFirst-orderAdministrativeDivision

Sisak is not a city anymore!

But it still identified as a city is in Wikipedia: http://en.wikipedia.org/wiki/Sisak.

Maybe some of this aspect was corrected in DBpedia since I made my first query, but the wrong way.
In any case... even something so simple as cities and populations reveal some surprises!