Tuesday, October 15, 2013

Sometimes you don't need SPARQL

I was just having a quick look at some public datasets, and in the specifics at the GP Practice prescribing data that you can find on the data.gov.uk website.


This is a big (500Mb) CSV file, whose headers are:

 SHA,PCT,PRACTICE,BNF CODE,BNF NAME                              ,ITEMS  ,NIC        ,ACT COST   ,PERIOD                                 
Q30,5D7,A86001,0703010F0,Combined Ethinylestradiol 30mcg         ,0000001,00000001.89,00000001.77,201109                                 
Q30,5D7,A86003,0101010G0,Co-Magaldrox(Magnesium/Aluminium Hydrox),0000027,00000074.94,00000069.92,201109                                 
Q30,5D7,A86003,0101010P0,Co-Simalcite (Simeticone/Hydrotalcite)  ,0000001,00000003.20,00000002.98,201109                                 
Q30,5D7,A86003,0101010R0,Simeticone                              ,0000002,00000007.35,00000006.84,201109 


I wanted to quickly know how many practices where in the UK (or at least in this dataset). Putting this data in a triplestore for a single query was a bit overkill (simple conversion, upload...). At the same time, the file was too large to open in OpenOffice.

The solution ? It's pretty obvious, but just for the people unfamiliar with Unix:

cut -d\,  -f3 T201109PDP\ IEXT.CSV | sort | uniq | nl

takes only a few seconds, and the result is 10192.

Need an explanation ?


cut       : extract only a range of columns from a file

-d\,      : (cut option) columns are separated by a comma
-f3       : (cut option) take only column 3
T201109PDP\ IEXT.CSV :  the file name!
|         : pipe all results through the next command
sort      : well, sort
|         : see above
uniq      : excluded repeated rows
|         : see above
nl        : count everything

It's Unix 101, but very useful ;)





Sunday, October 13, 2013

Stumbled on DPBedia

While preparing some slides for a tutorial... I have stumbled upon the following page in DBPedia live:

Note the last line "Motto: affittasi al miglior offerente" (on hire to whom makes the best offer...)

I think this is coming directly from Wikipedia... and clearly it is a joke!
Nevertheless... this highlight one issue that often goes neglected in open data: who is endorsing the original data ?

That said... old universities are known for students making jokes ("tradizione goliardica"). So this may be an accurate motto after all!

Friday, May 31, 2013

LOD and licenses (citing another blog)

I have stumbled on a blog post on licenses of datasets in the LOD. I think it's worth having a look:
http://www.licensius.com/blog/lodlicenses

Cities in DBPedia

Here is the first post of this blog. Starting with something very simple: a list of cities and their size in DBpedia. Surprises at the end of the post.

How many cities are in DBpedia ?

SELECT count (distinct ?x) WHERE {
?x a <http://dbpedia.org/ontology/City>
}

Asking the endpoint at http://dbpedia.org/sparql results in 15868
Asking the endpoint at http://lod.openlinksw.com/sparql returns 15881
(from now on all queries refer to: http://dbpedia.org/sparql)


The above results are more or less consistent, we can imagine there is some slight misalignment of version between the two endpoint. Just to double check that we didn't get a partial results because of server overload we check the headers, that seem to be fine.


curl --head http://dbpedia.org/sparql?query=SELECT+count+%28distinct+%3Fx%29+WHERE+%7B%0D%0A%3Fx+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FCity%3E%0D%0A%7D
HTTP/1.1 200 OK
Date: Thu, 30 May 2013 23:26:39 GMT
Content-Type: application/sparql-results+xml; charset=UTF-8
Content-Length: 435
Connection: keep-alive
Server: Virtuoso/06.04.3135 (Linux) x86_64-generic-linux-glibc212-64  VDB
Accept-Ranges: bytes


The first surprise (to me) is that I made the same query a few days ago, and I had different results: I had more cities-population association, in a good part explainable by cities having more than one population associated, with extremely big numbers. I remember that the croatian city? of Sisak had about 10^5 people (together with some more plausible value. In fact the bigger value was seemingly a concatenation error). Now all data seems fine.

If I run a query for all cities whose population is bigger than that of the state they are in, I only get two results (the error being in the country size).

SELECT distinct * WHERE {
?x a <http://dbpedia.org/ontology/City>
optional {?x <http://dbpedia.org/ontology/populationTotal> ?p}
optional {?x <http://dbpedia.org/ontology/country> ?y .
?y <http://dbpedia.org/property/populationCensus> ?p2}
filter(?p2<?p)
}

"x","p","y","p2"
"http://dbpedia.org/resource/La_Asunci%C3%B3n",28500,"http://dbpedia.org/resource/Venezuela",27150
"http://dbpedia.org/resource/Barinas,_Barinas",251535,"http://dbpedia.org/resource/Venezuela",27150

The same queries a few days ago was (to my memory) returning more values, including Sisak. In my experience, results were actually varying from time to time, at times being limited to the two cities in Venezuela, at times including Sisak.

Everything seems fine with the data now, but wait until the end of the post...


So, we have populations for a subset of known cities:

SELECT ?x ?p  WHERE {
?x a <http://dbpedia.org/ontology/City> .
?x <http://dbpedia.org/ontology/populationTotal> ?p}

13628

If have a quick look at the distribution, it seems more or less normal (is this expected for city sizes?)  but it is also very spiky

hist(logPop,breaks=1000)



It's easy to explain where the spike-ness is coming from. If we see the distribution of the last 4 digits of city sizes, here is the result:

res3<-pop%%100000

Apart from a decreasing trend due to the presence of cities with less than 100k people, we could expect a flat distribution. But as many population are approximations, we see lot of spikes that corresponds to rounded numbers. Most frequent round-ups are at 50k, then at every major 10k. But many smaller approximations are visible as well in an almost constant line of smaller spikes (again, there is a bias to the left of the chart due to cities with less than 100k pp).


Getting to the end of the post, what about Sisak ?

If we query for it:

SELECT distinct * WHERE {
<http://dbpedia.org/resource/Sisak> <http://dbpedia.org/ontology/populationTotal> ?o }

476992030567891

The result returned is still some order of magnitude bigger than the world population. This is what I remember from a couple of days ago.


What happened ?

SELECT distinct * WHERE {
<http://dbpedia.org/resource/Sisak> <http://dbpedia.org/ontology/populationTotal> ?o .
<http://dbpedia.org/resource/Sisak> a ?x}

ox
476992030567891http://dbpedia.org/ontology/Place
476992030567891http://www.w3.org/2002/07/owl#Thing
476992030567891http://dbpedia.org/class/yago/SpaTownsInCroatia
476992030567891http://dbpedia.org/class/yago/RomanTownsAndCitiesInCroatia
476992030567891http://dbpedia.org/ontology/Town
476992030567891http://dbpedia.org/ontology/PopulatedPlace
476992030567891http://dbpedia.org/ontology/Settlement
476992030567891http://www.opengis.net/gml/_Feature
476992030567891http://umbel.org/umbel/rc/PopulatedPlace
476992030567891http://umbel.org/umbel/rc/Village
476992030567891http://schema.org/Place
476992030567891http://dbpedia.org/class/yago/CitiesAndTownsInCroatia
476992030567891http://umbel.org/umbel/rc/Location_Underspecified
476992030567891http://umbel.org/umbel/rc/Town
476992030567891http://dbpedia.org/class/yago/GeoclassSeatOfAFirst-orderAdministrativeDivision

Sisak is not a city anymore!

But it still identified as a city is in Wikipedia: http://en.wikipedia.org/wiki/Sisak.

Maybe some of this aspect was corrected in DBpedia since I made my first query, but the wrong way.
In any case... even something so simple as cities and populations reveal some surprises!