Wednesday, December 3, 2014

Research output by country

I took the Nature country index (number of papers a country has in Nature, considering the relative fraction of co-authorships), divided it by the country population (source: DBpedia). Here are the results.
Not surprisingly, the Vatican city towers the list (quantic issues!).


Vatican City State (Holy See) 226.4600715
Switzerland 150.4753293
Singapore 95.84437903
United States of America (USA) 60.04527226
Israel 59.96219648
Denmark 55.21814083
Sweden 54.02169996
United Kingdom (UK) 52.13362303
Germany 51.08422995
Iceland 47.04878569
Canada 44.29182447
Netherlands 44.07014634
Australia 41.65815099
Austria 37.47406837
Finland 35.40119373
France 34.19050213
Belgium 29.2549024
Norway 27.59456603
Ireland 27.38415481
Japan 27.1078848
Spain 24.96486608
South Korea 23.38491695
New Zealand 22.59980241
Taiwan 22.50874958
Slovenia 20.22944462
Italy 17.38870468
Estonia 14.67527069
Luxembourg 14.51700928
Cyprus 14.00932401
Portugal 12.12659146
Czech Republic 11.29390967
Greece 9.241680918
Hungary 7.883389007
Poland 6.062448047
Bermuda 5.759920295
Croatia 5.419961486
Chile 5.191072433
Greenland 4.6185274
Lithuania 4.464849982
China 4.253672762
Malta 3.413010299
Barbados 3.228070175
Armenia 3.226127982
Monaco 2.97699594
Russia 2.615483509
Qatar 2.350552673
Argentina 2.323673819
Serbia 2.317134242
Saudi Arabia 2.249241356
Liechtenstein 2.154475924
Uruguay 2.091540746
Panama 1.868951491
Slovakia 1.853784074
Tonga 1.646457211
Latvia 1.536298825
South Africa 1.534017259
Seychelles 1.334089317
Brazil 1.164166491
Lebanon 1.079434698
Turkey 0.975376072
United Arab Emirates 0.970781283
Bulgaria 0.923309168
Kuwait 0.908692887
Romania 0.904092848
Ukraine 0.881236745
Iran 0.85279365
Georgia 0.82618862
Moldova 0.789858331
Bhutan 0.742193713
Belarus 0.740889278
India 0.737977439
Mexico 0.671103922
Cape Verde 0.636491811
Costa Rica 0.604688821
Malaysia 0.55035918
Mongolia 0.499098775
Macedonia 0.411469046
Thailand 0.387230041
Montenegro 0.387078669
Brunei 0.381318447
Fiji 0.33753192
Oman 0.303561833
Namibia 0.293410983
Trinidad and Tobago 0.286140484
Cuba 0.274753115
Tunisia 0.25039257
Mauritius 0.237867188
Colombia 0.214252601
Ecuador 0.213486826
Peru 0.185304328
Egypt 0.169769401
Gabon 0.169491525
Botswana 0.167909195
Papua New Guinea 0.167601541
Gambia 0.164679009
Jordan 0.151616593
Algeria 0.121963824
Palestine 0.116474096
Kazakhstan 0.112367968
Kenya 0.110526316
Vietnam 0.102330984
Jamaica 0.095658889
Azerbaijan 0.09526302
Morocco 0.086393734
Pakistan 0.084978915
Cambodia 0.082981452
Venezuela 0.073825642
Bolivia 0.072801587
Bosnia and Herzegovina 0.065934843
Kyrgyzstan 0.057127326
Senegal 0.052558663
Sri Lanka 0.050301818
Nepal 0.04449098
Cameroon 0.041693647
Libya 0.041580042
Uganda 0.035860995
Tanzania 0.035848502
Madagascar 0.034795082
Burkina Faso 0.034636441
Iraq 0.033329119
Uzbekistan 0.029843111
Niger 0.029173729
Benin 0.029034644
Philippines 0.0241483
Bahrain 0.022787695
Indonesia 0.020105899
Ghana 0.015900548
Syria 0.013947574
Dominican Republic 0.012526176
Bangladesh 0.012324738
Mali 0.012049721
Ethiopia 0.011483407
Malawi 0.010123226
Nigeria 0.00823451
Congo 0.007305294
Ivory Coast 0.00545737
Mozambique 0.004392634
Tajikistan 0.00367602
Guatemala 0.003163221
Sudan 0.002413554

Tuesday, October 15, 2013

Sometimes you don't need SPARQL

I was just having a quick look at some public datasets, and in the specifics at the GP Practice prescribing data that you can find on the data.gov.uk website.


This is a big (500Mb) CSV file, whose headers are:

 SHA,PCT,PRACTICE,BNF CODE,BNF NAME                              ,ITEMS  ,NIC        ,ACT COST   ,PERIOD                                 
Q30,5D7,A86001,0703010F0,Combined Ethinylestradiol 30mcg         ,0000001,00000001.89,00000001.77,201109                                 
Q30,5D7,A86003,0101010G0,Co-Magaldrox(Magnesium/Aluminium Hydrox),0000027,00000074.94,00000069.92,201109                                 
Q30,5D7,A86003,0101010P0,Co-Simalcite (Simeticone/Hydrotalcite)  ,0000001,00000003.20,00000002.98,201109                                 
Q30,5D7,A86003,0101010R0,Simeticone                              ,0000002,00000007.35,00000006.84,201109 


I wanted to quickly know how many practices where in the UK (or at least in this dataset). Putting this data in a triplestore for a single query was a bit overkill (simple conversion, upload...). At the same time, the file was too large to open in OpenOffice.

The solution ? It's pretty obvious, but just for the people unfamiliar with Unix:

cut -d\,  -f3 T201109PDP\ IEXT.CSV | sort | uniq | nl

takes only a few seconds, and the result is 10192.

Need an explanation ?


cut       : extract only a range of columns from a file

-d\,      : (cut option) columns are separated by a comma
-f3       : (cut option) take only column 3
T201109PDP\ IEXT.CSV :  the file name!
|         : pipe all results through the next command
sort      : well, sort
|         : see above
uniq      : excluded repeated rows
|         : see above
nl        : count everything

It's Unix 101, but very useful ;)





Sunday, October 13, 2013

Stumbled on DPBedia

While preparing some slides for a tutorial... I have stumbled upon the following page in DBPedia live:

Note the last line "Motto: affittasi al miglior offerente" (on hire to whom makes the best offer...)

I think this is coming directly from Wikipedia... and clearly it is a joke!
Nevertheless... this highlight one issue that often goes neglected in open data: who is endorsing the original data ?

That said... old universities are known for students making jokes ("tradizione goliardica"). So this may be an accurate motto after all!

Friday, May 31, 2013

LOD and licenses (citing another blog)

I have stumbled on a blog post on licenses of datasets in the LOD. I think it's worth having a look:
http://www.licensius.com/blog/lodlicenses

Cities in DBPedia

Here is the first post of this blog. Starting with something very simple: a list of cities and their size in DBpedia. Surprises at the end of the post.

How many cities are in DBpedia ?

SELECT count (distinct ?x) WHERE {
?x a <http://dbpedia.org/ontology/City>
}

Asking the endpoint at http://dbpedia.org/sparql results in 15868
Asking the endpoint at http://lod.openlinksw.com/sparql returns 15881
(from now on all queries refer to: http://dbpedia.org/sparql)


The above results are more or less consistent, we can imagine there is some slight misalignment of version between the two endpoint. Just to double check that we didn't get a partial results because of server overload we check the headers, that seem to be fine.


curl --head http://dbpedia.org/sparql?query=SELECT+count+%28distinct+%3Fx%29+WHERE+%7B%0D%0A%3Fx+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FCity%3E%0D%0A%7D
HTTP/1.1 200 OK
Date: Thu, 30 May 2013 23:26:39 GMT
Content-Type: application/sparql-results+xml; charset=UTF-8
Content-Length: 435
Connection: keep-alive
Server: Virtuoso/06.04.3135 (Linux) x86_64-generic-linux-glibc212-64  VDB
Accept-Ranges: bytes


The first surprise (to me) is that I made the same query a few days ago, and I had different results: I had more cities-population association, in a good part explainable by cities having more than one population associated, with extremely big numbers. I remember that the croatian city? of Sisak had about 10^5 people (together with some more plausible value. In fact the bigger value was seemingly a concatenation error). Now all data seems fine.

If I run a query for all cities whose population is bigger than that of the state they are in, I only get two results (the error being in the country size).

SELECT distinct * WHERE {
?x a <http://dbpedia.org/ontology/City>
optional {?x <http://dbpedia.org/ontology/populationTotal> ?p}
optional {?x <http://dbpedia.org/ontology/country> ?y .
?y <http://dbpedia.org/property/populationCensus> ?p2}
filter(?p2<?p)
}

"x","p","y","p2"
"http://dbpedia.org/resource/La_Asunci%C3%B3n",28500,"http://dbpedia.org/resource/Venezuela",27150
"http://dbpedia.org/resource/Barinas,_Barinas",251535,"http://dbpedia.org/resource/Venezuela",27150

The same queries a few days ago was (to my memory) returning more values, including Sisak. In my experience, results were actually varying from time to time, at times being limited to the two cities in Venezuela, at times including Sisak.

Everything seems fine with the data now, but wait until the end of the post...


So, we have populations for a subset of known cities:

SELECT ?x ?p  WHERE {
?x a <http://dbpedia.org/ontology/City> .
?x <http://dbpedia.org/ontology/populationTotal> ?p}

13628

If have a quick look at the distribution, it seems more or less normal (is this expected for city sizes?)  but it is also very spiky

hist(logPop,breaks=1000)



It's easy to explain where the spike-ness is coming from. If we see the distribution of the last 4 digits of city sizes, here is the result:

res3<-pop%%100000

Apart from a decreasing trend due to the presence of cities with less than 100k people, we could expect a flat distribution. But as many population are approximations, we see lot of spikes that corresponds to rounded numbers. Most frequent round-ups are at 50k, then at every major 10k. But many smaller approximations are visible as well in an almost constant line of smaller spikes (again, there is a bias to the left of the chart due to cities with less than 100k pp).


Getting to the end of the post, what about Sisak ?

If we query for it:

SELECT distinct * WHERE {
<http://dbpedia.org/resource/Sisak> <http://dbpedia.org/ontology/populationTotal> ?o }

476992030567891

The result returned is still some order of magnitude bigger than the world population. This is what I remember from a couple of days ago.


What happened ?

SELECT distinct * WHERE {
<http://dbpedia.org/resource/Sisak> <http://dbpedia.org/ontology/populationTotal> ?o .
<http://dbpedia.org/resource/Sisak> a ?x}

ox
476992030567891http://dbpedia.org/ontology/Place
476992030567891http://www.w3.org/2002/07/owl#Thing
476992030567891http://dbpedia.org/class/yago/SpaTownsInCroatia
476992030567891http://dbpedia.org/class/yago/RomanTownsAndCitiesInCroatia
476992030567891http://dbpedia.org/ontology/Town
476992030567891http://dbpedia.org/ontology/PopulatedPlace
476992030567891http://dbpedia.org/ontology/Settlement
476992030567891http://www.opengis.net/gml/_Feature
476992030567891http://umbel.org/umbel/rc/PopulatedPlace
476992030567891http://umbel.org/umbel/rc/Village
476992030567891http://schema.org/Place
476992030567891http://dbpedia.org/class/yago/CitiesAndTownsInCroatia
476992030567891http://umbel.org/umbel/rc/Location_Underspecified
476992030567891http://umbel.org/umbel/rc/Town
476992030567891http://dbpedia.org/class/yago/GeoclassSeatOfAFirst-orderAdministrativeDivision

Sisak is not a city anymore!

But it still identified as a city is in Wikipedia: http://en.wikipedia.org/wiki/Sisak.

Maybe some of this aspect was corrected in DBpedia since I made my first query, but the wrong way.
In any case... even something so simple as cities and populations reveal some surprises!