Playing With the LOD: Sometimes you don't need SPARQL

I was just having a quick look at some public datasets, and in the specifics at the GP Practice prescribing data that you can find on the data.gov.uk website.

This is a big (500Mb) CSV file, whose headers are:

SHA,PCT,PRACTICE,BNF CODE,BNF NAME ,ITEMS ,NIC ,ACT COST ,PERIOD

Q30,5D7,A86001,0703010F0,Combined Ethinylestradiol 30mcg ,0000001,00000001.89,00000001.77,201109

Q30,5D7,A86003,0101010G0,Co-Magaldrox(Magnesium/Aluminium Hydrox),0000027,00000074.94,00000069.92,201109

Q30,5D7,A86003,0101010P0,Co-Simalcite (Simeticone/Hydrotalcite) ,0000001,00000003.20,00000002.98,201109

Q30,5D7,A86003,0101010R0,Simeticone ,0000002,00000007.35,00000006.84,201109

I wanted to quickly know how many practices where in the UK (or at least in this dataset). Putting this data in a triplestore for a single query was a bit overkill (simple conversion, upload...). At the same time, the file was too large to open in OpenOffice.

The solution ? It's pretty obvious, but just for the people unfamiliar with Unix:

cut -d\, -f3 T201109PDP\ IEXT.CSV | sort | uniq | nl

takes only a few seconds, and the result is 10192.

Need an explanation ?

cut : extract only a range of columns from a file

-d\, : (cut option) columns are separated by a comma
-f3 : (cut option) take only column 3
T201109PDP\ IEXT.CSV : the file name!
| : pipe all results through the next command
sort : well, sort
| : see above
uniq : excluded repeated rows
| : see above
nl : count everything

It's Unix 101, but very useful ;)

Playing With the LOD

Tuesday, October 15, 2013

Sometimes you don't need SPARQL

No comments:

Post a Comment