Tuesday, October 15, 2013

Sometimes you don't need SPARQL

I was just having a quick look at some public datasets, and in the specifics at the GP Practice prescribing data that you can find on the data.gov.uk website.


This is a big (500Mb) CSV file, whose headers are:

 SHA,PCT,PRACTICE,BNF CODE,BNF NAME                              ,ITEMS  ,NIC        ,ACT COST   ,PERIOD                                 
Q30,5D7,A86001,0703010F0,Combined Ethinylestradiol 30mcg         ,0000001,00000001.89,00000001.77,201109                                 
Q30,5D7,A86003,0101010G0,Co-Magaldrox(Magnesium/Aluminium Hydrox),0000027,00000074.94,00000069.92,201109                                 
Q30,5D7,A86003,0101010P0,Co-Simalcite (Simeticone/Hydrotalcite)  ,0000001,00000003.20,00000002.98,201109                                 
Q30,5D7,A86003,0101010R0,Simeticone                              ,0000002,00000007.35,00000006.84,201109 


I wanted to quickly know how many practices where in the UK (or at least in this dataset). Putting this data in a triplestore for a single query was a bit overkill (simple conversion, upload...). At the same time, the file was too large to open in OpenOffice.

The solution ? It's pretty obvious, but just for the people unfamiliar with Unix:

cut -d\,  -f3 T201109PDP\ IEXT.CSV | sort | uniq | nl

takes only a few seconds, and the result is 10192.

Need an explanation ?


cut       : extract only a range of columns from a file

-d\,      : (cut option) columns are separated by a comma
-f3       : (cut option) take only column 3
T201109PDP\ IEXT.CSV :  the file name!
|         : pipe all results through the next command
sort      : well, sort
|         : see above
uniq      : excluded repeated rows
|         : see above
nl        : count everything

It's Unix 101, but very useful ;)





No comments:

Post a Comment