I was just having a quick look at some public datasets, and in the specifics at the GP Practice prescribing data that you can find on the data.gov.uk website.
I wanted to quickly know how many practices where in the UK (or at least in this dataset). Putting this data in a triplestore for a single query was a bit overkill (simple conversion, upload...). At the same time, the file was too large to open in OpenOffice.
The solution ? It's pretty obvious, but just for the people unfamiliar with Unix:
cut -d\, -f3 T201109PDP\ IEXT.CSV | sort | uniq | nl
takes only a few seconds, and the result is 10192.
Need an explanation ?
cut : extract only a range of columns from a file
-d\, : (cut option) columns are separated by a comma
-f3 : (cut option) take only column 3
T201109PDP\ IEXT.CSV : the file name!
| : pipe all results through the next command
sort : well, sort
| : see above
uniq : excluded repeated rows
| : see above
nl : count everything
It's Unix 101, but very useful ;)
This is a big (500Mb) CSV file, whose headers are:
SHA,PCT,PRACTICE,BNF CODE,BNF NAME ,ITEMS ,NIC ,ACT COST ,PERIOD
Q30,5D7,A86001,0703010F0,Combined Ethinylestradiol 30mcg ,0000001,00000001.89,00000001.77,201109
Q30,5D7,A86003,0101010G0,Co-Magaldrox(Magnesium/Aluminium Hydrox),0000027,00000074.94,00000069.92,201109
Q30,5D7,A86003,0101010P0,Co-Simalcite (Simeticone/Hydrotalcite) ,0000001,00000003.20,00000002.98,201109
Q30,5D7,A86003,0101010R0,Simeticone ,0000002,00000007.35,00000006.84,201109
I wanted to quickly know how many practices where in the UK (or at least in this dataset). Putting this data in a triplestore for a single query was a bit overkill (simple conversion, upload...). At the same time, the file was too large to open in OpenOffice.
The solution ? It's pretty obvious, but just for the people unfamiliar with Unix:
cut -d\, -f3 T201109PDP\ IEXT.CSV | sort | uniq | nl
takes only a few seconds, and the result is 10192.
Need an explanation ?
cut : extract only a range of columns from a file
-d\, : (cut option) columns are separated by a comma
-f3 : (cut option) take only column 3
T201109PDP\ IEXT.CSV : the file name!
| : pipe all results through the next command
sort : well, sort
| : see above
uniq : excluded repeated rows
| : see above
nl : count everything
It's Unix 101, but very useful ;)
No comments:
Post a Comment