This section develops a fairly advanced example of performance a simple data analytics task, by purely using Linux command line tools. It shows the power of these command line tools and of the ability to "pipe" the output of one tool into the other.
Task: Calculate to cumulative population of the 50 largest cities/towns in Scotland, based on data in the file like this:
Rank Locality Population Status Council area 1 Glasgow 599,650 City Glasgow City 2 Edinburgh 464,990 City City of Edinburgh 3 Aberdeen 196,670 City Aberdeen City ... |
In the first step, we separate the data in the input file by column, using the GNU gawk command. The command-line below, first feeds the contents of ScotCities.txt into the sed. The pipe combinator (|) takes the input of the left-hand component (here gawk) and feeds it into the right-hand component (here sed). The sed acts on the first line (1) and deletes that line (d). See the man page on sed for an explanation of the command string (the string after the -e option). The result of the sed command is the fed into the gawk. This command decomposes the input into columns, and numbers each column so that it can be refered to in the command string (after the -e option). This command is executed for each line of the input. Here, we want to print a label, with a colon, and then the value of the matching column in the input. For example, "Population" is the third column in the file, and we print "pop: " as a label, followed by the value in the third column, which is encoded as $3. We do this with all columns, and put "{}" as deliminators around the output. The final command on this line is less which just views the input, and allows paging in the text. Press 'q' to exit less.
![]() | We are aiming to produce a file in JSON format so that we can more easily process it further. Hence, we separate the entries (key-value pairs separated by a ':'), by commas (,), and put {} around each group of key-value pairs. |
Example 12. Separate input data into columns
# very basic: just drop header line and print column by colunm; NB: ; to separate commands in gawk cat ScotCities.txt | \ sed -e '1d' | \ gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \ less |
As you see, the output is spread over severall lines, where you would expect just one line for each line of input. To remedy this, we use a bit of sed magic to eliminate newlines in the right places.
Example 13. Put all data on one line
# this adds sed-magic to eliminate newlines; cat ScotCities.txt | \ sed -e '1d' | \ gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \ sed ':a;N;$!ba;s/\n/ /g' | \ less |
Running this command gives this output:
{ rank: 1 , name: Glasgow ; pop: 599,650 , status: City council: GlasgowCity } { rank: 2 , name: Edinburgh ; pop: 464,990 , status: City council: Cityof } { rank: 3 , name: Aberdeen ; pop: 196,670 , status: City council: AberdeenCity } |
Now we want to add commas as spearators between the key-value pairs in one entry. For that we again use sed
Example 14. Separate per-city entries by commas
# this adds separators between fields, searching for '} {' patterns; cat ScotCities.txt | \ sed -e '1d' | \ gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \ sed ':a;N;$!ba;s/\n/ /g' | \ sed -e 's/} {/} , {/g' | \ less |
Running this command-line prints the data for all cities, separated by commas:
{ rank: 1 , name: Glasgow ; pop: 599,650 , status: City council: GlasgowCity } , { rank: 2 , name: Edinburgh ; pop: 464,990 , status: City council: Cityof } , { rank: 3 , name: Aberdeen ; pop: 196,670 , status: City council: AberdeenCity } , ... |
Note that in some cases either the key or the value component contains spaces or commas. This makes it difficult to separate the contents from the meta-data, i.e. data that separates fields. To solve this issue, we put "" around entries where necessary. This again uses sed code which searches for numbers followed by a comma (\([0-9]*,[0-9]*\)) or alphabetical characters separated by a space ([a-zA-Z]*\)[ ]*\([;}]) and puts "" around the first match that was found, which is referred to in sed by \1. The output of the entire line is redirected into a file q2.json for further processing. This is an example of a more involved usage of sed pattern matching and replacement.
Example 15. Wrap entries into ""
# wrap all values into "" using sed code cat ScotCities.txt | \ sed -e '1d' | \ gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \ sed ':a;N;$!ba;s/\n/ /g' | sed -e 's/} {/} , {/g' | \ sed -e 's/\([a-z]*\):/"\1":/g' | sed -e 's/\([0-9]*,[0-9]*\)[ ]*,/"\1" ,/g' | \ sed -e 's/:[ ]*\([a-zA-Z]*\)[ ]*\([;}]\)/: "\1" \2/g' > \ q2.json |
Examining the start of the file q2.json (for example doing head -3 q2.json) gives the following output:
{ "rank": 1 , "name": "Glasgow" ; "pop": "599,650" , "status": City "council": "GlasgowCity" } , { "rank": 2 , "name": "Edinburgh" ; "pop": "464,990" , "status": City "council": "Cityof" } , { "rank": 3 , "name": "Aberdeen" ; "pop": "196,670" , "status": City "council": "AberdeenCity" } , ... |
The command-line / script so far produces a sequence of entries, each of which is in JSON format. However, the entire document isn't in JSON format, yet. We need to wrap the entire sequence into a pair of square brackets ([]), and separate the entries (line) by commas. Again, we can do this on the command line like this (we assume that the output of the command above has been redirected into a file q0.json:
Example 16. Wrapping the squence into [] and separating entries by commas
# post-processing: wrap [ ] around the entire thing echo "[ " >> q0.json ; \ cat q2.json >> q0.json ; \ echo "]" >> q0.json; \ mv q0.json qq.json |
To be continued ... (discuss pre-processing to eliminate blanks in names and commas in numbers)
In the data above, we can still spot several problems: (i) numbers in the population field contain a comma between the hundreds and thousands to make it human readable (but bad for computer processing); (ii) some names contain several words that belong together, but are currently separated (as an example the council area of Edinburgh should be "City of Edinburgh" but is only "Cityof" in the above output). The former is not an immediate issue, since we don't use the council field. But the same problem appears with town names containing two words, such as East Kilbride. The latter is a problem when trying to compute the sum over all population fields. To fix both problems we can introduce another pre-processing step, before we do the processing discussed so far. We can use the sed script below to both replace the space between 2 words in the name of a town or city by an underscore character (_), and also eliminated the comma (,) in the number of the population field. The second sed script goes over the data again, and eliminates the underscore. This is optional.
Example 17. Separate input data into columns
# pre-processing 2-word city names cat ScotCities.txt | \ sed -e 's/\([0-9]*\)\([ \t]*\)\([a-zA-Z]*\)[ ]\([a-zA-Z]*\)\([ \t]*\)\([0-9]*\),/\1\2\3_\4\5\6/g' | \ sed -e 's/\([a-zA-Z]*\)_\([ \t]\)/\1\2/g' > q1.json |
Running this script modifies the input data as follows:
Rank Locality Population Status Council area 1 Glasgow 599650 City Glasgow City 2 Edinburgh 464990 City City of Edinburgh 3 Aberdeen 196670 City Aberdeen City |
We can now re-run the script that we developed in the previous section, now on the contents of q1.json, to produce a sequence of entries each in JSON format:
Example 18. Main work from previous section
# now do the main work cat q1.json | \ sed -e '1d' | \ gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ","; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "," ; print "council: "; print $5 $6 ; print "}" }' | \ sed ':a;N;$!ba;s/\n/ /g' | \ sed -e 's/} {/} , {/g' | \ sed -e 's/\([a-z]*\):/"\1":/g' | \ sed -e 's/\([0-9]*,[0-9]*\)[ ]*,/"\1" ,/g' | \ sed -e 's/:[ ]*\([a-zA-Z_]*\)[ ]*\([,;}]\)/: "\1" \2/g' \ > q2.json |
Running this script now gives a clean data format:
{ "rank": 1 , "name": "Glasgow" , "pop": 599650 , "status": "City" , "council": "GlasgowCity" } , { "rank": 2 , "name": "Edinburgh" , "pop": 464990 , "status": "City" , "council": "Cityof" } , { "rank": 3 , "name": "Aberdeen" , "pop": 196670 , "status": "City" , "council": "AberdeenCity" } |
Finally, we repeat the step at the end of the previous section, to wrap the sequence into square brackets ([]) and separate entries by commas (,).
Now we have the data in JSON format, and we can use Python with its libraries to process the data in a more convenient way. Note that we could use different scripting languages as well, provided it has good libraries for processing data in standard formats such as JSON, and a good language support for composing and calculating over data.
We are now developing a Python script in several steps to read the JSON data-file and to print the total
population of all cities as a result.
Assuming our input file is qq.json we can use the json.loads library
function in Python to load the data in a hash-map data structure in Python.
The result will be in the variable city_list
.
Example 20. Python: Reading a JSON file file into a hash-map data structure
import json jfile = "qq.json" city_list = json.loads(open(jfile,'r').read()) |
Example 21. Testing: Printing the contents of the data structure
# entire dictionary print("Dictionary: ") for c in city_list: print (c) |
Next we sort this list (it is a special form of a hash-map) by its rank in the input
file. From that data in the original file we see that the rank is by size of population.
Note, that we define an anonymous function, or lambda expression, as an argument
to the sort method, which defines how to sort, in this case by rank.
Also note that the sort method performs an in-place sort operation. Thus, after the call
to city_sorted.sort the list city_sorted
will be sorted.
Beware that the value of the original list city_list
will be modified
by this call! In our example this is ok, because we don't need city_list
any more.
Example 22. Python: sorting a list
city_sorted = city_list city_sorted.sort(key=lambda c: c['rank']) # specify the field to sort by |
Now, we need to compute the sum over all population fields in the input data. There are different ways of doing
this and we discuss several variants.
The first version is using an explicit loop to compute the sum. This is probably the most obvious version
but it is more verbose than other versions we discuss.
The variable sum
accumulates the total population, and for each record in the
list, it adds the population field interpreted as an integer value (int(c['pop']
) if the city is on of the top 50 cities by population (c['rank'] <= 50
).
Example 23. Python: summation over population (explicit loop)
# Main task: combined population of top 50 cities/towns (explicit loop) # Verbose version, using an explicit loop sum = 0 for c in city_sorted: if (c['rank'] <= 50): sum += int(c['pop']) |
Another version of doing this summation uses higher-order functions in Python. Higher-order functions are
functions that take other functions as arguments, or produce functions as a result. This concept provides a
powerful mechanism of customising the behaviour of a function.
We have seen an example of this already: the sort function above took a function,
in the form of a lambda expression, as an argument to specify that sorting should be by rank.
In this case we use a composition of several higher order functions.
We use the reduce combinator (higher-order function) to do a pair-wise combination
over the list. Informally, we replace all the commas in the list by pluses and evaluate the entire
expression do to a sum over all components. The list we do this opertation on is obtained
by doing a map operation over list, which applies its argument (a function) to every
argument of the input list. In our case, it takes the pop
field of the list
and interprets it as an integer value. In order to make sure that we only consider the 50 largest cities,
we filter the input list, meaning we only take those elements of the input list that
match a criterion we provide as an argument. In out case that criterion is for the city's rank to be
less or equal to 50. This composition of 3 higher-order functions achieves the same result as the code above:
it computes the sum over all population fields.
![]() | When you compose functions like this, which is typical functional programming style, you read the code from right-to-left: the right-mode call (which is also the inner most call) is the first function that is applied (filter in our case), then the map function is applied to its result, and finally the reduce function is the last one to be applied. |
Example 24. Separate input data into columns
# Main task: combined population of top 50 cities/towns (using pre-defined higher-order functions) # Functional version s = functools.reduce( lambda s, x: s + x , map ( lambda c: int(c['pop']) , filter( lambda c: c['rank'] <= 50 , city_sorted)) ) |
If we use the second version of the code, the result is now in variable s
and
we can print its contents as the result of the computation.
Example 25. Python: printing the result
print("The total population of the 50 largest cities/towns is: " + str(s) |
![]() | The entire Python script discussed here, plus some print statements for testing, can be downloaded here. |
Now we can run the entire Python script, over the clean data produced in the previous section. Including the test print messages per entry, and running both versions of summation gives this output:
Sorting by rank ... The City of Glasgow in council GlasgowCity has a population of 599650 The City of Edinburgh in council Cityof has a population of 464990 The City of Aberdeen in council AberdeenCity has a population of 196670 ... The Town of Port_Glasgow in council Inverclyde has a population of 15190 The Town of Larkhall in council SouthLanarkshire has a population of 15020 ITER: The total population of the 50 largest cities/towns is: 2897410 FCTAL: The total population of the 50 largest cities/towns is: 2897410 |
To be continued ...
Having developed the data cleaning and data processing step by step, we can now tie things together and develop a one-liner that does the entire processing. Note that we call the python interpreter from the command line with option -c and a string. This string is the short version of the Python script that we developed in the previous section.
Example 27. The entire command-line to solve the task
# one huge command-line to do the entire job cat ScotCities.txt | \ sed -e 's/\([0-9]*\)\([ \t]*\)\([a-zA-Z]*\)[ ]\([a-zA-Z]*\)\([ \t]*\)\([0-9]*\),/\1\2\3_\4\5\6/g' | sed -e 's/\([a-zA-Z]*\)_\([ \t]\)/\1\2/g' | \ sed -e '1d' | \ gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ","; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "," ; print "council: "; print $5 $6 ; print "}" }' | \ sed ':a;N;$!ba;s/\n/ /g' | sed -e 's/} {/} , {/g' | sed -e 's/\([a-z]*\):/"\1":/g' | \ sed -e 's/\([0-9]*,[0-9]*\)[ ]*,/"\1" ,/g' | \ sed -e 's/:[ ]*\([a-zA-Z_]*\)[ ]*\([,;}]\)/: "\1" \2/g' > q2.json ; \ echo "[ " >> q0.json ; \ cat q2.json >> q0.json ; \ echo "]" >> q0.json; \ mv q0.json qq.json ; \ python3 -c 'import itertools; import functools; import json; jfile = "qq.json"; city_sorted = json.loads(open(jfile,"r").read()) ; city_sorted.sort(key=lambda c: c["rank"]) ; s = functools.reduce( lambda s, x: s + x, map ( lambda c: int(c["pop"]), city_sorted) ); print("The total population of the 50 largest cities/towns is: " + str(s)) ' |
Running this command-line giver the following output:
The total population of the 50 largest cities/towns is: 2958510 |
<<< Previous | Home | Next >>> |
Advanced Shell Scripting | Practical: Edit-Compile-Run Cycle for C Programs |