Advanced Example on Command line usage (sed, gawk, etc)

This section develops a fairly advanced example of performance a simple data analytics task, by purely using Linux command line tools. It shows the power of these command line tools and of the ability to "pipe" the output of one tool into the other.

Task: Calculate to cumulative population of the 50 largest cities/towns in Scotland, based on data in the file like this:


Rank 	Locality 	Population 	Status 	Council area
1 	Glasgow 	599,650 	City 	Glasgow City
2 	Edinburgh 	464,990 	City 	City of Edinburgh
3 	Aberdeen 	196,670 	City 	Aberdeen City
...

Pre-processing the data with sed and friends

In the first step, we separate the data in the input file by column, using the GNU gawk command. The command-line below, first feeds the contents of ScotCities.txt into the sed. The pipe combinator (|) takes the input of the left-hand component (here gawk) and feeds it into the right-hand component (here sed). The sed acts on the first line (1) and deletes that line (d). See the man page on sed for an explanation of the command string (the string after the -e option). The result of the sed command is the fed into the gawk. This command decomposes the input into columns, and numbers each column so that it can be refered to in the command string (after the -e option). This command is executed for each line of the input. Here, we want to print a label, with a colon, and then the value of the matching column in the input. For example, "Population" is the third column in the file, and we print "pop: " as a label, followed by the value in the third column, which is encoded as $3. We do this with all columns, and put "{}" as deliminators around the output. The final command on this line is less which just views the input, and allows paging in the text. Press 'q' to exit less.

Note

We are aiming to produce a file in JSON format so that we can more easily process it further. Hence, we separate the entries (key-value pairs separated by a ':'), by commas (,), and put {} around each group of key-value pairs.

Example 12. Separate input data into columns


  # very basic: just drop header line and print column by colunm; NB: ; to separate commands in gawk
  cat ScotCities.txt | \
  sed -e '1d' | \
  gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \
  less

As you see, the output is spread over severall lines, where you would expect just one line for each line of input. To remedy this, we use a bit of sed magic to eliminate newlines in the right places.

Example 13. Put all data on one line


  # this adds sed-magic to eliminate newlines; 
  cat ScotCities.txt | \
  sed -e '1d' | \
  gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \
  sed ':a;N;$!ba;s/\n/ /g' | \
  less

Running this command gives this output:


  { rank:  1 , name:  Glasgow ; pop:  599,650 , status: City council:  GlasgowCity } { rank:  2 , name:  Edinburgh ; pop:  464,990 , status: City council:  Cityof } { rank:  3 , name:  Aberdeen ; pop:  196,670 , status: City council:  AberdeenCity }

Now we want to add commas as spearators between the key-value pairs in one entry. For that we again use sed

Example 14. Separate per-city entries by commas


  # this adds separators between fields, searching for '} {' patterns; 
  cat ScotCities.txt | \
  sed -e '1d' | \
  gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \
  sed ':a;N;$!ba;s/\n/ /g' | \
  sed -e 's/} {/} , {/g' | \
  less

Running this command-line prints the data for all cities, separated by commas:


  { rank:  1 , name:  Glasgow ; pop:  599,650 , status: City council:  GlasgowCity } , { rank:  2 , name:  Edinburgh ; pop:  464,990 , status: City council:  Cityof } , { rank:  3 , name:  Aberdeen ; pop:  196,670 , status: City council:  AberdeenCity } , ...

Note that in some cases either the key or the value component contains spaces or commas. This makes it difficult to separate the contents from the meta-data, i.e. data that separates fields. To solve this issue, we put "" around entries where necessary. This again uses sed code which searches for numbers followed by a comma (\([0-9]*,[0-9]*\)) or alphabetical characters separated by a space ([a-zA-Z]*\)[ ]*\([;}]) and puts "" around the first match that was found, which is referred to in sed by \1. The output of the entire line is redirected into a file q2.json for further processing. This is an example of a more involved usage of sed pattern matching and replacement.

Example 15. Wrap entries into ""


  # wrap all values into "" using sed code
  cat ScotCities.txt | \
  sed -e '1d' | \
  gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ";"; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "council: "; print $5 $6 ; print "}" }' | \
  sed ':a;N;$!ba;s/\n/ /g' | sed -e 's/} {/} , {/g' | \
  sed -e 's/\([a-z]*\):/"\1":/g' | sed -e 's/\([0-9]*,[0-9]*\)[ ]*,/"\1" ,/g' | \
  sed -e 's/:[ ]*\([a-zA-Z]*\)[ ]*\([;}]\)/: "\1" \2/g' > \
  q2.json

Examining the start of the file q2.json (for example doing head -3 q2.json) gives the following output:


  { "rank":  1 , "name": "Glasgow" ; "pop":  "599,650" , "status": City "council": "GlasgowCity" } , { "rank":  2 , "name": "Edinburgh" ; "pop":  "464,990" , "status": City "council": "Cityof" } , { "rank":  3 , "name": "Aberdeen" ; "pop":  "196,670" , "status": City "council": "AberdeenCity" } , ...

The command-line / script so far produces a sequence of entries, each of which is in JSON format. However, the entire document isn't in JSON format, yet. We need to wrap the entire sequence into a pair of square brackets ([]), and separate the entries (line) by commas. Again, we can do this on the command line like this (we assume that the output of the command above has been redirected into a file q0.json:

Example 16. Wrapping the squence into [] and separating entries by commas


  # post-processing: wrap [ ] around the entire thing
  echo "[ " >>  q0.json ; \
  cat q2.json  >> q0.json ; \
  echo "]" >> q0.json; \
  mv q0.json qq.json

To be continued ... (discuss pre-processing to eliminate blanks in names and commas in numbers)

Pre-processing names and numbers

In the data above, we can still spot several problems: (i) numbers in the population field contain a comma between the hundreds and thousands to make it human readable (but bad for computer processing); (ii) some names contain several words that belong together, but are currently separated (as an example the council area of Edinburgh should be "City of Edinburgh" but is only "Cityof" in the above output). The former is not an immediate issue, since we don't use the council field. But the same problem appears with town names containing two words, such as East Kilbride. The latter is a problem when trying to compute the sum over all population fields. To fix both problems we can introduce another pre-processing step, before we do the processing discussed so far. We can use the sed script below to both replace the space between 2 words in the name of a town or city by an underscore character (_), and also eliminated the comma (,) in the number of the population field. The second sed script goes over the data again, and eliminates the underscore. This is optional.

Example 17. Separate input data into columns


  # pre-processing 2-word city names
  cat ScotCities.txt | \
  sed -e 's/\([0-9]*\)\([ \t]*\)\([a-zA-Z]*\)[ ]\([a-zA-Z]*\)\([ \t]*\)\([0-9]*\),/\1\2\3_\4\5\6/g' | \
  sed -e 's/\([a-zA-Z]*\)_\([ \t]\)/\1\2/g' > q1.json

Running this script modifies the input data as follows:


Rank    Locality        Population      Status  Council area
1       Glasgow 599650  City    Glasgow City
2       Edinburgh       464990  City    City of Edinburgh
3       Aberdeen        196670  City    Aberdeen City

We can now re-run the script that we developed in the previous section, now on the contents of q1.json, to produce a sequence of entries each in JSON format:

Example 18. Main work from previous section


  # now do the main work
  cat q1.json | \
  sed -e '1d' | \
  gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ","; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "," ; print "council: "; print $5 $6 ; print "}" }' | \
  sed ':a;N;$!ba;s/\n/ /g' | \
  sed -e 's/} {/} , {/g' | \
  sed -e 's/\([a-z]*\):/"\1":/g' | \
  sed -e 's/\([0-9]*,[0-9]*\)[ ]*,/"\1" ,/g' | \
  sed -e 's/:[ ]*\([a-zA-Z_]*\)[ ]*\([,;}]\)/: "\1" \2/g'  \
  > q2.json

Running this script now gives a clean data format:


{ "rank":  1 , "name": "Glasgow" , "pop":  599650 , "status": "City" , "council": "GlasgowCity" } , { "rank":  2 , "name": "Edinburgh" , "pop":  464990 , "status": "City" , "council": "Cityof" } , { "rank":  3 , "name": "Aberdeen" , "pop":  196670 , "status": "City" , "council": "AberdeenCity" }  

Finally, we repeat the step at the end of the previous section, to wrap the sequence into square brackets ([]) and separate entries by commas (,).

Example 19. Wrapping the squence into [] and separating entries by commas


  # post-processing: wrap [ ] around the entire thing
  echo "[ " >>  q0.json ; \
  cat q2.json  >> q0.json ; \
  echo "]" >> q0.json; \
  mv q0.json qq.json

Developing a Python script for processing the data

Now we have the data in JSON format, and we can use Python with its libraries to process the data in a more convenient way. Note that we could use different scripting languages as well, provided it has good libraries for processing data in standard formats such as JSON, and a good language support for composing and calculating over data.

We are now developing a Python script in several steps to read the JSON data-file and to print the total population of all cities as a result. Assuming our input file is qq.json we can use the json.loads library function in Python to load the data in a hash-map data structure in Python. The result will be in the variable city_list.

Example 20. Python: Reading a JSON file file into a hash-map data structure


import json
jfile = "qq.json"
city_list = json.loads(open(jfile,'r').read())

Example 21. Testing: Printing the contents of the data structure


# entire dictionary
print("Dictionary: ")
for c in city_list:
  print (c)

Next we sort this list (it is a special form of a hash-map) by its rank in the input file. From that data in the original file we see that the rank is by size of population. Note, that we define an anonymous function, or lambda expression, as an argument to the sort method, which defines how to sort, in this case by rank. Also note that the sort method performs an in-place sort operation. Thus, after the call to city_sorted.sort the list city_sorted will be sorted. Beware that the value of the original list city_list will be modified by this call! In our example this is ok, because we don't need city_list any more.

Example 22. Python: sorting a list


city_sorted = city_list
city_sorted.sort(key=lambda c: c['rank'])   # specify the field to sort by

Now, we need to compute the sum over all population fields in the input data. There are different ways of doing this and we discuss several variants. The first version is using an explicit loop to compute the sum. This is probably the most obvious version but it is more verbose than other versions we discuss. The variable sum accumulates the total population, and for each record in the list, it adds the population field interpreted as an integer value (int(c['pop']) if the city is on of the top 50 cities by population (c['rank'] <= 50).

Example 23. Python: summation over population (explicit loop)


# Main task: combined population of top 50 cities/towns (explicit loop)
# Verbose version, using an explicit loop
sum = 0
for c in city_sorted:
     if (c['rank'] <= 50):
         sum += int(c['pop'])

Another version of doing this summation uses higher-order functions in Python. Higher-order functions are functions that take other functions as arguments, or produce functions as a result. This concept provides a powerful mechanism of customising the behaviour of a function. We have seen an example of this already: the sort function above took a function, in the form of a lambda expression, as an argument to specify that sorting should be by rank. In this case we use a composition of several higher order functions. We use the reduce combinator (higher-order function) to do a pair-wise combination over the list. Informally, we replace all the commas in the list by pluses and evaluate the entire expression do to a sum over all components. The list we do this opertation on is obtained by doing a map operation over list, which applies its argument (a function) to every argument of the input list. In our case, it takes the pop field of the list and interprets it as an integer value. In order to make sure that we only consider the 50 largest cities, we filter the input list, meaning we only take those elements of the input list that match a criterion we provide as an argument. In out case that criterion is for the city's rank to be less or equal to 50. This composition of 3 higher-order functions achieves the same result as the code above: it computes the sum over all population fields.

Note

When you compose functions like this, which is typical functional programming style, you read the code from right-to-left: the right-mode call (which is also the inner most call) is the first function that is applied (filter in our case), then the map function is applied to its result, and finally the reduce function is the last one to be applied.

Example 24. Separate input data into columns


# Main task: combined population of top 50 cities/towns (using pre-defined higher-order functions)
# Functional version
s = functools.reduce( lambda s, x: s + x
                    , map ( lambda c: int(c['pop']) 
                          ,  filter( lambda c: c['rank'] <= 50 , city_sorted)) )

If we use the second version of the code, the result is now in variable s and we can print its contents as the result of the computation.

Example 25. Python: printing the result


print("The total population of the 50 largest cities/towns is: " + str(s)

Note

The entire Python script discussed here, plus some print statements for testing, can be downloaded here.

Now we can run the entire Python script, over the clean data produced in the previous section. Including the test print messages per entry, and running both versions of summation gives this output:

Example 26. Running the Python script over the clean data


python3 cities.py

Sorting by rank  ...
The City of Glasgow in council GlasgowCity has a population of 599650
The City of Edinburgh in council Cityof has a population of 464990
The City of Aberdeen in council AberdeenCity has a population of 196670
...
The Town of Port_Glasgow in council Inverclyde has a population of 15190
The Town of Larkhall in council SouthLanarkshire has a population of 15020
ITER:  The total population of the 50 largest cities/towns is: 2897410
FCTAL: The total population of the 50 largest cities/towns is: 2897410

To be continued ...

Tying things together

Having developed the data cleaning and data processing step by step, we can now tie things together and develop a one-liner that does the entire processing. Note that we call the python interpreter from the command line with option -c and a string. This string is the short version of the Python script that we developed in the previous section.

Example 27. The entire command-line to solve the task


 # one huge command-line to do the entire job
 cat ScotCities.txt | \
 sed -e 's/\([0-9]*\)\([ \t]*\)\([a-zA-Z]*\)[ ]\([a-zA-Z]*\)\([ \t]*\)\([0-9]*\),/\1\2\3_\4\5\6/g' | sed -e 's/\([a-zA-Z]*\)_\([ \t]\)/\1\2/g' | \
 sed -e '1d' | \
 gawk -e '{ print "{"; print "rank: "; print $1; print ","; print "name: " ; print $2; print ","; print "pop: " ; print $3; print ","; print "status:" ; print $4; print "," ; print "council: "; print $5 $6 ; print "}" }' | \
 sed ':a;N;$!ba;s/\n/ /g' | sed -e 's/} {/} , {/g' | sed -e 's/\([a-z]*\):/"\1":/g' | \
 sed -e 's/\([0-9]*,[0-9]*\)[ ]*,/"\1" ,/g' | \
 sed -e 's/:[ ]*\([a-zA-Z_]*\)[ ]*\([,;}]\)/: "\1" \2/g'  > q2.json ; \
 echo "[ " >> q0.json ; \
 cat q2.json  >> q0.json ; \
 echo "]" >> q0.json; \
 mv q0.json qq.json ; \
 python3 -c 'import itertools; import functools; import json; jfile = "qq.json"; city_sorted = json.loads(open(jfile,"r").read()) ; city_sorted.sort(key=lambda c: c["rank"]) ; s = functools.reduce( lambda s, x: s + x, map ( lambda c: int(c["pop"]),  city_sorted) );  print("The total population of the 50 largest cities/towns is: " + str(s)) '

Running this command-line giver the following output:


The total population of the 50 largest cities/towns is: 2958510