Data Analysis using Python – A Quick Exercise

Let’s start data analysis with Python right away. Open up the browser and enter following URL:

https://www.anaconda.com/products/distribution

download anaconda
Download Anaconda

Press on Download button. After download is complete, open the Downloads folder in your computer and find the file named Anaconda3… now start installation. Just click Next on each screen.

Once installation in finished, press Start button of Windows and find ‘Jupyter Notebook’ icon:

Jupyter Notebook
Jupyter Notebook

Click it and the browser will open up the Jupyter interface with the contents of user home folder. Now look at the upper right side and you’ll find ‘New’ button. Click it and then click over ‘Python 3’ from the dropdown menu:

Jupyter Interface
Jupyter Interface

Another browser tab will open with ‘Untitled’ Jupyter Notebook:

Untitled Jupyter Notebook
Untitled Jupyter Notebook

Now we can enter Python code here. Type following Python command:

print( "hello world" )

and then press ‘Run’ button located at the upper bar. After the code was run, it automatically moved on to new line. Now type following code in new line:

print( 5+6 )

Again press ‘Run’ button. Result is as expected. Now type following code in new line:

a=5
b=3
print( a+b )

Press ‘Run’ and result is as expected. Congrats! you’ve learned quite a bit of programming language. You’ve just seen how string (“hello world”) and numeric variables (a, b) work. You’ve also used a print( ) function which sends output to display. Functions are fundamental element of any programming language; they are ‘actions’ or we may say ‘verbs’ like go, run, talk etc. There are around 68 built-in functions in Python just like print( ).

But how about specialized functions like mathematical ‘actions’ e.g. square root, power, log etc. That’s where packages / modules come in. Specialized functions are normally clubbed in named packages / modules which can be imported to perform a specific task. Let’s import ‘math’ module to understand this. In new line enter following code:

import math as m

s = m.sqrt( 64 )
print( s )
x = 5
y = 3
p = m.pow( x, y )
print( p )

Let’s analyze the above code. In first line we imported the ‘math’ module and gave it a nickname ‘m’ so that we type less the next time we call ‘math’ module. In second line we created a new variable ‘s’ and then we called a function from math module i.e. sqrt( ) which actually reads the value given (parameter) within brackets of function and returns the square root of the given number. Then the assignment operator ‘=’ assigns the value which was returned by sqrt( ) function, to the variable ‘s’.

Next, we defined two variables ‘x’ and ‘y’ and assigned them values ‘5’ and ‘3’ respectively. Then, we called another function from ‘math’ module i.e. pow( ) to calculate power. Here the function pow( ) accepts two parameters – first is the number whose power is to be calculated and the second parameter is the power, therefore ‘5’ is calculated with power of ‘3’. The important programming lesson here is that a function can accept one or more parameters, which are separated by comma. Another thing which you must have observed is that we used dot ‘.’ when we call a function from a module.

Let’s move towards playing with actual data. For this, we may use any file containing data or actual database such as Oracle, SQL Server, MySQL etc. To keep things simple, we’ll use an Excel file which contains data – download this Excel file from: https://bittenbook.com/wp-content/uploads/2022/07/employee.xlsx and then put it at C:\Users\yourusername\ folder – just replace “yourusername” with your actual Windows username; this folder is where your Jupyter Notebooks are also being saved by default.

Now, enter following code in new line:

import pandas as pd

data = pd.read_excel( "employee.xlsx" )
data

Here, in first line we imported the ‘pandas’ package and gave it a nickname ‘pd’. In second line we created a new variable ‘data’ and then we called a function from pandas package i.e. read_excel( ) which actually reads an Excel file and returns the contents of that file.  The assignment operator ‘=’ assigns the contents which have been returned by read_excel( ) function, to the variable ‘data’. At the last line we outputted the contents of ‘data’ – we didn’t use print( ) here and voila! output got formatted nicely in table form.

Now let’s do some sorting on the data. For that, move on to next line and then enter code as below:

import pandas as pd

data = pd.read_excel( "employee.xlsx" )
result = data.sort_values( "Employee Name" )
result

Here, we called a function sort_values( ) on data variable and gave it the column name as parameter i.e. ‘Employee Name’ and the returned contents are being captured in a new variable named ‘result’.

Now let’s do some filtering on the data, so that only those employees who belong to ‘Finance’ department could be returned. For that, move on to next line and then enter code as below:

import pandas as pd

data = pd.read_excel( "employee.xlsx" )
filter1 = data[ "Department" ] == "Finance"
result = data.where( filter1 )
result

Here, we created a variable named ‘filter1′ which holds the condition i.e. the “Department” column of data must be equal to “Finance” (Note that it’s ‘==’ which means ‘equal to’, it’s not single ‘=’ which is assignment operator). In next line we called where( ) function on data variable and passed on ‘filter1’ as parameter. Finally ‘result’ variable gets the returned contents from data.where( ) function.

Move on to next line and let’s insert another filter as given below:

import pandas as pd

data = pd.read_excel( "employee.xlsx" )
filter1 = data[ "Department" ] == "Finance"
filter2 = data[ "Salary" ] > 75000
result = data.where( filter1 & filter2 )
result.dropna( )

Here, we created a variable named ‘filter2′ which holds the second condition i.e. the “Salary” column of data must be greater than 75000. In next line we called where( ) function on data variable and passed on “filter1’ and ‘filter2’ ( combining both conditions with & ) as parameter. At the end we also called dropna( ) function on ‘result’ which eliminates the irrelevant record lines from output.

Last but not least, let’s create a chart from data:

import matplotlib.pyplot as plt
import pandas as pd

data= pd.read_excel( "employee.xlsx" )
plt.plot( data[ "Salary" ] )
plt.title( "Employees Salaries" )
plt.xlabel( "Employee Name" )
plt.ylabel( "Salary" )
plt.xticks( data.index, data[ "Employee Name" ], rotation="vertical" )
plt.show( )

Worth noting code here is importing matplotlib.pyplot module for plotting purposes. Then at fourth line we used plot( ) function and passed on “Salary” column data for plotting. Most eye-catching line here is the second last where we used xticks( ) function to define ticks for x-axis; this function took three parameters – first parameter passed on the data index, the second parameter passed on “Employee Name” column from data and the third parameter passed on attribute for rotation of the names vertically.

This was a quick intro to data analysis using Python. You can build your knowledge from here onwards. Or you can also say…why bother? Well, technology is fiercely pushing every field to evolve, so remember that – if you don’t evolve, you dissolve!

(The above article was originally published in The Pakistan Accountant July-Sept 2022 edition. It’s a quick intro of Python and its basic packages which are used in data analysis. Hence, this article is addressed towards beginners who want to kick-start in the field of data analysis using Python language.)

Leave a Comment