Learning Data Analysis with Python - Introduction to Pandas

Cover image

What is Pandas and Why do we need it

Pandas is python library that provides rich data structures and functions that makes working with relational and structured data easy, fast and convenient. It is developed on top of the Numpy package for the high performance computing and it gives flexible data manipulation techniques of relational databases. Okay, now why do we need panadas at all? If we want to mainpulate high dimensional data, we have numpy, so why do we need it? Well, indeed numpy is there for the high dimensional data manipulation but what it doesn't provide is a way to index or label our data like this.

Pandas DataFrame

And, although we can do so many interesting things to our data using numpy functions such as finding statistics like mean,median,mode and standard deviation but we can't fill missing values in our data. Numpy doesn't provide any built-in method to handle missing values in our data and in real world problems, we don't always come around the perfect datasets.

There are other limitations to numpy such as no way to group data and no way to pivot data, and thus, Pandas was created to help programmers and data scientists with the data analysis task in Python.

Data Structures in Pandas

Well as we've mentioned earlier, pandas provides rich data structures and functions, let's start by looking at the data structures and then we'll look at the functions and methods that we can apply on these data structures to crunch data.

Pandas provide two major data structures namely "Series" and "DataFrames". In plain english, Series is a one-dimensional labeled array, whereas DataFrames is two-dimensional labeled array with columns of potentially different types. Although, Series are useful in many tasks, DataFrames are the most commonly used pandas object. You can think of DataFrame as an SQL table.

Let's look at the example for both of these data structures.

Series & DataFrames

Let's start by looking at the different ways of creating a Series object.But, Before we get into coding, we need to install pandas as python package. This can be done using pip install pandas or you can install it using conda install pandas if you are using Anaconda Distribution for Python. If you don't have python installed on your machine, the best way to get started without all the hustle of installing it is Google Colab. Learn more about it here. I'm going to use jupyter notebooks which you can find here.

Series

As mentioned above, Series is a one-dimensional array-like object containing an array of data and an array of labels associated with the data. The data labels are called 'index'. We can create a Series from a list or a dictionary or a numpy array. Let's start with a list.

Series using a list

In[1]:
    import pandas as pd
    import numpy as np

    friends = ["Monica", "Chandler", "Joey", "Pheobe", "Ross", "Rachel"]
    series = pd.Series(friends)
    series

Out[1]:
    0      Monica
    1    Chandler
    2        Joey
    3      Pheobe
    4        Ross
    5      Rachel
    dtype: object

As we can see, we get the Series object with the data that we provided with the associated index of for each row. We can access our data from this series using the index just like we do with the list.

In[2]:
    series[0]

Out[2]:
    'Monica'

Or we can get all the values in a list using .value attribute on series object.

In[3]:
    series.values

Out[3]:
    array(['Monica', 'Chandler', 'Joey', 'Pheobe', 'Ross', 'Rachel'],
      dtype=object)

Similarly, we can get all the indices in a list using .index attribute.

In[4]:
    series.index

Out[4]:
    RangeIndex(start=0, stop=6, step=1)

Note something different? Right, it doesn't return a list of indices as it did in values, that's because the implicit index that pandas automatically creates for us is just a list of consecutive numbers starting from 0. A more intiuative approach would be to provide index identifying each data point while creating a Series.

In[5]:
    nick_names = ["Monana", "Officer Bing", "Ken Adams", "Regina Filange", "Dinosaur Guy", "Fun Aunt Rachel"]
    series = pd.Series(data = friends, index = nick_names)
    series

Out[6]:
    Monana               Monica
    Officer Bing       Chandler
    Ken Adams              Joey
    Regina Filange       Pheobe
    Dinosaur Guy           Ross
    Fun Aunt Rachel      Rachel
    dtype: object

And now, we can access our data using the index that we provided like this.

In[7]:
    series["Dinosaur Guy"]

Out[7]:
    'Ross'

It's clearly a bad example of indexing but you got the idea, right?

Series using a dictionary

Now, let's instantiate Series from a dictionary. When creating a Series from dictionaries, we don't pass index, the Series gets ordered by dictionary's insertion order and the keys of dictionary becomes indices of Series.

In[8]:
    d = {"a" : 1, "b" : 2, "c" : 3}
    series = pd.Series(d)
    series

Out[8]:
    a    1
    b    2
    c    3
    dtype: int64

Series object also works like dictionaries. You can get or set the values in the Series by index label just like dictionaries.

In[9]:
    series["c"]

Out[9]:
    3

In[10]:
    series["d"] = 4
    series

Out[10]:
    a    1
    b    2
    c    3
    d    4
    dtype: int64

Series using a numpy array

And finally, we can create a Series from a numpy array.

In[11]:
    d = np.random.randn(5) # Create a numpy array of 5 random numbers
    series = pd.Series(d)
    series

Out[11]:
    0   -0.507243
    1   -0.115731
    2    0.526216
    3   -0.495840
    4    0.065357
    dtype: float64

Series is also similar to numpy array in functionalities. We can index data from a Series object as we do from a numpy array.

In[12]
    series[:3] # Returns first 3 elements

Out[12]:
    0   -0.507243
    1   -0.115731
    2    0.526216
    dtype: float64

In[13]:
    series[series > 0] # Returns elements whose value is greater than 0

Out[13]:
    2    0.526216
    4    0.065357
    dtype: float64

Now, once we created a Series, we can apply many different methods on it to achieve different tasks. A few of them are as below. Read more about all the methods available on Series object at pandas official documentation.

In[14]:
    series.size # Returns size of the series.
Out[14]:
    5

In[15]:
    series.shape # Returns shape of the series.
Out[15]:
    (5,)

In[16]:
    series.add(1) # Returns series with elements' value increased by 1
Out[16]:
    0    0.492757
    1    0.884269
    2    1.526216
    3    0.504160
    4    1.065357
    dtype: float64

In[17]:
    series.sub(1) # Returns series with elements' value decreased by 1
Out[17]:
    0   -1.507243
    1   -1.115731
    2   -0.473784
    3   -1.495840
    4   -0.934643
    dtype: float64

In[18]:
    series.mul(10) # Returns series with elements' value multiplied by 10
Out[18]:
    0   -5.072431
    1   -1.157307
    2    5.262160
    3   -4.958401
    4    0.653565
    dtype: float64

In[19]:
    series.div(10) # Returns series with elements' value divided by 10
Out[19]:
    0   -0.050724
    1   -0.011573
    2    0.052622
    3   -0.049584
    4    0.006536
    dtype: float64

In[20]:
    series.abs() # Returns series with absolute value of elements
Out[20]:
    0    0.507243
    1    0.115731
    2    0.526216
    3    0.495840
    4    0.065357
    dtype: float64

In[21]:
    series.max() # Returns maximum of all the elements.
Out[21]:
    0.5262159987471329

In[22]:
    series.min() # Returns minimum of all the elements.
Out[22]:
    -0.5072430551642492

In[23]:
    series.sum() # Returns sum of all the elements.
Out[23]:
    -0.5272413324692681

In[24]:
    series.mean() # Returns mean of the series.
Out[24]:
    -0.10544826649385361

In[25]:
    series.median() # Returns median of the series.
Out[25]:
    -0.1157306666767307

In[26]:
    series.std() # Returns standard deviation of the series.
Out[26]:
    0.43073327047221427

In[27]:
    series.drop(4) # Returns series with specified index removed.
Out[27]:
    0   -0.507243
    1   -0.115731
    2    0.526216
    3   -0.495840
    dtype: float64

In[28]:
    series
Out[28]:
    0   -0.507243
    1   -0.115731
    2    0.526216
    3   -0.495840
    4    0.065357
    dtype: float64

Look at the last code when we print the original series object, we get the elements same as we created, because all the operations that we applied on the series object gives us the view of the series and doesn't affect the original object. Okay, so this is just a beginning, there are many more interesting things that we can do with pandas such as dropping null values or replacing them with some other values from data, visualize data using so many plots and charts, combine and merge two or more Series, create and manipulate Time Series, etc.

That's it for this post.In the next couple of posts, I'll be playing with some real world datasets and make some analysis on those data using Pandas.

Get full code notebook here : Github

Thank you for reading. Happy coding.👨‍💻👩‍💻