Starting out with Pandas on my Data Science Journey
Last time we discussed numpy extensively we covered the importance and how it’s a basic building block of some other libraries in python. Today we would be talking about one of those libraries a powerful one at that.
Pandas is an open source library built on top of the Numpy and that was the reason why Numpy was first taught in the course.
Reasons why Pandas is great
· It has great built in visualization features
· It is pythons version of excel
· It excels in performance and productivity
· It allows for fast analysis, data cleaning and preparation
· It can also work with data from a variety of sources
To install pandas you need to install it in the command prompt by entering the following code
Pip install pandas
To start working with pandas in python you have to import it just like Numpy was imported as shown below.
Types of Data types in Pandas
1. Series: A Series looks very much like a NumPy array (in fact it is built on top of the NumPy array object). The difference between a NumPy and a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data; it can hold any arbitrary Python Object. What this means is it doesn’t just hold numbers but other objects.
To create a series, you can convert a list, numpy array, or dictionary to a Series:
You can make the list called ”labels” become the axis by doing this
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Checkout an example of how to grab information from a Series. Let us create two series, ser1 and ser2:
2. DataFrames: We can think of a DataFrame as a bunch of Series objects put together to share the same index. Check it out below
Remember those Numpy functions so basically we set the index to “A B C D E” and the columns to “W X Y Z” the meaning of the .split function is to divide or split the letters by the space.
Now lets learn how to select or pick a series from the Data frame above
So in the fig above I wanted the series “W” so I selected and typed the code df[“W”]. You can also try to select a random series on your own.
The fig above shows how to create a new column to the existing dataframe. To remove columns, do the following. We use axis=1 to show its columns not the rows.
Do note that the code above doesn’t reflect in the original code you have to specify in place for it to reflect in the original dataframe df.drop(‘new’,axis=1,inplace=True).
How to Select rows
You can select based off of position instead of label
We can also select specific numbers in the data frames i.e Selecting subset of rows and columns.
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
To show the numbers you put the condition in a dataframe.
To pick select or bring out the data frame where W is not less than Zero you would do the following.
Notice that it totally removed row “C” that is because it contains numbers less than zero in column “W” so from the fig data frame above we can call other results for instance if we wanted Column “Y”.
Or column “Y” and “X”