Data Visualization with Pandas is the presentation of data in a graphical format. It helps people understand the significance of data by summarizing and presenting a huge amount of data in a simple and easy-to-understand format and helps communicate information clearly and effectively.
Data Visualization with Pandas
In this tutorial, we will learn about Pandas’ built-in capabilities for data visualization! It’s built off of Matplotlib, but it is baked into pandas for easier usage!
Let’s take a look! Installation Easiest way to install pandas is to use pip:
pip install pandas
or, Download it from here This article demonstrates an illustration of using a built-in data visualization feature in pandas by plotting different types of charts.
Importing necessary libraries and data files
Sample CSV files df1 and df2 used in this tutorial can be downloaded from here.
Python3
import numpy as np import pandas as pd # There are some fake data csv files # you can read in as dataframes df1 = pd.read_csv( 'df1' , index_col = 0 ) df2 = pd.read_csv( 'df2' ) |
Style Sheets
Matplotlib has style sheets that can be used to make the plots look a little nicer. These style sheets include plot_bmh, plot_fivethirtyeight, plot_ggplot, and more. They basically create a set of style rules that your plots follow. We recommend using them, they make all your plots have the same look and feel more professional. We can even create our own if want the company’s plots to all have the same look (it is a bit tedious to create on though). Here is how to use them. Before plt.style.use() plots look like this:
Python3
df1[ 'A' ].hist() |
Output:
Call the style:
Now, plots look, like this after calling ggplot style.
Python3
import matplotlib.pyplot as plt plt.style.use( 'ggplot' ) df1[ 'A' ].hist() |
Output :
Plots look like this after calling bmh style:
Python3
plt.style.use( 'bmh' ) df1[ 'A' ].hist() |
Output :
Plots look like this after calling dark_background style:
Python3
plt.style.use( 'dark_background' ) df1[ 'A' ].hist() |
Output :
Plots look like this after calling FiveThirtyEight style:
Python3
plt.style.use( 'fivethirtyeight' ) df1[ 'A' ].hist() |
Output :
Pandas DataFrame Plots
There are several plot types built-in to pandas, most of them statistical plots by nature:
- df.plot.area
- df.plot.barh
- df.plot.density
- df.plot.hist
- df.plot.line
- df.plot.scatter
- df.plot.bar
- df.plot.box
- df.plot.hexbin
- df.plot.kde
- df.plot.pie
You can also just call df.plot(kind=’hist’) or replace that kind argument with any of the key terms shown in the list above (e.g. ‘box’, ‘barh’, etc.). Let’s start going through them!
Area Plots using Pandas DataFrame
An area chart or area graph displays graphically quantitative data. It is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings. Commonly one compares two or more quantities with an area chart.
Python3
df2.plot.area(alpha = 0.4 ) |
Output :
Bar Plots using Pandas DataFrame
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes called a line graph.
Python3
df2.head() |
Output :
Python3
df2.plot.bar() |
Output :
Python3
df2.plot.bar(stacked = True ) |
Output :
Histogram Plot using Pandas DataFrame
A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc.
Python3
df1[ 'A' ].plot.hist(bins = 50 ) |
Output :
Line Plot using Pandas DataFrame
A line plot is a graph that shows the frequency of data along a number line. It is best to use a line plot when the data is time series. It is a quick, simple way to organize data.
Python3
df1.plot.line(x = df1.index, y = 'B' , figsize = ( 12 , 3 ), lw = 1 ) |
Output :
Scatter Plot using Pandas DataFrame
Scatter plots are used when you want to show the relationship between two variables. Scatter plots are sometimes called correlation plots because they show how two variables are correlated.
Python3
df1.plot.scatter(x = 'A' , y = 'B' ) |
Output :
You can use c to color based off another column value Use cmap to indicate colormap to use. For all the colormaps, check out: http://matplotlib.org/users/colormaps.html
Python3
df1.plot.scatter(x = 'A' , y = 'B' , c = 'C' , cmap = 'coolwarm' ) |
Output :
Or use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column:
Python3
df1.plot.scatter(x = 'A' , y = 'B' , s = df1[ 'C' ] * 200 ) |
Output :
Box Plots using Pandas DataFrame
It is a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines on either side of the rectangle. A boxplot is a standardized way of displaying the distribution of data based on a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
Python3
df2.plot.box() # Can also pass a by = argument for groupby |
Output:
Hexagonal Bin Plots using Pandas DataFrame
Hexagonal Binning is another way to manage the problem of having too many points that start to overlap. Hexagonal binning plots density, rather than points. Points are binned into gridded hexagons and distribution (the number of points per hexagon) is displayed using either the color or the area of the hexagons. Useful for Bivariate Data, an alternative to scatterplot:
Python3
df = pd.DataFrame(np.random.randn( 1000 , 2 ), columns = [ 'a' , 'b' ]) df.plot.hexbin(x = 'a' , y = 'b' , gridsize = 25 , cmap = 'Oranges' ) |
Output :
Kernel Density Estimation plot (KDE) using Pandas DataFrame
KDE is a technique that let’s you create a smooth curve given a set of data. This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram. It can also be used to generate points that look like they came from a certain dataset – this behavior can power simple simulations, where simulated objects are modeled off of real data.
Python3
df2[ 'a' ].plot.kde() |
Output :
Python3
df2.plot.density() |
Output :
That’s it! Hopefully, you can see why this method of plotting will be a lot easier to use than full-on Matplotlib, it balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent Matplotlib plt. call.