Let’s start with pandas groupby()

In python, pandas is a very good and most popular tools to handle the data. In pandas groupby() is very good API used for data understanding and preprocessing. Let’s see some of the usecases and operations we can perform using gropuby().

Initially one need to import pandas.

In [ ]:
import pandas as pd

Create sample dataframe used in this example

In [9]:
raw_data = {'Subject': ['Maths', 'Maths', 'Maths', 'Maths', 'Science', 'Science', 'Science', 'Science', 'English', 'English', 'English', 'English'], 
        'School': ['SCH1', 'SCH1', 'SCH2', 'SCH2', 'SCH1', 'SCH1', 'SCH2', 'SCH2','SCH1', 'SCH2', 'SCH2', 'SCH2'], 
        'name': ['Hemant', 'Hetav', 'Prajakta', 'Milan', 'Daisy', 'Sangeeta', 'Bhoomi', 'Jone', 'Jamy', 'Satiah', 'Rani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['Subject', 'School', 'name', 'preTestScore', 'postTestScore'])
df
Out[9]:
Subject School name preTestScore postTestScore
0 Maths SCH1 Hemant 4 25
1 Maths SCH1 Hetav 24 94
2 Maths SCH2 Prajakta 31 57
3 Maths SCH2 Milan 2 62
4 Science SCH1 Daisy 3 70
5 Science SCH1 Sangeeta 4 25
6 Science SCH2 Bhoomi 24 94
7 Science SCH2 Jone 31 57
8 English SCH1 Jamy 2 62
9 English SCH2 Satiah 3 70
10 English SCH2 Rani 2 62
11 English SCH2 Ali 3 70

View a grouping

Use list() to show what a grouping looks like

In [3]:
list(df['preTestScore'].groupby(df['Subject']))
Out[3]:
[('English', 8     2
  9     3
  10    2
  11    3
  Name: preTestScore, dtype: int64), ('Maths', 0     4
  1    24
  2    31
  3     2
  Name: preTestScore, dtype: int64), ('Science', 4     3
  5     4
  6    24
  7    31
  Name: preTestScore, dtype: int64)]

The reaults shows subject wise preTestScore with indices

Descriptive statistics by group

In [4]:
df['preTestScore'].groupby(df['Subject']).describe()
Out[4]:
count mean std min 25% 50% 75% max
Subject
English 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0
Maths 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0
Science 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0

In this describe() is the API use to generate statistical description apply on the output from —— “df[‘preTestScore’].groupby(df[‘Subject’])”

Group the entire dataframe by Subject and School

In [5]:
df.groupby(['Subject', 'School']).mean()
Out[5]:
preTestScore postTestScore
Subject School
English SCH1 2.000000 62.000000
SCH2 2.666667 67.333333
Maths SCH1 14.000000 59.500000
SCH2 16.500000 59.500000
Science SCH1 3.500000 47.500000
SCH2 27.500000 75.500000
In [6]:
df.groupby(['School','Subject']).mean()
Out[6]:
preTestScore postTestScore
School Subject
SCH1 English 2.000000 62.000000
Maths 14.000000 59.500000
Science 3.500000 47.500000
SCH2 English 2.666667 67.333333
Maths 16.500000 59.500000
Science 27.500000 75.500000

Here mean() is the API to calculate mean of the output generates by gropuby(). By looking above results one can notice that input order does matter in groupby().

Iterations over groups

In [7]:

# Group the dataframe by Subject, and for each Subject,
for name, group in df.groupby('Subject'): 
    # print the name of the Subject
    print(name)
    # print the data of that Subject
    print(group)
English
    Subject School    name  preTestScore  postTestScore
8   English   SCH1    Jamy             2             62
9   English   SCH2  Satiah             3             70
10  English   SCH2    Rani             2             62
11  English   SCH2     Ali             3             70
Maths
  Subject School      name  preTestScore  postTestScore
0   Maths   SCH1    Hemant             4             25
1   Maths   SCH1     Hetav            24             94
2   Maths   SCH2  Prajakta            31             57
3   Maths   SCH2     Milan             2             62
Science
   Subject School      name  preTestScore  postTestScore
4  Science   SCH1     Daisy             3             70
5  Science   SCH1  Sangeeta             4             25
6  Science   SCH2    Bhoomi            24             94
7  Science   SCH2      Jone            31             57

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.