programming python

Hướng dẫn dùng padnas join python

Ở bài này, ta sẽ giải quyết câu hỏi làm thế nào để sắp xếp lại cấu trúc dữ liệu phục vụ cho mục đích phù hợp. Ta sẽ sử dụng một số hàm phổ biến như: groupby, concat, aggregate, append,.. qua các ví dụ với tập dữ liệu thực để hiểu rõ hơn. [Các ví dụ được thao tác trên python 3.7.0 và pandas 0.23.4]

Nội dung chính Show

I. Nhóm dữ liệu [grouping of data]
Group một cột
Group nhiều cột
Group với MultiIndex
Thao tác với hàm aggregate:
II. Nối dữ liệu [merging and concatenating of data]
Hàm concat
Hàm append
SQL-like merging/joining

Các mục thảo luận gồm:

Nhóm dữ liệu [grouping of data]
Nối dữ liệu [merging and concatenating data]

I. Nhóm dữ liệu [grouping of data]

Groupby là phép toán thực hiện trên DataFrames, khi thực thi sẽ thực hiện 3 việc:

Chia tập dữ liệu
Phân tích dữ liệu
Nhóm dữ liệu

Hàm groupby:

DataFrame.groupby[by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs]

Tham số thường dùng:

by : mapping, function, label, hoặc danh sách các label. Xác định nhóm để groupby.
axis : int, 0: group theo hàng, 1: group theo cột [mặc định 0]
level : int/string, chỉ số level hoặc tên cột, mặc định None. Sử dụng với MultiIndex, để chỉ rõ lấy index nào
as_index : boolean, mặc định True. Giá trị True: kết quả trả về column được group sẽ là key.
sort : boolean, mặc định True. Sắp xếp key, để tăng hiệu suất thì hãy để False.

Group một cột

Giờ cùng thử thao tác với tập dữ liệu các đội bóng vô địch UEFA Champions League: euro_winners.csv

In [1]: import pandas as pd
        uefaDF=pd.read_csv['./euro_winners.csv']
In [2]: uefaDF.head[]

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Kiểu của nationsGrp là pandas.core.groupby.DataFrameGroupBy. Cột được sử dụng để groupby sẽ là key. Sử dụng thuộc tính “groups”:

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

Về cơ bản, đây là dạng dictionary cho thấy tên nhóm riêng biệt và index tương ứng.

Để lấy được số quốc gia có FC vô địch [số lượng group], số FC vô địch theo từng nước [số lượng bản ghi trong từng group]:

In [8]: len[nationsGrp.groups]
Out[9]: 10
In [10]: nationWins=nationsGrp['Winners'].count[];
        nationWins
Out[11]:
Nation
England        12
France          1
Germany         7
Italy          12
Netherlands     6
Portugal        4
Romania         1
Scotland        1
Spain          13
Yugoslavia      1
dtype: int64

Để đưa về dạng DataFrame, sử dụng reset_index[]

In [12]: nationWins.reset_index[name='FC_count']
Out[13]: 
        Nation  FC_count
0      England        12
1       France         1
2      Germany         7
3        Italy        12
4  Netherlands         6
5     Portugal         4
6      Romania         1
7     Scotland         1
8        Spain        13
9   Yugoslavia         1

Thuộc tính ‘name‘ để thay đổi tên cột kết quả, ở đây là số lượng FC vô địch.

Thay vì sử dụng hàm reset_index[], ta có thể sử dụng thuộc tính as_index =False khi groupby cũng sẽ cho kết quả tương đương.

In [14]: nationsGrp=uefaDF.groupby['Nation', as_index=False]['Winners'].count[];

Group nhiều cột

Để xem số lần vô địch của từng FC theo từng quốc gia:

In [15]: winnersGrp =uefaDF.groupby[['Nation','Winners']]
          clubWins=winnersGrp.size[]
          clubWins
Out[16]: 
Nation       Winners          
England      Aston Villa          1
             Chelsea              1
             Liverpool            5
             Manchester United    3
             Nottingham Forest    2
France       Marseille            1
Germany      Bayern Munich        5
             Borussia Dortmund    1
             Hamburg              1
Italy        Internazionale       3
             Juventus             2
             Milan                7
Netherlands  Ajax                 4
             Feyenoord            1
             PSV Eindhoven        1
Portugal     Benfica              2
             Porto                2
Romania      Steaua Bucure?ti     1
Scotland     Celtic               1
Spain        Barcelona            4
             Real Madrid          9
Yugoslavia   Red Star Belgrade    1
dtype: int64

Một chút khác biệt so với khi group một cột đó là group nhiều cột thì các cột này phải được đưa dưới dạng list [trong dấu []]

Như vậy bước đầu ta có thể hiểu được cách hoạt động của groupby và cách sử dụng.

Group với MultiIndex

Ta có thể group by theo level với DataFrame có MultiIndex. Dữ liệu: goal_stats_euro_leagues_2012-13.csv

In [17]: goalStatsDF=pd.read_csv['./goal_stats_euro_leagues_2012-13.csv']
         goalStatsDF=goalStatsDF.set_index[['Month','Stat']]
         goalStatsDF		
Out[18]: 
                            EPL  La Liga  Serie A  Bundesliga
Month      Stat                                              
08/01/2012 MatchesPlayed   20.0       20     10.0        10.0
09/01/2012 MatchesPlayed   38.0       39     50.0        44.0
10/01/2012 MatchesPlayed   31.0       31     39.0        27.0
11/01/2012 MatchesPlayed   50.0       41     42.0        46.0
12/01/2012 MatchesPlayed   59.0       39     39.0        26.0
01/01/2013 MatchesPlayed   42.0       40     40.0        18.0
02/01/2013 MatchesPlayed   30.0       40     40.0        36.0
03/01/2013 MatchesPlayed   35.0       38     39.0        36.0
04/01/2013 MatchesPlayed   42.0       42     41.0        36.0
05/01/2013 MatchesPlayed   33.0       40     40.0        27.0
06/02/2013 MatchesPlayed    NaN       10      NaN         NaN
08/01/2012 GoalsScored     57.0       60     21.0        23.0
09/01/2012 GoalsScored    111.0      112    133.0       135.0
10/01/2012 GoalsScored     95.0       88     97.0        77.0
11/01/2012 GoalsScored    121.0      116    120.0       137.0
12/01/2012 GoalsScored    183.0      109    125.0        72.0
01/01/2013 GoalsScored    117.0      121    104.0        51.0
02/01/2013 GoalsScored     87.0      110    100.0       101.0
03/01/2013 GoalsScored     91.0      101     99.0       106.0
04/01/2013 GoalsScored    105.0      127    102.0       104.0
05/01/2013 GoalsScored     96.0      109    102.0        92.0
06/01/2013 GoalsScored      NaN       80      NaN         NaN

Theo thứ tự set index: Month = 0, Stat = 1 nên khi group theo level = 1 hay level = ‘Stat’ sẽ đều cho kết quả tương đương.

In [19]: monthStatGroup=goalStatsDF.groupby[level=1].count[]
     	 monthStatGroup
Out[20]: 
               EPL  La Liga  Serie A  Bundesliga
Stat                                            
GoalsScored     10       11       10          10
MatchesPlayed   10       11       10          10

Thao tác với hàm aggregate:

In [21]: import numpy as np
       monthStatGroup=goalStatsDF.groupby[level='Stat']
       monthStatGroup.agg[np.sum]
Out[22]: 
                  EPL  La Liga  Serie A  Bundesliga
Stat                                               
GoalsScored    1063.0     1133   1003.0       898.0
MatchesPlayed   380.0      380    380.0       306.0

Chú ý rằng giá trị NaN sẽ không được tính trong phép toán agg.

Có thể truyền nhiều hàm cùng lúc:

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Cũng thể chỉ định cột nào sử dụng hàm nào

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

II. Nối dữ liệu [merging and concatenating of data]

Hàm concat

Hàm concat dùng để nối các dữ liệu cấu trúc pandas với nhau.

pandas.concat[objs, axis=0, join=’outer’, join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True]

Tham số thường dùng:

objs: danh sách các object Series, DataFrame, hoặc Panel
axis: int, 0: concat theo cột, 1: concat theo hàng [mặc định 0]
join: inner/outer [mặc định outer]
ignore_index: boolean [mặc định False], giá trị True: giá trị index sẽ không được sử dụng trong khi concat. Kết quả trả về index được đánh lại từ 0

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Sử dụng ignore_index:

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Sử dụng logic để nối, nếu không chỉ định tham số join thì mặc định là sẽ ‘outer’

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Chỉ định tham số join=’inner’:

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Hàm append

Hàm append là phiên bản đơn giản của hàm concat với axis=0

DataFrame.append[other, ignore_index=False, verify_integrity=False, sort=None]

Tham số thường dùng:

other: danh sách object DataFrame hoặc Series/Dictionary
ignore_index: boolean [mặc định False], giá trị True: giá trị index sẽ không được sử dụng

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Chú ý: hàm concat và append đều không thay đổi df1, mà chỉ tạo ra bản copy với phần df4 đã được nối

Append row vào DataFrame

Kiểu Series/Dictionary:

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Kiểu Dictionary:

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

SQL-like merging/joining

Hàm merge ở đây tương tự như query trong SQL database, DataFrame object tương tự như bảng trong SQL database.

Pandas cung cấp phép toán join đầy đủ các tính năng, hiệu suất cao. Những phương thức này có hiệu suất tốt hơn so với các open source khác [chẳng hạn base::merge.data.frame trong ngôn ngữ R]

pandas.merge[left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=[‘_x’, ‘_y’], copy=True, indicator=False, validate=None]

Tham số thường dùng:

left: DataFrame
right: DataFrame
how: {‘left’, ‘right’, ‘outer’, ‘inner’}, mặc định ‘inner’
on: tên các cột hoặc index level để join. [chú ý: phải có ở cả 2 DataFrames]
left_on: tên các cột hoặc index level của DataFrame bên trái để join
right_on: tên các cột hoặc index level của DataFrame bên phải để join
sort: boolean, mặc định là False
indicator: boolean hoặc string, mặc định là False, nếu để True: sẽ có thêm một cột “_merge”[tên cột sẽ thay đổi theo string nếu được truyền vào] với thông tin source của từng row
- left_only: key chỉ xuất hiện ở DataFrame bên trái
- right_only: key chỉ xuất hiện ở DataFrame bên phải
- both: key xuất hiện ở cả 2 DataFrame
validate: string, mặc định là None.
- “one_to_one” or “1:1”: kiểm tra liệu key hợp nhất có riêng biệt ở cả 2 DataFrame không
- “one_to_many” or “1:m”: kiểm tra liệu key hợp nhất có riêng biệt ở DataFrame bên trái không
- “many_to_one” or “m:1”: kiểm tra liệu key hợp nhất có riêng biệt ở DataFrame bên phải không
- “many_to_many” or “m:m”: được phép nhưng không đưa ra kết quả

Giá trị tham số “how” và giá trị join SQL tương đương:

Merge methodSQL Join NameDescription_merge

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

3Chỉ sử dụng keys của frame bên tráileft_only

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

5Chỉ sử dụng keys của frame bên phảiright_only

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

7Sử dụng từng keys của 2 framesboth

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

9Sử dụng keys giao nhau của 2 framesboth

In [4]: nationsGrp=uefaDF.groupby['Nation'];
         type[nationsGrp]
Out[5]: pandas.core.groupby.DataFrameGroupBy

Tham số “indicator”

In [6]: nationsGrp.groups
Out[7]:{'England': IntIndex64[[12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], dtype='int64'],
'France':IntIndex64[[37], dtype='int64'],
'Germany':IntIndex64[[18, 19, 20, 27, 41, 45, 57], dtype='int64'],
'Italy':IntIndex64[[7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], dtype='int64'],
'Netherlands':IntIndex64[[14, 15, 16, 17, 32, 39], dtype='int64'],
'Portugal':IntIndex64[[5, 6, 31, 48], dtype='int64'],
'Romania':IntIndex64[[30], dtype='int64'],
'Scotland':IntIndex64[[11], dtype='int64'],
'Spain':IntIndex64[[0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], dtype='int64'],
'Yugoslavia':IntIndex64[[35], dtype='int64']}

Tham số “validate”

Pandas cung cấp tham số “validate” để kiểm tra liệu có sự trùng lặp key hợp nhất không. Tính duy nhất của khóa được kiểm tra trước khi merge, do đó sẽ chống việc tràn bộ nhớ, đồng thời đó cũng là một cách tốt để đảm bảo cấu trúc dữ liệu được như mong đợi. Nếu “validate” đúng thì kết quả sẽ được trả về, nếu không sẽ có lỗi được thông báo.

I. Nhóm dữ liệu [grouping of data]

Group một cột

Group nhiều cột

Group với MultiIndex

Thao tác với hàm aggregate:

II. Nối dữ liệu [merging and concatenating of data]

Hàm concat

Hàm append

SQL-like merging/joining

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề