pandas处理丢失数据-【老鱼学pandas】

假设我们的数据集中有缺失值,该如何进行处理呢?

丢弃缺失值的行或列

首先我们定义了数据集的缺失值:

import pandas as pd
import numpy as np
dates = pd.date_range("2017-01-08", periods=6)
data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"]) data.iloc[0, 1] = np.nan
data.iloc[1, 2] = np.nan print("data:")
print(data)

这里缺失值用np.nan来设置,输出为:

data:
A B C D
2017-01-08 0 NaN 2.0 3
2017-01-09 4 5.0 NaN 7
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23

丢弃缺失值数据

可以使用dropna函数把拥有缺失值数据的行或列进行丢弃。

我们这里以丢弃掉拥有缺失值行作为例子:

import pandas as pd
import numpy as np
dates = pd.date_range("2017-01-08", periods=6)
data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"]) data.iloc[0, 1] = np.nan
data.iloc[1, 2] = np.nan print("data:")
print(data) print("处理结果:")
print(data.dropna(axis=0))

输出为:

data:
A B C D
2017-01-08 0 NaN 2.0 3
2017-01-09 4 5.0 NaN 7
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23
处理结果:
A B C D
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23

这样把拥有NaN的2017-01-08和2017-01-09行给丢弃掉了。

dropna所拥有的参数有:

axis:0=按行进行删除,1=按列进行删除

how:'all'=丢掉全为NaN的行,'any'=丢弃只要此行中出现一个NaN的字段就丢弃

把缺失值替换成其它值

在处理缺失值时,我们也可以把缺失值替换成其它值,具体是通过使用fillna函数来实现的。

比如,我们想把缺失值设置成-1:

import pandas as pd
import numpy as np
dates = pd.date_range("2017-01-08", periods=6)
data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"]) data.iloc[0, 1] = np.nan
data.iloc[1, 2] = np.nan print("data:")
print(data) ret = data.fillna(-1)
print("处理结果:")
print(ret)

输出为:

data:
A B C D
2017-01-08 0 NaN 2.0 3
2017-01-09 4 5.0 NaN 7
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23
处理结果:
A B C D
2017-01-08 0 -1.0 2.0 3
2017-01-09 4 5.0 -1.0 7
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23

检查是否存在缺失数据

isnull()函数用来检查是否存在缺失值,如果存在缺失值,则对应位置就会显示True:

import pandas as pd
import numpy as np
dates = pd.date_range("2017-01-08", periods=6)
data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"]) data.iloc[0, 1] = np.nan
data.iloc[1, 2] = np.nan print("data:")
print(data) ret = data.isnull()
print("处理结果:")
print(ret)

输出为:

data:
A B C D
2017-01-08 0 NaN 2.0 3
2017-01-09 4 5.0 NaN 7
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23
处理结果:
A B C D
2017-01-08 False True False False
2017-01-09 False False True False
2017-01-10 False False False False
2017-01-11 False False False False
2017-01-12 False False False False
2017-01-13 False False False False

如果我们想要知道整个的数据中是否存在缺失值,例子如下:

import pandas as pd
import numpy as np
dates = pd.date_range("2017-01-08", periods=6)
data = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=["A", "B", "C", "D"]) data.iloc[0, 1] = np.nan
data.iloc[1, 2] = np.nan print("data:")
print(data) ret = np.any(data.isnull() == True) print("处理结果:")
print(ret)

输出为:

data:
A B C D
2017-01-08 0 NaN 2.0 3
2017-01-09 4 5.0 NaN 7
2017-01-10 8 9.0 10.0 11
2017-01-11 12 13.0 14.0 15
2017-01-12 16 17.0 18.0 19
2017-01-13 20 21.0 22.0 23
处理结果:
True
上一篇:修改或隐藏Nginx的版本号


下一篇:(升级版)Spark从入门到精通(Scala编程、案例实战、高级特性、Spark内核源码剖析、Hadoop高端)