Pandas读取csv/tsv文件

饺子大人 2022-05-23 我要评论

前言

要将csv和tsv文件读取为pandas.DataFrame格式，可以使用Pandas的函数read_csv（）或read_table（）。

在此

read_csv（）和read_table（）之间的区别
读取没有标题的CSV
读取有标题的CSV
读取有index的CSV
指定（选择）要读取的列
跳过（排除）行的读取
通过指定类型dtype进行读取
NaN缺失值的处理
读取使用zip等压缩的文件
tsv的读取

对以上的内容进行说明。

read_csv（）和read_table（）之间的区别

函数pd.read_csv（）和pd.read_table（）的内容相同，只是默认分隔符不同。

在read_csv（）中，定界符为,，在read_table（）中，定界符为\ t。

查看源代码，它调用相同的函数。

read_csv = _make_parser_function('read_csv', sep=',')
read_csv = Appender(_read_csv_doc)(read_csv)

read_table = _make_parser_function('read_table', sep='\t')
read_table = Appender(_read_table_doc)(read_table)

如果要读取csv文件（以逗号分隔），使用函数read_csv（），如果要读取tsv文件（以制表符分隔），使用函数read_table（）也可以。

如果既不是逗号也不是制表符，则可以通过参数（sep或delimiter）设置区分符。

以下，将使用说明read_csv（），但是对read_table也是如此。

读取没有标题的CSV

读取以下不带标题的csv文件。

11,12,13,14
21,22,23,24
31,32,33,34

如果未设置任何参数，则将第一行识别为标题并将自动分配列名columns。

df = pd.read_csv('./data/03/sample.csv')
print(df)
#    11  12  13  14
# 0  21  22  23  24
# 1  31  32  33  34

print(df.columns)
# Index(['11', '12', '13', '14'], dtype='object')

如果header = None，则将为列名列分配一个序号。

df_none = pd.read_csv('./data/03/sample.csv', header=None)
print(df_none)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

可以将任意值设置为列名，参数为name=（‘A’，‘B’，‘C’，‘D’）。通过列表或元组指定。

df_names = pd.read_csv('./data/03/sample.csv', names=('A', 'B', 'C', 'D'))
print(df_names)
#     A   B   C   D
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

读取有标题的CSV

读取以下带标头的csv文件。

a,b,c,d
11,12,13,14
21,22,23,24
31,32,33,34

指定标题的行号从0开始，例如header = 0。由于默认值为header = 0，因此如果第一行是header，则可以获得相同的结果。

df_header = pd.read_csv('./data/03/sample_header.csv')
print(df_header)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

df_header_0 = pd.read_csv('./data/03/sample_header.csv', header=0)
print(df_header_0)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

使用header进行起始行的读取指定。

df_header_2 = pd.read_csv('./data/03/sample_header.csv', header=2)
print(df_header_2)
#    21  22  23  24
# 0  31  32  33  34

读取有index的CSV

读取以下带有标题和索引（标题列）的csv文件。

,a,b,c,d
ONE,11,12,13,14
TWO,21,22,23,24
THREE,31,32,33,34

如果未指定任何内容，则不会识别索引列。

df_header_index = pd.read_csv('./data/03/sample_header_index.csv')
print(df_header_index)
#   Unnamed: 0   a   b   c   d
# 0        ONE  11  12  13  14
# 1        TWO  21  22  23  24
# 2      THREE  31  32  33  34

print(df_header_index.index)
# RangeIndex(start=0, stop=3, step=1)

指定要用作索引的列的列号，从0开始，例如index_col = 0。

df_header_index_col = pd.read_csv('./data/03/sample_header_index.csv',
									index_col=0)
print(df_header_index_col)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34

print(df_header_index_col.index)
# Index(['ONE', 'TWO', 'THREE'], dtype='object')

指定（选择）要读取的列

要仅读取特定的列，请使用usecols参数。指定要在列表中读取的列号。即使只有一列，也要使用列表。

df_none_usecols = pd.read_csv('./data/03/sample.csv',
								header=None, usecols=[1, 3])
print(df_none_usecols)
#     1   3
# 0  12  14
# 1  22  24
# 2  32  34

df_none_usecols = pd.read_csv('./data/03/sample.csv',
								header=None, usecols=[2])
print(df_none_usecols)
#     2
# 0  13
# 1  23
# 2  33

也可以按列名而不是列号指定。

df_header_usecols = pd.read_csv('./data/03/sample_header.csv',
								usecols=['a', 'c'])
print(df_header_usecols)
#     a   c
# 0  11  13
# 1  21  23
# 2  31  33

在没有特定列的情况下时，使用匿名函数（lambda表达式）很方便。尤其是当您要从具有许多列的文件中排除少量列并读取它们时，比指定要读取的大量列号要容易得多。

df_header_usecols = pd.read_csv('./data/03/sample_header.csv', 
                              usecols=lambda x: x is not 'b')
print(df_header_usecols)
#     a   c   d
# 0  11  13  14
# 1  21  23  24
# 2  31  33  34

df_header_usecols = pd.read_csv('./data/03/sample_header.csv', 
                              usecols=lambda x: x not in ['a', 'c'])
print(df_header_usecols)
#     b   d
# 0  12  14
# 1  22  24
# 2  32  34

当与index_col一起使用时，由index_col指定的列也必须由usecols指定。

df_index_usecols = pd.read_csv('./data/03/sample_header_index.csv',
                              index_col=0, usecols=[0, 1, 3])
print(df_index_usecols)
#         a   c
# ONE    11  13
# TWO    21  23
# THREE  31  33

跳过（排除）行的读取

skiprows

要跳过（排除）特定行并读取它们，使用参数skipprows。如果将整数传递给跳过行，那么将跳过那么多行的文件开头。

df_none = pd.read_csv('./data/03/sample.csv', header=None)
print(df_none)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

df_none = pd.read_csv('./data/03/sample.csv', header=None, skiprows=2)
print(df_none)
#     0   1   2   3
# 0  31  32  33  34

可以指定要跳过的行号列表。与usecols不同，指定要跳过的行，而不是要读取的行。即使在一行中也要使用列表。

df_none_skiprows = pd.read_csv('./data/03/sample.csv',
								header=None, skiprows=[0, 2])
print(df_none_skiprows)
#     0   1   2   3
# 0  21  22  23  24

df_none_skiprows = pd.read_csv('./data/03/sample.csv',
								header=None, skiprows=[1])
print(df_none_skiprows)
#     0   1   2   3
# 0  11  12  13  14
# 1  31  32  33  34

仅读取特定行时，使用匿名函数（lambda表达式）会很方便。特别是当您只想从文件中读取多行的特定行时，比指定要跳过的行数要容易得多。

df_none_skiprows = pd.read_csv('./data/03/sample.csv', header=None,
                           skiprows=lambda x: x not in [0, 2])
print(df_none_skiprows)
#     0   1   2   3
# 0  11  12  13  14
# 1  31  32  33  34

请注意，如果文件具有标题，则还需要考虑标题行。

df_header_skiprows = pd.read_csv('./data/03/sample_header.csv', skiprows=[1])
print(df_header_skiprows)
#     a   b   c   d
# 0  21  22  23  24
# 1  31  32  33  34

df_header_skiprows = pd.read_csv('./data/03/sample_header.csv', skiprows=[0, 3])
print(df_header_skiprows)
#    11  12  13  14
# 0  21  22  23  24

请注意，即使指定了索引，也无法通过行名指定skipprows。

skipfooter

要跳过文件的末尾，请使用skipfooter参数。将要跳过的行数指定为整数。根据环境的不同，会出现以下警告，因此请指定参数engine =‘python’。

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.

df_none_skipfooter = pd.read_csv('./data/03/sample.csv', header=None,
	                           skipfooter=1, engine='python')
print(df_none_skipfooter)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24

nrows

也可以只阅读前几行。使用参数nrows。当想检查大文件的数据时很有用。

df_none_nrows = pd.read_csv('./data/03/sample.csv', header=None, nrows=2)
print(df_none_nrows)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24

通过指定类型dtype进行读取

在pandas.DataFrame中，为每一列设置类型dtype，可以使用astype（）方法对其进行转换（转换）。

以下文件为例。

,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z

默认情况下，以0开头的数字序列（无论是否带引号）都被视为数字，而不是字符串，并且省略前导零。

df_default = pd.read_csv('./data/03/sample_header_index_dtype.csv', index_col=0)
print(df_default)
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    NaN  y
# THREE  3  300  300.0  z

print(df_default.dtypes)
# a      int64
# b      int64
# c    float64
# d     object
# dtype: object

print(df_default.applymap(type))
#                    a              b                c              d
# ONE    <class 'int'>  <class 'int'>  <class 'float'>  <class 'str'>
# TWO    <class 'int'>  <class 'int'>  <class 'float'>  <class 'str'>
# THREE  <class 'int'>  <class 'int'>  <class 'float'>  <class 'str'>

如果要作为包含前导0的字符串进行处理，请指定read_csv（）的参数dtype。

如果在参数dtype中指定了任意数据类型，则包括index_col指定的列在内的所有列都将转换为该类型并读取。例如，如果dtype = str，则所有列都强制转换为字符串。但是，同样在这种情况下，缺少的值是浮点类型。

df_str = pd.read_csv('./data/03/sample_header_index_dtype.csv',
						index_col=0,dtype=str)
print(df_str)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str.dtypes)
# a    object
# b    object
# c    object
# d    object
# dtype: object

print(df_str.applymap(type))
#                    a              b                c              d
# ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>

dtype = object也是如此。

df_object = pd.read_csv('./data/03/sample_header_index_dtype.csv',
						index_col=0, dtype=object)
print(df_object)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_object.dtypes)
# a    object
# b    object
# c    object
# d    object
# dtype: object

print(df_object.applymap(type))
#                    a              b                c              d
# ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>

请注意，在参数dtype中指定无法转换的类型将导致错误。在此示例中，将由index_col指定的字符串的索引列转换为整数int类型时发生错误。

# df_int = pd.read_csv('data/src/sample_header_index_dtype.csv',
#                      index_col=0, dtype=int)
# ValueError: invalid literal for int() with base 10: 'ONE'

要在读取后转换pandas.DataFrame的列类型，请在astype（）方法中以字典格式指定它。

df_str_cast = df_str.astype({'a': int})
print(df_str_cast)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str_cast.dtypes)
# a     int64
# b    object
# c    object
# d    object
# dtype: object

使用read_csv（）进行读取时，可以在字典格式中的参数dtype中指定列类型。将自动选择除指定列以外的其他类型。

df_str_col = pd.read_csv('./data/03/sample_header_index_dtype.csv',
                     index_col=0, dtype={'b': str, 'c': str})
print(df_str_col)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str_col.dtypes)
# a     int64
# b    object
# c    object
# d    object
# dtype: object

不仅可以指定列名，还可以指定列号。注意，在指定索引列时，必须指定包括索引列的列号。

df_str_col_num = pd.read_csv('./data/03/sample_header_index_dtype.csv',
                     index_col=0, dtype={2: str, 3: str})
print(df_str_col_num)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str_col_num.dtypes)
# a     int64
# b    object
# c    object
# d    object
# dtype: object

NaN缺失值的处理

默认情况下，read_csv（）和read_table（）将某些值视为缺少的NaN。

默认情况下，可能的值（例如空字符串”，字符串“ NaN”，“ nan”和null）通常默认为缺少NaN，如下所示：

By default the following values are interpreted as NaN: ‘', ‘#N/A', ‘#N/A N/A', ‘#NA', ‘-1.#IND', ‘-1.#QNAN', ‘-NaN', ‘-nan', ‘1.#IND', ‘1.#QNAN', ‘N/A', ‘NA', ‘NULL', ‘NaN', ‘n/a', ‘nan', ‘null'.

以下文件为例检查操作。

,a,b
ONE,,NaN
TWO,-,nan
THREE,null,N/A

特别是，如果您在默认情况下未设置任何参数而进行读取，并使用isnull（）方法对其进行了检查，则可以看到除“-”以外的其他非目标均被视为缺失值NaN。

df_nan = pd.read_csv('./data/03/sample_header_index_nan.csv', index_col=0)
print(df_nan)
#          a   b
# ONE    NaN NaN
# TWO      - NaN
# THREE  NaN NaN

print(df_nan.isnull())
#            a     b
# ONE     True  True
# TWO    False  True
# THREE   True  True

要指定默认值以外的值，将其视为缺失值，使用参数na_values。

df_nan_set_na = pd.read_csv('./data/03/sample_header_index_nan.csv',
							index_col=0, na_values='-')
print(df_nan_set_na)
#         a   b
# ONE   NaN NaN
# TWO   NaN NaN
# THREE NaN NaN

print(df_nan_set_na.isnull())
#           a     b
# ONE    True  True
# TWO    True  True
# THREE  True  True

如果在将参数keep_default_na设置为False之后为参数na_values指定值，则仅将为na_values指定的值视为缺失值。除非在na_values中指定，否则默认值不会被视为缺失值。

df_nan_set_na_no_keep = pd.read_csv('./data/03/sample_header_index_nan.csv',
									index_col=0, 
									na_values=['-', 'NaN', 'null'], 
									keep_default_na=False)
print(df_nan_set_na_no_keep)
#          a    b
# ONE         NaN
# TWO    NaN  nan
# THREE  NaN  N/A

print(df_nan_set_na_no_keep.isnull())
#            a      b
# ONE    False   True
# TWO     True  False
# THREE   True  False

如果参数na_filter设置为False，则无论参数na_values和keep_default_na的规格如何，所有值都将按原样读取，并且不会被视为缺失值。

df_nan_no_filter = pd.read_csv('./data/03/sample_header_index_nan.csv',
								index_col=0, na_filter=False)
print(df_nan_no_filter)
#           a    b
# ONE          NaN
# TWO       -  nan
# THREE  null  N/A

print(df_nan_no_filter.isnull())
#            a      b
# ONE    False  False
# TWO    False  False
# THREE  False  False

读取使用zip等压缩的文件

也可以按原样读取通过zip等压缩的csv文件。

df_zip = pd.read_csv('./data/03/sample_header.zip')
print(df_zip)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

如果扩展名是.gz，.bz2，.zip，.xz，则会自动检测并扩展。如果扩展名不同，请在compression参数中显式指定字符串“ gz”，“ bz2”，“ zip”和“ xz”。
请注意，仅限压缩单个csv文件。如果压缩多个文件，则会发生错误。

tsv的读取

在开始时所写的那样，如果要读取tsv文件（制表符分隔），则可以使用read_table（）。

对于如下文件

a b c d
ONE 11 12 13 14
TWO 21 22 23 24
THREE 31 32 33 34

参数与read_csv（）相同。

df_tsv = pd.read_table('./data/03/sample_header_index.tsv', index_col=0)
print(df_tsv)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34

通过read_csv（）将制表符\t设置为区别符，也可以读取它。

df_tsv_sep = pd.read_csv('./data/03/sample_header_index.tsv', 
                            index_col=0, sep='\t')
print(df_tsv_sep)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34