The pandas I/O API is a set of top level
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object05 functions accessed like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object06 that generally return a pandas object. The corresponding
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object07 functions are object methods that are accessed like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object08. Below is a table containing available
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object09 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object10
Format Type
Data Description
Reader
Writer
text
CSV
read_csv
to_csv
text
Fixed-Width Text File
read_fwf
text
JSON
read_json
to_json
text
HTML
read_html
to_html
text
Mủ cao su
Styler. to_latex
text
XML
read_xml
to_xml
text
Local clipboard
read_clipboard
to_clipboard
binary
MS Excel
read_excel
to_excel
binary
OpenDocument
read_excel
binary
HDF5 Format
read_hdf
to_hdf
binary
Feather Format
read_feather
to_feather
binary
Parquet Format
read_parquet
to_parquet
binary
ORC Format
read_orc
to_orc
binary
Stata
read_stata
to_stata
binary
SAS
read_sas
binary
SPSS
read_spss
binary
Python Pickle Format
read_pickle
to_pickle
SQL
SQL
read_sql
to_sql
SQL
Google BigQuery
read_gbq
to_gbq
Here is an informal performance comparison for some of these IO methods.
Note
For examples that use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11 class, make sure you import it with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object12 for Python 3
CSV & text files#
The workhorse function for reading text files [a. k. a. flat files] is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13. See the cookbook for some advanced strategies.
Parsing options#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13 accepts the following common arguments
Basic#
filepath_or_buffer variousEither a path to a file [a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object16, or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object17], URL [including http, ftp, and S3 locations], or any object with a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object18 method [such as an open file or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11]sep str, defaults to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20 for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object22 for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object23
Delimiter to use. If sep is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object25. In addition, separators longer than 1 character and different from
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object26 will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Ví dụ về biểu thức chính quy.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object27delimiter str, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Alternative argument name for sep
delim_whitespace boolean, default FalseSpecifies whether or not whitespace [e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object29 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object30] will be used as the delimiter. Equivalent to setting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object31. If this option is set to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, nothing should be passed in for the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33 parameter
Column and index locations and names#
header int or list of ints, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34
Row number[s] to use as the column names, and the start of the data. Default behavior is to infer the column names. if no names are passed the behavior is identical to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36. Explicitly pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 to be able to replace existing names
The header can be a list of ints that specify row locations for a MultiIndex on the columns e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object38. Intervening rows that are not specified will be skipped [e. g. 2 trong ví dụ này bị bỏ qua]. Note that this parameter ignores commented lines and empty lines if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39, so header=0 denotes the first line of data rather than the first line of the filenames array-like, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
List of column names to use. If file contains no header row, then you should explicitly pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36. Duplicates in this list are not allowedindex_col int, str, sequence of int / str, or False, optional, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Column[s] to use as the row labels of the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object44 can be used to force pandas to not use the first column as the index, e. g. when you have a malformed file with delimiters at the end of each line
The default value of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 instructs pandas to guess. If the number of fields in the column header row is equal to the number of fields in the body of the data file, then a default index is used. If it is larger, then the first columns are used as index so that the remaining number of fields in the body are equal to the number of fields in the header
The first row after the header is used to determine the number of columns, which will go into the index. If the subsequent rows contain less columns than the first row, they are filled with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
This can be avoided through
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47. This ensures that the columns are taken as is and the trailing data are ignoredusecols list-like or callable, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Return a subset of the columns. If list-like, all elements must either be positional [i. e. integer indices into the document columns] or strings that correspond to column names provided either by the user in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 or inferred from the document header row[s]. If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 are given, the document header row[s] are not taken into account. For example, a valid list-like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 parameter would be
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object52 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object53
Element order is ignored, so
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object54 is the same as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object55. To instantiate a DataFrame from
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56 with element order preserved use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object57 for columns in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object58 order or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object59 for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object60 order
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object6
Using this parameter results in much faster parsing time and lower memory usage when using the c engine. The Python engine loads the data first before deciding which columns to drop
squeeze boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
If the parsed data only contains one column then return a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62
Deprecated since version 1. 4. 0. Append
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object63 to the call to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object64 to squeeze the data. prefix str, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Prefix to add to column numbers when no header, e. g. ‘X’ for X0, X1, …
Không dùng nữa kể từ phiên bản 1. 4. 0. Use a list comprehension on the DataFrame’s columns after calling
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3mangle_dupe_cols boolean, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
Duplicate columns will be specified as ‘X’, ‘X. 1’…’X. N’, rather than ‘X’…’X’. Passing in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 will cause data to be overwritten if there are duplicate names in the columns
Deprecated since version 1. 5. 0. The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead.
General parsing configuration#
dtype Type name or dict of column -> type, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Data type for data or columns. e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object70 Use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72 together with suitable
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion
New in version 1. 5. 0. Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed.
engine {In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object74,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object75,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object76}
Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine
New in version 1. 4. 0. The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine.
converters dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
true_values list, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Values to consider as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32false_values list, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Values to consider as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61skipinitialspace boolean, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
Skip spaces after delimiter
skiprows list-like or integer, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Line numbers to skip [0-indexed] or number of lines to skip [int] at the start of the file
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object1skipfooter int, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84
Số dòng ở cuối tệp cần bỏ qua [không được hỗ trợ với engine=’c’]
nrows int, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Số hàng của tập tin để đọc. Hữu ích để đọc các phần của tệp lớn
low_memory boolean, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
Xử lý nội bộ tệp theo khối, dẫn đến việc sử dụng bộ nhớ thấp hơn trong khi phân tích cú pháp, nhưng có thể suy luận kiểu hỗn hợp. Để đảm bảo không có loại hỗn hợp, hãy đặt
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 hoặc chỉ định loại bằng tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88. Lưu ý rằng toàn bộ tệp được đọc thành một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 duy nhất, sử dụng tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object91 để trả về dữ liệu theo khối. [Chỉ hợp lệ với trình phân tích cú pháp C]memory_map boolean, mặc định Sai
Nếu đường dẫn tệp được cung cấp cho ______ 092, ánh xạ đối tượng tệp trực tiếp vào bộ nhớ và truy cập dữ liệu trực tiếp từ đó. Sử dụng tùy chọn này có thể cải thiện hiệu suất vì không còn bất kỳ chi phí I/O nào nữa
NA và xử lý dữ liệu bị thiếu#
na_values vô hướng, str, dạng danh sách hoặc chính tả, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Các chuỗi bổ sung để nhận dạng là NA/NaN. Nếu dict được thông qua, các giá trị NA cụ thể trên mỗi cột. See na values const below for a list of the values interpreted as NaN by default.
keep_default_na boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
Whether or not to include the default NaN values when parsing the data. Depending on whether
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 is passed in, the behavior is as follows
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
96 isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32, andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
73 are specified,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
73 is appended to the default NaN values used for parsingIf
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
96 isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32, andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
73 are not specified, only the default NaN values are used for parsingIf
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
96 isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
61, andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
73 are specified, only the NaN values specifiedIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
73 are used for parsingIf
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
96 isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
61, andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
73 are not specified, no strings will be parsed as NaN
Note that if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0810 is passed in as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object96 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 parameters will be ignoredna_filter boolean, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
Detect missing value markers [empty strings and the value of na_values]. In data without any NAs, passing
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0815 can improve the performance of reading a large fileverbose boolean, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
Indicate number of NA values placed in non-numeric columns
skip_blank_lines boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, skip over blank lines rather than interpreting as NaN values
Datetime handling#
parse_dates boolean or list of ints or names or list of lists or dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61.
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32 -> try parsing the indexIf
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0821 -> try parsing columns 1, 2, 3 each as a separate date columnIf
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0822 -> combine columns 1 and 3 and parse as a single date columnIf
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0823 -> parse columns 1, 3 as date and call result ‘foo’
Note
A fast-path exists for iso8601-formatted dates
infer_datetime_format boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 and parse_dates is enabled for a column, attempt to infer the datetime format to speed up the processingkeep_date_col boolean, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 and parse_dates specifies combining multiple columns then keep the original columnsdate_parser function, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Function to use for converting a sequence of string columns to an array of datetime instances. The default uses
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0829 to do the conversion. pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs. 1] Pass one or more arrays [as defined by parse_dates] as arguments; 2] concatenate [row-wise] the string values from the columns defined by parse_dates into a single array and pass that; and 3] call date_parser once for each row using one or more strings [corresponding to the columns defined by parse_dates] as argumentsngày đầu tiên boolean, mặc định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
Ngày định dạng DD/MM, định dạng quốc tế và châu Âu
cache_dates boolean, mặc định là TrueNếu Đúng, hãy sử dụng bộ nhớ cache của các ngày đã chuyển đổi, duy nhất để áp dụng chuyển đổi ngày giờ. Có thể tạo ra tốc độ tăng đáng kể khi phân tích chuỗi ngày trùng lặp, đặc biệt là các chuỗi có chênh lệch múi giờ
Mới trong phiên bản 0. 25. 0
Lần lặp #
trình lặp boolean, mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
Return
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0832 object for iteration or getting chunks with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0833chunksize int, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Return
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0832 object for iteration. See iterating and chunking below.
Quoting, compression, and file format#
compression {In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0837,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0838,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0839,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0840,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0841,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0843}, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34
For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip, xz, or zstandard if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object92 is path-like ending in ‘. gz’, ‘. bz2’, ‘. zip’, ‘. xz’, ‘. zst’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 for no decompression. Can also be a dict with key
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0847 set to one of {
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0839,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0837,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0838,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0841} and other key-value pairs are forwarded to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0852,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0853,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0854, or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0855. As an example, the following could be passed for faster compression and to create a reproducible gzip archive.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0856
Changed in version 1. 1. 0. dict option extended to support
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0857 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0858.
Changed in version 1. 2. 0. Previous versions forwarded dict entries for ‘gzip’ to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0859. thousands str, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Thousands separator
decimal str, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0861
Character to recognize as decimal point. E. g. use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20 for European datafloat_precision string, default None
Specifies which converter the C engine should use for floating-point values. The options are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 for the ordinary converter,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0864 for the high-precision converter, and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0865 for the round-trip converterlineterminator str [length 1], default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Character to break file into lines. Only valid with C parser
quotechar str [length 1]The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored
quoting int orIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0867 instance, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84
Control field quoting behavior per
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0867 constants. Sử dụng một trong số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0870 [0],
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0871 [1],
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0872 [2] hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0873 [3]doublequote boolean, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
When
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0875 is specified and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0876 is not
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0873, indicate whether or not to interpret two consecutive
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0875 elements inside a field as a single
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0875 elementescapechar str [length 1], default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
One-character string used to escape delimiter when quoting is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0873comment str, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines [as long as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39], fully commented lines are ignored by the parameter
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884 but not by
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0885. For example, if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0886, parsing ‘#empty\na,b,c\n1,2,3’ with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 will result in ‘a,b,c’ being treated as the headerencoding str, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Encoding to use for UTF when reading/writing [e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0889]. Danh sách mã hóa tiêu chuẩn Pythondialect str or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0890 instance, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
If provided, this parameter will override values [default or not] for the following parameters.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0893,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0894,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0895,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0875, and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0876. If it is necessary to override values, a ParserWarning will be issued. See
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0890 documentation for more details
Error handling#
error_bad_lines boolean, optional, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
Lines with too many fields [e. g. a csv line with too many commas] will by default cause an exception to be raised, and no
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 will be returned. If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, then these “bad lines” will dropped from the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 that is returned. See bad lines below.
Deprecated since version 1. 3. 0. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0103 parameter should be used instead to specify behavior upon encountering a bad line instead. warn_bad_lines boolean, optional, default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24
If error_bad_lines is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, and warn_bad_lines is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, a warning for each “bad line” will be output
Deprecated since version 1. 3. 0. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0103 parameter should be used instead to specify behavior upon encountering a bad line instead. on_bad_lines [‘error’, ‘warn’, ‘skip’], default ‘error’
Specifies what to do upon encountering a bad line [a line with too many fields]. Allowed values are
‘error’, raise an ParserError when a bad line is encountered
‘warn’, print a warning when a bad line is encountered and skip that line
‘skip’, skip bad lines without raising or warning when they are encountered
New in version 1. 3. 0
Specifying column data types#
You can indicate the data type for the whole
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 or individual columns
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
Fortunately, pandas offers more than one way to ensure that your column[s] contain only one
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88. If you’re unfamiliar with these concepts, you can see here to learn more about dtypes, and here to learn more about
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72 conversion in pandas.
For instance, you can use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0111 argument of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object08
Or you can use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0113 function to coerce the dtypes after reading in the data,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object01
which will convert all valid parsing to floats, leaving the invalid parsing as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs. In the case above, if you wanted to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 out the data anomalies, then
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0113 is probably your best option. However, if you wanted for all the data to be coerced, no matter the type, then using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0111 argument of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13 would certainly be worth trying
Note
In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Do đó, bạn có thể kết thúc với [các] cột có các kiểu dữ liệu hỗn hợp. For example,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20
will result with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0119 containing an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0120 dtype for certain chunks of the column, and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 for others due to the mixed dtypes from the data that was read in. It is important to note that the overall column will be marked with a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72, which is used for columns with mixed dtypes
Specifying categorical dtype#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 columns can be parsed directly by specifying
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0125 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0126
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object29
Individual columns can be parsed as a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 using a dict specification
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object31
Specifying
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0125 will result in an unordered
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 whose
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0130 are the unique values observed in the data. For more control on the categories and order, create a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0131 ahead of time, and pass that for that column’s
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object37
When using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0133, “unexpected” values outside of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0134 are treated as missing values
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object30
This matches the behavior of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0135
Note
With
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0125, the resulting categories will always be parsed as strings [object dtype]. If the categories are numeric they can be converted using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0113 function, or as appropriate, another converter such as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0138
When
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 is a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0131 with homogeneous
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0130 [ all numeric, all datetimes, etc. ], the conversion is done automatically
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object31
Naming and using columns#
Handling column names#
A file may or may not have a header row. pandas assumes the first row should be used as the column names
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
By specifying the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 argument in conjunction with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884 you can indicate other names to use and whether or not to throw away the header row [if any]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33
If the header is in a row other than the first, pass the row number to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884. This will skip the preceding rows
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34
Note
Default behavior is to infer the column names. if no names are passed the behavior is identical to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 and column names are inferred from the first non-blank line of the file, if column names are passed explicitly then the behavior is identical to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36
Duplicate names parsing#
Deprecated since version 1. 5. 0.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0147 was never implemented, and a new argument where the renaming pattern can be specified will be added instead.
If the file or header contains duplicate names, pandas will by default distinguish between them so as to prevent overwriting data
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35
There is no more duplicate data because
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0148 by default, which modifies a series of duplicate columns ‘X’, …, ‘X’ to become ‘X’, ‘X. 1’, …, ‘X. N’
Filtering columns [In [13]: import numpy as np
In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In [15]: print[data]
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [16]: df = pd.read_csv[StringIO[data], dtype=object]
In [17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [18]: df["a"][0]
Out[18]: '1'
In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}]
In [20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
47]#
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 argument allows you to select any subset of the columns in a file, either using the column names, position numbers or a callable
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 argument can also be used to specify which columns not to use in the final result
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object37
In this case, the callable is specifying that we exclude the “a” and “c” columns from the output
Comments and empty lines#
Ignoring line comments and empty lines#
If the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0152 parameter is specified, then completely commented lines will be ignored. By default, completely blank lines will be ignored as well
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object38
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0153, then
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 will not ignore blank lines
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39
Warning
The presence of ignored lines might create ambiguities involving line numbers; the parameter
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884 uses row numbers [ignoring commented/empty lines], while
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0885 uses line numbers [including commented/empty lines]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object10
If both
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0885 are specified,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884 will be relative to the end of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0885. For example
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11
Comments#
Sometimes comments or meta data may be included in a file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object12
By default, the parser includes the comments in the output
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13
We can suppress the comments using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0152 keyword
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object14
Dealing with Unicode data#
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0162 argument should be used for encoded unicode data, which will result in byte strings being decoded to unicode in the result
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15
Some formats which encode all characters as multiple bytes, like UTF-16, won’t parse correctly at all without specifying the encoding. Full list of Python standard encodings
Index columns and trailing delimiters#
If a file has one more column of data than the number of column names, the first column will be used as the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43’s row names
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object16
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object17
Ordinarily, you can achieve this behavior using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 option
There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object44
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object18
If a subset of data is being parsed using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 option, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 specification is based on that subset, not the original data
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object19
Date Handling#
Specifying date columns#
To better facilitate working with datetime data,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13 uses the keyword arguments
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0169 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0170 to allow users to specify a variety of columns and date/time formats to turn the input text data into
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0171 objects
The simplest case is to just pass in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0172
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0
It is often the case that we may want to store date and time data separately, or store various date fields separately. the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0169 keyword can be used to specify a combination of columns to parse the dates and/or times from
You can specify a list of column lists to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0169, the resulting date columns will be prepended to the output [so as to not affect the existing column order] and the new column names will be the concatenation of the component column names
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object1
By default the parser removes the component date columns, but you can choose to retain them via the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0175 keyword
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2
Note that if you wish to combine multiple columns into a single date column, a nested list must be used. In other words,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0176 indicates that the second and third columns should each be parsed as separate date columns while
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0177 means the two columns should be parsed into a single column
You can also use a dict to specify custom name columns
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3
It is important to remember that if multiple text columns are to be parsed into a single date column, then a new column is prepended to the data. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 specification is based off of this new set of columns rather than the original data columns
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object4
Note
If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0138 after
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0180
Note
read_csv has a fast_path for parsing datetime strings in iso8601 format, e. g “2000-01-01T00. 01. 02+00. 00” và các biến thể tương tự. If you can arrange for your data to store datetimes in this format, load times will be significantly faster, ~20x has been observed
Date parsing functions#
Finally, the parser allows you to specify a custom
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0170 function to take full advantage of the flexibility of the date parsing API
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object5
pandas will try to call the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0170 function in three different ways. If an exception is raised, the next one is tried
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0170 is first called with one or more arrays as arguments, as defined usingIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0169 [e. g. ,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0185]If #1 fails,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0170 is called with all the columns concatenated row-wise into a single array [e. g. ,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0187]
Note that performance-wise, you should try these methods of parsing dates in order
Try to infer the format using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0188 [see section below]If you know the format, use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0189.In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0190If you have a really non-standard format, use a custom
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0170 function. For optimal performance, this should be vectorized, i. e. , it should accept arrays as arguments
Parsing a CSV with mixed timezones#
pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0169
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object6
To parse the mixed-timezone values as a datetime column, pass a partially-applied
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0138 with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0194 as the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0170
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object7
Inferring datetime format#
If you have
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0169 enabled for some or all of your columns, and your datetime strings are all formatted the same way, you may get a large speed up by setting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0188. If set, pandas will attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings. 5-10x parsing speeds have been observed. pandas will fallback to the usual parsing if either the format cannot be guessed or the format that was guessed cannot properly parse the entire column of strings. So in general,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0198 should not have any negative consequences if enabled
Here are some examples of datetime strings that can be guessed [All representing December 30th, 2011 at 00. 00. 00]
“20111230”
“2011/12/30”
“20111230 00. 00. 00”
“12/30/2011 00. 00. 00”
“30/Dec/2011 00. 00. 00”
“30/December/2011 00. 00. 00”
Note that
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0198 is sensitive to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2000. With
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2001, it will guess “01/12/2011” to be December 1st. With
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2002 [default] it will guess “01/12/2011” to be January 12th
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object8
International date formats#
While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead. For convenience, a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2000 keyword is provided
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object9
Ghi CSV vào đối tượng tệp nhị phân#
New in version 1. 2. 0
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2004 allows writing a CSV to a file object opened binary mode. In most cases, it is not necessary to specify
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2005 as Pandas will auto-detect whether the file object is opened in text or binary mode
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object080
Specifying method for floating-point conversion#
The parameter
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2006 can be specified in order to use a specific floating-point converter during parsing with the C engine. The options are the ordinary converter, the high-precision converter, and the round-trip converter [which is guaranteed to round-trip values after writing to a file]. For example
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object081
Thousand separators#
For large numbers that have been written with a thousands separator, you can set the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2007 keyword to a string of length 1 so that integers will be parsed correctly
By default, numbers with a thousands separator will be parsed as strings
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object082
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2007 keyword allows integers to be parsed correctly
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object083
NA values#
To control which values are parsed as missing values [which are signified by
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46], specify a string in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73. If you specify a list of strings, then all values in it are considered to be missing values. If you specify a number [a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2011, like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2012 or an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2013 like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2014], the corresponding equivalent values will also imply a missing value [in this case effectively
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2015 are recognized as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46]
To completely override the default values that are recognized as missing, specify
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2017
The default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 recognized values are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2019
Let us consider some examples
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object084
Trong ví dụ trên,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2014 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2012 sẽ được công nhận là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46, ngoài các giá trị mặc định. A string will first be interpreted as a numerical
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2014, then as a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object085
Above, only an empty field will be recognized as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object086
Above, both
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2026 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84 as strings are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object087
The default values, in addition to the string
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2029 are recognized as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
Infinity#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2031 like values will be parsed as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2032 [positive infinity], and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2033 as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2034 [negative infinity]. These will ignore the case of the value, meaning
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2035, will also be parsed as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2032
Returning Series#
Using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2037 keyword, the parser will return output with a single column as a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62
Deprecated since version 1. 4. 0. Users should append
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object63 to the DataFrame returned by
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 instead.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object088
Boolean values#
The common values
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2043, and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2044 are all recognized as boolean. Occasionally you might want to recognize other values as being boolean. To do this, use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2045 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2046 options as follows
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object089
Handling “bad” lines#
Some files may have malformed lines with too few fields or too many. Lines with too few fields will have NA values filled in the trailing fields. Lines with too many fields will raise an error by default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object010
You can elect to skip bad lines
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object011
Or pass a callable function to handle the bad line if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2047. The bad line will be a list of strings that was split by the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2048
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object012
You can also use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 parameter to eliminate extraneous column data that appear in some lines but not others
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object013
In case you want to keep all data including the lines with too many fields, you can specify a sufficient number of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49. This ensures that lines with not enough fields are filled with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object014
Dialect#
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2052 keyword gives greater flexibility in specifying the file format. By default it uses the Excel dialect but you can specify either the dialect name or a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0890 instance
Suppose you had data with unenclosed quotes
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object015
By default,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 uses the Excel dialect and treats the double quote as the quote character, which causes it to fail when it finds a newline before it finds the closing double quote
We can get around this using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2052
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object016
All of the dialect options can be specified separately by keyword arguments
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object017
Another common dialect option is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0895, to skip any whitespace after a delimiter
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object018
The parsers make every attempt to “do the right thing” and not be fragile. Type inference is a pretty big deal. If a column can be coerced to integer dtype without altering the contents, the parser will do so. Mọi cột không phải là số sẽ xuất hiện dưới dạng đối tượng dtype như với các đối tượng pandas còn lại
Quoting and Escape Characters#
Quotes [and other escape characters] in embedded fields can be handled in any number of ways. One way is to use backslashes; to properly parse this data, you should pass the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0894 option
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object019
Files with fixed width columns#
While
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object13 reads delimited data, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2059 function works with data files that have known and fixed column widths. The function parameters to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2060 are largely the same as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 with two extra parameters, and a different usage of the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33 parameter
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2063. A list of pairs [tuples] giving the extents of the fixed-width fields of each line as half-open intervals [i. e. , [from, to[ ]. String value ‘infer’ can be used to instruct the parser to try detecting the column specifications from the first 100 rows of the data. Default behavior, if not specified, is to inferIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2064. A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguousIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
33. Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces [e. g. , ‘~’]
Xem xét một tệp dữ liệu có chiều rộng cố định điển hình
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object200
In order to parse this file into a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43, we simply need to supply the column specifications to the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2060 function along with the file name
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object201
Note how the parser automatically picks column names X. when
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36 argument is specified. Alternatively, you can supply just the column widths for contiguous columns:
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object202
The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between the columns in the file
By default,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2060 will try to infer the file’s
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2063 by using the first 100 rows of the file. Nó chỉ có thể làm điều đó trong trường hợp khi các cột được căn chỉnh và phân tách chính xác bằng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33 được cung cấp [dấu phân cách mặc định là khoảng trắng]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object203
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2060 supports the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 parameter for specifying the types of parsed columns to be different from the inferred type
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object204
Indexes#
Files with an “implicit” index column#
Consider a file with one less entry in the header than the number of data column
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object205
In this special case,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 assumes that the first column is to be used as the index of the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object206
Note that the dates weren’t automatically parsed. In that case you would need to do as before
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object207
Reading an index with a In [13]: import numpy as np
In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In [15]: print[data]
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [16]: df = pd.read_csv[StringIO[data], dtype=object]
In [17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [18]: df["a"][0]
Out[18]: '1'
In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}]
In [20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
2076#
Suppose you have data indexed by two columns
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object208
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 argument to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 can take a list of column numbers to turn multiple columns into a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 for the index of the returned object
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object209
Reading columns with a In [13]: import numpy as np
In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In [15]: print[data]
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [16]: df = pd.read_csv[StringIO[data], dtype=object]
In [17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [18]: df["a"][0]
Out[18]: '1'
In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}]
In [20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
2076#
By specifying list of row locations for the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884 argument, you can read in a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 for the columns. Specifying non-consecutive rows will skip the intervening rows
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object290
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 is also able to interpret a more common format of multi-columns indices
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object291
Note
If an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 is not specified [e. g. you don’t have an index, or wrote it with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2085, then any
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 on the columns index will be lost
Tự động “đánh hơi” dấu phân cách#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 is capable of inferring delimited [not necessarily comma-separated] files, as pandas uses the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object25 class of the csv module. For this, you have to specify
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2089
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object292
Reading multiple files to create a single DataFrame#
It’s best to use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2090 to combine multiple files. See the cookbook for an example.
Iterating through files chunk by chunk#
Suppose you wish to iterate through a [potentially very large] file lazily rather than reading the entire file into memory, such as the following
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object293
By specifying a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66, the return value will be an iterable object of type
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0832
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object294
Changed in version 1. 2.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2094 return a context-manager when iterating through a file.
Specifying
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2095 will also return the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0832 object
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object295
Specifying the parser engine#
Pandas currently supports three engines, the C engine, the python engine, and an experimental pyarrow engine [requires the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2097 package]. In general, the pyarrow engine is fastest on larger workloads and is equivalent in speed to the C engine on most other workloads. The python engine tends to be slower than the pyarrow and C engines on most workloads. However, the pyarrow engine is much less robust than the C engine, which lacks a few features compared to the Python engine
Where possible, pandas uses the C parser [specified as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2098], but it may fall back to Python if C-unsupported options are specified
Currently, options unsupported by the C and pyarrow engines include
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2048 other than a single character [e. g. regex separators]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2900In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2089 withIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2902
Specifying any of the above options will produce a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2903 unless the python engine is selected explicitly using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2904
Options that are unsupported by the pyarrow engine which are not covered by the list above include
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2006In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
90In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0152In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2908In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2007In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2910In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2052In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2912In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2913In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0103In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2915In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0876In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2917In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0111In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2919In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
91In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2000In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0198In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2923In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0895In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2925
Chỉ định các tùy chọn này với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2926 sẽ tăng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2927
Đọc/ghi tập tin từ xa#
You can pass in a URL to read or write remote files to many of pandas’ IO functions - the following example shows reading a CSV file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object296
New in version 1. 3. 0
A custom header can be sent alongside HTTP[s] requests by passing a dictionary of header key value mappings to the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2928 keyword argument as shown below
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object297
Tất cả các URL không phải là tệp cục bộ hoặc [các] HTTP đều được xử lý bởi fsspec, nếu được cài đặt và các triển khai hệ thống tệp khác nhau của nó [bao gồm Amazon S3, Google Cloud, SSH, FTP, webHDFS…]. Một số triển khai này sẽ yêu cầu cài đặt các gói bổ sung, ví dụ: URL S3 yêu cầu thư viện s3fs
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object298
Khi xử lý các hệ thống lưu trữ từ xa, bạn có thể cần cấu hình bổ sung với các biến môi trường hoặc tệp cấu hình ở các vị trí đặc biệt. Ví dụ: để truy cập dữ liệu trong bộ chứa S3 của bạn, bạn sẽ cần xác định thông tin xác thực theo một trong một số cách được liệt kê trong tài liệu S3Fs. Điều này cũng đúng đối với một số phụ trợ lưu trữ và bạn nên theo các liên kết tại fsimpl1 để biết các triển khai được tích hợp trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2929 và fsimpl2 cho những phụ trợ không có trong bản phân phối chính của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2929
Bạn cũng có thể truyền tham số trực tiếp cho trình điều khiển phụ trợ. Ví dụ: nếu bạn không có thông tin đăng nhập S3, bạn vẫn có thể truy cập dữ liệu công khai bằng cách chỉ định một kết nối ẩn danh, chẳng hạn như
New in version 1. 2. 0
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object299
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2929 also allows complex URLs, for accessing data in compressed archives, local caching of files, and more. Để lưu trữ cục bộ ví dụ trên, bạn sẽ sửa đổi lệnh gọi thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object310
trong đó chúng tôi chỉ định rằng tham số “anon” có nghĩa là dành cho phần “s3” của quá trình triển khai, không dành cho việc triển khai bộ nhớ đệm. Lưu ý rằng bộ đệm này lưu trữ vào một thư mục tạm thời chỉ trong thời lượng của phiên, nhưng bạn cũng có thể chỉ định một cửa hàng vĩnh viễn
Viết ra dữ liệu #
Viết sang định dạng CSV#
Các đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 có một phương thức thể hiện
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2934 cho phép lưu trữ nội dung của đối tượng dưới dạng tệp giá trị được phân tách bằng dấu phẩy. Hàm nhận một số đối số. Chỉ cái đầu tiên là bắt buộc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2935. Đường dẫn chuỗi đến tệp để ghi hoặc đối tượng tệp. Nếu một đối tượng tệp thì nó phải được mở bằngIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2936In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2048. Dấu phân cách trường cho tệp đầu ra [mặc định là “,”]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2938. Biểu diễn chuỗi của một giá trị bị thiếu [mặc định ‘’]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2939. Định dạng chuỗi cho số dấu phẩy độngIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940. Các cột để viết [mặc định Không có]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0884. Có viết tên cột hay không [mặc định là True]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942. có viết tên hàng [chỉ mục] hay không [mặc định là True]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2943. [Các] nhãn cột cho [các] cột chỉ mục nếu muốn. Nếu Không có [mặc định] vàIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0884 vàIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942 là Đúng, thì tên chỉ mục được sử dụng. [Một chuỗi nên được đưa ra nếuIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
43 sử dụng MultiIndex]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2005. Chế độ ghi Python, mặc định 'w'In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0162. một chuỗi đại diện cho mã hóa để sử dụng nếu nội dung không phải ASCII, đối với các phiên bản Python trước 3In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2917. Chuỗi ký tự biểu thị kết thúc dòng [mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2950]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0876. Đặt quy tắc trích dẫn như trong mô-đun csv [csv mặc định. QUOTE_MINIMAL]. Lưu ý rằng nếu bạn đã đặtIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2939 thì số float sẽ được chuyển đổi thành chuỗi và csv. QUOTE_NONNUMERIC sẽ coi chúng không phải là sốIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0875. Ký tự được sử dụng để trích dẫn các trường [mặc định là '”']In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0893. Kiểm soát trích dẫn củaIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0875 trong các trường [mặc định là Đúng]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0894. Ký tự được sử dụng để thoát khỏiIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2048 vàIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0875 khi thích hợp [mặc định Không có]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
90. Số hàng để viết tại một thời điểmIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2960. Định dạng chuỗi cho đối tượng ngày giờ
Viết một chuỗi định dạng #
Đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 có một phương thức thể hiện
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2962 cho phép kiểm soát biểu diễn chuỗi của đối tượng. Tất cả các đối số là tùy chọn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2963 mặc định Không có, ví dụ đối tượng StringIOIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940 mặc định Không có, ghi cột nàoIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2965 mặc định Không có, chiều rộng tối thiểu của mỗi cộtIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2938 defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
46, representation of NA valueIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2968 default None, a dictionary [by column] of functions each of which takes a single argument and returns a formatted stringIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2939 default None, a function which takes a single [float] argument and returns a formatted string; to be applied to floats in theIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
43In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2971 default True, set to False for aIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
43 with a hierarchical index to print every MultiIndex key at each rowIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2973 default True, will print the names of the indicesIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942 default True, will print the index [ie, row labels]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0884 default True, will print the column labelsIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2976 defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2977, will print column headers left- or right-justified
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 object also has a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2962 method, but with only the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2963,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2938,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2939 arguments. There is also a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2983 argument which, if set to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, will additionally output the length of the Series
JSON#
Read and write
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2985 format files and strings
Writing JSON#
A
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 can be converted to a valid JSON string. Use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2988 with optional parameters
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2935 . the pathname or buffer to write the output This can beIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
24 in which case a JSON string is returnedIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2991In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
62default is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942allowed values are {
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2994,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
43default is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940allowed values are {
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2994,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3103,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3104}
The format of the JSON string
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2994dict like {index -> [index], columns -> [columns], data -> [values]}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995list like [{column -> value}, … , {column -> value}]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942dict like {index -> {column -> value}}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940dict like {column -> {index -> value}}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3103just the values array
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3104adhering to the JSON Table Schema
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2960 . string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3112 . Số vị trí thập phân sẽ sử dụng khi mã hóa các giá trị dấu phẩy động, mặc định là 10In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3113 . force encoded string to be ASCII, default TrueIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3114 . The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3115 . The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable objectIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3116 . IfIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995 orient, then will write each record per line as json
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46’s,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3119’s and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 will be converted to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3121 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0171 objects will be converted based on the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2960 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3114 parameters
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object311
Orient options#
There are a number of different options for the format of the resulting JSON file / string. Hãy xem xét những điều sau đây
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object312
Định hướng theo cột [mặc định cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43] tuần tự hóa dữ liệu dưới dạng các đối tượng JSON lồng nhau với các nhãn cột đóng vai trò là chỉ mục chính
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object313
Định hướng theo chỉ mục [mặc định cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62] tương tự như định hướng theo cột nhưng nhãn chỉ mục hiện là chính
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object314
Định hướng bản ghi tuần tự hóa dữ liệu thành một mảng JSON của cột -> bản ghi giá trị, không bao gồm nhãn chỉ mục. Điều này hữu ích để chuyển dữ liệu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 tới các thư viện vẽ sơ đồ, ví dụ như thư viện JavaScript
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3130
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object315
Định hướng giá trị là một tùy chọn cơ bản chỉ tuần tự hóa thành các mảng giá trị JSON lồng nhau, không bao gồm nhãn cột và chỉ mục
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object316
Tuần tự hóa định hướng phân tách thành một đối tượng JSON chứa các mục nhập riêng biệt cho các giá trị, chỉ mục và cột. Tên cũng được bao gồm cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object317
Bảng được định hướng tuần tự hóa thành Lược đồ bảng JSON, cho phép lưu giữ siêu dữ liệu bao gồm nhưng không giới hạn đối với các kiểu chữ và tên chỉ mục
Note
Bất kỳ tùy chọn định hướng nào mã hóa thành đối tượng JSON sẽ không duy trì thứ tự của nhãn chỉ mục và cột trong quá trình tuần tự hóa khứ hồi. Nếu bạn muốn duy trì thứ tự nhãn, hãy sử dụng tùy chọn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2994 vì nó sử dụng các thùng chứa được đặt hàng
Xử lý ngày#
Viết ở định dạng ngày ISO
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object318
Viết ở định dạng ngày ISO, với micro giây
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object319
Dấu thời gian Epoch, tính bằng giây
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object370
Viết vào một tệp, với chỉ mục ngày và cột ngày
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object371
Hành vi dự phòng#
Nếu trình nối tiếp JSON không thể xử lý trực tiếp nội dung vùng chứa, nó sẽ quay trở lại theo cách sau
nếu dtype không được hỗ trợ [e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3133] thìIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3115, nếu được cung cấp, sẽ được gọi cho mỗi giá trị, nếu không thì một ngoại lệ sẽ được đưa ranếu một đối tượng không được hỗ trợ, nó sẽ cố gắng như sau
kiểm tra xem đối tượng đã xác định phương thức
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3135 chưa và gọi nó. Một phương thứcIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3135 sẽ trả về mộtIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0843, sau đó sẽ được tuần tự hóa JSONgọi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3115 nếu được cung cấpchuyển đổi đối tượng thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0843 bằng cách duyệt qua nội dung của nó. Tuy nhiên, điều này thường sẽ thất bại vớiIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3140 hoặc cho kết quả không mong muốn
Nói chung, cách tiếp cận tốt nhất cho các đối tượng hoặc dtypes không được hỗ trợ là cung cấp một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3115. Ví dụ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object372
có thể được xử lý bằng cách chỉ định một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3115 đơn giản
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object373
Đọc JSON#
Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3144 is not supplied or is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24. To explicitly force
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 parsing, pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3147
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
92 . a VALID JSON string or file handle / StringIO. The string could be a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be file . //localhost/path/to/table. jsonIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3144 . loại đối tượng cần khôi phục [sê-ri hoặc khung], 'khung' mặc địnhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2991 Loạtdefault is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942allowed values are {
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2994,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942}
default is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940allowed values are {
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2994,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3103,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3104}
The format of the JSON string
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2994dict like {index -> [index], columns -> [columns], data -> [values]}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2995list like [{column -> value}, … , {column -> value}]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942dict like {index -> {column -> value}}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940dict like {column -> {index -> value}}
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3103just the values array
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3104adhering to the JSON Table Schema
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
88 . if True, infer dtypes, if a dict of column to dtype, then use those, ifIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
61, then don’t infer dtypes at all, default is True, apply only to the dataIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3170 . boolean, try to convert the axes to the proper dtypes, default isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3172 . a list of columns to parse for dates; IfIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32, then try to parse date-like columns, default isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3175 . boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
32. If parsing dates, then parse the default date-like columnsIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3177 . direct decoding to NumPy arrays. default isIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
61; Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term ifIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3179In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3180 . boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
61. Set to enable usage of higher precision [strtod] function when decoding string to double values. Default [In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
61] is to use fast but less precise builtin functionalityIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3114 . string, the timestamp unit to detect if converting dates. Default None. By default the timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectivelyIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3116 . reads file as one json object per lineIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
0162 . The encoding to use to decode py3 bytesIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
90 . when used in combination withIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3187, return a JsonReader which reads inIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
90 lines per iteration
The parser will raise one of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3189 if the JSON is not parseable
If a non-default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2991 was used when encoding to JSON be sure to pass the same option here so that decoding produces sensible results, see Orient Options for an overview
Data conversion#
The default of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3191,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3192, and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3193 will try to parse the axes, and all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3170 should only be set to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 if you need to preserve string-like numbers [e. g. ‘1’, ‘2’] in an axes
Note
Các giá trị số nguyên lớn có thể được chuyển đổi thành ngày tháng nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3193 và dữ liệu và/hoặc nhãn cột xuất hiện 'giống như ngày tháng'. Ngưỡng chính xác phụ thuộc vào
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3114 được chỉ định. 'giống ngày' có nghĩa là nhãn cột đáp ứng một trong các tiêu chí sau
nó kết thúc bằng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3199nó kết thúc bằng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3700nó bắt đầu bằng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3701đó là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3702đó là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3703
Warning
Khi đọc dữ liệu JSON, việc tự động ép buộc vào dtypes có một số điều kỳ quặc
một chỉ mục có thể được xây dựng lại theo thứ tự khác với thứ tự tuần tự hóa, nghĩa là thứ tự trả về không được đảm bảo giống như trước khi tuần tự hóa
một cột có dữ liệu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2011 sẽ được chuyển đổi thànhIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2013 nếu nó có thể được thực hiện một cách an toàn, e. g. một cột củaIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3706các cột bool sẽ được chuyển đổi thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2013 khi xây dựng lại
Do đó, có những lúc bạn có thể muốn chỉ định các kiểu dữ liệu cụ thể thông qua đối số từ khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88
Đọc từ một chuỗi JSON
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object374
Đọc từ một tập tin
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object375
Không chuyển đổi bất kỳ dữ liệu nào [nhưng vẫn chuyển đổi trục và ngày tháng]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object376
Chỉ định dtypes để chuyển đổi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object377
Preserve string indices
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object378
Dates written in nanoseconds need to be read back in nanoseconds
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object379
The Numpy parameter#
Note
This param has been deprecated as of version 1. 0. 0 and will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3709
This supports numeric data only. Index and columns labels may be non-numeric, e. g. strings, dates etc
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3179 is passed to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3711 an attempt will be made to sniff an appropriate dtype during deserialization and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python objects
This can provide speedups if you are deserialising a large amount of numeric data
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object300
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object301
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object302
The speedup is less noticeable for smaller datasets
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object303
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object304
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object305
Warning
Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected output if these assumptions are not satisfied
data is numeric
data is uniform. The dtype is sniffed from the first value decoded. A
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2927 may be raised, or incorrect output may be produced if this condition is not satisfiedlabels are ordered. Labels are only read from the first container, it is assumed that each subsequent row / column has been encoded in the same order. This should be satisfied if the data was encoded using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2988 but may not be the case if the JSON is from another source
Normalization#
pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into a flat table
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object306
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object307
The max_level parameter provides more control over which level to end normalization. With max_level=1 the following snippet normalizes until 1st nesting level of the provided dict
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object308
Line delimited json#
pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop or Spark
For line-delimited json files, pandas can also return an iterator which reads in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 lines at a time. This can be useful for large files or to read from a stream
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object309
Table schema#
Table Schema is a spec for describing tabular datasets as a JSON object. The JSON includes information on the field names, types, and other attributes. You can use the orient
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 to build a JSON string with two fields,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3716 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object310
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3716 field contains the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3719 key, which itself contains a list of column name to type pairs, including the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3720 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 [see below for a list of types]. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3716 field also contains a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3723 field if the [Multi]index is unique
The second field,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56, contains the serialized data with the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2995 orient. The index is included, and any datetimes are ISO 8601 formatted, as required by the Table Schema spec
The full list of types supported are described in the Table Schema spec. This table shows the mapping from pandas types
pandas type
Table Schema type
int64
integer
float64
number
bool
boolean
datetime64[ns]
datetime
timedelta64[ns]
duration
categorical
any
object
str
A few notes on the generated table schema
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3716 object contains aIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3727 field. This contains the version of pandas’ dialect of the schema, and will be incremented with each revisionAll dates are converted to UTC when serializing. Even timezone naive values, which are treated as UTC with an offset of 0
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
311datetimes with a timezone [before serializing], include an additional field
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3728 with the time zone name [e. g.In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3729]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
312Periods are converted to timestamps before serialization, and so have the same behavior of being converted to UTC. In addition, periods will contain and additional field
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3730 with the period’s frequency, e. g.In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3731In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
313Categoricals use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3732 type and anIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3733 constraint listing the set of possible values. Additionally, anIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3734 field is includedIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
314Trường
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3723, chứa một mảng nhãn, được bao gồm nếu chỉ mục là duy nhấtIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
315Hành vi của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3723 giống với MultiIndexes, nhưng trong trường hợp này,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3723 là một mảngIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
316Việc đặt tên mặc định đại khái tuân theo các quy tắc này
Đối với sê-ri,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3738 được sử dụng. Nếu không có, thì tên làIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3103Đối với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3740, phiên bản chuỗi hóa của tên cột được sử dụngĐối với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3720 [không phảiIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2076],In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3743 được sử dụng, với giá trị dự phòng làIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942 nếu không cóĐối với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2076,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3746 được sử dụng. Nếu bất kỳ cấp độ nào không có tên, thìIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3747 được sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3711 cũng chấp nhận
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3749 làm đối số. Điều này cho phép duy trì siêu dữ liệu như dtypes và tên chỉ mục theo cách có thể lặp lại
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object317
Xin lưu ý rằng chuỗi ký tự 'chỉ mục' làm tên của một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3720 không thể lặp lại, cũng như không có bất kỳ tên nào bắt đầu bằng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3751 trong một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076. Chúng được sử dụng theo mặc định trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3753 để chỉ ra các giá trị bị thiếu và lần đọc tiếp theo không thể phân biệt ý định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object318
Khi sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3749 cùng với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3755 do người dùng xác định, lược đồ được tạo sẽ chứa khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3756 bổ sung trong phần tử
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3719 tương ứng. Khóa bổ sung này không phải là tiêu chuẩn nhưng kích hoạt các vòng lặp JSON cho các loại tiện ích mở rộng [e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3758]
Khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3756 mang tên của tiện ích mở rộng, nếu bạn đã đăng ký đúng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3760, gấu trúc sẽ sử dụng tên đã nói để thực hiện tra cứu sổ đăng ký và chuyển đổi lại dữ liệu được tuần tự hóa thành loại tùy chỉnh của bạn
HTML#
Đọc nội dung HTML#
Warning
Chúng tôi đặc biệt khuyến khích bạn đọc Các vấn đề về phân tích cú pháp bảng HTML bên dưới về các vấn đề xung quanh trình phân tích cú pháp BeautifulSoup4/html5lib/lxml.
Hàm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3761 cấp cao nhất có thể chấp nhận chuỗi/tệp/URL HTML và sẽ phân tích các bảng HTML thành danh sách gấu trúc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3740. Hãy xem xét một vài ví dụ
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3763 trả về một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3764 trong số các đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43, ngay cả khi chỉ có một bảng duy nhất chứa trong nội dung HTML
Đọc một URL không có tùy chọn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object319
Note
Dữ liệu từ URL trên thay đổi vào thứ Hai hàng tuần nên dữ liệu kết quả ở trên có thể hơi khác một chút
Đọc nội dung của tệp từ URL trên và chuyển nó tới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3763 dưới dạng chuỗi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object320
Bạn thậm chí có thể vượt qua một trường hợp của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11 nếu bạn mong muốn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object321
Note
Các ví dụ sau đây không được trình đánh giá IPython chạy do thực tế là có quá nhiều chức năng truy cập mạng làm chậm quá trình xây dựng tài liệu. Nếu bạn phát hiện lỗi hoặc một ví dụ không chạy, vui lòng báo cáo lỗi đó trên trang vấn đề GitHub của gấu trúc
Đọc một URL và khớp với một bảng có chứa văn bản cụ thể
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object322
Chỉ định một hàng tiêu đề [theo mặc định, các phần tử
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3768 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3769 nằm trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3770 được sử dụng để tạo chỉ mục cột, nếu nhiều hàng được chứa trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3770 thì MultiIndex được tạo];
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object323
Chỉ định một cột chỉ mục
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object324
Chỉ định một số hàng để bỏ qua
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object325
Chỉ định một số hàng để bỏ qua bằng cách sử dụng danh sách [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3773 cũng hoạt động]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object326
Chỉ định một thuộc tính HTML
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object327
Chỉ định các giá trị sẽ được chuyển đổi thành NaN
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object328
Chỉ định có giữ bộ giá trị NaN mặc định hay không
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object329
Chỉ định bộ chuyển đổi cho các cột. Điều này hữu ích cho dữ liệu văn bản số có số 0 đứng đầu. Theo mặc định, các cột là số được chuyển thành kiểu số và các số 0 ở đầu sẽ bị mất. Để tránh điều này, chúng ta có thể chuyển đổi các cột này thành chuỗi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object330
Sử dụng một số kết hợp ở trên
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object331
Đọc ở đầu ra pandas
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3774 [với một số mất độ chính xác của dấu phẩy động]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object332
Chương trình phụ trợ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3775 sẽ phát sinh lỗi khi phân tích cú pháp không thành công nếu đó là trình phân tích cú pháp duy nhất bạn cung cấp. Nếu bạn chỉ có một trình phân tích cú pháp duy nhất, bạn có thể chỉ cung cấp một chuỗi, nhưng cách tốt nhất là chuyển một danh sách bằng một chuỗi nếu, ví dụ, hàm mong đợi một chuỗi các chuỗi. Bạn có thể sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object333
Hoặc bạn có thể vượt qua
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3776 mà không cần danh sách
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object334
Tuy nhiên, nếu bạn đã cài đặt bs4 và html5lib và vượt qua
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3778 thì rất có thể quá trình phân tích cú pháp sẽ thành công. Lưu ý rằng ngay sau khi phân tích cú pháp thành công, hàm sẽ trả về
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object335
Liên kết có thể được trích xuất từ các ô cùng với văn bản bằng cách sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3779
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object336
Mới trong phiên bản 1. 5. 0
Ghi vào tệp HTML#
Các đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 có một phương thức thể hiện
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3774 hiển thị nội dung của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 dưới dạng bảng HTML. Các đối số của hàm như trong phương thức
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2962 được mô tả ở trên
Note
Vì lý do ngắn gọn, không phải tất cả các tùy chọn có thể có cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3784 đều được hiển thị ở đây. Xem
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3785 để biết đầy đủ các tùy chọn
Note
Trong môi trường hỗ trợ hiển thị HTML như Jupyter Notebook,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3786 sẽ hiển thị HTML thô vào môi trường
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object337
Đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2940 sẽ giới hạn các cột được hiển thị
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object338
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2939 sử dụng Python có thể gọi được để kiểm soát độ chính xác của các giá trị dấu phẩy động
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object339
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3789 sẽ in đậm nhãn hàng theo mặc định, nhưng bạn có thể tắt tính năng này
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object340
Đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3790 cung cấp khả năng đưa ra các lớp CSS của bảng HTML kết quả. Lưu ý rằng các lớp này được thêm vào lớp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3791 hiện có
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object341
Đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3792 cung cấp khả năng thêm siêu liên kết vào các ô chứa URL
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object342
Finally, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3793 argument allows you to control whether the “” and “&” characters escaped in the resulting HTML [by default it is
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32]. So to get the HTML without escaped characters pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3795
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object343
trốn thoát
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object344
không thoát
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object345
Note
Một số trình duyệt có thể không hiển thị sự khác biệt trong kết xuất của hai bảng HTML trước đó
Phân tích cú pháp bảng HTML Gotchas#
Có một số vấn đề về phiên bản xung quanh các thư viện được sử dụng để phân tích cú pháp các bảng HTML trong chức năng pandas io cấp cao nhất
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3763
Các vấn đề với lxml
Lợi ích
lxml rất nhanh
lxml yêu cầu Cython cài đặt chính xác
nhược điểm
lxml không đưa ra bất kỳ đảm bảo nào về kết quả phân tích cú pháp của nó trừ khi nó được đánh dấu hợp lệ nghiêm ngặt
Theo những điều trên, chúng tôi đã chọn cho phép bạn, người dùng, sử dụng phần phụ trợ lxml, nhưng phần phụ trợ này sẽ sử dụng html5lib nếu lxml không thể phân tích cú pháp
Do đó, chúng tôi khuyên bạn nên cài đặt cả BeautifulSoup4 và html5lib để bạn vẫn nhận được kết quả hợp lệ [miễn là mọi thứ khác đều hợp lệ] ngay cả khi lxml không thành công
Sự cố với BeautifulSoup4 khi sử dụng lxml làm phụ trợ
Các vấn đề trên cũng tồn tại ở đây vì BeautifulSoup4 về cơ bản chỉ là một trình bao bọc xung quanh phần phụ trợ của trình phân tích cú pháp
Sự cố với BeautifulSoup4 khi sử dụng html5lib làm phụ trợ
Lợi ích
html5lib nhẹ nhàng hơn nhiều so với lxml và do đó xử lý đánh dấu trong đời thực theo cách lành mạnh hơn nhiều thay vì chỉ, e. g. , loại bỏ một phần tử mà không thông báo cho bạn
html5lib tự động tạo đánh dấu HTML5 hợp lệ từ đánh dấu không hợp lệ. Điều này cực kỳ quan trọng để phân tích cú pháp các bảng HTML, vì nó đảm bảo một tài liệu hợp lệ. Tuy nhiên, điều đó KHÔNG có nghĩa là nó “đúng”, vì quá trình sửa lỗi đánh dấu không có một định nghĩa duy nhất
html5lib là Python thuần túy và không yêu cầu các bước xây dựng bổ sung ngoài cài đặt của chính nó
nhược điểm
Hạn chế lớn nhất khi sử dụng html5lib là nó chậm như mật mía. Tuy nhiên, hãy xem xét thực tế là nhiều bảng trên web không đủ lớn để thời gian chạy thuật toán phân tích cú pháp trở nên quan trọng. Nhiều khả năng nút cổ chai sẽ nằm trong quá trình đọc văn bản thô từ URL trên web, tôi. e. , IO [đầu vào-đầu ra]. Đối với các bảng rất lớn, điều này có thể không đúng
Mủ cao su#
New in version 1. 3. 0
Hiện tại không có phương thức đọc từ LaTeX, chỉ có phương thức xuất
Ghi vào tệp LaTeX#
Note
Các đối tượng DataFrame và Styler hiện có phương thức
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3797. Chúng tôi khuyên bạn nên sử dụng Styler. phương thức to_latex[] trên DataFrame. to_latex[] do tính linh hoạt cao hơn của cái trước với kiểu dáng có điều kiện và khả năng không dùng nữa trong tương lai của cái sau.
Xem lại tài liệu về Styler. to_latex , cung cấp các ví dụ về kiểu dáng có điều kiện và giải thích hoạt động của các đối số từ khóa của nó.
Đối với ứng dụng đơn giản, mẫu sau là đủ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object346
Để định dạng giá trị trước khi xuất, hãy xâu chuỗi Styler. định dạng phương thức.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object347
XML#
Đọc XML#
New in version 1. 3. 0
Hàm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3798 cấp cao nhất có thể chấp nhận một chuỗi/tệp/URL XML và sẽ phân tích cú pháp các nút và thuộc tính thành một con gấu trúc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43
Note
Since there is no standard XML structure where design types can vary in many ways,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3000 works best with flatter, shallow versions. Nếu một tài liệu XML được lồng sâu, hãy sử dụng tính năng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3001 để chuyển đổi XML thành một phiên bản phẳng hơn
Hãy xem xét một vài ví dụ
Đọc một chuỗi XML
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object348
Đọc một URL không có tùy chọn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object349
Đọc trong nội dung của “sách. xml” và chuyển nó tới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3000 dưới dạng một chuỗi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object350
Đọc trong nội dung của “sách. xml” như ví dụ của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3004 và chuyển nó tới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3000
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object351
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object352
Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing Biomedical and Life Science Jorurnals
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object353
With lxml as default
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3006, you access the full-featured XML library that extends Python’s ElementTree API. Một công cụ mạnh mẽ là khả năng truy vấn các nút một cách có chọn lọc hoặc có điều kiện với XPath biểu cảm hơn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object354
Chỉ định các phần tử hoặc chỉ các thuộc tính để phân tích cú pháp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object355
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object356
Tài liệu XML có thể có không gian tên có tiền tố và không gian tên mặc định không có tiền tố, cả hai đều được biểu thị bằng một thuộc tính đặc biệt
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3007. Để phân tích cú pháp theo nút trong ngữ cảnh không gian tên,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3008 phải tham chiếu tiền tố
Ví dụ: XML bên dưới chứa một không gian tên có tiền tố,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3009 và URI tại
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3010. Để phân tích cú pháp các nút
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3011, phải sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3012
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object357
Tương tự, một tài liệu XML có thể có một không gian tên mặc định không có tiền tố. Không gán tiền tố tạm thời sẽ không trả về nút nào và tăng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2927. Nhưng việc gán bất kỳ tên tạm thời nào để sửa URI cho phép phân tích cú pháp theo các nút
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object358
Tuy nhiên, nếu XPath không tham chiếu đến các tên nút như mặc định,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3014, thì không cần dùng đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3012
Với lxml làm trình phân tích cú pháp, bạn có thể làm phẳng các tài liệu XML lồng nhau bằng tập lệnh XSLT, tập lệnh này cũng có thể là các loại chuỗi/tệp/URL. Về cơ bản, XSLT là một ngôn ngữ có mục đích đặc biệt được viết trong một tệp XML đặc biệt có thể chuyển đổi các tài liệu XML gốc thành XML, HTML khác, thậm chí cả văn bản [CSV, JSON, v.v. ] sử dụng bộ xử lý XSLT
Ví dụ: hãy xem xét cấu trúc hơi lồng nhau này của Chicago “L” Rides trong đó các phần tử nhà ga và chuyến đi gói gọn dữ liệu trong các phần riêng của chúng. With below XSLT,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3775 can transform original nested document into a flatter output [as shown below for demonstration] for easier parse into
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object359
Đối với các tệp XML rất lớn có thể từ hàng trăm megabyte đến gigabyte,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3018 hỗ trợ phân tích cú pháp các tệp có kích thước lớn như vậy bằng cách sử dụng iterparse của lxml và iterparse của etree, đây là các phương pháp hiệu quả về bộ nhớ để lặp qua cây XML và trích xuất các phần tử và thuộc tính cụ thể. without holding entire tree in memory
Mới trong phiên bản 1. 5. 0
Để sử dụng tính năng này, bạn phải chuyển đường dẫn tệp XML vật lý vào
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3000 và sử dụng đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3020. Các tệp không được nén hoặc trỏ đến các nguồn trực tuyến mà được lưu trữ trên đĩa cục bộ. Also,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3020 should be a dictionary where the key is the repeating nodes in document [which become the rows] and the value is a list of any element or attribute that is a descendant [i. e. , con, cháu] của nút lặp. Vì XPath không được sử dụng trong phương pháp này, nên các hậu duệ không cần chia sẻ cùng mối quan hệ với nhau. Dưới đây cho thấy ví dụ về việc đọc trong kết xuất dữ liệu bài viết mới nhất rất lớn [12 GB+] của Wikipedia
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object360
Viết XML#
New in version 1. 3. 0
Các đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 có một phương thức thể hiện
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3023 hiển thị nội dung của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 dưới dạng tài liệu XML
Note
This method does not support special properties of XML including DTD, CData, XSD schemas, processing instructions, comments, and others. Chỉ các không gian tên ở cấp cơ sở được hỗ trợ. Tuy nhiên,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3001 cho phép thay đổi thiết kế sau đầu ra ban đầu
Hãy xem xét một vài ví dụ
Viết một XML không có tùy chọn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object361
Viết một XML với gốc và tên hàng mới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object362
Write an attribute-centric XML
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object363
Viết hỗn hợp các phần tử và thuộc tính
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object364
Bất kỳ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3740 nào có các cột phân cấp sẽ được làm phẳng cho các tên phần tử XML với các mức được phân tách bằng dấu gạch dưới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object365
Viết một XML với không gian tên mặc định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object366
Viết một XML với tiền tố không gian tên
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object367
Viết một XML mà không cần khai báo hoặc in đẹp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object368
Viết một XML và chuyển đổi với biểu định kiểu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object369
XML Final Notes#
Tất cả các tài liệu XML tuân thủ các thông số kỹ thuật của W3C. Both
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3027 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3775 parsers will fail to parse any markup document that is not well-formed or follows XML syntax rules. Do be aware HTML is not an XML document unless it follows XHTML specs. However, other popular markup types including KML, XAML, RSS, MusicML, MathML are compliant XML schemasFor above reason, if your application builds XML prior to pandas operations, use appropriate DOM libraries like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3027 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3775 to build the necessary document and not by string concatenation or regex adjustments. Always remember XML is a special text file with markup rulesWith very large XML files [several hundred MBs to GBs], XPath and XSLT can become memory-intensive operations. Be sure to have enough available RAM for reading and writing to large XML files [roughly about 5 times the size of text]
Because XSLT is a programming language, use it with caution since such scripts can pose a security risk in your environment and can run large or infinite recursive operations. Luôn kiểm tra tập lệnh trên các đoạn nhỏ trước khi chạy đầy đủ
The etree parser supports all functionality of both
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3000 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3023 except for complex XPath and any XSLT. Though limited in features,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3027 is still a reliable and capable parser and tree builder. Its performance may trailIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3775 to a certain degree for larger files but relatively unnoticeable on small to medium size files
Excel files#
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3035 method can read Excel 2007+ [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036] files using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3037 Python module. Excel 2003 [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3038] files can be read using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3039. Binary Excel [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3040] files can be read using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3041. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3042 instance method is used for saving a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 to Excel. Generally the semantics are similar to working with csv data. See the cookbook for some advanced strategies.
Warning
The xlwt package for writing old-style
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3038 excel files is no longer maintained. The xlrd package is now only for reading old-style
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3038 files
Before pandas 1. 3. 0, đối số mặc định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3046 đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3035 sẽ dẫn đến việc sử dụng công cụ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3039 trong nhiều trường hợp, bao gồm các tệp Excel 2007+ [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036] mới. gấu trúc bây giờ sẽ mặc định sử dụng công cụ openpyxl
It is strongly encouraged to install
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3037 to read Excel 2007+ [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036] files. Vui lòng không báo cáo sự cố khi sử dụng ``xlrd`` để đọc ``. tập tin xlsx``. This is no longer supported, switch to using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3037 instead
Attempting to use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3053 engine will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3709 unless the option
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3055 is set to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3056. Mặc dù tùy chọn này hiện không được dùng nữa và cũng sẽ tăng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3709, tùy chọn này có thể được đặt trên toàn cầu và cảnh báo bị chặn. Users are recommended to write
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036 files using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3037 engine instead
Đọc tệp Excel#
Trong trường hợp sử dụng cơ bản nhất,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3060 có đường dẫn đến tệp Excel và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3061 cho biết trang tính nào cần phân tích cú pháp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object370
lớp In [13]: import numpy as np
In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In [15]: print[data]
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [16]: df = pd.read_csv[StringIO[data], dtype=object]
In [17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [18]: df["a"][0]
Out[18]: '1'
In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}]
In [20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
3062#
Để tạo điều kiện làm việc với nhiều trang tính từ cùng một tệp, lớp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3062 có thể được sử dụng để bọc tệp và có thể được chuyển vào
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3060 Sẽ có lợi về hiệu suất khi đọc nhiều trang tính vì tệp chỉ được đọc vào bộ nhớ một lần
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object371
Lớp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3062 cũng có thể được sử dụng làm trình quản lý ngữ cảnh
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object372
Thuộc tính
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3066 sẽ tạo danh sách tên trang tính trong tệp
Trường hợp sử dụng chính cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3062 đang phân tích cú pháp nhiều trang tính với các tham số khác nhau
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object373
Lưu ý rằng nếu các tham số phân tích cú pháp giống nhau được sử dụng cho tất cả các trang tính, một danh sách tên trang tính có thể được chuyển đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3060 mà không làm giảm hiệu suất
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object374
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3062 cũng có thể được gọi với đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3070 làm tham số. This allows the user to control how the excel file is read. Ví dụ: các trang tính có thể được tải theo yêu cầu bằng cách gọi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3071 với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3072
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object375
Chỉ định trang tính #
Note
Đối số thứ hai là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3061, đừng nhầm lẫn với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3074
Note
Thuộc tính của ExcelFile
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3066 cung cấp quyền truy cập vào danh sách các trang tính
Các đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3061 cho phép chỉ định trang tính hoặc trang tính để đọcGiá trị mặc định cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3061 là 0, cho biết đọc trang đầu tiênTruyền một chuỗi để chỉ tên của một trang tính cụ thể trong sổ làm việc
Truyền một số nguyên để chỉ chỉ mục của một trang tính. Các chỉ số tuân theo quy ước Python, bắt đầu từ 0
Truyền một danh sách các chuỗi hoặc số nguyên để trả về một từ điển gồm các trang tính được chỉ định
Vượt qua một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
24 để trả lại một từ điển của tất cả các tờ có sẵn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object376
Sử dụng chỉ mục trang tính
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object377
Sử dụng tất cả các giá trị mặc định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object378
Sử dụng Không để có được tất cả các tờ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object379
Using a list to get multiple sheets
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object380
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3060 can read more than one sheet, by setting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3061 to either a list of sheet names, a list of sheet positions, or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 to read all sheets. Các trang tính có thể được chỉ định theo chỉ mục trang tính hoặc tên trang tính, sử dụng một số nguyên hoặc chuỗi tương ứng
Đọc một In [13]: import numpy as np
In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In [15]: print[data]
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [16]: df = pd.read_csv[StringIO[data], dtype=object]
In [17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [18]: df["a"][0]
Out[18]: '1'
In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}]
In [20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
2076#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3060 có thể đọc chỉ mục
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 bằng cách chuyển danh sách các cột tới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 và cột
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 bằng cách chuyển danh sách các hàng tới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884. Nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2942 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2940 có tên cấp độ được đánh số thứ tự, những tên đó cũng sẽ được đọc bằng cách chỉ định các hàng/cột tạo nên cấp độ
Ví dụ: để đọc trong chỉ mục
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 không có tên
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object381
Nếu chỉ mục có tên cấp độ, chúng cũng sẽ được phân tích cú pháp, sử dụng cùng tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object382
If the source file has both
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 index and columns, lists specifying each should be passed to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0884
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object383
Các giá trị bị thiếu trong các cột được chỉ định trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164 sẽ được điền chuyển tiếp để cho phép thực hiện quay vòng với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3095 cho
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3096. Để tránh điền tiếp các giá trị còn thiếu, hãy sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3097 sau khi đọc dữ liệu thay vì
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0164
Phân tích cú pháp các cột cụ thể#
Thường xảy ra trường hợp người dùng sẽ chèn các cột để thực hiện các phép tính tạm thời trong Excel và bạn có thể không muốn đọc trong các cột đó.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3060 lấy từ khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 để cho phép bạn chỉ định một tập hợp con các cột để phân tích cú pháp
Thay đổi trong phiên bản 1. 0. 0
Passing in an integer for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 will no longer work. Thay vào đó, vui lòng chuyển vào danh sách các số nguyên từ 0 đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47
Bạn có thể chỉ định một tập hợp các cột và phạm vi Excel được phân tách bằng dấu phẩy dưới dạng một chuỗi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object384
Nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 là một danh sách các số nguyên, thì nó được coi là chỉ số cột tệp được phân tích cú pháp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object385
Element order is ignored, so
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object54 is the same as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object55
Nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 là một danh sách các chuỗi, giả định rằng mỗi chuỗi tương ứng với một tên cột do người dùng cung cấp trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 hoặc được suy ra từ [các] hàng tiêu đề tài liệu. Các chuỗi đó xác định cột nào sẽ được phân tích cú pháp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object386
Thứ tự phần tử bị bỏ qua, vì vậy
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3108 giống như
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3109
Nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 có thể gọi được, thì hàm có thể gọi được sẽ được đánh giá dựa trên tên cột, trả về các tên mà hàm có thể gọi được đánh giá là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object387
Ngày phân tích cú pháp#
Các giá trị giống như ngày giờ thường được tự động chuyển đổi thành dtype thích hợp khi đọc tệp excel. Nhưng nếu bạn có một cột gồm các chuỗi trông giống như ngày tháng [nhưng thực tế không được định dạng là ngày tháng trong excel], bạn có thể sử dụng từ khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0169 để phân tích cú pháp các chuỗi đó thành datetimes
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object388
Bộ chuyển đổi tế bào #
Có thể chuyển đổi nội dung của các ô Excel thông qua tùy chọn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0111. Chẳng hạn, để chuyển đổi một cột thành boolean
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object389
Tùy chọn này xử lý các giá trị bị thiếu và coi các ngoại lệ trong bộ chuyển đổi là dữ liệu bị thiếu. Transformations are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. Chẳng hạn, một cột gồm các số nguyên có giá trị bị thiếu không thể được chuyển đổi thành một mảng có kiểu số nguyên, vì NaN hoàn toàn là một số float. You can manually mask missing data to recover integer dtype
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object390
Dtype specifications#
Là một giải pháp thay thế cho bộ chuyển đổi, loại cho toàn bộ cột có thể được chỉ định bằng cách sử dụng từ khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88, từ điển ánh xạ tên cột thành các loại. To interpret data with no type inference, use the type
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object391
Writing Excel files#
Writing Excel files to disk#
To write a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 object to a sheet of an Excel file, you can use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3095 instance method. The arguments are largely the same as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2934 described above, the first argument being the name of the excel file, and the optional second argument the name of the sheet to which the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 should be written. For example
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object392
Files with a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3038 extension will be written using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3053 and those with a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036 extension will be written using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3124 [if available] or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3037
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 will be written in a way that tries to mimic the REPL output. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2943 will be placed in the second row instead of the first. You can place it in the first row by setting the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3128 option in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3042 to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object393
In order to write separate
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3740 to separate sheets in a single Excel file, one can pass an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3132
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object394
Writing Excel files to memory#
pandas supports writing Excel files to buffer-like objects such as
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object11 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3004 using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3132
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object395
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3136 is optional but recommended. Setting the engine determines the version of workbook produced. Setting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3137 will produce an Excel 2003-format workbook [xls]. Using either
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3138 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3139 will produce an Excel 2007-format workbook [xlsx]. If omitted, an Excel 2007-formatted workbook is produced
Excel writer engines#
Deprecated since version 1. 2. 0. As the xlwt package is no longer maintained, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3053 engine will be removed from a future version of pandas. This is the only engine in pandas that supports writing to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3038 files.
pandas chooses an Excel writer via two methods
the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3136 keyword argumentphần mở rộng tên tệp [thông qua mặc định được chỉ định trong tùy chọn cấu hình]
By default, pandas uses the XlsxWriter for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036, openpyxl for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3144, and xlwt for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3038 files. If you have multiple engines installed, you can set the default engine through setting the config options
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3146 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3055. pandas will fall back on openpyxl for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3036 files if Xlsxwriter is not available.
To specify which writer you want to use, you can pass an engine keyword argument to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3095 and to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3132. The built-in engines are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3037. version 2. 4 or higher is requiredIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3124In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3053
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object396
Style and formatting#
The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43’s
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3095 method
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2939 . Format string for floating point numbers [defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
24]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3158 . A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so [1, 1] will freeze the first row and first column [defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
24]
Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created with the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3095 method. Excellent examples can be found in the Xlsxwriter documentation here. https. //xlsxwriter. readthedocs. io/working_with_pandas. html
OpenDocument Spreadsheets#
New in version 0. 25
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3035 method can also read OpenDocument spreadsheets using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3162 module. The semantics and features for reading OpenDocument spreadsheets match what can be done for Excel files using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3163
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object397
Note
Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented
Binary Excel [. xlsb] files#
New in version 1. 0. 0
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3035 method can also read binary Excel files using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3041 module. The semantics and features for reading binary Excel files mostly match what can be done for Excel files using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3166.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3041 does not recognize datetime types in files and will return floats instead
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object398
Note
Currently pandas only supports reading binary Excel files. Writing is not implemented
Clipboard#
A handy way to grab data is to use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3168 method, which takes the contents of the clipboard buffer and passes them to the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 method. For instance, you can copy the following text to the clipboard [CTRL-C on many operating systems]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object399
And then import the data directly to a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 by calling
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object100
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3171 method can be used to write the contents of a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 to the clipboard. Following which you can paste the clipboard contents into other applications [CTRL-V on many operating systems]. Here we illustrate writing a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 into clipboard and reading it back
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object101
We can see that we got the same content back, which we had earlier written to the clipboard
Note
You may need to install xclip or xsel [with PyQt5, PyQt4 or qtpy] on Linux to use these methods
Pickling#
All pandas objects are equipped with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3174 methods which use Python’s
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3175 module to save data structures to disk using the pickle format
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object102
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3176 function in the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3177 namespace can be used to load any pickled pandas object [or any other pickled object] from file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object103
Warning
Loading pickled data received from untrusted sources can be unsafe
See. https. //docs. python. org/3/library/pickle. html
Warning
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3178 is only guaranteed backwards compatible back to pandas version 0. 20. 3
Compressed pickle files#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3178,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3180 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3181 can read and write compressed pickle files. The compression types of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0857,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0858,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3184,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3185 are supported for reading and writing. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3186 file format only supports reading and must contain only one data file to be read
The compression type can be an explicit parameter or be inferred from the file extension. If ‘infer’, then use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0857,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0858,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3186,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3184,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3185 if filename ends in
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3192,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3193,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3194,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3195, or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3196, respectively
The compression parameter can also be a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0843 in order to pass options to the compression protocol. Nó phải có khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0847 được đặt thành tên của giao thức nén, phải là một trong {
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0839,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0837,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0838,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0840,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0841}. All other key-value pairs are passed to the underlying compression library
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object104
Using an explicit compression type
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object105
Inferring compression type from the extension
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object106
The default is to ‘infer’
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object107
Passing options to the compression protocol in order to speed up compression
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object108
msgpack#
pandas support for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3204 has been removed in version 1. 0. 0. It is recommended to use pickle instead.
Alternatively, you can also the Arrow IPC serialization format for on-the-wire transmission of pandas objects. For documentation on pyarrow, see here
HDF5 [PyTables]#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library. See the cookbook for some advanced strategies
Warning
pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle. Loading pickled data received from untrusted sources can be unsafe
See. https. //docs. python. org/3/library/pickle. html for more
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object109
Objects can be written to the file just like adding key-value pairs to a dict
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object110
In a current or later Python session, you can retrieve stored objects
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object111
Deletion of the object specified by the key
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object112
Closing a Store and using a context manager
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object113
Read/write API#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 supports a top-level API using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3207 for reading and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3208 for writing, similar to how
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2934 work
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object114
HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3211
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object115
Fixed format#
The examples above show storing using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3212, which write the HDF5 to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 in a fixed array format, called the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3214 format. These types of stores are not appendable once written [though you can simply remove them and rewrite]. Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3214 format stores offer very fast writing and slightly faster reading than
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 stores. This format is specified by default when using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3212 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3208 or by
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3219 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3220
Warning
A
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3214 format will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3222 if you try to retrieve using a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object116
Table format#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 supports another
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 format on disk, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 format. Về mặt khái niệm, một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 có hình dạng rất giống một DataFrame, với các hàng và cột. A
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 may be appended to in the same or other sessions. In addition, delete and query type operations are supported. This format is specified by
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3229 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3230 to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3231 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3212 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3208
This format can be set as an option as well
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3234 to enable
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3235 to by default store in the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 format
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object117
Note
You can also create a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 by passing
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3229 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3230 to a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3212 operation
Hierarchical keys#
Keys to a store can be specified as a string. These can be in a hierarchical path-name like format [e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3241], which will generate a hierarchy of sub-stores [or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3242 in PyTables parlance]. Keys can be specified without the leading ‘/’ and are always absolute [e. g. ‘foo’ refers to ‘/foo’]. Thao tác xóa có thể xóa mọi thứ trong cửa hàng phụ trở xuống, vì vậy hãy cẩn thận
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object118
Bạn có thể duyệt qua hệ thống phân cấp nhóm bằng phương pháp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3243 sẽ tạo ra một bộ cho mỗi khóa nhóm cùng với các khóa tương đối của nội dung của nó
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object119
Warning
Hierarchical keys cannot be retrieved as dotted [attribute] access as described above for items stored under the root node
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object120
Instead, use explicit string based keys
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object121
Storing types#
Lưu trữ các loại hỗn hợp trong một bảng#
Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the appended column. Subsequent attempts at appending longer strings will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2927
Passing
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3245 as a parameter to append will set a larger minimum for the string columns. Storing
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3246 are currently supported. For string columns, passing
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3247 to append will change the default nan representation on disk [which converts to/from
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3248], this defaults to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3249
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object122
Storing MultiIndex DataFrames#
Storing MultiIndex
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3740 as tables is very similar to storing/selecting from homogeneous index
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3740
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object123
Note
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2942 keyword is reserved and cannot be use as a level name
Querying#
Querying a table#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3253 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3254 operations have an optional criterion that can be specified to select/delete only a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data
A query is specified using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3255 class under the hood, as a boolean expression
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2940 are supported indexers ofIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3740if
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3259 are specified, these can be used as additional indexerslevel name in a MultiIndex, with default name
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3260,In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3261, … if not provided
Valid comparison operators are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3262
Valid boolean expressions are combined with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3263 . orIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3264 . andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3265 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3266 . để nhóm
These rules are similar to how boolean expressions are used in pandas for indexing
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3267 will be automatically expanded to the comparison operatorIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3268In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3269 is the not operator, but can only be used in very limited circumstancesIf a list/tuple of expressions is passed they will be combined via
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3264
The following are valid expressions
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3271In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3272In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3273In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3274In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3275In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3276In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3277In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3278In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3279In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3280
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3281 are on the left-hand side of the sub-expression
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2940,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3283,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3284
The right-hand side of the sub-expression [after a comparison operator] can be
functions that will be evaluated, e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3285strings, e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3286date-like, e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3287, orIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3288lists, e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3289variables that are defined in the local names space, e. g.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3290
Note
Passing a string to a query by interpolating it into the query expression is not recommended. Simply assign the string of interest to a variable and use that variable in an expression. For example, do this
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object124
instead of this
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object125
The latter will not work and will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3291. Note that there’s a single quote followed by a double quote in the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3292 variable
If you must interpolate, use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3293 format specifier
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object126
which will quote
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3292
Here are some examples
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object127
Use boolean expressions, with in-line function evaluation
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object128
Use inline column reference
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object129
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2940 keyword can be supplied to select a list of columns to be returned, this is equivalent to passing a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3296
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object130
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3297 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3298 parameters can be specified to limit the total search space. These are in terms of the total number of rows in a table
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3253 will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2927 if the query expression has an unknown variable reference. Usually this means that you are trying to select on a column that is not a data_column
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3253 will raise a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3291 if the query expression is not valid
Query timedelta64[ns]#
You can store and query using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3303 type. Terms can be specified in the format.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3304, where float may be signed [and fractional], and unit can be
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3305 for the timedelta. Here’s an example
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object131
Query MultiIndex#
Selecting from a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 can be achieved by using the name of the level
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object132
If the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 levels names are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24, the levels are automatically made available via the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3309 keyword with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3310 the level of the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2076 you want to select from
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object133
Indexing#
You can create/modify an index for a table with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3312 after data is already in the table [after and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3313 operation]. Creating a table index is highly encouraged. This will speed your queries a great deal when you use a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3253 with the indexed dimension as the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223
Note
Indexes are automagically created on the indexables and any data columns you specify. Có thể tắt hành vi này bằng cách chuyển
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3316 đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3231
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object134
Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each append, then recreate at the end
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object135
Then create the index when finished appending
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object136
See here for how to create a completely-sorted-index [CSI] on an existing store
Query via data columns#
You can designate [and index] certain columns that you want to be able to perform queries [other than the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3318 columns, which you can always query]. For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3319 to force all columns to be
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3259
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object137
There is some performance degradation by making lots of columns into
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3321, so it is up to the user to designate these. In addition, you cannot change data columns [nor indexables] after the first append/put operation [Of course you can simply read in the data and create a new table. ]
Iterator#
You can pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2095 or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3323 to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3253 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3325 to return an iterator on the results. The default is 50,000 rows returned in a chunk
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object138
Note
You can also use the iterator with
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3207 which will open, then automatically close the store when finished iterating
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object139
Note, that the chunksize keyword applies to the source rows. So if you are doing a query, then the chunksize will subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal sized chunks
Here is a recipe for generating a query and using it to create equal sized return chunks
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object140
Advanced queries#
To retrieve a single indexable or data column, use the method
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3327. This will, for example, enable you to get the index very quickly. These return a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 of the result, indexed by the row number. These do not currently accept the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223 selector
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object141
Sometimes you want to get the coordinates [a. k. a the index locations] of your query. This returns an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3330 of the resulting locations. These coordinates can also be passed to subsequent
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223 operations
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object142
Sometime your query can involve creating a list of rows to select. Usually this
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3332 would be a resulting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2942 from an indexing operation. This example selects the months of a datetimeindex which are 5
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object143
If you want to inspect the stored object, retrieve via
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3334. Bạn có thể sử dụng điều này theo lập trình để nói lấy số lượng hàng trong một đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object144
Multiple table queries#
The methods
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3335 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3325 can perform appending/selecting from multiple tables at once. The idea is to have one table [call it the selector table] that you index most/all of the columns, and perform your queries. The other table[s] are data tables with an index matching the selector table’s index. You can then perform a very fast query on the selector table, yet get lots of data back. This method is similar to having a very wide table, but enables more efficient queries
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3335 method splits a given single DataFrame into multiple tables according to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3338, a dictionary that maps the table names to a list of ‘columns’ you want in that table. If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 is used in place of a list, that table will have the remaining unspecified columns of the given DataFrame. The argument
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3340 defines which table is the selector table [which you can make queries from]. The argument
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3341 will drop rows from the input
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 to ensure tables are synchronized. This means that if a row for one of the tables being written to is entirely
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3343, that row will be dropped from all tables
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3341 is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES. Remember that entirely
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3345 rows are not written to the HDFStore, so if you choose to call
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3346, some tables may have more rows than others, and therefore
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3325 may not work or it may return unexpected results
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object145
Delete from a table#
You can delete from a table selectively by specifying a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223. In deleting rows, it is important to understand the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 deletes rows by erasing the rows, then moving the following data. Do đó, việc xóa có thể là một hoạt động rất tốn kém tùy thuộc vào hướng dữ liệu của bạn. To get optimal performance, it’s worthwhile to have the dimension you are deleting be the first of the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3350
Data is ordered [on the disk] in terms of the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3350. Here’s a simple use case. You store panel-type data, with dates in the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3283 and ids in the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3353. The data is then interleaved like this
- date_1
id_1
id_2
.
id_n
- date_2
id_1
.
id_n
It should be clear that a delete operation on the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3283 will be fairly quick, as one chunk is removed, then the following data moved. On the other hand a delete operation on the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3353 will be very expensive. In this case it would almost certainly be faster to rewrite the table using a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223 that selects all but the missing data
Warning
Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically. Thus, repeatedly deleting [or removing nodes] and adding again, WILL TEND TO INCREASE THE FILE SIZE
To repack and clean the file, use ptrepack .
Notes & caveats#
Compression#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 allows the stored data to be compressed. This applies to all kinds of stores, not just tables. Two parameters are used to control compression.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3358 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3359
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3358 specifies if and how hard data is to be compressed.In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3361 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3362 disables compression andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3363 enables compressionIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3359 specifies which compression library to use. If nothing is specified the default libraryIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3365 is used. A compression library usually optimizes for either good compression rates or speed and the results will depend on the type of data. Which type of compression to choose depends on your specific needs and data. The list of supported compression librarieszlib. The default compression library. A classic in terms of compression, achieves good compression rates but is somewhat slow
lzo. Fast compression and decompression
bzip2. Good compression rates
blosc. Fast compression and decompression
Support for alternative blosc compressors
blosc. blosclz This is the default compressor for
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3366blosc. lz4. A compact, very popular and fast compressor
blosc. lz4hc. A tweaked version of LZ4, produces better compression ratios at the expense of speed
blosc. snappy. A popular compressor used in many places
blosc. zlib. A classic; somewhat slower than the previous ones, but achieving better compression ratios
blosc. zstd. An extremely well balanced codec; it provides the best compression ratios among the others above, and at reasonably fast speed
If
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3359 is defined as something other than the listed libraries aIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2927 exception is issued
Note
If the library specified with the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3359 option is missing on your platform, compression defaults to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3365 without further ado
Enable compression for all objects within the file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object146
Or on-the-fly compression [this only applies to tables] in stores where compression is not enabled
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object147
ptrepack#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 offers better write performance when tables are compressed after they are written, as opposed to turning on compression at the very beginning. You can use the supplied
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 utility
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3373. In addition,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3373 can change compression levels after the fact
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object148
Furthermore
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3375 will repack the file to allow you to reuse previously deleted space. Ngoài ra, người ta có thể chỉ cần xóa tệp và ghi lại hoặc sử dụng phương thức
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3376
Caveats#
Warning
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 is not-threadsafe for writing. The underlying
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 only supports concurrent reads [via threading or processes]. If you need reading and writing at the same time, you need to serialize these operations in a single thread in a single process. You will corrupt your data otherwise. See the [GH2397] for more information
If you use locks to manage write access between multiple processes, you may want to use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3379 before releasing write locks. For convenience you can useIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3380 to do this for youOnce a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3104 is created columns [DataFrame] are fixed; only exactly the same columns can be appendedBe aware that timezones [e. g. ,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3382] are not necessarily equal across timezone versions. So if data is localized to a specific timezone in the HDFStore using one version of a timezone library and that data is updated with another version, the data will be converted to UTC since these timezones are not considered equal. Either use the same version of timezone library or useIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3383 with the updated timezone definition
Warning
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 will show a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3385 if a column name cannot be used as an attribute selector. Natural identifiers contain only letters, numbers, and underscores, and may not begin with a number. Other identifiers cannot be used in a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3223 clause and are generally a bad idea
DataTypes#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 will map an object dtype to the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 underlying dtype. This means the following types are known to work
Loại hình
Represents missing values
floating .
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3389
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3248
integer .
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3391
boolean
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3392
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3119
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3303
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3119
categorical . see the section below
object .
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3396
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3248
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3398 columns are not supported, and WILL FAIL
Categorical data#
You can write data that contains
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3399 dtypes to a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205. Queries work the same as if it was an object array. However, the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3399 dtyped data is stored in a more efficient manner
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object149
String columns#
min_itemsize
The underlying implementation of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 uses a fixed column width [itemsize] for string columns. A string column itemsize is calculated as the maximum of the length of data [for that column] that is passed to the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205, in the first append. Subsequent appends, may introduce a string for a column larger than the column can hold, an Exception will be raised [otherwise you could have a silent truncation of these columns, leading to loss of information]. In the future we may relax this and allow a user-specified truncation to occur
Pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3404 on the first table creation to a-priori specify the minimum length of a particular string column.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3404 can be an integer, or a dict mapping a column name to an integer. You can pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3103 as a key to allow all indexables or data_columns to have this min_itemsize
Passing a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3404 dict will cause all passed columns to be created as data_columns automatically
Note
If you are not passing any
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3259, then the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3404 will be the maximum of the length of any string passed
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object150
nan_rep
String columns will serialize a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3248 [a missing value] with the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3411 string representation. This defaults to the string value
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3249. You could inadvertently turn an actual
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3249 value into a missing value
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object151
External compatibility#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 writes
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3104 format objects in specific formats suitable for producing loss-less round trips to pandas objects. For external compatibility,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 can read native
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3213 format tables
It is possible to write an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3205 object that can easily be imported into
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3419 using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3420 library [Package website]. Create a table format store like this
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object152
In R this file can be read into a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3421 object using the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3420 library. The following example function reads the corresponding column names and data values from the values and assembles them into a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3421
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object153
Now you can import the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 into R
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object154
Note
The R function lists the entire HDF5 file’s contents and assembles the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3421 object from all matching nodes, so use this only as a starting point if you have stored multiple
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 objects to a single HDF5 file
Performance#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3427 format come with a writing performance penalty as compared toIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3214 stores. The benefit is the ability to append/delete and query [potentially very large amounts of data]. Write times are generally longer as compared with regular stores. Thời gian truy vấn có thể khá nhanh, đặc biệt là trên trục được lập chỉ mụcYou can pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3429 toIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3231, specifying the write chunksize [default is 50000]. This will significantly lower your memory usage on writingYou can pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3431 to the firstIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3231, to set the TOTAL number of rows thatIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3213 will expect. This will optimize read/write performanceDuplicate rows can be written to tables, but are filtered out in selection [with the last items being selected; thus a table is unique on major, minor pairs]
A
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3434 will be raised if you are attempting to store types that will be pickled by PyTables [rather than stored as endemic types]. See Here for more information and some solutions
Feather#
Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy
Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas dtypes, including extension dtypes such as categorical and datetime with tz
Several caveats
The format will NOT write an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3720, orIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2076 for theIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
43 and will raise an error if a non-default one is provided. You canIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3438 to store the index orIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3439 to ignore itTên cột trùng lặp và tên cột không phải chuỗi không được hỗ trợ
Các đối tượng Python thực tế trong các cột dtype đối tượng không được hỗ trợ. These will raise a helpful error message on an attempt at serialization
See the Full Documentation
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object155
Write to a feather file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object156
Read from a feather file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object157
Parquet#
Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance
Parquet is designed to faithfully serialize and de-serialize
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 s, supporting all of the pandas dtypes, including extension dtypes such as datetime with tz
Several caveats
Tên cột trùng lặp và tên cột không phải chuỗi không được hỗ trợ
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2097 engine always writes the index to the output, butIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3442 only writes non-default indexes. Cột bổ sung này có thể gây ra sự cố cho những người tiêu dùng không phải là pandas không mong đợi điều đó. You can force including or omitting indexes with theIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2942 argument, regardless of the underlying engineIndex level names, if specified, must be strings
Trong công cụ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2097, các kiểu dữ liệu phân loại cho các loại không phải chuỗi có thể được đánh số thứ tự thành sàn gỗ, nhưng sẽ hủy đánh số thứ tự như kiểu dữ liệu nguyên thủy của chúngThe
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2097 engine preserves theIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3734 flag of categorical dtypes with string types.In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3442 does not preserve theIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3734 flagNon supported types include
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3449 and actual Python object types. These will raise a helpful error message on an attempt at serialization.In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3450 type is supported with pyarrow >= 0. 16. 0The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
2097 engine preserves extension data types such as the nullable integer and string data type [requiring pyarrow >= 0. 16. 0, and requiring the extension type to implement the needed protocols, see the extension types documentation ].
You can specify an
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3136 to direct the serialization. This can be one of
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2097, or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3442, or
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3455. If the engine is NOT specified, then the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3456 option is checked; if this is also
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3455, then
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2097 is tried, and falling back to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3442
See the documentation for pyarrow and fastparquet
Note
These engines are very similar and should read/write nearly identical parquet format files.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3460 supports timedelta data,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3461 supports timezone aware datetimes. These libraries differ by having different underlying dependencies [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3442 by using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3463, while
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2097 uses a c-library]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object158
Write to a parquet file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object159
Read from a parquet file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object160
Read only certain columns of a parquet file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object161
Handling indexes#
Nối tiếp một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 thành sàn gỗ có thể bao gồm chỉ mục ẩn dưới dạng một hoặc nhiều cột trong tệp đầu ra. Thus, this code
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object162
creates a parquet file with three columns if you use
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2097 for serialization.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3467,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3468, and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3469. If you’re using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3442, the index may or may not be written to the file
This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column doesn’t exist in the target table
If you want to omit a dataframe’s indexes when writing, pass
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3316 to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3472
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object163
This creates a parquet file with just the two expected columns,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3467 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3468. If your
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 has a custom index, you won’t get it back when you load this file into a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43
Passing
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3477 will always write the index, even if that’s not the underlying engine’s default behavior
Partitioning Parquet files#
Parquet supports partitioning of data based on the values of one or more columns
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object164
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3478 specifies the parent directory to which data will be saved. The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3479 are the column names by which the dataset will be partitioned. Columns are partitioned in the order they are given. The partition splits are determined by the unique values in the partition columns. The above example creates a partitioned dataset that may look like
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object165
ORC#
New in version 1. 0. 0
Similar to the parquet format, the ORC Format is a binary columnar serialization for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the ORC format,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3480 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3481. This requires the pyarrow library.
Warning
It is highly recommended to install pyarrow using conda due to some issues occurred by pyarrow
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3481 requires pyarrow>=7. 0. 0In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3480 andIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3481 are not supported on Windows yet, you can find valid environments on install optional dependencies .For supported dtypes please refer to supported ORC features in Arrow
Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object166
Write to an orc file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object167
Read from an orc file
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object168
Chỉ đọc một số cột nhất định của tệp orc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object169
SQL queries#
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3485 module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs
If SQLAlchemy is not installed, a fallback is only provided for sqlite [and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version]. This mode requires a Python database adapter which respect the Python DB-API
See also some cookbook examples for some advanced strategies.
The key functions are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3486[table_name, con[, schema, . ]]
Read SQL database table into a DataFrame
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3487[sql, con[, index_col, . ]]
Read SQL query into a DataFrame
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3488[sql, con[, index_col, . ]]
Đọc truy vấn SQL hoặc bảng cơ sở dữ liệu vào DataFrame
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3489[name, con[, schema, . ]]
Write records stored in a DataFrame to a SQL database
Note
The function
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3490 is a convenience wrapper around
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3491 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3492 [and for backward compatibility] and will delegate to specific function depending on the provided input [database table name or sql query]. Table names do not need to be quoted if they have special characters
In the following example, we use the SQlite SQL database engine. You can use a temporary SQLite database where data are stored in “memory”
Để kết nối với SQLAlchemy, bạn sử dụng hàm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3493 để tạo đối tượng công cụ từ URI cơ sở dữ liệu. You only need to create the engine once per database you are connecting to. For more information on
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3493 and the URI formatting, see the examples below and the SQLAlchemy documentation
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object170
If you want to manage your own connections you can pass one of those instead. The example below opens a connection to the database using a Python context manager that automatically closes the connection after the block has completed. See the SQLAlchemy docs for an explanation of how the database connection is handled
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object171
Warning
When you open a connection to a database you are also responsible for closing it. Side effects of leaving a connection open may include locking the database or other breaking behaviour
Writing DataFrames#
Assuming the following data is in a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56, we can insert it into the database using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3497
id
Date
Col_1
Col_2
Col_3
26
2012-10-18
X
25. 7
True
42
2012-10-19
Y
-12. 4
Sai
63
2012-10-20
Z
5. 73
True
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object172
Với một số cơ sở dữ liệu, việc ghi DataFrames lớn có thể dẫn đến lỗi do vượt quá giới hạn kích thước gói. Điều này có thể tránh được bằng cách đặt tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 khi gọi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3499. Ví dụ: phần sau ghi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56 vào cơ sở dữ liệu theo lô 1000 hàng cùng một lúc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object173
Các kiểu dữ liệu SQL#
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3497 sẽ cố gắng ánh xạ dữ liệu của bạn sang loại dữ liệu SQL thích hợp dựa trên loại dữ liệu. Khi bạn có các cột dtype
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72, gấu trúc sẽ cố gắng suy ra kiểu dữ liệu
Bạn luôn có thể ghi đè loại mặc định bằng cách chỉ định loại SQL mong muốn của bất kỳ cột nào bằng cách sử dụng đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88. Đối số này cần tên cột ánh xạ từ điển tới các loại SQLAlchemy [hoặc chuỗi cho chế độ dự phòng sqlite3]. Ví dụ: chỉ định sử dụng loại sqlalchemy
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3504 thay vì loại
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3505 mặc định cho các cột chuỗi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object174
Note
Do sự hỗ trợ hạn chế cho timedelta trong các hương vị cơ sở dữ liệu khác nhau, các cột có loại
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3506 sẽ được ghi dưới dạng giá trị số nguyên dưới dạng nano giây vào cơ sở dữ liệu và cảnh báo sẽ được đưa ra
Note
Các cột của
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3399 dtype sẽ được chuyển thành biểu diễn dày đặc như bạn sẽ nhận được với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3508 [e. g. đối với các danh mục chuỗi, điều này mang lại một chuỗi các chuỗi]. Do đó, việc đọc lại bảng cơ sở dữ liệu không tạo ra một phân loại
Kiểu dữ liệu ngày giờ#
Sử dụng SQLAlchemy,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3497 có khả năng ghi dữ liệu ngày giờ không biết múi giờ hoặc nhận biết múi giờ. Tuy nhiên, dữ liệu kết quả được lưu trữ trong cơ sở dữ liệu cuối cùng phụ thuộc vào loại dữ liệu được hỗ trợ cho dữ liệu ngày giờ của hệ thống cơ sở dữ liệu đang được sử dụng
Bảng sau đây liệt kê các kiểu dữ liệu được hỗ trợ cho dữ liệu ngày giờ đối với một số cơ sở dữ liệu phổ biến. Các phương ngữ cơ sở dữ liệu khác có thể có các loại dữ liệu khác nhau cho dữ liệu ngày giờ
cơ sở dữ liệu
Các kiểu ngày giờ SQL
Hỗ trợ múi giờ
SQLite
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3510
Không
mysql
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3511 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3512
Không
PostgreSQL
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3511 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3514
Đúng
Khi ghi dữ liệu nhận biết múi giờ vào cơ sở dữ liệu không hỗ trợ múi giờ, dữ liệu sẽ được ghi dưới dạng dấu thời gian ngây thơ múi giờ theo giờ địa phương đối với múi giờ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3491 cũng có khả năng đọc dữ liệu ngày giờ nhận biết múi giờ hoặc ngây thơ. Khi đọc các loại
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3514, gấu trúc sẽ chuyển đổi dữ liệu sang UTC
Phương pháp chèn #
Tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3517 kiểm soát mệnh đề chèn SQL được sử dụng. Possible values are
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
24. Sử dụng mệnh đề SQLIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3519 tiêu chuẩn [mỗi hàng một cái]In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3520. Truyền nhiều giá trị trong một mệnh đềIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3519. Nó sử dụng một cú pháp SQL đặc biệt không được hỗ trợ bởi tất cả các chương trình phụ trợ. Điều này thường mang lại hiệu suất tốt hơn cho các cơ sở dữ liệu phân tích như Presto và Redshift, nhưng lại có hiệu suất kém hơn đối với phần phụ trợ SQL truyền thống nếu bảng chứa nhiều cột. Để biết thêm thông tin, hãy kiểm tra tài liệu SQLAlchemycó thể gọi được với chữ ký
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object
3522. Điều này có thể được sử dụng để triển khai phương thức chèn hiệu quả hơn dựa trên các tính năng phương ngữ phụ trợ cụ thể
Ví dụ về một mệnh đề có thể gọi được bằng PostgreSQL COPY
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object175
Bảng đọc #
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3491 sẽ đọc một bảng cơ sở dữ liệu được đặt tên bảng và tùy chọn một tập hợp con các cột để đọc
Note
Để sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3491, bạn phải cài đặt phần phụ thuộc tùy chọn SQLAlchemy
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object176
Note
Lưu ý rằng gấu trúc suy ra các kiểu cột từ đầu ra truy vấn chứ không phải bằng cách tra cứu các loại dữ liệu trong lược đồ cơ sở dữ liệu vật lý. Ví dụ: giả sử
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3525 là một cột số nguyên trong bảng. Sau đó, theo trực giác,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3526 sẽ trả về chuỗi giá trị số nguyên, trong khi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3527 sẽ trả về chuỗi giá trị đối tượng [str]. Theo đó, nếu đầu ra truy vấn trống, thì tất cả các cột kết quả sẽ được trả về dưới dạng giá trị đối tượng [vì chúng là tổng quát nhất]. Nếu bạn thấy trước rằng truy vấn của mình đôi khi sẽ tạo ra một kết quả trống, thì bạn có thể muốn đánh máy rõ ràng sau đó để đảm bảo tính toàn vẹn của dtype
Bạn cũng có thể chỉ định tên của cột là chỉ mục
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 và chỉ định một tập hợp con các cột sẽ được đọc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object177
Và bạn rõ ràng có thể buộc các cột được phân tích thành ngày
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object178
Nếu cần, bạn có thể chỉ định rõ ràng một chuỗi định dạng hoặc một lệnh của các đối số để chuyển đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3529
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object179
Bạn có thể kiểm tra xem một bảng có tồn tại hay không bằng cách sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3530
Hỗ trợ lược đồ #
Reading from and writing to different schema’s is supported through the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3716 keyword in the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3491 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3497 functions. Tuy nhiên, lưu ý rằng điều này phụ thuộc vào hương vị cơ sở dữ liệu [sqlite không có lược đồ]. Ví dụ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object180
Querying#
Bạn có thể truy vấn bằng SQL thô trong hàm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3492. Trong trường hợp này, bạn phải sử dụng biến thể SQL phù hợp với cơ sở dữ liệu của mình. Khi sử dụng SQLAlchemy, bạn cũng có thể chuyển các cấu trúc ngôn ngữ Biểu thức SQLAlchemy, không liên quan đến cơ sở dữ liệu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object181
Tất nhiên, bạn có thể chỉ định một truy vấn “phức tạp” hơn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object182
Hàm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3492 hỗ trợ đối số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90. Việc chỉ định điều này sẽ trả về một trình vòng lặp thông qua các đoạn kết quả truy vấn
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object183
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object184
Bạn cũng có thể chạy một truy vấn đơn giản mà không cần tạo một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 với
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3538. Điều này hữu ích cho các truy vấn không trả về giá trị, chẳng hạn như INSERT. Điều này có chức năng tương đương với việc gọi
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3539 trên công cụ SQLAlchemy hoặc đối tượng kết nối db. Một lần nữa, bạn phải sử dụng biến thể cú pháp SQL phù hợp với cơ sở dữ liệu của mình
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object185
Ví dụ về kết nối động cơ#
To connect with SQLAlchemy you use the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3493 function to create an engine object from database URI. You only need to create the engine once per database you are connecting to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object186
For more information see the examples the SQLAlchemy documentation
Advanced SQLAlchemy queries#
You can use SQLAlchemy constructs to describe your query
Sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3541 để chỉ định các tham số truy vấn theo cách trung lập với phụ trợ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object187
If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object188
You can combine SQLAlchemy expressions with parameters passed to
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3490 using
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3543
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object189
Sqlite fallback#
The use of sqlite is supported without using SQLAlchemy. Chế độ này yêu cầu bộ điều hợp cơ sở dữ liệu Python tôn trọng Python DB-API
You can create connections like so
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object190
And then issue the following queries
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object191
Google BigQuery#
Warning
Starting in 0. 20. 0, pandas đã tách hỗ trợ Google BigQuery thành gói riêng biệt
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3544. You can
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3545 to get it
The
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3544 package provides functionality to read/write from Google BigQuery
gấu trúc tích hợp với gói bên ngoài này. nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3544 được cài đặt, bạn có thể sử dụng các phương thức pandas
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3548 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3549, sẽ gọi các hàm tương ứng từ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3544
Tài liệu đầy đủ có thể được tìm thấy ở đây
định dạng thống kê #
Ghi vào định dạng stata#
Phương pháp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3551 sẽ ghi một DataFrame vào một. tập tin dta. Phiên bản định dạng của tệp này luôn là 115 [Stata 12]
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object192
Các tệp dữ liệu Stata có hỗ trợ loại dữ liệu hạn chế; . Ngoài ra, Stata dự trữ các giá trị nhất định để biểu thị dữ liệu bị thiếu. Xuất một giá trị không bị thiếu nằm ngoài phạm vi cho phép trong Stata cho một loại dữ liệu cụ thể sẽ nhập lại biến có kích thước lớn hơn tiếp theo. Ví dụ: các giá trị
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3552 bị hạn chế nằm trong khoảng từ -127 đến 100 trong Stata và do đó, các biến có giá trị trên 100 sẽ kích hoạt chuyển đổi thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3553. Các giá trị
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3249 trong kiểu dữ liệu dấu phẩy động được lưu trữ dưới dạng kiểu dữ liệu bị thiếu cơ bản [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3561 trong Stata]
Note
Không thể xuất giá trị dữ liệu bị thiếu cho kiểu dữ liệu số nguyên
Người viết Stata xử lý một cách duyên dáng các loại dữ liệu khác bao gồm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3562,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3563,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3564,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3565,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3566 bằng cách chuyển sang loại được hỗ trợ nhỏ nhất có thể biểu thị dữ liệu. Ví dụ: dữ liệu có loại
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3564 sẽ được chuyển thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3552 nếu tất cả các giá trị nhỏ hơn 100 [giới hạn trên đối với dữ liệu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3552 không bị thiếu trong Stata] hoặc, nếu các giá trị nằm ngoài phạm vi này, biến sẽ được chuyển thành
Warning
Chuyển đổi từ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3562 sang
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3556 có thể dẫn đến mất độ chính xác nếu giá trị
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3562 lớn hơn 2**53
Warning
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3574 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3551 chỉ hỗ trợ các chuỗi có độ rộng cố định chứa tối đa 244 ký tự, giới hạn do định dạng tệp dta phiên bản 115 áp đặt. Attempting to write Stata dta files with strings longer than 244 characters raises a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2927
Đọc từ định dạng Stata#
Hàm cấp cao nhất
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3577 sẽ đọc tệp dta và trả về
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3579 có thể được sử dụng để đọc tệp tăng dần
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object193
Chỉ định một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 mang lại một phiên bản
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3579 có thể được sử dụng để đọc các dòng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 từ tệp cùng một lúc. Đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3579 có thể được sử dụng làm trình vòng lặp
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object194
Để kiểm soát chi tiết hơn, hãy sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2095 và chỉ định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 với mỗi lệnh gọi tới
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object18
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object195
Hiện tại,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2942 được truy xuất dưới dạng cột
Tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3588 cho biết có nên đọc và sử dụng nhãn giá trị để tạo biến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 từ chúng hay không. Nhãn giá trị cũng có thể được truy xuất bằng hàm
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3590, hàm này yêu cầu gọi ____________ trước khi sử dụng
Tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3592 cho biết liệu các biểu diễn giá trị bị thiếu trong Stata có nên được giữ nguyên hay không. Nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 [mặc định], các giá trị bị thiếu được biểu thị dưới dạng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3248. Nếu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, các giá trị bị thiếu được biểu diễn bằng các đối tượng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3596 và các cột chứa các giá trị bị thiếu sẽ có kiểu dữ liệu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72
Note
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3598 and
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3579 support . dta formats 113-115 [Stata 10-12], 117 [Stata 13], and 118 [Stata 14]
Note
Cài đặt
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3600 sẽ upcast lên kiểu dữ liệu pandas tiêu chuẩn.
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3562 cho tất cả các loại số nguyên và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3556 cho dữ liệu dấu phẩy động. Theo mặc định, kiểu dữ liệu Stata được giữ nguyên khi nhập
Categorical data#
Dữ liệu
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 có thể được xuất sang tệp dữ liệu Stata dưới dạng dữ liệu được gắn nhãn giá trị. Dữ liệu đã xuất bao gồm các mã danh mục cơ bản dưới dạng giá trị dữ liệu số nguyên và danh mục dưới dạng nhãn giá trị. Stata does not have an explicit equivalent to a
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 and information about whether the variable is ordered is lost when exporting
Warning
Stata chỉ hỗ trợ các nhãn giá trị chuỗi và do đó,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 được gọi trên các danh mục khi xuất dữ liệu. Exporting
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 variables with non-string categories produces a warning, and can result a loss of information if the
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 representations of the categories are not unique
Tương tự, dữ liệu được gắn nhãn có thể được nhập từ các tệp dữ liệu Stata dưới dạng các biến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 bằng cách sử dụng đối số từ khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3588 [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 theo mặc định]. Đối số từ khóa
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3611 [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 theo mặc định] xác định xem các biến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 đã nhập có được sắp xếp hay không
Note
Khi nhập dữ liệu phân loại, giá trị của các biến trong tệp dữ liệu Stata không được bảo toàn do các biến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 luôn sử dụng các kiểu dữ liệu số nguyên trong khoảng từ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3615 đến
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3616 trong đó
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3310 là số lượng phân loại. Nếu các giá trị gốc trong tệp dữ liệu Stata là bắt buộc, thì có thể nhập các giá trị này bằng cách đặt ____63618, thao tác này sẽ nhập dữ liệu gốc [nhưng không nhập các nhãn biến]. Các giá trị ban đầu có thể khớp với dữ liệu phân loại đã nhập vì có một ánh xạ đơn giản giữa các giá trị dữ liệu Stata ban đầu và mã danh mục của các biến Phân loại đã nhập. các giá trị còn thiếu được gán mã
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3615 và giá trị ban đầu nhỏ nhất được gán
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84, giá trị nhỏ thứ hai được gán
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3621, v.v. cho đến khi giá trị gốc lớn nhất được gán mã
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3616
Note
Stata hỗ trợ sê-ri được dán nhãn một phần. These series have value labels for some but not all data values. Nhập chuỗi được gắn nhãn một phần sẽ tạo ra một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object0124 với các danh mục chuỗi cho các giá trị được gắn nhãn và danh mục số cho các giá trị không có nhãn
định dạng SAS #
Hàm cấp cao nhất
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3624 có thể đọc [nhưng không ghi] SAS XPORT [. xpt] và [kể từ v0. 18. 0] SAS7BDAT [. sas7bdat] định dạng tập tin
Tệp SAS chỉ chứa hai loại giá trị. Văn bản ASCII và giá trị dấu phẩy động [thường là 8 byte nhưng đôi khi bị cắt ngắn]. Đối với tệp xuất, không có chuyển đổi loại tự động thành số nguyên, ngày hoặc phân loại. Đối với các tệp SAS7BDAT, mã định dạng có thể cho phép các biến ngày được tự động chuyển đổi thành ngày. Theo mặc định, toàn bộ tệp được đọc và trả về dưới dạng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43
Chỉ định một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 hoặc sử dụng
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object2095 để lấy các đối tượng người đọc [
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3628 hoặc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3629] để đọc tệp dần dần. Các đối tượng người đọc cũng có các thuộc tính chứa thông tin bổ sung về tệp và các biến của nó
Đọc tệp SAS7BDAT
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object196
Lấy một trình vòng lặp và đọc một tệp XPORT 100.000 dòng cùng một lúc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object197
Thông số kỹ thuật cho định dạng tệp xport có sẵn trên trang web của SAS
Không có tài liệu chính thức nào cho định dạng SAS7BDAT
định dạng SPSS#
Mới trong phiên bản 0. 25. 0
Hàm cấp cao nhất
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3630 có thể đọc [nhưng không ghi] SPSS SAV [. sav] và ZSAV [. tệp định dạng zsav]
Tệp SPSS chứa tên cột. Theo mặc định, toàn bộ tệp được đọc, các cột phân loại được chuyển đổi thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3631 và một
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 với tất cả các cột được trả về
Chỉ định tham số
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 để có được một tập hợp con các cột. Chỉ định
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3618 để tránh chuyển đổi các cột phân loại thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3631
Đọc một tệp SPSS
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object198
Trích xuất một tập hợp con các cột có trong
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 từ tệp SPSS và tránh chuyển đổi các cột phân loại thành
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3631
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object199
Thông tin thêm về các định dạng tệp SAV và ZSAV có tại đây
Các định dạng tệp khác#
bản thân gấu trúc chỉ hỗ trợ IO với một bộ định dạng tệp giới hạn ánh xạ rõ ràng tới mô hình dữ liệu dạng bảng của nó. Để đọc và ghi các định dạng tệp khác vào và từ gấu trúc, chúng tôi khuyên dùng các gói này từ cộng đồng rộng lớn hơn
netCDF#
xarray cung cấp cấu trúc dữ liệu lấy cảm hứng từ gấu trúc
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 để làm việc với bộ dữ liệu đa chiều, tập trung vào định dạng tệp netCDF và chuyển đổi dễ dàng sang và từ gấu trúc
Cân nhắc về hiệu suất#
Đây là một so sánh không chính thức của các phương pháp IO khác nhau, sử dụng pandas 0. 24. 2. Thời gian phụ thuộc vào máy và nên bỏ qua những khác biệt nhỏ
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object00
Các chức năng kiểm tra sau đây sẽ được sử dụng bên dưới để so sánh hiệu suất của một số phương pháp IO
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object01
Khi viết, ba chức năng hàng đầu về tốc độ là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3639,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3640 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3641
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object02
Khi đọc, ba chức năng hàng đầu về tốc độ là
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3642,
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3643 và
In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print[data] a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv[StringIO[data], dtype=object] In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv[StringIO[data], dtype={"b": object, "c": np.float64, "d": "Int64"}] In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object3644