Sort list by column python

8 Things to Know to Master Value Sorting in Pandas

Sorting values with pandas effectively

Yong Cui

Aug 5·5 min read

Photo by Markus Spiske on Unsplash

When we deal with data, sorting is an important preprocessing step to visually examine the quality of your data. With pandas, although sometimes we may use a related method sort_index, we sort data using the sort_values method most of the time. In this article, Id like to share 8 things that are essential for you to complete this preprocessing step with the focus on the sort_values method.

Without any ado, lets get it started.

1. Sort by a Single Column

In this article, well be using the flights dataset, which records the monthly passenger numbers from 1949 to 1960. For the purpose of this tutorial, well select a random subset, as shown below.

Dataset for Sorting

When we want to sort the data by a single column, we specify the column name directly as the function calls first parameter. As a side note, you may see me use head a lot, just to show you the top values without wasting the space.

>>> df.sort_values["year"].head[]
year month passengers
0 1949 July 148
13 1951 February 150
8 1951 April 163
19 1951 December 166
5 1952 May 183

2. Sort Values Inplace

In the previous sorting, one thing you may have notices is that the sort_values method will create a new DataFrame object, as shown below.

>>> df.sort_values["year"] is df
False

To avoid creating a new DataFrame, you can request the sorting to be done inplace by setting the inplace parameter. When you do that, note that calling sort_values will return None.

>>> df.sort_values["year", inplace=True]
>>> df.head[]
year month passengers
0 1949 July 148
13 1951 February 150
8 1951 April 163
19 1951 December 166
5 1952 May 183

3. Reset Index After Sorting

In the previous sorting, you may notice that the index goes with each sorted row, which puzzles me sometimes, when I want the sorted DataFrame has an ordered index. In this case, you can either reset the index after sorting, or simply take advantage of the ignore_index parameter, as shown below.

>>> df.sort_values["year", ignore_index=True].head[]
year month passengers
0 1949 July 148
1 1951 February 150
2 1951 April 163
3 1951 December 166
4 1952 May 183

4. Sort by Multiple Columns

We dont always need one column for sorting. In many cases, we need to sort the data frame by multiple columns. Its also simple with sort_values because by doesnt only take a single column but also a list of columns without any special syntax.

>>> df.sort_values[["year", "passengers"]].head[]
year month passengers
0 1949 July 148
13 1951 February 150
8 1951 April 163
19 1951 December 166
17 1952 January 171

5. Sort by Descending Orders

As weve seen so far, every sorting is done using the ascending order, which is the default behavior. However, we often want to have the data sorted by a descending order. We can take advantage of the ascending parameter.

>>> df.sort_values["year", ascending=False].head[]
year month passengers
18 1960 June 535
6 1958 April 348
4 1958 October 359
1 1957 June 422
7 1957 March 356

What should we do if we sort by multiple columns and have different ascending requirements for these columns? In this case, we can pass a list of boolean values with each corresponding to one column.

>>> df.sort_values[["year", "passengers"], ascending=[False, True]].head[]
year month passengers
18 1960 June 535
6 1958 April 348
4 1958 October 359
7 1957 March 356
1 1957 June 422

6. Sort by Custom Functions

What if we want to sort by year and month with the current dataset? Lets try it without too much thinking.

>>> df.sort_values[["year", "month"]].head[]
year month passengers
0 1949 July 148
8 1951 April 163
19 1951 December 166
13 1951 February 150
9 1952 August 242

Apparently, the sorted data isnt something that we expect the months are not in the desired order. To make this happen, we can take advantage of the sort_method taking a key parameter, to which we can pass a custom function for sorting, just like Pythons built-in sorted function. A possible solution is shown below.

>>> def _month_sorting[x]:
... if x.name == "year":
... return x
... months = ["January", "February", "March", "April",
... "May", "June", "July", "August",
... "Septempber", "October", "November", "December"]
... return x.map[dict[zip[months, range[0, len[months]]]]]
...
>>> df.sort_values[["year", "month"], key=_month_sorting].head[]
year month passengers
0 1949 July 148
13 1951 February 150
8 1951 April 163
19 1951 December 166
17 1952 January 171
  • The key takes a callable, and we use a custom function here. Besides, this parameter is only available with pandas 1.1.0+.
  • Unlike the key parameter used in sorted[], the key function applies to each of the sorting columns in the sort_values method. Because we only want to custom the sorting for the month column, when the column is year, we want to use the original values of the year column.

7. Sort Lexicographically Unordered Columns After Casting to Categorical

The above sorting using the key parameter can be confusing to some people. Is there a cleaner way? Pandas is arguably the most versatile library for data processing, and you can expect that there is something neat to solve this relatively common problem converting these lexicographically unordered columns to categorical data.

Sort by Casted Categories
  • We define a CategoricalDtype by specifying the order of the months.
  • We cast the month column to the new defined category.
  • When we sort the month, it will use the order of the months in the category data definition.

8. Dont Forget about NANs

Its important to remember that your datasets can always contain NANs. Unless youve examined your data quality and know that there are no NANs, you should pay attention to that. When we sort values, these NANs are placed behind all the other valid values, by default. If we want to change this default behavior, we set the na_position parameter.

Sorting with NANs
  • We first inject one NAN into the DataFrame object.
  • When we do nothing with the na_position, the NAN value is placed at the end of the sorting group.
  • When we set first to na_position, the NAN value appears at the top.

Conclusions

In this article, we reviewed 8 things/conditions about sorting values with pandas, which should cover most use cases. If you feel Im missing anything important, please feel free to leave a comment!

Video liên quan

Chủ Đề