Remove duplicates from list pyspark

Remove duplicates from a dataframe in PySpark

It is not an import problem. You simply call .dropDuplicates[] on a wrong object. While class of sqlContext.createDataFrame[rdd1, ...] is pyspark.sql.dataframe.DataFrame, after you apply .collect[] it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

[df1 = sqlContext .createDataFrame[rdd1, ['column1', 'column2', 'column3', 'column4']] .dropDuplicates[]] df1.collect[]

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column [called 'colName']:

count before dedupe:

df.count[]

do the de-dupe [convert the column you are de-duping to string type]:

from pyspark.sql.functions import col df = df.withColumn['colName',col['colName'].cast['string']] df.drop_duplicates[subset=['colName']].count[]

can use a sorted groupby to check to see that duplicates have been removed:

df.groupBy['colName'].count[].toPandas[].set_index["count"].sort_index[ascending=False]

Video liên quan

Chủ Đề