Remove duplicates from a dataframe in PySpark
It is not an import problem. You simply call .dropDuplicates[] on a wrong object. While class of sqlContext.createDataFrame[rdd1, ...] is pyspark.sql.dataframe.DataFrame, after you apply .collect[] it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:
[df1 = sqlContext .createDataFrame[rdd1, ['column1', 'column2', 'column3', 'column4']] .dropDuplicates[]] df1.collect[]if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column [called 'colName']:
count before dedupe:
do the de-dupe [convert the column you are de-duping to string type]:
from pyspark.sql.functions import col df = df.withColumn['colName',col['colName'].cast['string']] df.drop_duplicates[subset=['colName']].count[]can use a sorted groupby to check to see that duplicates have been removed:
df.groupBy['colName'].count[].toPandas[].set_index["count"].sort_index[ascending=False]