Skip to content Skip to sidebar Skip to footer

How Do I Extract A Value (i Want An Int Not Row) From A Dataframe And Do Simple Calculations On It?

I've got a dataframe lets call it 'df' in apache spark with 3 colums and about 1000 rows. One of the colums 'stores' a double in each row that either is 1.00 or 0.00 lets call it

Solution 1:

Once you have the final dataframe with aggregated counts for column, then you can call 'collect' on that Dataframe, this will return the rows of DataFrame as List of Rows datatype.

From the list of Rows, you can query the access the column value by column name and assign to the variable, as below:

>>> df.show()
+--------+----+
|    col1|col2|
+--------+----+
|column_x|1000|
|column_y|2000|
+--------+----+

>>>
>>> test = df.collect()
>>> test
[Row(col1=u'column_x', col2=1000), Row(col1=u'column_y', col2=2000)]
>>>
>>> count_x = test[0].col2
>>> count_x
1000
>>>
>>> count_y = test[1].col2
>>> count_y
2000
>>>

Solution 2:

editI didn't notice that you're asking about python, and I wrote the code in Scala, but in principle the solution should be the same, you should only use the python API

The dataframe is essentially a wrapper on a collection of data. Distributed, but a collection nevertheless. There is an operation org.apache.spark.sql.Dataset#collect, which essentially unwraps that collection into a simple scala Array. When you have an array, you can simply take the n-th element from it, or, since you care only about the first element, you can call head() on an array to get the first element. Since you're using a DataFrame, you've got a collection of org.apache.spark.sql.Row elements. To retrieve the value of an element you'd have to call getDouble or whatever value you want to extract from it.

To summarise this is the code that would do what you want (roughly):

val grouped_df = df2.map(lambda label : (label, 1)).reduceByKey(lambda a, b: a +b)
val collectionOfValues: Array[Row] = grouped_df.collect
val topRow: Row = collectionOfValues.head
val value: Double = topRow.getDouble

Hope this is what you're looking for.

Please note as per documentation:

Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError

Solution 3:

edit: I forgor to write the import.

I solved it by transforming the result into a Panda's dataFrame and then using the int() function on the cell in position [[0][0]] to get the result in the variable x as an integer. Alternatively, you can use float().

import pyspark.sql.functions as f
data=[(1,1,1),(1,2,0),(0,3,1),(1,4,1),(0,1,0),(0,2,0),(1,3,1)]
df=spark.createDataFrame(data,['class_label','review','words'])

print(type(df))

> <class 'pyspark.sql.dataframe.DataFrame'>

print(df)

+-----------+------+-----+ 
|class_label|review|words| 
+-----------+------+-----+ 
|          1|     1|    1| 
|          1|     2|    0| 
|          0|     3|    1| 
|          1|     4|    1| 
|          0|     1|    0| 
|          0|     2|    0| 
|          1|     3|    1| 
+-----------+------+-----+

df2 = df.groupBy().agg(f.sum('class_label').alias('result')).toPandas()

x = int(df2.iloc[[0][0]])

print(type(x))
> <type'int'>
print(x)
> 4

Post a Comment for "How Do I Extract A Value (i Want An Int Not Row) From A Dataframe And Do Simple Calculations On It?"