CopyPastor

Your DataFrame (df_1) ``` +----------+----------+----------------------------+ |item_name |item_value|timestamp | +----------+----------+----------------------------+ |hpc_max |0.25 |2023-03-01T17:20:00.000+0000| |asset_min |0.34 |2023-03-01T17:20:00.000+0000| |off_median|0.3 |2023-03-01T17:30:00.000+0000| |hpc_max |0.54 |2023-03-01T17:30:00.000+0000| |asset_min |0.32 |2023-03-01T17:35:00.000+0000| |off_median|0.67 |2023-03-01T17:20:00.000+0000| |asset_min |0.54 |2023-03-01T17:30:00.000+0000| |off_median|0.32 |2023-03-01T17:35:00.000+0000| |hpc_max |0.67 |2023-03-01T17:35:00.000+0000| +----------+----------+----------------------------+ ``` Try this ``` from pyspark.sql.functions import collect_list
df_2 = df_1.orderBy("timestamp","item_name")
df_2.groupBy("timestamp").agg( collect_list("item_name").alias("item_name"), collect_list("item_value").alias("item_value") ).show(truncate=False) ``` ***Output*** ``` +----------------------------+--------------------------------+------------------+ |timestamp |item_name |item_value | +----------------------------+--------------------------------+------------------+ |2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]| |2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] | |2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]| +----------------------------+--------------------------------+------------------+ ```

Your DataFrame (df_1) ``` +----------+----------+----------------------------+ |item_name |item_value|timestamp | +----------+----------+----------------------------+ |hpc_max |0.25 |2023-03-01T17:20:00.000+0000| |asset_min |0.34 |2023-03-01T17:20:00.000+0000| |off_median|0.3 |2023-03-01T17:30:00.000+0000| |hpc_max |0.54 |2023-03-01T17:30:00.000+0000| |asset_min |0.32 |2023-03-01T17:35:00.000+0000| |off_median|0.67 |2023-03-01T17:20:00.000+0000| |asset_min |0.54 |2023-03-01T17:30:00.000+0000| |off_median|0.32 |2023-03-01T17:35:00.000+0000| |hpc_max |0.67 |2023-03-01T17:35:00.000+0000| +----------+----------+----------------------------+ ```
Import necessary package: ``` from pyspark.sql.functions import collect_list ```
1. Order the DataFrame by column `timestamp` and `item_name` ``` df_2 = df_1.orderBy("timestamp", "item_name") ```
2. Use the `collect_list` aggregation function to collect the column value based on the column `timestamp` ``` df_2.groupBy("timestamp").agg( collect_list("item_name").alias("item_name"), collect_list("item_value").alias("item_value") ).show(truncate=False) ```
***Output*** ``` +----------------------------+--------------------------------+------------------+ |timestamp |item_name |item_value | +----------------------------+--------------------------------+------------------+ |2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]| |2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] | |2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]| +----------------------------+--------------------------------+------------------+ ```

CopyPastor

Possible Plagiarism

Original Post