Your DataFrame (df_1)
```
+----------+----------+----------------------------+
|item_name |item_value|timestamp |
+----------+----------+----------------------------+
|hpc_max |0.25 |2023-03-01T17:20:00.000+0000|
|asset_min |0.34 |2023-03-01T17:20:00.000+0000|
|off_median|0.3 |2023-03-01T17:30:00.000+0000|
|hpc_max |0.54 |2023-03-01T17:30:00.000+0000|
|asset_min |0.32 |2023-03-01T17:35:00.000+0000|
|off_median|0.67 |2023-03-01T17:20:00.000+0000|
|asset_min |0.54 |2023-03-01T17:30:00.000+0000|
|off_median|0.32 |2023-03-01T17:35:00.000+0000|
|hpc_max |0.67 |2023-03-01T17:35:00.000+0000|
+----------+----------+----------------------------+
```
Try this
```
from pyspark.sql.functions import collect_list
df_2 = df_1.orderBy("timestamp","item_name")
df_2.groupBy("timestamp").agg(
collect_list("item_name").alias("item_name"),
collect_list("item_value").alias("item_value")
).show(truncate=False)
```
***Output***
```
+----------------------------+--------------------------------+------------------+
|timestamp |item_name |item_value |
+----------------------------+--------------------------------+------------------+
|2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]|
|2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] |
|2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]|
+----------------------------+--------------------------------+------------------+
```
Your DataFrame (df_1)
```
+----------+----------+----------------------------+
|item_name |item_value|timestamp |
+----------+----------+----------------------------+
|hpc_max |0.25 |2023-03-01T17:20:00.000+0000|
|asset_min |0.34 |2023-03-01T17:20:00.000+0000|
|off_median|0.3 |2023-03-01T17:30:00.000+0000|
|hpc_max |0.54 |2023-03-01T17:30:00.000+0000|
|asset_min |0.32 |2023-03-01T17:35:00.000+0000|
|off_median|0.67 |2023-03-01T17:20:00.000+0000|
|asset_min |0.54 |2023-03-01T17:30:00.000+0000|
|off_median|0.32 |2023-03-01T17:35:00.000+0000|
|hpc_max |0.67 |2023-03-01T17:35:00.000+0000|
+----------+----------+----------------------------+
```
Import necessary package:
```
from pyspark.sql.functions import collect_list
```
1. Order the DataFrame by column `timestamp` and `item_name`
```
df_2 = df_1.orderBy("timestamp", "item_name")
```
2. Use the `collect_list` aggregation function to collect the column value based on the column `timestamp`
```
df_2.groupBy("timestamp").agg(
collect_list("item_name").alias("item_name"),
collect_list("item_value").alias("item_value")
).show(truncate=False)
```
***Output***
```
+----------------------------+--------------------------------+------------------+
|timestamp |item_name |item_value |
+----------------------------+--------------------------------+------------------+
|2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]|
|2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] |
|2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]|
+----------------------------+--------------------------------+------------------+
```