CopyPastor

Detecting plagiarism made easy.

Score: 0.9276883033948524; Reported for: String similarity Open both answers

Possible Plagiarism

Reposted on 2023-04-01
by arudsekaberne

Original Post

Original - Posted on 2023-03-21
by arudsekaberne



            
Present in both answers; Present only in the new answer; Present only in the old answer;

Your DataFrame (df_1) ``` +----------+----------+----------------------------+ |item_name |item_value|timestamp | +----------+----------+----------------------------+ |hpc_max |0.25 |2023-03-01T17:20:00.000+0000| |asset_min |0.34 |2023-03-01T17:20:00.000+0000| |off_median|0.3 |2023-03-01T17:30:00.000+0000| |hpc_max |0.54 |2023-03-01T17:30:00.000+0000| |asset_min |0.32 |2023-03-01T17:35:00.000+0000| |off_median|0.67 |2023-03-01T17:20:00.000+0000| |asset_min |0.54 |2023-03-01T17:30:00.000+0000| |off_median|0.32 |2023-03-01T17:35:00.000+0000| |hpc_max |0.67 |2023-03-01T17:35:00.000+0000| +----------+----------+----------------------------+ ``` Try this ``` from pyspark.sql.functions import collect_list
df_2 = df_1.orderBy("timestamp","item_name")
df_2.groupBy("timestamp").agg( collect_list("item_name").alias("item_name"), collect_list("item_value").alias("item_value") ).show(truncate=False) ``` ***Output*** ``` +----------------------------+--------------------------------+------------------+ |timestamp |item_name |item_value | +----------------------------+--------------------------------+------------------+ |2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]| |2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] | |2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]| +----------------------------+--------------------------------+------------------+ ```
Your DataFrame (df_1) ``` +----------+----------+----------------------------+ |item_name |item_value|timestamp | +----------+----------+----------------------------+ |hpc_max |0.25 |2023-03-01T17:20:00.000+0000| |asset_min |0.34 |2023-03-01T17:20:00.000+0000| |off_median|0.3 |2023-03-01T17:30:00.000+0000| |hpc_max |0.54 |2023-03-01T17:30:00.000+0000| |asset_min |0.32 |2023-03-01T17:35:00.000+0000| |off_median|0.67 |2023-03-01T17:20:00.000+0000| |asset_min |0.54 |2023-03-01T17:30:00.000+0000| |off_median|0.32 |2023-03-01T17:35:00.000+0000| |hpc_max |0.67 |2023-03-01T17:35:00.000+0000| +----------+----------+----------------------------+ ```
Import necessary package: ``` from pyspark.sql.functions import collect_list ```
1. Order the DataFrame by column `timestamp` and `item_name` ``` df_2 = df_1.orderBy("timestamp", "item_name") ```
2. Use the `collect_list` aggregation function to collect the column value based on the column `timestamp` ``` df_2.groupBy("timestamp").agg( collect_list("item_name").alias("item_name"), collect_list("item_value").alias("item_value") ).show(truncate=False) ```
***Output*** ``` +----------------------------+--------------------------------+------------------+ |timestamp |item_name |item_value | +----------------------------+--------------------------------+------------------+ |2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]| |2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] | |2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]| +----------------------------+--------------------------------+------------------+ ```

        
Present in both answers; Present only in the new answer; Present only in the old answer;