CopyPastor

Detecting plagiarism made easy.

Score: 0.9843247989577126; Reported for: String similarity Open both answers

Possible Plagiarism

Reposted on 2024-07-29
by Ehsan Fathi

Original Post

Original - Posted on 2024-07-29
by Ehsan Fathi



            
Present in both answers; Present only in the new answer; Present only in the old answer;

Another approach would be to do the merge manually. In my case one dataset is huge `[9000000, 580]` and the other one is small `[10000, 3]`.
The problem with merging normally is that when you merge two data frames, first it creates the third dataframe which is the result of the merge and then it assigns it to the variable. It means that there is a point in time that your memory needs to be able to hold `df1`, `df2` and the result of the join at the same time.
So if you have two data frames df1 and df2 and the merge result is not going to shrink in size as much, then when you run:
df1 = pd.merge([df1, df2])
before the result of the merge is assigned back to df1, it is in memory. In order to prevent the creation of the copy of the extra dataframe, we can do the join manually which is not recommended and is not the most computationally efficient way of doing a join but it is definitely more memory efficient: ``` for col in [columns, that, you, want, to, add, to, df1, from, df2]: dict = df2.set_index('joining_column_in_df2')[col].to_dict() df1[col] = df1['joining_column_in_df1'].apply(lambda x: labels_dict.get(x)) ``` at the end `df1` is the result of the merge. In case `df2` is way smaller than `df1`, this approach actually beats normal merge because it skips the overhead of creating another huge dataset and deleting the original one.
Another approach would be to do the merge manually. In my case one dataset is huge `[9000000, 580]` and the other one is small `[10000, 3]`.
The problem with merging normally is that when you merge two data frames, first it creates the third dataframe which is the result of the merge and then it assigns it to the variable. It means that there is a point in time that your memory needs to be able to hold `df1`, `df2` and the result of the join at the same time.
So if you have two data frames df1 and df2 and the merge result is not going to shrink in size as much, then when you run:
df1 = pd.merge([df1, df2])
before the result of the merge is assigned back to df1, it is in memory. In order to prevent the creation of the copy of the extra dataframe, we can do the join manually which is not recommended and is not the most computationally efficient way of doing a join but it is definitely more memory efficient: ``` for col in [columns, that, you, want, to, add, to, the, first, df, from, df2]: dict = df2.set_index('Colname2')[col].to_dict() df1[col] = df1['Colname1'].apply(lambda x: labels_dict.get(x)) ``` at the end `df1` is the result of the merge. In case `df2` is way smaller than `df1`, this approach actually beats normal merge because it skips the overhead of creating another huge dataset and deleting the original one.

        
Present in both answers; Present only in the new answer; Present only in the old answer;