CopyPastor

Let's try [`groupby agg`]() with [Named Aggregation]() [`reindex`]() to match the `horse_id` column and [`join`]() back to the initial DataFrame: ``` df = df.join( df.groupby('Sire_horse_id') .agg(Offspring=('horse_id', 'count'), Offspring_races=('Races', 'sum')) .reindex(df['horse_id'], fill_value=0) .reset_index(drop=True) ) ```
`df`: ``` horse_id horse_type Sire_horse_id Dam_horse_id Races Offspring Offspring_races 0 101 Stalllion 50 80 20 3 62 1 102 Mare 51 81 3 1 5 2 103 Stallion 90 70 33 2 51 3 104 Colt 101 77 27 0 0 4 105 Filly 52 102 17 0 0 5 106 Filly 101 102 23 0 0 6 107 Mare 103 35 33 0 0 7 108 Colt 103 77 18 0 0 8 109 Colt 102 107 5 0 0 9 110 Filly 101 107 12 0 0 ```

Just out of curiosity I've taken a look at what happens under the hood, and I've used [dtruss/strace][1] on each test.
C++
./a.out < in Saw 6512403 lines in 8 seconds. Crunch speed: 814050
syscalls `sudo dtruss -c ./a.out < in`
CALL COUNT __mac_syscall 1 <snip> open 6 pread 8 mprotect 17 mmap 22 stat64 30 read_nocancel 25958

Python
./a.py < in Read 6512402 lines in 1 seconds. LPS: 6512402
syscalls `sudo dtruss -c ./a.py < in`
CALL COUNT __mac_syscall 1 <snip> open 5 pread 8 mprotect 17 mmap 21 stat64 29
[1]: http://en.wikipedia.org/wiki/Strace

CopyPastor

Possible Plagiarism

Original Post