CopyPastor

###`groupby` + `bfill`, and `ffill`
df = df.groupby('cluster').bfill().ffill() df
cluster Value 0 1 A 1 1 A 2 1 A 3 1 A 4 1 A 5 2 B 6 2 B 7 2 B 8 2 B 9 3 B 10 3 B 11 3 C 12 3 C 13 4 S 14 4 S 15 4 S 16 5 A 17 5 A 18 5 A 19 5 A
---
Or,
###`groupby` + `transform` with `first`
df['Value'] = df.groupby('cluster').Value.transform('first') df
cluster Value 0 1 A 1 1 A 2 1 A 3 1 A 4 1 A 5 2 B 6 2 B 7 2 B 8 2 B 9 3 B 10 3 B 11 3 C 12 3 C 13 4 S 14 4 S 15 4 S 16 5 A 17 5 A 18 5 A 19 5 A

Just out of curiosity I've taken a look at what happens under the hood, and I've used [dtruss/strace][1] on each test.
C++
./a.out < in Saw 6512403 lines in 8 seconds. Crunch speed: 814050
syscalls `sudo dtruss -c ./a.out < in`
CALL COUNT __mac_syscall 1 <snip> open 6 pread 8 mprotect 17 mmap 22 stat64 30 read_nocancel 25958

Python
./a.py < in Read 6512402 lines in 1 seconds. LPS: 6512402
syscalls `sudo dtruss -c ./a.py < in`
CALL COUNT __mac_syscall 1 <snip> open 5 pread 8 mprotect 17 mmap 21 stat64 29
[1]: http://en.wikipedia.org/wiki/Strace

CopyPastor

Possible Plagiarism

Original Post