###`groupby` + `bfill`, and `ffill`
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
---
Or,
###`groupby` + `transform` with `first`
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Just out of curiosity I've taken a look at what happens under the hood, and I've used [dtruss/strace][1] on each test.
C++
./a.out < in
Saw 6512403 lines in 8 seconds. Crunch speed: 814050
syscalls `sudo dtruss -c ./a.out < in`
CALL COUNT
__mac_syscall 1
<snip>
open 6
pread 8
mprotect 17
mmap 22
stat64 30
read_nocancel 25958
Python
./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402
syscalls `sudo dtruss -c ./a.py < in`
CALL COUNT
__mac_syscall 1
<snip>
open 5
pread 8
mprotect 17
mmap 21
stat64 29
[1]: http://en.wikipedia.org/wiki/Strace