CopyPastor

You need to define escape character so that comma(,) inside test can be ignored while parsing
this can be done as ***spark.read.option("escape","\"")***
working example :

scala> val df = spark.read.option("header",true).option("escape","\"").csv("train.csv"); df: org.apache.spark.sql.DataFrame = [id: string, teacher_id: string ... 14 more fields] scala> df.select($"project_is_approved").show +-------------------+ |project_is_approved| +-------------------+ | 1| | 0| | 1| | 0| | 1| | 1| | 1| | 1| | 1| | 1| | 1| | 1| | 1| | 0| | 1| | 1| | 1| | 1| | 1| | 0| +-------------------+ only showing top 20 rows

Just out of curiosity I've taken a look at what happens under the hood, and I've used [dtruss/strace][1] on each test.
C++
./a.out < in Saw 6512403 lines in 8 seconds. Crunch speed: 814050
syscalls `sudo dtruss -c ./a.out < in`
CALL COUNT __mac_syscall 1 <snip> open 6 pread 8 mprotect 17 mmap 22 stat64 30 read_nocancel 25958

Python
./a.py < in Read 6512402 lines in 1 seconds. LPS: 6512402
syscalls `sudo dtruss -c ./a.py < in`
CALL COUNT __mac_syscall 1 <snip> open 5 pread 8 mprotect 17 mmap 21 stat64 29
[1]: http://en.wikipedia.org/wiki/Strace

CopyPastor

Possible Plagiarism

Original Post