You need to define escape character so that comma(,) inside test can be ignored while parsing
this can be done as ***spark.read.option("escape","\"")***
working example :
scala> val df = spark.read.option("header",true).option("escape","\"").csv("train.csv");
df: org.apache.spark.sql.DataFrame = [id: string, teacher_id: string ... 14 more fields]
scala> df.select($"project_is_approved").show
+-------------------+
|project_is_approved|
+-------------------+
| 1|
| 0|
| 1|
| 0|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 0|
| 1|
| 1|
| 1|
| 1|
| 1|
| 0|
+-------------------+
only showing top 20 rows
Just out of curiosity I've taken a look at what happens under the hood, and I've used [dtruss/strace][1] on each test.
C++
./a.out < in
Saw 6512403 lines in 8 seconds. Crunch speed: 814050
syscalls `sudo dtruss -c ./a.out < in`
CALL COUNT
__mac_syscall 1
<snip>
open 6
pread 8
mprotect 17
mmap 22
stat64 30
read_nocancel 25958
Python
./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402
syscalls `sudo dtruss -c ./a.py < in`
CALL COUNT
__mac_syscall 1
<snip>
open 5
pread 8
mprotect 17
mmap 21
stat64 29
[1]: http://en.wikipedia.org/wiki/Strace