`spark` is conventionally the Python variable denoting the `SparkSession` and you will often find PySpark code with `spark` assumed to be given. In your Python script you're going to have to define it
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('My PySpark App') \
.getOrCreate()
Alternatively, you can run the code right off the bat in the [**PySpark shell**](https://spark.apache.org/docs/latest/index.html#running-the-examples-and-shell), a Python interpreter where you can run Spark interactively. In the PySpark shell the `spark` variable is pre-defined (as well as `sc`, the Spark _context_).
% pyspark
Python 3.9.7 (v3.9.7:1016ef3790, Aug 30 2021, 16:39:15)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
24/04/20 12:13:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Python version 3.9.7 (v3.9.7:1016ef3790, Aug 30 2021 16:39:15)
Spark context Web UI available at http://192.168.0.199:4040
Spark context available as 'sc' (master = local[*], app id = local-1713608012950).
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x7fe190a96610>
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
You even get a Web UI running on localhost to monitor your jobs!
By typing `help(spark)` you can get useful information about the Spark session:
Help on SparkSession in module pyspark.sql.session object:
class SparkSession(pyspark.sql.pandas.conversion.SparkConversionMixin)
| SparkSession(sparkContext, jsparkSession=None)
|
| The entry point to programming Spark with the Dataset and DataFrame API.
|
| A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
| tables, execute SQL over tables, cache tables, and read parquet files.
| To create a SparkSession, use the following builder pattern:
|
| .. autoattribute:: builder
| :annotation:
|
| Examples
| --------
| >>> spark = SparkSession.builder \
| ... .master("local") \
| ... .appName("Word Count") \
| ... .config("spark.some.config.option", "some-value") \
| ... .getOrCreate()
|
| >>> from datetime import datetime
| >>> from pyspark.sql import Row
| >>> spark = SparkSession(sc)
| >>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1,
| ... b=True, list=[1, 2, 3], dict={"s": 0}, row=Row(a=1),
| ... time=datetime(2014, 8, 1, 14, 1, 5))])
| >>> df = allTypes.toDF()
| >>> df.createOrReplaceTempView("allTypes")
| >>> spark.sql('select i+1, d+1, not b, list[1], dict["s"], time, row.a '
| ... 'from allTypes where b and i > 0').collect()
| [Row((i + CAST(1 AS BIGINT))=2, (d + CAST(1 AS DOUBLE))=2.0, (NOT b)=False, list[1]=2, dict[s]=0, time=datetime.datetime(2014, 8, 1, 14, 1, 5), a=1)]
| >>> df.rdd.map(lambda x: (x.i, x.s, x.d, x.l, x.b, x.time, x.row.a, x.list)).collect()
| [(1, 'string', 1.0, 1, True, datetime.datetime(2014, 8, 1, 14, 1, 5), 1, [1, 2, 3])]
|
| Method resolution order:
| SparkSession
| pyspark.sql.pandas.conversion.SparkConversionMixin
| builtins.object
[. . .]