CopyPastor

`spark` is a variable that usually denotes the Spark session. If the variable is not defined, you can instantiate one:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('My PySpark App') \ .getOrCreate()
Alternatively, you can use the `pyspark` shell where `spark` (the Spark session) as well as `sc` (the Spark context) are predefined (see also [NameError: name 'spark' is not defined, how to solve?](https://stackoverflow.com/a/78357862/2314737)).

`spark` is conventionally the Python variable denoting the `SparkSession` and you will often find PySpark code with `spark` assumed to be given. In your Python script you're going to have to define it
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('My PySpark App') \ .getOrCreate()
Alternatively, you can run the code right off the bat in the [**PySpark shell**](https://spark.apache.org/docs/latest/index.html#running-the-examples-and-shell), a Python interpreter where you can run Spark interactively. In the PySpark shell the `spark` variable is pre-defined (as well as `sc`, the Spark _context_).
% pyspark Python 3.9.7 (v3.9.7:1016ef3790, Aug 30 2021, 16:39:15) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. 24/04/20 12:13:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.1.2 /_/ Using Python version 3.9.7 (v3.9.7:1016ef3790, Aug 30 2021 16:39:15) Spark context Web UI available at http://192.168.0.199:4040 Spark context available as 'sc' (master = local[*], app id = local-1713608012950). SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x7fe190a96610> >>> sc <SparkContext master=local[*] appName=PySparkShell>
You even get a Web UI running on localhost to monitor your jobs!
By typing `help(spark)` you can get useful information about the Spark session:
Help on SparkSession in module pyspark.sql.session object: class SparkSession(pyspark.sql.pandas.conversion.SparkConversionMixin) | SparkSession(sparkContext, jsparkSession=None) | | The entry point to programming Spark with the Dataset and DataFrame API. | | A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as | tables, execute SQL over tables, cache tables, and read parquet files. | To create a SparkSession, use the following builder pattern: | | .. autoattribute:: builder | :annotation: | | Examples | -------- | >>> spark = SparkSession.builder \ | ... .master("local") \ | ... .appName("Word Count") \ | ... .config("spark.some.config.option", "some-value") \ | ... .getOrCreate() | | >>> from datetime import datetime | >>> from pyspark.sql import Row | >>> spark = SparkSession(sc) | >>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1, | ... b=True, list=[1, 2, 3], dict={"s": 0}, row=Row(a=1), | ... time=datetime(2014, 8, 1, 14, 1, 5))]) | >>> df = allTypes.toDF() | >>> df.createOrReplaceTempView("allTypes") | >>> spark.sql('select i+1, d+1, not b, list[1], dict["s"], time, row.a ' | ... 'from allTypes where b and i > 0').collect() | [Row((i + CAST(1 AS BIGINT))=2, (d + CAST(1 AS DOUBLE))=2.0, (NOT b)=False, list[1]=2, dict[s]=0, time=datetime.datetime(2014, 8, 1, 14, 1, 5), a=1)] | >>> df.rdd.map(lambda x: (x.i, x.s, x.d, x.l, x.b, x.time, x.row.a, x.list)).collect() | [(1, 'string', 1.0, 1, True, datetime.datetime(2014, 8, 1, 14, 1, 5), 1, [1, 2, 3])] | | Method resolution order: | SparkSession | pyspark.sql.pandas.conversion.SparkConversionMixin | builtins.object [. . .]

CopyPastor

Possible Plagiarism

Original Post