in this series, PySpark for Beginners: Mastering the Basics then you already understand the heart of Spark: distributed data, DataFrames, and lazy execution. You’ve installed PySpark, stood up a SparkSession, read a CSV and performed simple manipulations of the data in a Dataframe. I’ll leave a link to that story at the end of this one.
One thing worth repeating from that original article is that I often use the terms PySpark and Spark interchangeably, but strictly speaking, Spark is the overarching distributed computing framework (written in Scala), and PySpark is a dedicated Python API to Spark.
Now, something interesting happens when you get past that beginner stage. You quickly realise your second PySpark project requires a slightly different mindset:
This article takes you through those next steps. It’s deliberately slow‑paced and practical. No deep internals. No cluster tuning. No complicated Spark optimisations. Just the things real beginners need to know when they move from toy examples to small, real-world work.
We’re using open‑source Spark, running locally, just like before.
In my first article, we used the simplest possible CSV loader:
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
It works - and it’s fine for early experiments - but it hides a subtle problem.
When you use the inferSchema=True directive, Spark looks at a small sample of your file and uses that information to guess whether a column is an integer, string, boolean, or double. That means:
This can lead to mysterious behaviour later , the kind of bugs beginners find hardest to diagnose.
Think of a schema as Spark’s version of a blueprint for reading data. Before building anything, you tell Spark things like:
The names of the columns
What data type they should be
Whether or not a column value is optional.
Here’s what it looks like for our sales data example. Recall that the data looked like this:
transaction_id,customer_name,net_amount,tax_amount, is_member
101,Alice,250.50,25.05,true
102,Bob,120.00,6.00, false
103,Charlie,450.75,25.07,true
104,David,89.99,5.73,false
To specify the types of the above fields in Spark, we define our schema using code like the following.
from pyspark.sql import types as T
schema = T.StructType([
T.StructField("transaction_id", T.IntegerType(), False),
T.StructField("customer_name", T.StringType(), False),
T.StructField("net_amount", T.DoubleType(), True),
T.StructField("tax_amount", T.DoubleType(), True),
T.StructField("is_member", T.BooleanType(), True),
])
The column names and type parameters are self-explanatory. The True[False] parameter indicates that there may [not] be NULL values in the column. Note, the True/False nullability flag is mostly schema metadata and optimisation info. It’s not always strictly enforced for every data source the way a database NOT NULL constraint is.
There’s a bunch of handy CSV read options you can combine with the schema directive that make loading CSV data even more reliable.
The more common options include:
Now we can load the sales_data into a Dataframe like this:
df = (
spark.read
.option("header", True)
# Other modes: "PERMISSIVE" and "DROPMALFORMED".
.option("mode", "FAILFAST")
.option("nullValue", "N/A")
.schema(schema)
.csv("sales_data.csv")
)
Recall, in my previous article, in our first steps with manipulating dataframes with PySpark, we added an extra, derived column to our Dataframe using code like this:
df2 = df.withColumn("gross_amount", df.net_amount + df.tax_amount)
I explained that this line doesn’t calculate anything yet. It simply adds a step to Spark’s internal plan:
1. Read the CSV
2. Add a new column (gross_amount = net + tax)
Then you might add more steps like this:
df3 = df2.withColumn("tax_percentage", df2.tax_amount / df2.gross_amount * 100)
Still, no computation has happened. Only when you perform an action like …
df3.show()
… does Spark say:
“Okay, now I need to actually run all these steps.”
This is what “lazy execution” means, but the important bit for beginners isn’t the name. It’s the effect, and it means,
Think of it like you would an everyday task, like making a sandwich:
Real data is usually messy and often contains missing values, blank strings, duplicate records, or placeholder values like “N/A” and “unknown”.
In PySpark, the goal is to catch and deal with obvious problems early so the rest of your workflow behaves predictably. PySpark has a number of useful functions that enable you to do this.
The simplest cleaning function is dropna().
df_clean = df.dropna()
This removes any row that contains a null value in any column. That can be useful, but it is often too aggressive.
More commonly, you only drop rows where important columns in that particular row are missing:
df_clean = df.dropna(subset=["net_amount", "tax_amount"])
This means:
Keep the row as long as net_amount and tax_amount are present.
Other columns may still contain nulls, and that might be fine.
Sometimes you don’t want to remove rows. You just want to replace missing values with something sensible.
That is where fillna() is useful.
df_clean = df.fillna({"city": "Unknown"})
You can also fill numeric columns:
df_clean = df.fillna({"tax_amount": 0.0})
This is useful when a missing value has a clear meaning. For example, a missing discount amount might reasonably become 0.0. But be careful. Filling missing values can change the meaning of your data if you choose the wrong default.
Sometimes Spark reads a column as the wrong type, especially when working with CSV files. If so, you can convert a column using the cast() operator:
from pyspark.sql import functions as F
df_clean = df.withColumn("net_amount",F.col("net_amount").cast("double") )
This is especially common when dates, numbers, or booleans have been read as strings.
Duplicate rows can appear when files are exported more than once, joined incorrectly, or combined from multiple sources. You can remove exact duplicates like this:
df_clean = df.dropDuplicates()
Or remove duplicates based on one or more selected columns.
df_clean = df.dropDuplicates(["transaction_id"])
That second version is often more useful because it says:
Each transaction ID should only appear once.
Putting those ideas together:
from pyspark.sql import functions as F
df_clean = (
df
# Remove transactions missing required values.
.dropna(subset=["transaction_id", "net_amount"])
# Supply defaults for optional values.
.fillna(
{
"city": "Unknown",
"tax_amount": 0.0,
}
)
# Apply the expected numeric types.
.withColumn(
"net_amount",
F.col("net_amount").cast("double"),
)
.withColumn(
"tax_amount",
F.col("tax_amount").cast("double"),
)
# Keep one row for each transaction.
.dropDuplicates(["transaction_id"])
)
If you’ve worked with databases before, you’ve probably written SQL statements that join two or more tables together. Joins in Spark work the same way, but on Dataframes.
If the concept of a join is new to you, they are a way to match rows from one DataFrame with related rows from another DataFrame. In other words, it answers a question like:
“Which rows in this DataFrame correspond to rows in that DataFrame?”
That is the main idea behind every join in PySpark. Once that part is clear, the syntax and join types become much easier to understand.
If you have two Dataframes like this:
sales_data.csv
transaction_id, customer_name, net_amount, tax_amount
101, Alice, 250.50, 25.05
102, Bob, 120.00, 6.00
customers.csv
customer_name, city, loyalty_level
Alice, New York, Gold
Bob, London, Silver
You can join them on their common customer_name field like this:
df_sales = spark.read.csv("sales_data.csv", header=True)
df_customers = spark.read.csv("customers.csv", header=True)
df_joined = df_sales.join(df_customers, on="customer_name", how="inner")
df_joined.show()
# Output
+-------------+--------------+----------+----------+--------+-------------+
|customer_name|transaction_id|net_amount|tax_amount|city |loyalty_level|
+-------------+--------------+----------+----------+--------+-------------+
|Alice |101 |250.50 |25.05 |New York|Gold |
|Bob |102 |120.00 |6.00 |London |Silver |
+-------------+--------------+----------+----------+--------+-------------+
There are several different types of joins available in Spark. For 99% of beginner use‑cases, you’ll use one of the following:
And of these, the inner join will be far and away the most common type of join you’ll use in your day-to-day work
Don’t worry about “broadcast”, “sort‑merge”, “shuffle‑hash”, or any other advanced join strategy yet. As your experience of Spark grows, you can read up on these at your leisure.
Just remember:
Joins are computaionally more expensive than simple column operations, so use them when necessary — but not casually.
Most beginners stick with CSV because it’s familiar. But CSV is slow, rigid, and lacks support for data types, and in real life, Parquet is Spark’s native data format. Parquet is a columnar, compressed data format ideally suited for data analytics, data reporting and read-heavy workloads.
When Spark reads a Parquet data set:
You write out the Dataframe contents in Parquet format files like this:
df_joined.write.mode("overwrite").parquet("output/enriched_sales")
Then you can read it back instantly like this,
df_fast = spark.read.parquet("output/enriched_sales")
df_fast.show()
NB. Using Parquet for file input and output is the single easiest performance “upgrade” for any Spark beginner.
Once you understand how to read data, clean it, transform it, join it, and write it back out, the next step is learning how to organise those actions into a simple workflow. A beginner PySpark project usually follows this sequence:
Read data
-> check and clean it
-> add useful columns
-> combine with other data
-> write the result
That may sound obvious, but it is an important shift. You are no longer just experimenting with one DataFrame at a time. You are building a repeatable process.
A useful beginner habit is to give each stage of your workflow a clear purpose. For example:
df_raw = spark.read.schema(schema).csv("sales_data.csv", header=True)
df_clean = df_raw.dropna(subset=["net_amount", "tax_amount"])
df_enriched = df_clean.withColumn(
"gross_amount",
F.col("net_amount") + F.col("tax_amount")
)
df_final = df_enriched.join(df_customers, on="customer_name", how="left")
df_final.write.mode("overwrite").parquet("output/final_dataset")
This style is slightly more verbose than chaining everything into one long expression, but it is much easier to read when you are learning.
Each DataFrame name tells you where you are in the workflow:
df_raw -> the data as it arrived
df_clean -> the data after basic cleaning
df_enriched -> the data after adding new meaning
df_final -> the dataset ready to save
When something goes wrong, this structure makes debugging much easier.
You can inspect each stage by looking at the data:
df_raw.show()
df_clean.show()
df_enriched.show()
You can check row counts:
df_raw.count()
df_clean.count()
df_final.count()
This helps to answer useful questions like:
Did rows disappear unexpectedly during cleaning?
Did the join create more rows than expected?
Did a calculated column produce nulls?
The simple mental model of: Inputs → preparation → combination → output will take you surprisingly far in your PySpark journey.
Spark has a nice little web UI that switches on when you run an action like .count() or .write(). When your Spark job is running locally, visit:
http://localhost:4040
You should see something like this displayed.

It looks a bit overwhelming, but you don’t need to understand every tab. At this stage, you only need to know that the UI exists and why it’s useful. And it’s useful because it helps you see which Spark jobs have run or are currently running.
And, as your experience in Spark grows, the UI can help you understand why jobs failed or are taking longer to run than expected. But that comes much later. For now, treat the Spark UI like the dashboard in your car — you don’t need to understand the engine to notice when something looks odd.
At this point, you’ve moved beyond “I can run Spark” into “I can build a clean, simple Spark pipeline.”
You now know how to:
Nothing in this article required a cluster. Nothing required advanced tuning. This is exactly how many real PySpark projects begin.
When you’re more experienced, you may want to build on your knowledge by researching some of these topics.
These are some of the topics I hope to cover in a future article, but for now, you’ve mastered your next major milestone, and you can build something meaningful and useful with PySpark.
BTW, here is that link to the first article in this series,
PySpark for Beginners: Mastering the Basics, which I mentioned at the start.