Last week, I found myself wrestling with an issue for my client. The mission? Rebuild a Spark SQL query to create a finance table for reporting their organizational formation. Sounds straightforward, right?
Spoiler alert: it wasn’t.
To present the data, we report everything against a specific organizational structure called an organigram. Nothing fancy, just your typical hierarchy chart. But then the trouble began.The goal was simple. Find the correct organigram for the active period of the formation value.
In other words:
- Which organigram belongs to code 20000 during the time period 2023-01-01 to 2099-12-31?
This SQL looks as follows:
left join lh_enriched.vw_dim_organigram org
on beheer.code_oe = org.OrganisatorischeEenheid
AND dd_eind_formatieplaats between BeginOrganigram and EindOrganigram
However, things didn’t go as planned. Instead of the expected results, I got… NULLs. Everywhere. My join apparently decided to ghost me. Strange, because I know the data is there. I double-checked if every date field is a timestamp in Spark SQL. And, above all, I clearly see values for code 20000 in the right time window. Yet, the result? A big fat pile of NULLs.
At this point, I felt like a detective in a crime scene where all the suspects are timestamps. So, I tried to force the join to behave by explicitly mapping the timestamps like this:
to_timestamp(coalesce(beheer.dd_eind_formatieplaats, '2099-12-31'))
between to_timestamp(org.BeginOrganigram) and to_timestamp(org.EindOrganigram)
Although, this still gave me a big fat pile of NULLs!
Here’s where it got really weird: I ran the same query in two separate Fabric notebook sessions… and got two different results. One session happily returned data. The other? Still drowning in NULLs. At this point, I was questioning everything: my notebook settings, my Spark settings. Could some obscure configuration be sabotaging me?
After some deep diving (and a few cups of coffee), I discovered the likely culprit:
- ISO 8601 strings with T/Z are not parsed by a plain CAST(... AS DATE).
Yep, my innocent-looking CAST('2019-01-01T00:00:00Z' AS DATE) was silently returning NULL. Sneaky.
I had been parsing values with to_timestamp(...) or casting to DATE, but Spark wanted something more specific: an explicit format.
As an example:
CAST('2019-01-01T00:00:00Z' AS DATE)
needs to be:
to_date(col, "yyyy-MM-dd'T'HH:mm:ssX").
So eventually, I ended up with a spark sql join condition that is the following:
on coalesce(to_date(eind_formatieplaats, "yyyy-MM-dd'T'HH:mm:ssX"), date'2099-12-31')
BETWEEN
to_date(BeginOrganigram, "yyyy-MM-dd'T'HH:mm:ssX")
AND
coalesce(to_date(EindOrganigram, "yyyy-MM-dd'T'HH:mm:ssX"), date'9999-12-31')
Once I ran the query, I finally got the expected results. Me happy, client happy! But let’s be honest, some frustrating hours had passed.
Now the report delivers the correct data to power HR insights. Mission accomplished! It just took a little longer than I had planned. And that’s the thing: no matter how long you’ve worked with data platforms. or even in Fabric, there will always be challenges that look simple on the surface but end up eating more time than you’d like.
In the end, the client is happy, and I learned something new this week.
Let’s see what next week’s challenge will be. Until then… happy coding!