Ten years ago, a friend introduced me to climbing.
I started off with bouldering, which was very easy to get into. All you need is a pair of climbing shoes.
About a year later, I learned to top rope. This required one short lesson on rope management and belay techniques, but the payoff was immense. A whole new area of the climbing gym was opened up for me to explore.
A few years later, more out of curiosity than courage, I wanted to try lead climbing. Unlike top rope, where the rope is already set up for you, in lead climbing, you carry the rope up as you go, clipping it into quickdraws one at a time. The responsibility for both the climber and the belayer is a lot greater. One careless mistake can spell grave injuries.
That's why lead climbing requires several hours of class. Afterwards, climbers often practice some more with a monkey tail, before taking a mandatory test that finally earns them the certificate for lead climbing.
I did not find success with the lead climbing test. I took the class. I practiced many times. But still I failed the test. Twice.
I was getting despondent, until a much more experienced climber gave me a suggestion. He told me, during the test, do everything I was supposed to do, like keeping the break hand on the rope as a belayer, and clipping in the correct orientation as a climber. But after a climb, also try to catch the tip of the rope as it was falling from the wall.
What? Catch the tip of the rope? As the rope falls from the wall? Like you'd catch a popcorn with your mouth? I do like catching the rope but isn't that just a little play?
On my third lead test, that's what I did. And I passed.
One Little Action Can Make a Big Impression in a Test (and an Interview)
Obviously, catching the rope said nothing about my ability as a lead belayer or climber. I still needed to do all the right things.
But when it comes to the lead test, the examiner is not just looking to see if you have memorized all the moves. They also need to ensure you can execute them fluidly. Things happen fast in a lead climb. If your climber falls, the catch needs to be immediate. You don't have time to remember what to do; you just react. Like a driving test, doing the right thing is just the baseline. You also need to do the right thing without thinking, with comfort and confidence.
And looking back, that's exactly the message I inadvertently conveyed by catching the rope. It showed the examiner that I was comfortable. My motion was fluid: retrieve the rope, catch the tip, pull it in for a new figure-eight knot and repeat. At the same time, doing this little "play" also put my mind at ease. It made me less mechanical, less anxious, which in turn created a truly more relaxed and confident mindset. The transformation was magical.
Data Engineering Interview vs Lead Climbing Test
Now, having been a certified lead climber for many years, and a data engineer for just as long, I'd say data engineering is the more difficult of the two. Despite my troubles with the lead test, I'd also say going through a data engineering interview is more difficult than passing the lead test.
And like the lead test, there are fundamentals you need to know in a data engineering interview. You still need to know your batch processing vs stream processing, your data warehouse vs data lake. You should be able to propose a data pipeline architecture given a set of business requirements.
But if you are looking for an edge that helps you stand out, I have a humble suggestion. Like the suggestion of "catching the rope" above, it'll both give you an effective mindset, as well as signal your interviewer that you are an expert data engineer.
Idempotence: A Must-Have Concept for Data Engineers
Idempotence describes the property of an operation where performing the operation multiple times yields the same result as performing it just once.
For example, consider the arithmetic operation of adding one to a number. Each time you do this, the result increases. So this operation is not idempotent.
However, the operation of multiplying one to a number is idempotent. Doesn't matter how many times you do this, the result is the same (the original number). So this operation is idempotent.
Similarly, the operation of multiplying zero to a number is also idempotent, as it'll give you zero regardless of how many times you do it.
In the context of data engineering, designing your data processing operations to be idempotent can significantly enhance the reliability and robustness of your data pipelines. In addition, idempotent operations are often easier to maintain and debug.
Scenario: Daily Sales Report
Let's illustrate the power of idempotent data pipeline with an example. Imagine you are a data engineer at an e-commerce company. You are tasked with generating a sales report that summarizes total revenue each day. You need to combine sales data from various sources, including transactions made on the website and the mobile app.
Your data pipeline would involve several steps:
- Ingest sales data from multiple sources.
- Transform the data to a common format.
- Load the data into a central data warehouse.
- Aggregate daily sales data.
- Generate the daily sales report.
Sounds straightforward, right? However, real-world data pipelines are rarely smooth sailing. What happens if there's a network glitch, a database timeout, or a sudden surge in traffic causing an interruption? What if a user made a purchase on the mobile app but then later canceled it on the website? Without idempotence, you might end up with duplicate records or incomplete data, leading to inaccurate reports. And inaccurate sales report would then lead your company to wrong business decisions.
Applying Idempotence
Here's how having the concept of idempotence can help.
- Unique Identifiers for Transactions instead of Increments:
- To get the total sales volume each day, you might have a "total sales" counter that initializes to zero each day, and as a new transaction comes in, increment it.
- But as we had seen with the "adding one to a number" example, we know the increment operation is not idempotent.
- Non-idempotent operations will be very brittle against unpredictable network behaviors. For example, the purchase network request might be dropped when the user walks into an elevator, and retried later. The user would still make the one purchase successfully, but your sales data summary would record multiple purchases.
- Instead of using an increment operation, which is not idempotent, you might utilize an identifier that is unique to each transaction.
- If the same purchase network request arrives in your database multiple times, because they have the same identifier, you wouldn't double count them.
As you can see, by applying the principle of idempotence, you are steered away from a bad implementation (using increments) to a more robust one (using unique identifiers). Below are a couple more examples.
- Upserts instead of Inserts:
- Use upsert operations ("update or insert") when loading data into your data warehouse. If a sales record already exists, update it; if not, insert it.
- Idempotent Transformation Logic:
- For example, if you need to convert all transaction currencies to USD, make sure repeated application of this function to the same transaction doesn't change its final answer.
During Interviews
Of course, in your data engineering interview, you shouldn't just keep repeating the word "idempotent". Just like no one will ever pass a lead climbing test by simply catching the rope again and again.
Yet, being deeply comfortable with the concept of idempotence, and integrating it into your data architecture decisions will give you the mindset of an expert data engineer.
So take a thorough review of a data pipeline you've recently worked on. Break it down into individual operations. Is each step an idempotent operation? If not, should it be exempt from idempotence and why? Can you redesign this operation to be idempotent and how? In all sorts of network scenarios, how would idempotency improve reliability and consistency of your data pipeline?
By demonstrating how you can apply the principle of idempotence in practice, you'll convey to your interviewer that you're not only technically proficient, but also a thoughtful and seasoned data engineer.