How to plan a successful Data Architecture
In the era of cloud computing, it’s really easy to create and change data services, so in each project, we have architecture decisions to make, and each developer has to deal with these considerations.
Lately, I gave a session on the Microsoft Data Engineers Club community about the considerations and questions I deal with when planning a data architecture. Here is a short summary:
Phase 1 – Collect information from the business customer
How? Ask a billion questions.
You should focus on 4 main areas:
Business requirements:
- What kind of project is this?
- What are the data sources?
- What kinds of transformations should we do on the data?
- What is the overall data volume and how much will change daily?
- How much time is allocated to complete the project, and can we parallelize it with multiple developers (this can affect our choice of technologies to an easier-to-learn, or more available one)?
Technical requirements:
Especially what we can’t do.
- Can I host the solution in the cloud? Or should it be on-premises?
- Do I have to create a virtual network, and only connect with a VPN?
- Do we have regulatory or legal limitations (for example, for financial companies)?
Inventory:
- What kind of system does the customer have now?
- What kind of processes do they run?
- Is there a code base that we need to migrate?
- What is the current knowledge of the data team in proposed technologies or languages (did someone say Python?) ?
Long-term thinking:
- Scalability – how will the system run when data increases with time?
- Flexibility – how easy it will be to make changes, add features, and connect new systems?
- Modernization – will the platform I choose support common data processes, like connecting to APIs? Do they support modern developer tools (source control, code reusing, etc.) Does the platform creator often add new features?
Phase 2 – Design and Draw
After we have our answers, we can start to think about optional designs that will meet the requirements.
Drawing the process and the included services in a diagram helps me understand the solution better and find issues faster. I use the free drew.io app.
Once we have a design, we can now do a cost analysis. We can go to Azure Calculator (or any other cloud or software pricing information) and get an estimation of costs. Don’t forget to include:
- Data storage
- Data movements
- Compute (usually the most expansive item)
Phase 3 – Presenting the solution to the customer
This is an individual choice, but I like to present 2-3 optional designs and explain which one I think is best and why.
Don’t forget to present the architecture diagram, it will help the customer understand you better.
Costs – don’t just show the estimation, explain how you got to these numbers, so the customer will understand how changes will affect costs (for example, if you refresh data more times per day)
Good luck with your data projects!
If you have more ideas about what is important to a data architecture, I would love to hear in the comments!