How to create beautiful architectural diagrams with Python? (~7min)

This article discusses the technical diagram in the Data Engineering projects, looking at solutions such as Google Slides And draw.io. It highlights alignment challenges and concludes with a reflection on simplifying this process.

Nicolas Le Gall Profile Picture
Nicolas Le Gall Data Scientist

Introduction

In a Data Engineering project, to have a complete representation of the project a technical diagram summarizing the entire pipeline is necessary. How to quickly and easily create a technical diagram that summarizes the data pipeline of a project? This is a question that frequently arises when it comes to representing your pipeline. It is possible to do this directly on Google Slides by inserting shapes and arrows, but it can quickly become tedious when you have to resize the different parts, make them move together or align the text. There are also tools, like draw.io, which facilitate the creation of diagrams by linking the different parts, however alignment problems persist and the pipeline will be created only with shapes. 

To avoid all these problems and limitations, it is possible to use the Diagrams package which, in a few lines of code, will produce an easily readable technical diagram. In addition, creating a technical diagram with code allows you to reuse what has been done and if several people work collaboratively on the same diagram, this allows you to easily use a version control tool.

 

We will see in detail that Diagram is a flexible tool that allows you to produce technical diagrams easily while maintaining a certain clarity for readers. We will see step by step how to use this package and its features.

 

Prerequisites

 

To use the diagrams package, you must have Python 3.6 or higher. Next, you will need to install GraphViz because this is what allows you to display the graphs. You can find GraphViz in the “Getting Started” section of the github of the Diagram project.

Then you can install the diagrams library with your package manager, and then you will be ready to start creating beautiful diagrams.

For my part I installed the package with pip

pip install diagrams

The basics

In this package there are 4 different elements:

    • Diagrams
    • Groups (Cluster)
    • Links (Edge)
    • The knots

The first 3 elements are characterized respectively by a class. Regarding nodes, there are many classes offered by different providers such as AWS, Azure or GCP for clouds or even Kubernetes. You can find all the classes in the official package documentation.

Finally, these 4 elements are linked: in fact a diagram is made up of nodes which can be grouped together in groups and which are linked together by links. You will therefore need to import the necessary classes in order to correctly represent your architecture diagram.

Now let's try to code a first diagram to understand the basics of the package.

 

    from diagrams import Diagram
    from diagrams.aws.analytics import Glue, Quicksight
    from diagrams.aws.database import RDS
    from diagrams.aws.management import Cloudwatch
    from diagrams.aws.storage import S3
    
    with Diagram('Pipeline - Global Overview', filename='Diagramm/Pipeline_GO', show=True, direction="LR"): db = RDS('RDS Database') jobs = Glue('ETL') log = Cloudwatch(' Logging') bucket = S3('S3 Buckets') dashboard = Quicksight('Dashboard') db >> jobs >> bucket >> dashboard jobs >> log

This diagram describes a data engineering pipeline, using a table contained in an RDS database, processed via AWS Glue ETL. The processing results are stored in an S3 bucket and the logs in Cloudwatch. Finally, a Quicksight dashboard is connected to the S3 bucket.

Let's look at this first piece of code in detail. First we import the Diagram class, which is needed to produce a diagram. Next, we import a few node classes from AWS provider, for example RDS, Glue, etc. 

We then create a new Diagram with the name 'Pipeline – Global Overview'. As we entered the parameter filename, the diagram will be saved at the indicated location, be careful the indicated path is a relative path (the root will be the same as that of the location where the code is executed, for example if the code is launched from the desktop, the diagram will be saved to the desktop) and not an absolute path. The parameter show being equal to True, Python will immediately open the diagram after executing the code. The parameter direction indicates in which direction the graph will be constructed, here it will be from left to right, which is the default setting. Other options are right to left (RL), top to bottom (TB), and bottom to top (BT). Inside the diagram, we create several nodes, with the classes that we have imported. For create a link between two nodes, you must add '>>' between the two nodes if you want the arrow to go from left to right or '<<' if necessary.

 

To finish with all the classes in the package, let's try a slightly more complex diagram incorporating clusters.

    from diagrams import Diagram, Cluster
    from diagrams.aws.analytics import Glue, Quicksight
    from diagrams.aws.database import RDS
    from diagrams.aws.management import Cloudwatch
    from diagrams.aws.storage import S3
    
    with Diagram('Pipeline - Global Overview', filename='Diagramm/Pipeline_GO', show=True, direction="LR"): db = RDS('RDS Database')      
        
        with Cluster('AWS Glue (ETL) \n Data Engineering \n (Filter, join, rename...)'): jobs = [Glue('Job1'), Glue('Job2')] log = Cloudwatch('Logging ') bucket = S3('S3 Buckets') dashboard = Quicksight('Dashboard') db >> jobs >> bucket >> dashboard jobs >> log

This diagram describes the same pipeline as the previous one, the only difference is that here there are two Jobs which are represented in AWS Glue.

The primary purpose of Clusters is to group similar elements into the same subset.

The second use I find for clusters is to be able to delineate the different parts of the pipeline even more clearly, as shown in the following example:

    from diagrams import Diagram, Cluster
    from diagrams.aws.analytics import Glue, Quicksight
    from diagrams.aws.database import RDS
    from diagrams.aws.management import Cloudwatch
    from diagrams.aws.storage import S3
    
    with Diagram('Pipeline - Global Overview', filename='Diagramm/Pipeline_GO', show=True, direction="LR"):    
        with Cluster('RDS'): db = RDS('PostgreSQL BDD\n stored in RDS')    
        
        with Cluster('AWS Glue (ETL)'): jobs = Glue('Data Engineering (Filter, join, rename..)')    
            
        with Cluster('Cloudwatch'): log = Cloudwatch('Monitoring Scripts')    
            
        with Cluster('S3'): bucket = S3('S3 Buckets\n to store\n AWS Glue outputs')    
            
        with Cluster('Quicksight'): dashboard = Quicksight('Dashboard\n for monitoring') db >> jobs >> bucket >> dashboard jobs >> log

The code will produce the same diagram again, but the appearance will be different, in fact each part of AWS that was used will be even more clearly identified.

advanced settings

 

Now that we know how to use diagrams, clusters, edges and nodes, let's look at customization. There are two customizable objects: nodes and edges.

 

Customizing Edges

 

Let's first look at how to customize the edges. There are 3 customization parameters: color, style and label. The default color is gray, but if you want to set another color, the colors are those used by the matplotlib package which you can find here</a>.

Then, it is possible to play on the style and there are 4 available:

    • The default style which is a continuous line
    • Bold, one solid line in bold
    • Dashed, the line is made of dashes
    • Dotted, the line is made of dots

It is not possible to combine different styles, ie have a line made of bold dashes.

Finally, it is possible to add a label to an Edge if you wish to explain what this Edge represents.

Let's see what this looks like in code.

    from diagrams import Diagram, Cluster, Edge
    from diagrams.aws.analytics import Glue, Quicksight
    from diagrams.aws.database import RDS
    from diagrams.aws.management import Cloudwatch
    from diagrams.aws.storage import S3
    
    def arrow(color='black', style='line', label=None):    
        """ Function to define the edge between the part of the diagram :param color: the color of the edge, could be any color :type color: str :param style: the style of the edge, could be dashed, dotted, bold or line (default) :type style: str :param label: the text you want to show on the edge :type label: str :return: Edge object with the different parameters we set up :type: Edge() """    
        return Edge(color=color, style=style, label=label)
        
    with Diagram('Pipeline - Global Overview', filename='Diagramm/Pipeline_GO', show=True, direction="LR"):    
        with Cluster('RDS'): db = RDS('PostgreSQL BDD\n stored in RDS')    
            
        with Cluster('AWS Glue (ETL)'): jobs = Glue('Data Engineering (Filter, join, rename..)')    
            
        with Cluster('Cloudwatch'): log = Cloudwatch('Monitoring Scripts')    
            
        with Cluster('S3'): bucket = S3('S3 Buckets\n to store\n AWS Glue outputs')    
            
        with Cluster('Quicksight'): dashboard = Quicksight('Dashboard\n for monitoring') db_mycharlotte >> arrow(color='red', style='bold') >> jobs >> arrow(color='red', style ='bold') >> \ bucket >> arrow(color='red', style='bold') >> dashboard jobs >> arrow(color='hotpink', style='dashed') >> log

Here I created a function, arrow(), which by default produces a bold black arrow, which I prefer over the package default arrow. I then use this function to define the different arrows that I want to have in my graph. When you want to personalize an Edge, you must explicitly mark it in the diagram diagram between the two nodes concerned. Here I wanted the Edges of the pipeline to be red and bold, except the arrow for the logs which is pink and dotted. We can see this in the last two lines of the code.

Customizing Nodes

 

Let's address the second point, the customization of Nodes. What does that mean ? Customizing nodes means displaying a Node with an image that is not already pre-recorded in the package's image bank and therefore creating a Node that does not exist.

Do you want to represent sending an email in the event of an error? This is not in the options available on the package, but you just need to upload an image representing an email and using the Node Custom() you can integrate this new Node into your diagram. So there are many possibilities for Nodes and the only limit is your imagination.

    from diagrams import Diagram, Cluster, Edge
    from diagrams.aws.analytics import Glue, Quicksight
    from diagrams.aws.database import RDS
    from diagrams.aws.management import Cloudwatch
    from diagrams.aws.storage import S3
    
    with Diagram('Pipeline - Global Overview', filename='Diagramm/Pipeline_GO', show=True, direction="LR"):    
        with Cluster('RDS'): db = RDS('PostgreSQL BDD\n stored in RDS')    
            
        with Cluster('AWS Glue (ETL)'): jobs = Glue('Data Engineering (Filter, join, rename..)')    
            
        with Cluster('Cloudwatch'): log = Cloudwatch('Monitoring Scripts')    
            
        with Cluster('S3'): bucket = S3('S3 Buckets\n to store\n AWS Glue outputs')    
            
        with Cluster('Quicksight'): dashboard = Quicksight('Dashboard for monitoring')    
            
        with Cluster('Devs'): houcem = Custom('Houcem\n Lead DS', '.../Custom/houcem.png') nico = Custom('Nico\n DS', '.../Custom/nico .png') dev = [nico, houcem] db >> arrow(color='red', style='bold') >> jobs >> arrow(color='red', style='bold') >> bucket >> \ arrow(color='red', style='bold') >> dashboard jobs >> arrow(color='hotpink', style='dashed') >> log houcem >> arrow(color='sandybrown' , style='dotted') >> jobs houcem >> arrow(color='sandybrown', style='dotted') >> log nico >> arrow(color='blue', style='dotted') >> jobs nico >> arrow(color='blue', style='dotted') >> log nico >> arrow(color='blue', style='dotted') >> bucket nico >> arrow(color='blue' , style='dotted') >> dashboard   

Here, I have chosen to represent the developers who worked on this project by specifying which parts of the pipeline they worked on. To make it more visual, I created two new Nodes with photos of the developers and this way, we clearly identify who to contact in the event of problems with the pipeline.

Finally, once you have mastered the different functionalities of the package, you can produce very detailed diagrams. Here is an example:

    from diagrams import Diagram, Cluster, Edge
    from diagrams.aws.analytics import Glue, GlueCrawlers, GlueDataCatalog, Quicksight
    from diagrams.aws.database import RDS
    from diagrams.aws.management import Cloudwatch
    from diagrams.aws.storage import S3, SimpleStorageServiceS3Object, SimpleStorageServiceS3BucketWithObjects
    from diagrams.custom import Custom
    
    with Diagram('Pipeline - Global Overview', filename='Diagramm/Pipeline_GO', show=True, direction="LR"):    
        with Cluster('RDS'): db_mycharlotte = RDS('PostgreSQL BDD\n stored in RDS')    
        
        with Cluster('AWS Glue'): crawler = GlueCrawlers('Glue\n Crawler') data_catalog = GlueDataCatalog('Glue\n DataCatalog') jobs = Glue('Glue Jobs')        
        
        with Cluster('Jobs'): job = [Glue('Job for\n Activity\n transformation'), Glue('Job for\n Appointment\n transformation')]    
        
        with Cluster('Cloudwatch'): log = Cloudwatch('\n\n\nMonitoring Scripts')    
            
        with Cluster('S3'): bucket = SimpleStorageServiceS3BucketWithObjects('S3 Buckets\n to store\n AWS Glue outputs')        
            
            with Cluster('Objects within S3 bucket'): obj = [SimpleStorageServiceS3Object('Output from \nActivity\n Transformation Job'), SimpleStorageServiceS3Object('Output from \nAppointment\n Transformation Job')]    
                
        with Cluster('Quicksight'): dashboard = Quicksight('Dashboard for monitoring')    
            
        with Cluster('Devs'): houcem = Custom('Houcem\n Lead DS', '.../Custom/houcem.png') nico = Custom('Nico\n DS', '.../Custom/nico2 .png') dev = [nico, houcem] db >> arrow(color='red') >> data_catalog db << arrow(color='purple', style='dotted', label='Connect DB ') < < crawler >> \ arrow(color='purple', style='dotted', label='to AWS Glue') >> data_catalog >> arrow(color='red') >> \ jobs >> arrow(color= 'purple') >> job job >> arrow(color='hotpink', style='dashed') >> log job >> arrow(color='red') >> bucket >> arrow(color='darkgreen' ) >> \ obj >> arrow(color='red') >> dashboard houcem >> arrow(color='sandybrown', style='dotted') >> jobs houcem >> arrow(color='sandybrown', style ='dotted') >> log nico >> arrow(color='blue', style='dotted') >> jobs nico >> arrow(color='blue', style='dotted') >> log nico > > arrow(color='blue', style='dotted') >> bucket nico >> arrow(color='blue', style='dotted') >> dashboard 

We first notice that the code is still more complex than in the various previous diagrams. Then, regarding the diagram itself, it gives a real detailed overview of the data processing pipeline from the database to the dashboard all in the AWS environment.

Conclusion

 

Diagrams is a package that allows you to represent pipelines through diagrams with ease and flexibility. If you want to have more information and use more advanced commands, I advise you to watch the Diagrams project github and particularly in the section Issues.

NB : This article was freely inspired by the article Create Beautiful Architecture Diagrams with Python written by Dylan Roy and available here

A must see

Most popular articles

Do you have a transformation project? Let's talk about it !