Data contextualization is the process of adding contextual information to raw data in order to enhance its meaning and relevance. It involves the use of additional information like metadata, annotations, and other relevant details to provide a better understanding of the data. Contextualization can help analysts understand the relationships between data points and the environment in which they were collected. For example, contextualization can provide information about the time, location, and other environmental factors that might have influenced the data. In data processing, contextualization is becoming increasingly important as datasets become larger and more complex. Without proper contextualization, it can be difficult to interpret data accurately and make informed decisions based on it.
This article demonstrates how to contextualize data by looking up relevant context that's stored in a graph database in Azure SQL Database.
Architecture
Download a Visio file of this architecture.
In this architecture, data stored in Delta Lake in the silver layer is read incrementally, contextualized based on a graph lookup, and merged into Azure SQL Database and another Delta Lake instance in the gold layer.
Dataflow
The following dataflow corresponds to the preceding diagram:
- The incoming data that needs to be contextualized is appended into the Delta table in the silver layer.
- The incoming data is incrementally loaded into Azure Databricks.
- Contextual information is retrieved from a graph database.
- The incoming data is contextualized.
- The contextualized data is merged into the corresponding table in SQL Database.
- Optionally, the contextualized data is appended into the corresponding Delta table in the gold layer.
Components
- Azure Data Lake Storage is a scalable data lake for high-performance analytics workloads. In this solution, it stores input data and contextualized data in Delta tables.
- Azure Databricks is a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. In this solution, it provides the platform on which Python notebook files are used to contextualize data.
- SQL Database is an always-up-to-date, fully managed relational database service that's built for the cloud. In this solution, it stores a graph database and contextualized data.
Alternatives
Many graph databases are available. For more information, see:
- Graph processing with SQL Database
- Azure Cosmos DB for Apache Gremlin
- Neo4J
- RedisGraph
- Apache Age for PostgreSQL
There are pros and cons associated with each of these products and services. Some of them are Azure managed services, and some aren't. This architecture uses SQL Database, because:
- It's an Azure-managed relational database service that has graph capabilities.
- It's easy to get started if you're familiar with SQL Server or SQL Database.
- Solutions often benefit from the use of Transact-SQL in parallel. SQL Database graph relationships are integrated into Transact-SQL.
Scenario details
Data layers
This solution is based on the Databricks medallion architecture. In this design pattern, data is logically organized in various layers. The goal is to incrementally and progressively improve the structure and quality of the data as it moves from one layer to the next.
For simplicity, this architecture has only two layers:
- The silver layer stores the input data.
- The gold layer stores the contextualized data.
The data in the silver layer is stored in Delta Lake and exposed as Delta tables.
Incremental data load
This solution implements incremental data processing, so only data that has been modified or added since the previous run is processed. Incremental data load is typical in batch processing because it helps keep data processing fast and economical.
For more information, see incremental data load.
Data contextualization
Data contextualization can be applied in various ways. In this architecture, contextualization is the process of performing a graph lookup and retrieving matching values.
The solution assumes that a graph has already been created in a graph database. The internal complexity of the graph isn't a concern because the graph query is passed via a configuration and executed dynamically with passed input values.
The solution uses Azure Databricks for the data contextualization process.
Graph database
The graph database is the database that stores the graph models. As noted earlier, there are many graph databases available. In this solution, the graph capabilities of SQL Server are used to create the graph.
SQL Database
In this architecture, SQL database is used to store the contextualized data, but you can use any storage option. To ensure idempotent processing, the data is merged into the system rather than appended.
Contoso scenario
The solution in this article is based on the scenario that's described in this section.
Gary is an operations engineer at Contoso, Ltd. One of his responsibilities is to provide a weekly health report for the assets in Contoso factories within a specific city.
First, Gary needs to fetch all the asset IDs that he's interested in from the company's asset system. He then looks for all the attributes that belong to the assets to use as input for the health report. For example, the operational efficiency data of the asset with ID AE0520.
The following diagram illustrates some Contoso data relationships:
Contoso has many applications that help factory managers monitor processes and operations. Operational efficiency data is recorded in the quality system, another stand-alone application.
Gary signs in to the quality system and looks up the asset ID AE0520 in the AE_OP_EFF
table. That table contains all the key attributes for operational efficiency data.
There are many columns in the AE_OP_EFF
table. Gary is especially interested in the alarm status. However, the details for the most critical alarms of the asset are kept in another table called Alarm
. Therefore, Gary needs to record that the key ID MA_0520 of the Alarm
table corresponds to the asset AE0520, because they use different naming conventions.
The relationship is actually much more complicated. Gary needs to search for more than one attribute of the asset and sign in to many tables in different systems to get all the data for a complete report. He uses queries and scripts to perform his work, but the queries are complicated and hard to maintain. Even worse, the systems are growing, and more data needs to be added to the report for different decision makers.
One of the main problems for Gary is that the IDs of a given asset in various systems are different. The systems were developed and are maintained separately, and they even use different protocols. Gary needs to manually query the various tables to get data for a single asset. The queries are complex and difficult to understand. As a result, Gary spends a lot of time training new operations engineers and explaining the relationships in the data.
Gary needs a mechanism to link the various names that belong to a single asset across systems. This mechanism will make report queries simpler and make Gary's job easier.
Graph design
SQL Database provides graph database capabilities for modeling many-to-many relationships. The graph relationships are integrated into Transact-SQL.
A graph database is a collection of nodes (or vertices) and edges (or relationships). A node represents an entity, like a person or an organization. An edge represents a relationship between the two nodes that it connects, for example, likes or friends.
Graph model for the scenario
This is the graph model for the Contoso scenario:
Alarm
is one of the metrics that belong toQuality_System
.Quality_System
is associated with anAsset
.
This is what the data looks like:
In the graph model, the nodes and edges need to be defined. Azure SQL graph uses edge tables to represent relationships. In this scenario, there are two edge tables. They record the relationships between Alarm
and Quality_System
and Quality_System
and Asset
.
The following table shows the nodes and edges:
Nodes | Edges |
---|---|
Alarm | Alarm -> belongs_to -> Quality_System |
Quality_System | Quality_System -> is_associated_with -> Asset |
Asset |
To create these nodes and edges in SQL Database, you can use the following SQL commands:
…CREATE TABLE Alarm (ID INTEGER PRIMARY KEY, Alarm_Type VARCHAR(100)) AS NODE; CREATE TABLE Asset (ID INTEGER PRIMARY KEY, Asset_ID VARCHAR(100)) AS NODE;CREATE TABLE Quality_System (ID INTEGER PRIMARY KEY, Quality_ID VARCHAR(100)) AS NODE;CREATE TABLE belongs_to AS EDGE;CREATE TABLE is_associated_with AS EDGE;…
These commands create the following graph tables:
- dbo.Alarm
- dbo.Asset
- dbo.belongs.to
- dbo.is_associated_with
- dbo.Quality_System
To query the graph database, you can use the MATCH clause to match patterns and traverse the graph:
SELECT [dbo].[Alarm].Alarm_Type, [dbo].[Asset].Asset_IDFROM [dbo].[Alarm], [dbo].[Asset], [dbo].[Quality_System], [dbo].[belongs_to], [dbo].[is_associated_with]WHERE MATCH (Alarm-(belongs_to)->Quality_System -(is_associated_with)-> Asset)
You can then use the query result to join the incoming raw data for contextualization.
Incremental data load
As the architecture diagram shows, the solution contextualizes only new incoming data, not the entire dataset in the Delta table. To meet this requirement, it uses an incremental data loading solution.
In Delta Lake, change data feed is a feature that simplifies the architecture for implementing change data capture. The following diagram illustrates how it works. When change data feed is enabled, the system records data changes, which, in this case, include inserted rows and two rows that represent the pre-image and post-image of an updated row. If you need to, you can use the pre-image and post-image information to evaluate the changes. There's also a delete change type that represents deleted rows. To query the change data, you can use the table_changes
function.
In this solution, change data feed is enabled for Delta tables that store the source data. You can enable it by using this command:
CREATE TABLE tbl_alarm (alarm_id INT, alarm_type STRING, alarm_desc STRING, valid_from TIMESTAMP, valid_till TIMESTAMP)USING DELTATBLPROPERTIES (delta.enableChangeDataFeed = true)
The following query gets the changed rows in the table. 2
is the commit version number.
SELECT *FROM table_changes('tbl_alarm', 2)
If you only need information about newly inserted data, you can use this query:
SELECT *FROM table_changes('tbl_alarm', 2)WHERE _change_type = 'insert'
For more samples, see Change data feed demo.
You can use change data feed to load data incrementally. To do that, you need the version number of the most recent commit. You can create a Delta table to store that version number:
CREATE TABLE table_commit_version(table_name STRING, last_commit_version LONG)USING DELTA
Every time you load new data into tbl_alarm
, you need to complete these steps:
- Get the
last_commit_version
for thetbl_alarm
table fromtable_commit_version
. - Query and load the data added since the version that's stored in
last_commit_version
. - Get the highest commit version number of the
tbl_alarm
table. - Update
last_commit_version
in thetable_commit_version
table to prepare it for the next query.
Enabling change data feed doesn't have a significant effect on system performance or cost. The change data records are generated inline during the query execution process and are much smaller than the total size of the rewritten files.
Potential use cases
- A manufacturing solution provider wants to continuously contextualize the data and events that are provided by its customers. Because the context information is too complicated to represent in relational tables, the company uses graph models for data contextualization.
- A process engineer in a factory needs to troubleshoot a problem with factory equipment. The graph model stores all data, directly or indirectly related, from troubleshooting equipment to get information for root cause analysis.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.
For this scenario, you need to consider the security of data at rest (that is, data that's stored in Data Lake Storage, SQL Database, and Azure Databricks) and data that's in transit between the storage solutions.
For Data Lake Storage:
- Azure Storage service-side encryption (SSE) is enabled to help protect data at rest.
- You should use shared access signature (SAS) to restrict access and permissions to data. Use HTTPS to protect data in transit.
For SQL Database:
- Use role-based access control (RBAC) to limit access to specific operations and resources within a database.
- Use strong passwords to access SQL Database. Save passwords in Azure Key Vault.
- Enable TLS to help secure in-transit data between SQL Database and Azure Databricks.
For Azure Databricks:
- Use RBAC.
- Enable Azure Monitor to monitor your Azure Databricks workspace for unusual activity. Enable logging to track user activity and security events.
- To provide a layer of protection for data in transit, enable TLS for the JDBC connection to SQL Database.
In your production environment, put these resources into an Azure virtual network that isolates them from the public internet to reduce the attack surface and help protect against data exfiltration.
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational efficiencies. For more information, see Overview of the cost optimization pillar.
Cost optimization for SQL Database:
- Because solution performance isn't a goal for this architecture, it uses the lowest pricing tier that meets requirements.
- You should use the serverless compute tier, which is billed per second based on the number of compute cores that are used.
Cost optimization for Azure Databricks:
- Use the All-Purpose Compute workload and the Premium tier. Choose the instance type that meets your workload requirements while minimizing costs.
- Use autoscaling to scale the number of nodes based on workload demand.
- Turn off clusters when they aren't in use.
For more information about the cost of this scenario, see this monthly cost estimate.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal authors:
- Hong Bu | Senior Program Manager
- Chenshu Cai | Software Engineer
- Anuj Parashar | Senior Data Engineer
- Bo Wang | Software Engineer
- Gary Wang | Principal Software Engineer
Other contributor:
- Mick Alberts | Technical Writer
To see non-public LinkedIn profiles, sign in to LinkedIn.
Next steps
- What is Azure Cosmos DB for Apache Gremlin?
- The Leading Graph Data Platform on Microsoft Azure
- Graph processing with SQL Server and Azure SQL Database
- Use Delta Lake change data feed on Azure Databricks
- How to Simplify CDC With Delta Lake's Change Data Feed
- PostgreSQL Graph Search Practices - 10 Billion-Scale Graph with Millisecond Response
- Azure security baseline for Azure Databricks
- Databases architecture design
FAQs
What are the limitations of SQL Server Graph database? ›
SQL Server Graph Database Feature Limitations
Like anything else in the world, SQL Server graph database features have their limitations: First of all, temporary tables, table type variables, system-versioned temporal tables, and memory-optimized tables cannot be node or edge tables.
Azure Cosmos DB for Apache Gremlin is a graph database service that can be used to store massive graphs with billions of vertices and edges. You can query the graphs with millisecond latency and evolve the graph structure easily.
How do I view data in SQL database in Azure? ›On your SQL database Overview page in the Azure portal, select Query editor (preview) from the left menu. On the sign-in screen, provide credentials to connect to the database. You can connect using SQL authentication or Azure AD.
What is the disadvantage of graph database? ›The general disadvantages of graph databases are: There is no standardized query language. The language depends on the platform used. Graphs are inappropriate for transactional-based systems.
What are advantages and disadvantages of graph database? ›Advantages | Disadvantages |
---|---|
Query speed only dependent on the number of concrete relationships, and not on the amount of data | Difficult to scale, as designed as one-tier architecture |
Results in real time | No uniform query language |
An azure architecture diagram visualizes the deployment and hosting of any application on azure cloud services. Azure architecture is the most popular cloud service used by most of the 500 fortune companies.
What is the free tool to draw Azure architecture diagram? ›SmartDraw lets you connect to your Microsoft Azure account and visualize your existing network infrastructure automatically. All you have to do is login to Azure using your Microsoft credentials, choose the appropriate Subscription ID from your account, then click Import to generate your diagram.
What is meant by Azure architecture diagram? ›Azure Architecture is a process of creating, deploying, operating and managing cloud-based applications. Azure Architecture covers all the aspects related to cloud architecture such as planning, design, deployment, operation and maintenance of your cloud-based applications.
What is the difference between graph database and SQL database? ›Query languages for the two types of databases also differ. Relational databases use SQL (Structured Query Language) to query the data, while graph databases use a specific query language such as Gremlin or Cypher. Graph query languages tend to be faster to write than SQL.
When should I use a graph database? ›Graph databases are a good choice for recommendation applications. With graph databases, you can store in a graph relationships between information categories such as customer interests, friends, and purchase history.
What is an example of using a graph database? ›
With the Graph Database model, Digital Asset Management becomes intuitive. Graph Database Example: Netflix uses Graph Database for its Digital Asset Management because it is a perfect way to track which movies (assets) each viewer has already watched, and which movies they are allowed to watch (access management).
How to show data from SQL database? ›Right-click the Products table in SQL Server Object Explorer, and select View Data. The Data Editor launches.
What is the difference between Azure SQL and SQL database? ›SQL virtual machines offer full administrative control over the SQL Server instance and underlying OS for migration to Azure. The most significant difference from SQL Database and SQL Managed Instance is that SQL Server on Azure Virtual Machines allows full control over the database engine.
How is data stored in Azure SQL Database? ›SQL Server Data Files in Microsoft Azure enables native support for SQL Server database files stored as blobs. It allows you to create a database in SQL Server running in on-premises or in a virtual machine in Microsoft Azure with a dedicated storage location for your data in Microsoft Azure Blob storage.
Why is a graph of data better than a table? ›Because of their visual nature, they show the overall shape of your data. This is when you should use charts instead of tables: The message is contained in the shape of the values (e.g. patterns, trends, exceptions). The display will be used to reveal relationships among whole sets of values.
What is graph database not good for? ›If the connections within the data are not the primary focus and the data is transactional, then a graph database is probably not the best fit. Sometimes it's just important to store data and complex analysis isn't needed.
What are 3 advantages of using tables and graphs? ›Advantages of Using Tables and Figures
Enable relationships to be seen easily. Condense detailed information and thus avoid the necessity for complex and repetitive sentences. Act as a summary of detailed information. Act as a welcome relief from pages and pages of text.
What Makes Graph Databases Unique. Some graph databases use native graph storage that is specifically designed to store and manage graphs – from bare metal on up. Other graph technologies use relational, columnar or object-oriented databases as their storage layer.
What are the three advantages of using graphs charts? ›The three advantages of graphs are as follows: It makes data presentable and easy to understand. It helps in summarizing the data in a crisp manner. It helps in the comparison of data in a better way.
What is 3 tier architecture in Azure example? ›The three-tier architecture is a well-established software application architecture that organizes applications at three levels of logical and physical computing: the display layer or the user interface; The layer of the application where the data is processed; and a data layer where application-related data is stored ...
What is the architecture of Azure SQL data warehouse? ›
Data warehouse architectures
Enterprise BI in Azure with Azure Synapse Analytics. This reference architecture implements an extract, load, and transform (ELT) pipeline that moves data from an on-premises SQL Server database into Azure Synapse. Automated enterprise BI with Azure Synapse and Azure Data Factory.
Microsoft Graph Data Connect provides a set of tools to streamline secure and scalable delivery of Microsoft Graph data to popular Azure data stores. The cached data serves as data sources for Azure development tools that you can use to build intelligent applications.
Does Azure architect need coding? ›Yes, you can learn Microsoft Azure without learning to program. But this would restrict your work roles to just non-technical roles. If you're a beginner, learning a programming language is recommended, such as Python, to get a high-paying job in the cloud computing industry.
Which software is used to draw architecture diagram? ›Lucidchart is a collaborative diagramming tool with templating and shape types to create many diagram types, such as flow charts, process flows, mind maps and more. Free and paid. Best for teams collaborating on quick and flexible diagrams.
How do you explain an architecture diagram? ›Architecture diagramming is the process of creating visual representations of software system components. In a software system, the term architecture refers to various functions, their implementations, and their interactions with each other.
What is the purpose of architecture diagram? ›An architecture diagram is a visual representation of all the elements that make up part, or all, of a system. Above all, it helps the engineers, designers, stakeholders — and anyone else involved in the project — understand a system or app's layout.
What are core Azure architectural components? ›Azure resources, resource groups, and Azure Resource Manager. Azure regions, region pairs, and availability zones.
What is the type of data in graph database? ›Graph database types
There are two popular models of graph databases: property graphs and RDF graphs. The property graph focuses on analytics and querying, while the RDF graph emphasizes data integration. Both types of graphs consist of a collection of points (vertices) and the connections between those points (edges).
SQL Graph Database
Users can create one graph per database. A graph is a collection of node and edge tables. Node or edge tables can be created under any schema in the database, but they all belong to one logical graph.
A graph database is a NoSQL database that stores data as a network graph.
What are 2 ways in which graphs and charts in a database are important? ›
A chart or graph can help you compare different values, understand how different parts impact the whole, or analyze trends. Charts and graphs can also be useful for recognizing data that veers away from what you're used to or help you see relationships between groups.
In which cases will you use graph databases and why? ›- Fraud Detection.
- 360 Customer Views.
- Recommendation Engines.
- Network/Operations Mapping.
- AI Knowledge Graphs.
- Social Networks.
- Supply Chain Mapping.
Unlike charts, which use abstraction to focus on trends and numerical relationships, tables present data in as close to raw form as possible. Tables are meant to be read, so they are ideal when you have data that cannot easily be presented visually, or when the data requires more specific attention.
What are typical graph database queries? ›Conceptually, graph databases differ from relational databases in that the topology of the data is as important as the data itself. Thus, typical graph database queries are navigational, asking whether some nodes are connected by paths satisfying some specific properties.
Does a graph database have tables? ›Graph databases can be more trusting because they are based on a one-table model. In a graph database, the table is called a graph, and it contains all the information about entities and the relationships between them.
What is an example of data that can be presented using line graph? ›A line graph is a unique graph which is commonly used in statistics. It represents the change in a quantity with respect to another quantity. For example, the price of different flavours of chocolates varies, which we can represent with the help of this graph.
How do I show data in columns in SQL? ›- Type SELECT , followed by the names of the columns in the order that you want them to appear on the report. ...
- If you know the table from which you want to select data, but do not know all the column names, you can use the Draw function key on the SQL Query panel to display the column names.
Click the Data tab to view the data stored in the table. The Data tab shows the rows stored in the EMPLOYEES table. To sort the rows by last name, right-click the LAST_NAME column name and select Sort in the menu. To Select the LAST_NAME column and click the right-arrow to move it to the Selected Columns list.
What are the main benefits of using Azure SQL Database? ›- Data discovery and classification: Azure SQL Database uses its built-in capabilities for discovering, classifying, labeling, and protecting sensitive data. ...
- Vulnerability assessment. ...
- Threat detection:
Azure SQL Database is based on the latest stable version of the Microsoft SQL Server database engine. You can use advanced query processing features, such as high-performance in-memory technologies and intelligent query processing.
What is Azure SQL Database used for? ›
Azure SQL Database serverless simplifies performance management and helps developers build apps faster and more efficiently with compute resources that automatically scale based on workload demand.
How do I transfer data to Azure SQL Database? ›- Open the Azure SQL Migration extension for Azure Data Studio.
- Connect to your source SQL Server instance.
- Click the Migrate to Azure SQL button, in the Azure SQL Migration wizard in Azure Data Studio.
- Select databases for assessment, then click on next.
- Prepare your environment for the Azure CLI. ...
- Launch Azure Cloud Shell. ...
- Sign in to Azure. ...
- Set parameter values. ...
- Create a resource group. ...
- Create a server. ...
- Configure a server-based firewall rule. ...
- Create a single database.
Maximum size of each data file is 8 TB. Use at least two data files for databases larger than 8 TB.
What problems do graph databases solve? ›Types of Problems that Graphs Solve
A graph is able to blend various datasets into a structure that enables the ability to reveal connections. Fraud Detection: Business events and customer data, such as new accounts, loan applications and credit card transactions can be modelled in a graph in order to detect fraud.
Key-Value Store: It is best for building a shopping cart. NoSQL databases: It is stored as a document so, it is best for storing structured product information. GraphDB: It follows a graph structure. It is best for describing how a user got from point A to point B.
Why do graph databases struggle with scaling out? ›Explain why graph databases tend to struggle with scaling out. Graph databases do not scale very well to clusters as they specialize in highly related data, not independent pieces of data.
What are the benefits of graph data structure? ›- Performance. For intensive data relationship handling, graph databases improve performance by several orders of magnitude. ...
- Flexibility. ...
- Agility.
Graph databases are purpose-built to store and navigate relationships. Relationships are first-class citizens in graph databases, and most of the value of graph databases is derived from these relationships. Graph databases use nodes to store data entities, and edges to store relationships between entities.
What are the advantages of using a graph database over a relational database? ›Graph databases are better than relational databases because they are more flexible and can handle more complex data relationships. Relational databases are based on the table structure of data, which is difficult to change once the data is in the database.
Are graph databases faster? ›
Complex queries typically run faster in graph databases than they do in relational databases. Relational databases require complex joins on data tables to perform complex queries, so the process is not as fast.
Are graph databases better? ›A graph database is much better in this use case compared to a relational database because graphs perform faster with highly interconnected data. The graph model is flexible. Tight connections and patterns between entities can help design recommendation and fraud detection systems in insurance.
What are two important factors you should consider when scaling a graph? ›1. The scale should be uniform, and the division should be uniformly done according to the graph type given. 2. The labels and units must be plotted the same as the given data.
What makes graph database unique? ›What Makes Graph Databases Unique. Some graph databases use native graph storage that is specifically designed to store and manage graphs – from bare metal on up. Other graph technologies use relational, columnar or object-oriented databases as their storage layer.
What are the different types of graph databases? ›Graph database types
There are two popular models of graph databases: property graphs and RDF graphs. The property graph focuses on analytics and querying, while the RDF graph emphasizes data integration. Both types of graphs consist of a collection of points (vertices) and the connections between those points (edges).