DP-203: Data Engineering on Microsoft Azure Topic 3
Question #: 30
Topic #: 4
A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.
The company must be able to monitor the devices in real-time.
You need to design the solution.
What should you recommend?
A. Azure Analysis Services using Azure PowerShell
B. Azure Data Factory instance using Azure PowerShell
C. Azure Stream Analytics cloud job using Azure Portal
D. Azure Data Factory instance using Microsoft Visual Studio
Selected Answer: C
Question #: 31
Topic #: 2
You are developing a solution that will stream to Azure Stream Analytics. The solution will have both streaming data and reference data.
Which input type should you use for the reference data?
A. Azure Cosmos DB
B. Azure Blob storage
C. Azure IoT Hub
D. Azure Event Hubs
Selected Answer: B
Question #: 32
Topic #: 2
You are designing an Azure Stream Analytics job to process incoming events from sensors in retail environments.
You need to process the events to produce a running average of shopper counts during the previous 15 minutes, calculated at five-minute intervals.
Which type of window should you use?
A. snapshot
B. tumbling
C. hopping
D. sliding
Selected Answer: C
Question #: 32
Topic #: 1
You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
What should you recommend?
A. JSON
B. Parquet
C. CSV
D. Avro
Selected Answer: B
Question #: 33
Topic #: 1
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times.
What should you do?
A. Insert the data from stg.Sales into dbo.Sales.
B. Switch the first partition from dbo.Sales to stg.Sales.
C. Switch the first partition from stg.Sales to dbo.Sales.
D. Update dbo.Sales from stg.Sales.
Selected Answer: C
Question #: 33
Topic #: 4
You have a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
You need to design queries to maximize the benefits of partition elimination.
What should you include in the Transact-SQL queries?
A. JOIN
B. WHERE
C. DISTINCT
D. GROUP BY
Selected Answer: B
Question #: 34
Topic #: 1
You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.
You plan to keep a record of changes to the available fields.
The supplier data contains the following columns.
Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. surrogate primary key
B. effective start date
C. business key
D. last modified date
E. effective end date
F. foreign key
Selected Answer: ABE
Question #: 34
Topic #: 2
You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day.
You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times.
What should you include in the solution?
A. Partition by DateTime fields.
B. Sink to Azure Queue storage.
C. Include a watermark column.
D. Use a JSON format for physical data storage.
Selected Answer: C
Question #: 34
Topic #: 3
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 that contains a table named Sales.
Sales has row-level security (RLS) applied. RLS uses the following predicate filter.
A user named SalesUser1 is assigned the db_datareader role for Pool1.
Which rows in the Sales table are returned when SalesUser1 queries the table?
A. only the rows for which the value in the User_Name column is SalesUser1
B. all the rows
C. only the rows for which the value in the SalesRep column is Manager
D. only the rows for which the value in the SalesRep column is SalesUser1
Selected Answer: D
Question #: 34
Topic #: 4
You have an Azure Stream Analytics query. The query returns a result set that contains 10,000 distinct values for a column named clusterID.
You monitor the Stream Analytics job and discover high latency.
You need to reduce the latency.
Which two actions should you perform? Each correct answer presents a complete solution.
NOTE: Each correct selection is worth one point.
A. Add a pass-through query.
B. Increase the number of streaming units.
C. Add a temporal analytic function.
D. Scale out the query by using PARTITION BY.
E. Convert the query to a reference query.
Selected Answer: BD
Question #: 35
Topic #: 4
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and a database named DB1. DB1 contains a fact table named Table1.
You need to identify the extent of the data skew in Table1.
What should you do in Synapse Studio?
A. Connect to the built-in pool and query sys.dm_pdw_nodes_db_partition_stats.
B. Connect to the built-in pool and run DBCC CHECKALLOC.
C. Connect to Pool1 and query sys.dm_pdw_node_status.
D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.
Selected Answer: D
Question #: 36
Topic #: 4
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a fact table named Table1.
You need to identify the extent of the data skew in Table1.
What should you do in Synapse Studio?
A. Connect to Pool1 and DBCC PDW_SHOWSPACEUSED.
B. Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED.
C. Connect to the built-in pool and run DBCC CHECKALLOC.
D. Connect to the built-in pool and query sys.dm_pdw_sys_info.
Selected Answer: A
Question #: 36
Topic #: 2
You have an Azure Databricks workspace named workspace1 in the Standard pricing tier.
You need to configure workspace1 to support autoscaling all-purpose clusters. The solution must meet the following requirements:
✑ Automatically scale down workers when the cluster is underutilized for three minutes.
✑ Minimize the time it takes to scale to the maximum number of workers.
✑ Minimize costs.
What should you do first?
A. Enable container services for workspace1.
B. Upgrade workspace1 to the Premium pricing tier.
C. Set Cluster Mode to High Concurrency.
D. Create a cluster policy in workspace1.
Selected Answer: B
Question #: 36
Topic #: 3
You have an Azure subscription that is linked to a tenant in Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra. The tenant that contains a security group named Group1. The subscription contains an Azure Data Lake Storage account named myaccount1. The myaccount1 account contains two containers named container1 and container2.
You need to grant Group1 read access to container1. The solution must use the principle of least privilege.
Which role should you assign to Group1?
A. Storage Table Data Reader for myaccount1
B. Storage Blob Data Reader for container1
C. Storage Blob Data Reader for myaccount1
D. Storage Table Data Reader for container1
Selected Answer: B
Question #: 37
Topic #: 1
You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following specifications:
✑ Contain sales data for 20,000 products.
Use hash distribution on a column named ProductID.
✑ Contain 2.4 billion records for the years 2019 and 2020.
Which number of partition ranges provides optimal compression and performance for the clustered columnstore index?
A. 40
B. 240
C. 400
D. 2,400
Selected Answer: A
Question #: 37
Topic #: 2
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.
Solution: You use a tumbling window, and you set the window size to 10 seconds.
Does this meet the goal?
A. Yes
B. No
Selected Answer: A
Question #: 37
Topic #: 4
You use Azure Data Lake Storage Gen2.
You need to ensure that workloads can use filter predicates and column projections to filter data at the time the data is read from disk.
Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. Reregister the Azure Storage resource provider.
B. Create a storage policy that is scoped to a container.
C. Reregister the Microsoft Data Lake Store resource provider.
D. Create a storage policy that is scoped to a container prefix filter.
E. Register the query acceleration feature.
Selected Answer: BE
Question #: 37
Topic #: 3
You have an Azure Synapse Analytics dedicated SQL pool that contains a table named dbo.Users.
You need to prevent a group of users from reading user email addresses from dbo.Users.
What should you use?
A. column-level security
B. row-level security (RLS)
C. Transparent Data Encryption (TOE)
D. dynamic data masking
Selected Answer: A
Question #: 38
Topic #: 4
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a fact table named Table1.
You need to identify the extent of the data skew in Table1.
What should you do in Synapse Studio?
A. Connect to Pool1 and run DBCC PDW_SHOWSPACEUSED.
B. Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED.
C. Connect to Pool1 and run DBCC CHECKALLOC.
D. Connect to the built-in pool and query sys.dm_pdw_sys_info.
Selected Answer: A
Question #: 38
Topic #: 2
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.
Solution: You use a session window that uses a timeout size of 10 seconds.
Does this meet the goal?
A. Yes
B. No
Selected Answer: B
Question #: 39
Topic #: 1
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT –
SupplierKey, StockItemKey, COUNT(*)
FROM FactPurchase –
WHERE DateKey >= 20210101 –
AND DateKey <= 20210131 - GROUP By SupplierKey, StockItemKey Which table distribution will minimize query times? A. replicated B. hash-distributed on PurchaseKey C. round-robin D. hash-distributed on DateKey Selected Answer: D Question #: 39 Topic #: 4 You have an Azure Data Lake Storage Gen2 account that contains two folders named Folder1 and Folder2. You use Azure Data Factory to copy multiple files from Folder1 to Folder2. You receive the following error. Operation on target Copy_sks failed: Failure happened on 'Sink' side. ErrorCode=DelimitedTextMoreColumnsThanDefined, 'Type=Microsoft.DataTransfer.Common.Snared.HybridDeliveryException, Message=Error found when processing 'Csv/Tsv Format Text' source '0_2020_11_09_11_43_32.avro' with row number 53: found more columns than expected column count 27., Source=Microsoft.DataTransfer.Comnon,' What should you do to resolve the error? A. Change the Copy activity setting to Binary Copy. B. Lower the degree of copy parallelism. C. Add an explicit mapping. D. Enable fault tolerance to skip incompatible rows. Selected Answer: A Question #: 39 Topic #: 3 You have an Azure subscription that contains a storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool. The storage1 account contains a CSV file that requires an account key for access. You plan to read the contents of the CSV file by using an external table. You need to create an external data source for the external table. What should you create first? A. a database role B. a database scoped credential C. a database view D. an external file format Selected Answer: B Question #: 39 Topic #: 2 You use Azure Stream Analytics to receive data from Azure Event Hubs and to output the data to an Azure Blob Storage account. You need to output the count of records received from the last five minutes every minute. Which windowing function should you use? A. Session B. Tumbling C. Sliding D. Hopping Selected Answer: D Question #: 40 Topic #: 1 You are implementing a batch dataset in the Parquet format. Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. You need to minimize storage costs for the solution. What should you do? A. Use Snappy compression for the files. B. Use OPENROWSET to query the Parquet files. C. Create an external table that contains a subset of columns from the Parquet files. D. Store all data as string in the Parquet files. Selected Answer: C Question #: 40 Topic #: 3 You have a tenant in Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra. The tenant contains a group named Group1. You have an Azure subscription that contains the resources shown in the following table.
You need to ensure that members of Group1 can read CSV files from storage1 by using the OPENROWSET function. The solution must meet the following requirements:
• The members of Group1 must use credential1 to access storage1.
• The principle of least privilege must be followed.
Which permission should you grant to Group1?
A. EXECUTE
B. CONTROL
C. REFERENCES
D. SELECT
Selected Answer: C
Question #: 40
Topic #: 4
A company plans to use Apache Spark analytics to analyze intrusion detection data.
You need to recommend a solution to analyze network and system activity data for malicious activities and policy violations. The solution must minimize administrative efforts.
What should you recommend?
A. Azure HDInsight
B. Azure Data Factory
C. Azure Data Lake Storage
D. Azure Databricks
Selected Answer: D
Question #: 41
Topic #: 3
You have an Azure subscription that contains an Azure Data Lake Storage account named dl1 and an Azure Analytics Synapse workspace named workspace1.
You need to query the data in dl1 by using an Apache Spark pool named Pool1 in workspace1. The solution must ensure that the data is accessible Pool1.
Which two actions achieve the goal? Each correct answer presents a complete solution.
NOTE: Each correct answer is worth one point.
A. Implement Azure Synapse Link.
B. Load the data to the primary storage account of workspace1.
C. From workspace1, create a linked service for the dl1.
D. From Microsoft Purview, register dl1 as a data source.
Selected Answer: BC
Question #: 42
Topic #: 4
You have an Azure Data Factory pipeline named pipeline1 that includes a Copy activity named Copy1. Copy1 has the following configurations:
• The source of Copy1 is a table in an on-premises Microsoft SQL Server instance that is accessed by using a linked service connected via a self-hosted integration runtime.
• The sink of Copy1 uses a table in an Azure SQL database that is accessed by using a linked service connected via an Azure integration runtime.
You need to maximize the amount of compute resources available to Copy1. The solution must minimize administrative effort.
What should you do?
A. Scale out the self-hosted integration runtime.
B. Scale up the data flow runtime of the Azure integration runtime and scale out the self-hosted integration runtime.
C. Scale up the data flow runtime of the Azure integration runtime.
Selected Answer: A
Question #: 42
Topic #: 3
You have an Azure Synapse Analytics dedicated SQL pool named SQL1 and a user named User1.
You need to ensure that User1 can view requests associated with SQL1 by querying the sys.dm_pdw_exec_requests dynamic management view. The solution must follow the principle of least privilege.
Which permission should you grant to User1?
A. VIEW DATABASE STATE
B. SHOWPLAN
C. CONTROL SERVER
D. VIEW ANY DATABASE
Selected Answer: A
Question #: 42
Topic #: 1
You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and employee transactions.
From a source system, you have a flat extract that has the following fields:
✑ EmployeeID
FirstName –
✑ LastName
✑ Recipient
✑ GrossAmount
✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid
✑ TransactionDate
You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.
Which two tables should you create? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. a dimension table for Transaction
B. a dimension table for EmployeeTransaction
C. a dimension table for Employee
D. a fact table for Employee
E. a fact table for Transaction
Selected Answer: CE
Question #: 43
Topic #: 3
You have a Microsoft Entra tenant.
The tenant contains an Azure Data Lake Storage Gen2 account named storage1 that has two containers named fs1 and fs2.
You have a Microsoft Entra group named DepartmentA.
You need to meet the following requirements:
• DepartmentA must be able to read, write, and list all the files in fs1.
• DepartmentA must be prevented from accessing any files in fs2.
• The solution must use the principle of least privilege.
Which role should you assign to DepartmentA?
A. Contributor for fs1
B. Storage Blob Data Owner for fs1
C. Storage Blob Data Contributor for storage1
D. Storage Blob Data Contributor for fs1
Selected Answer: D
Question #: 43
Topic #: 4
You are designing a solution that will use tables in Delta Lake on Azure Databricks.
You need to minimize how long it takes to perform the following:
• Queries against non-partitioned tables
• Joins on non-partitioned columns
Which two options should you include in the solution? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. the clone command
B. Z-Ordering
C. Apache Spark caching
D. dynamic file pruning (DFP)
Selected Answer: BD
Question #: 43
Topic #: 1
You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the history of the data by adding new rows as the data changes.
Which type of slowly changing dimension (SCD) should you use?
A. Type 0
B. Type 1
C. Type 2
D. Type 3
Selected Answer: C
Question #: 44
Topic #: 4
You have an Azure Data Lake Storage Gen2 account named account1 that contains a container named container1.
You plan to create lifecycle management policy rules for container1.
You need to ensure that you can create rules that will move blobs between access tiers based on when each blob was accessed last.
What should you do first?
A. Configure object replication
B. Create an Azure application
C. Enable access time tracking
D. Enable the hierarchical namespace
Selected Answer: C
Question #: 44
Topic #: 2
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.
The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.
You need to create the Databricks clusters for the workloads.
Solution: You create a Standard cluster for each data scientist, a Standard cluster for the data engineers, and a High Concurrency cluster for the jobs.
Does this meet the goal?
A. Yes
B. No
Selected Answer: B