Certified Data Engineer Associate Topic 2
Question #: 112
Topic #: 1
A data engineer is working with two tables. Each of these tables is displayed below in its entirety.
The data engineer runs the following query to join these tables together:
A.
B.
C.
D.
Selected Answer: C
Question #: 77
Topic #: 1
A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.
Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?
A. None of these changes will need to be made
B. The pipeline will need to stop using the medallion-based multi-hop architecture
C. The pipeline will need to be written entirely in SQL
D. The pipeline will need to use a batch source in place of a streaming source
E. The pipeline will need to be written entirely in Python
Selected Answer: D
Question #: 45
Topic #: 1
A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.
Which of the following commands can be used to grant the necessary permission on the entire database to the new team?
A. GRANT VIEW ON CATALOG customers TO team;
B. GRANT CREATE ON DATABASE customers TO team;
C. GRANT USAGE ON CATALOG team TO customers;
D. GRANT CREATE ON DATABASE team TO customers;
E. GRANT USAGE ON DATABASE customers TO team;
Selected Answer: E
Question #: 101
Topic #: 1
A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.
Which keyword can be used to compact the small files?
A. OPTIMIZE
B. VACUUM
C. COMPACTION
D. REPARTITION
Selected Answer: B
Question #: 91
Topic #: 1
Which data lakehouse feature results in improved data quality over a traditional data lake?
A. A data lakehouse stores data in open formats.
B. A data lakehouse allows the use of SQL queries to examine data.
C. A data lakehouse provides storage solutions for structured and unstructured data.
D. A data lakehouse supports ACID-compliant transactions.
Selected Answer: D
Question #: 74
Topic #: 1
Which of the following must be specified when creating a new Delta Live Tables pipeline?
A. A key-value pair configuration
B. The preferred DBU/hour cost
C. A path to cloud storage location for the written data
D. A location of a target database for the written data
E. At least one notebook library to be executed
Selected Answer: E
Question #: 6
Topic #: 1
Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the existing Delta table my_table and save the updated table?
A. SELECT * FROM my_table WHERE age > 25;
B. UPDATE my_table WHERE age > 25;
C. DELETE FROM my_table WHERE age > 25;
D. UPDATE my_table WHERE age <= 25;
E. DELETE FROM my_table WHERE age <= 25;
Selected Answer: C
Question #: 5
Topic #: 1
Which of the following describes the storage organization of a Delta table?
A. Delta tables are stored in a single file that contains data, history, metadata, and other attributes.
B. Delta tables store their data in a single file and all metadata in a collection of files in a separate location.
C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.
D. Delta tables are stored in a collection of files that contain only the data stored within the table.
E. Delta tables are stored in a single file that contains only the data stored within the table.
Selected Answer: C
Question #: 4
Topic #: 1
Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta Lake?
A. The ability to manipulate the same data using a variety of languages
B. The ability to collaborate in real time on a single notebook
C. The ability to set up alerts for query failures
D. The ability to support batch and streaming workloads
E. The ability to distribute complex data operations
Selected Answer: D
Question #: 2
Topic #: 1
Which of the following describes a scenario in which a data team will want to utilize cluster pools?
A. An automated report needs to be refreshed as quickly as possible.
B. An automated report needs to be made reproducible.
C. An automated report needs to be tested to identify errors.
D. An automated report needs to be version-controlled across multiple collaborators.
E. An automated report needs to be runnable by all stakeholders.
Selected Answer: A
Question #: 1
Topic #: 1
A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.
Which of the following describes how a data lakehouse could alleviate this issue?
A. Both teams would autoscale their work as data size evolves
B. Both teams would use the same source of truth for their work
C. Both teams would reorganize to report to the same department
D. Both teams would be able to collaborate on projects in real-time
E. Both teams would respond more quickly to ad-hoc requests
Selected Answer: B
Question #: 128
Topic #: 1
A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.
Which approach can the data engineer use to set up the new task?
A. They can clone the existing task in the existing Job and update it to run the new notebook.
B. They can create a new task in the existing Job and then add it as a dependency of the original task.
C. They can create a new task in the existing Job and then add the original task as a dependency of the new task.
D. They can create a new job from scratch and add both tasks to run concurrently.
Selected Answer: B
Question #: 90
Topic #: 1
A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project.
Which of the following commands can be used to grant full permissions on the database to the new data engineering team?
A. GRANT ALL PRIVILEGES ON TABLE sales TO team;
B. GRANT SELECT CREATE MODIFY ON TABLE sales TO team;
C. GRANT SELECT ON TABLE sales TO team;
D. GRANT USAGE ON TABLE sales TO team;
E. GRANT ALL PRIVILEGES ON TABLE team TO sales;
Selected Answer: A
Question #: 88
Topic #: 1
A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job.
Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?
A. They can turn on the Auto Stop feature for the SQL endpoint.
B. They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint.
C. They can reduce the cluster size of the SQL endpoint.
D. They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints.
E. They can set up the dashboard's SQL endpoint to be serverless.
Selected Answer: A
Question #: 18
Topic #: 1
A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.
Which of the following approaches could be used by the data engineering team to complete this task?
A. They could submit a feature request with Databricks to add this functionality.
B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.
C. They could only run the entire program on Sundays.
D. They could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.
E. They could redesign the data model to separate the data used in the final query into a new table.
Selected Answer: B
Question #: 109
Topic #: 1
What is a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?
A. Parquet files can be partitioned
B. Parquet files will become Delta tables
C. Parquet files have a well-defined schema
D. Parquet files have the ability to be optimized
Selected Answer: C
Question #: 106
Topic #: 1
A data engineer has created a new database using the following command:
CREATE DATABASE IF NOT EXISTS customer360;
In which location will the customer360 database be located?
A. dbfs:/user/hive/database/customer360
B. dbfs:/user/hive/warehouse
C. dbfs:/user/hive/customer360
D. dbfs:/user/hive/database
Selected Answer: A
Question #: 105
Topic #: 1
A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which command could the data engineering team use to access sales in PySpark?
A. SELECT * FROM sales
B. spark.table("sales")
C. spark.sql("sales")
D. spark.delta.table("sales")
Selected Answer: A
Question #: 103
Topic #: 1
A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:
After running the command today, the data engineer notices that the number of records in table transactions has not changed.
What explains why the statement might not have copied any new records into the table?
A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B. The COPY INTO statement requires the table to be refreshed to view the copied rows.
C. The previous day’s file has already been copied into the table.
D. The PARQUET file format does not support COPY INTO.
Selected Answer: C
Question #: 104
Topic #: 1
Which command can be used to write data into a Delta table while avoiding the writing of duplicate records?
A. DROP
B. INSERT
C. MERGE
D. APPEND
Selected Answer: A
Question #: 100
Topic #: 1
A data engineer has been given a new record of data:
id STRING = 'a1'
rank INTEGER = 6
rating FLOAT = 9.4
Which SQL commands can be used to append the new record to an existing Delta table my_table?
A. INSERT INTO my_table VALUES ('a1', 6, 9.4)
B. INSERT VALUES ('a1', 6, 9.4) INTO my_table
C. UPDATE my_table VALUES ('a1', 6, 9.4)
D. UPDATE VALUES ('a1', 6, 9.4) my_table
Selected Answer: B
Question #: 99
Topic #: 1
A data architect has determined that a table of the following format is necessary:
Which code block is used by SQL DDL command to create an empty Delta table in the above format regardless of whether a table already exists with this name?
A. CREATE OR REPLACE TABLE table_name ( employeeId STRING, startDate DATE, avgRating FLOAT )
B. CREATE OR REPLACE TABLE table_name WITH COLUMNS ( employeeId STRING, startDate DATE, avgRating FLOAT ) USING DELTA
C. CREATE TABLE IF NOT EXISTS table_name ( employeeId STRING, startDate DATE, avgRating FLOAT )
D. CREATE TABLE table_name AS SELECT employeeId STRING, startDate DATE, avgRating FLOAT
Selected Answer: A
Question #: 97
Topic #: 1
A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.
Which Git operation does the data engineer need to run to accomplish this task?
A. Clone
B. Pull
C. Merge
D. Push
Selected Answer: B
Question #: 92
Topic #: 1
In which scenario will a data team want to utilize cluster pools?
A. An automated report needs to be version-controlled across multiple collaborators.
B. An automated report needs to be runnable by all stakeholders.
C. An automated report needs to be refreshed as quickly as possible.
D. An automated report needs to be made reproducible.
Selected Answer: D
Question #: 98
Topic #: 1
Which file format is used for storing Delta Lake Table?
A. CSV
B. Parquet
C. JSON
D. Delta
Selected Answer: C
Question #: 75
Topic #: 1
A data engineer has joined an existing project and they see the following query in the project repository:
CREATE STREAMING LIVE TABLE loyal_customers AS
SELECT customer_id -
FROM STREAM(LIVE.customers)
WHERE loyalty_level = 'high';
Which of the following describes why the STREAM function is included in the query?
A. The STREAM function is not needed and will cause an error.
B. The table being created is a live table.
C. The customers table is a streaming live table.
D. The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.
E. The data in the customers table has been updated since its last run.
Selected Answer: C
Question #: 37
Topic #: 1
A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.
Which of the following approaches can the data engineer use to set up the new task?
A. They can clone the existing task in the existing Job and update it to run the new notebook.
B. They can create a new task in the existing Job and then add it as a dependency of the original task.
C. They can create a new task in the existing Job and then add the original task as a dependency of the new task.
D. They can create a new job from scratch and add both tasks to run concurrently.
E. They can clone the existing task to a new Job and then edit it to run the new notebook.
Selected Answer: B
Question #: 16
Topic #: 1
Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?
A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT
Selected Answer: C
Question #: 81
Topic #: 1
Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?
A. Silver tables contain a less refined, less clean view of data than Bronze data.
B. Silver tables contain aggregates while Bronze data is unaggregated.
C. Silver tables contain more data than Bronze tables.
D. Silver tables contain a more refined and cleaner view of data than Bronze tables.
E. Silver tables contain less data than Bronze tables.
Selected Answer: D
Question #: 80
Topic #: 1
A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION FAIL UPDATE
What is the expected behavior when a batch of data containing data that violates these constraints is processed?
A. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
B. Records that violate the expectation cause the job to fail.
C. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
E. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
Selected Answer: B
Question #: 78
Topic #: 1
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:
Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?
A. Replace predict with a stream-friendly prediction function
B. Replace schema(schema) with option (“maxFilesPerTrigger”, 1)
C. Replace “transactions” with the path to the location of the Delta table
D. Replace format(“delta”) with format(“stream”)
E. Replace spark.read with spark.readStream
Selected Answer: E
Question #: 76
Topic #: 1
Which of the following describes the type of workloads that are always compatible with Auto Loader?
A. Streaming workloads
B. Machine learning workloads
C. Serverless workloads
D. Batch workloads
E. Dashboard workloads
Selected Answer: A
Question #: 65
Topic #: 1
A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:
DROP TABLE IF EXISTS my_table;
After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.
Which of the following describes why all of these files were deleted?
A. The table was managed
B. The table’s data was smaller than 10 GB
C. The table’s data was larger than 10 GB
D. The table was external
E. The table did not have a location
Selected Answer: A
Question #: 63
Topic #: 1
A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.
They have the following incomplete code block:
____(f”SELECT customer_id, spend FROM {table_name}”)
Which of the following can be used to fill in the blank to successfully complete the task?
A. spark.delta.sql
B. spark.delta.table
C. spark.table
D. dbutils.sql
E. spark.sql
Selected Answer: E
Question #: 62
Topic #: 1
A data engineer needs to apply custom logic to identify employees with more than 5 years of experience in array column employees in table stores. The custom logic should create a new column exp_employees that is an array of all of the employees with more than 5 years of experience for each row. In order to apply this custom logic at scale, the data engineer wants to use the FILTER higher-order function.
Which of the following code blocks successfully completes this task?
A.
B.
C.
D.
E.
Selected Answer: A