Certified Data Engineer Associate Topic 3
Question #: 60
Topic #: 1
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A. SELECT * FROM sales
B. spark.delta.table
C. spark.sql
D. There is no way to share data between PySpark and SQL.
E. spark.table
Selected Answer: C
Question #: 61
Topic #: 1
Which of the following commands will return the number of null values in the member_id column?
A. SELECT count(member_id) FROM my_table;
B. SELECT count(member_id) – count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
D. SELECT null(member_id) FROM my_table;
E. SELECT count_null(member_id) FROM my_table;
Selected Answer: C
Question #: 57
Topic #: 1
Which of the following SQL keywords can be used to convert a table from a long format to a wide format?
A. TRANSFORM
B. PIVOT
C. SUM
D. CONVERT
E. WHERE
Selected Answer: B
Question #: 55
Topic #: 1
A data architect has determined that a table of the following format is necessary:
Which of the following code blocks uses SQL DDL commands to create an empty Delta table in the above format regardless of whether a table already exists with this name?
A.
B.
C.
D.
E.
Selected Answer: E
Question #: 54
Topic #: 1
Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?
A. None of these
B. Data lake
C. Data warehouse
D. All of these
E. Data lakehouse
Selected Answer: E
Question #: 53
Topic #: 1
Which of the following is stored in the Databricks customer’s cloud account?
A. Databricks web application
B. Cluster management metadata
C. Repos
D. Data
E. Notebooks
Selected Answer: D
Question #: 52
Topic #: 1
In which of the following file formats is data from Delta Lake tables primarily stored?
A. Delta
B. CSV
C. Parquet
D. JSON
E. A proprietary, optimized format specific to Databricks
Selected Answer: C
Question #: 51
Topic #: 1
A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.
Which of the following keywords can be used to compact the small files?
A. REDUCE
B. OPTIMIZE
C. COMPACTION
D. REPARTITION
E. VACUUM
Selected Answer: B
Question #: 50
Topic #: 1
A data engineer has been given a new record of data:
id STRING = ‘a1’
rank INTEGER = 6
rating FLOAT = 9.4
Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?
A. INSERT INTO my_table VALUES (‘a1’, 6, 9.4)
B. my_table UNION VALUES (‘a1’, 6, 9.4)
C. INSERT VALUES ( ‘a1’ , 6, 9.4) INTO my_table
D. UPDATE my_table VALUES (‘a1’, 6, 9.4)
E. UPDATE VALUES (‘a1’, 6, 9.4) my_table
Selected Answer: A
Question #: 49
Topic #: 1
Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?
A. When they are working interactively with a small amount of data
B. When they are running automated reports to be refreshed as quickly as possible
C. When they are working with SQL within Databricks SQL
D. When they are concerned about the ability to automatically scale with larger data
E. When they are manually running reports with a large amount of data
Selected Answer: A
Question #: 48
Topic #: 1
A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.
In which of the following locations can the data engineer review their permissions on the table?
A. Databricks Filesystem
B. Jobs
C. Dashboards
D. Repos
E. Data Explorer
Selected Answer: E
Question #: 47
Topic #: 1
Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?
A. Cloud-specific integrations
B. Simplified governance
C. Ability to scale storage
D. Ability to scale workloads
E. Avoiding vendor lock-in
Selected Answer: E
Question #: 46
Topic #: 1
A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.
Which of the following Git operations does the data engineer need to run to accomplish this task?
A. Merge
B. Push
C. Pull
D. Commit
E. Clone
Selected Answer: C
Question #: 44
Topic #: 1
A new data engineering team team. has been assigned to an ELT project. The new data engineering team will need full privileges on the database customers to fully manage the project.
Which of the following commands can be used to grant full permissions on the database to the new data engineering team?
A. GRANT USAGE ON DATABASE customers TO team;
B. GRANT ALL PRIVILEGES ON DATABASE team TO customers;
C. GRANT SELECT PRIVILEGES ON DATABASE customers TO teams;
D. GRANT SELECT CREATE MODIFY USAGE PRIVILEGES ON DATABASE customers TO team;
E. GRANT ALL PRIVILEGES ON DATABASE customers TO team;
Selected Answer: E
Question #: 43
Topic #: 1
A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.
Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?
A. They can use endpoints available in Databricks SQL
B. They can use jobs clusters instead of all-purpose clusters
C. They can configure the clusters to be single-node
D. They can use clusters that are from a cluster pool
E. They can configure the clusters to autoscale for larger data sizes
Selected Answer: D
Question #: 42
Topic #: 1
A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case.
Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?
A. They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook.
B. They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook.
C. They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.
D. There is no way to determine why a Job task is running slowly.
E. They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.
Selected Answer: C
Question #: 39
Topic #: 1
A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their always-on SQL endpoint. They claim that this issue is present when many members of the team are running small queries simultaneously. They ask the data engineering team for help. The data engineering team notices that each of the team’s queries uses the same SQL endpoint.
Which of the following approaches can the data engineering team use to improve the latency of the team’s queries?
A. They can increase the cluster size of the SQL endpoint.
B. They can increase the maximum bound of the SQL endpoint’s scaling range.
C. They can turn on the Auto Stop feature for the SQL endpoint.
D. They can turn on the Serverless feature for the SQL endpoint.
E. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to “Reliability Optimized.”
Selected Answer: B
Question #: 35
Topic #: 1
Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?
A.
B.
C.
D.
E.
Selected Answer: E
Question #: 34
Topic #: 1
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?
A. Unity Catalog
B. Delta Lake
C. Databricks SQL
D. Data Explorer
E. Auto Loader
Selected Answer: E
Question #: 31
Topic #: 1
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:
If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?
A. trigger(“5 seconds”)
B. trigger()
C. trigger(once=”5 seconds”)
D. trigger(processingTime=”5 seconds”)
E. trigger(continuous=”5 seconds”)
Selected Answer: D
Question #: 30
Topic #: 1
Which of the following tools is used by Auto Loader process data incrementally?
A. Checkpointing
B. Spark Structured Streaming
C. Data Explorer
D. Unity Catalog
E. Databricks SQL
Selected Answer: B
Question #: 29
Topic #: 1
Which of the following describes the relationship between Bronze tables and raw data?
A. Bronze tables contain less data than raw data files.
B. Bronze tables contain more truthful data than raw data.
C. Bronze tables contain aggregates while raw data is unaggregated.
D. Bronze tables contain a less refined view of data than raw data.
E. Bronze tables contain raw data with a schema applied.
Selected Answer: E
Question #: 25
Topic #: 1
A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.
Which of the following tools can the data engineer use to solve this problem?
A. Unity Catalog
B. Data Explorer
C. Delta Lake
D. Delta Live Tables
E. Auto Loader
Selected Answer: D
Question #: 13
Topic #: 1
Which of the following commands will return the location of database customer360?
A. DESCRIBE LOCATION customer360;
B. DROP DATABASE customer360;
C. DESCRIBE DATABASE customer360;
D. ALTER DATABASE customer360 SET DBPROPERTIES (‘location’ = ‘/user’};
E. USE DATABASE customer360;
Selected Answer: C
Question #: 10
Topic #: 1
A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databricks Repos.
Which of the following is an advantage of using Databricks Repos over the Databricks Notebooks versioning?
A. Databricks Repos automatically saves development progress
B. Databricks Repos supports the use of multiple branches
C. Databricks Repos allows users to revert to previous versions of a notebook
D. Databricks Repos provides the ability to comment on specific changes
E. Databricks Repos is wholly housed within the Databricks Lakehouse Platform
Selected Answer: B
Question #: 8
Topic #: 1
Which of the following Git operations must be performed outside of Databricks Repos?
A. Commit
B. Pull
C. Push
D. Clone
E. Merge
Selected Answer: E
Question #: 69
Topic #: 1
A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.
They run the following command:
Which of the following lines of code fills in the above blank to successfully complete the task?
A. None of these lines of code are needed to successfully complete the task
B. USING CSV
C. FROM CSV
D. USING DELTA
E. FROM “path/to/csv”
Selected Answer: B
Question #: 67
Topic #: 1
In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?
A. When the location of the data needs to be changed
B. When the target table is an external table
C. When the source table can be deleted
D. When the target table cannot contain duplicate records
E. When the source is not a Delta table
Selected Answer: D