Certified Data Engineer Professional → Certified Data Engineer Professional Topic 2

Certified Data Engineer Professional Topic 2

Question #: 143
Topic #: 1
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

A. Task queueing resulting from improper thread pool assignment.
B. Spill resulting from attached volume storage being too small.
C. Network latency due to some cluster nodes being in different regions from the source data
D. Skew caused by more data being assigned to a subset of spark-partitions.

Selected Answer: D

Question #: 142
Topic #: 1
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

A. spark.sql.files.maxPartitionBytes
B. spark.sql.autoBroadcastJoinThreshold
C. spark.sql.adaptive.advisoryPartitionSizeInBytes
D. spark.sql.adaptive.coalescePartitions.minPartitionNum

Selected Answer: A

Question #: 141
Topic #: 1
Which statement characterizes the general programming model used by Spark Structured Streaming?

A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

Selected Answer: D

Question #: 129
Topic #: 1
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A. configure
B. fs
C. workspace
D. libraries

Selected Answer: B

Question #: 121
Topic #: 1
A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table –

Which of the following describes how results are generated each time the dashboard is updated?

A. The total count of rows is calculated by scanning all data files
B. The total count of rows will be returned from cached results unless REFRESH is run
C. The total count of records is calculated from the Delta transaction logs
D. The total count of records is calculated from the parquet file metadata

Selected Answer: C

Question #: 119
Topic #: 1
Which statement regarding Spark configuration on the Databricks platform is true?

A. The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster.
B. Spark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster.
C. When the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.
D. Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

Selected Answer: D

Question #: 1
Topic #: 1
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format(“parquet”).load(f”/mnt/source/(date)”)
Which code block should be used to create the date Python variable used in the above code block?

A. date = spark.conf.get(“date”)
B. input_dict = input()
date= input_dict[“date”] C. import sys
date = sys.argv[1] D. date = dbutils.notebooks.getParam(“date”)
E. dbutils.widgets.text(“date”, “null”)
date = dbutils.widgets.get(“date”)

Selected Answer: E

Question #: 2
Topic #: 1
The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

A. “Can Manage” privileges on the required cluster
B. Workspace Admin privileges, cluster creation allowed, “Can Attach To” privileges on the required cluster
C. Cluster creation allowed, “Can Attach To” privileges on the required cluster
D. “Can Restart” privileges on the required cluster
E. Cluster creation allowed, “Can Restart” privileges on the required cluster

Selected Answer: D

Question #: 98
Topic #: 1
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the auditing group executes the following query:

SELECT * FROM user_ltv_no_minors

Which statement describes the results returned by this query?

A. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.
B. All age values less than 18 will be returned as null values, all other columns will be returned with the values in user_ltv.
C. All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv.
D. All records from all columns will be displayed with the values in user_ltv.
E. All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

Selected Answer: A

Question #: 148
Topic #: 1
A DLT pipeline includes the following streaming tables:

• raw_iot ingests raw device measurement data from a heart rate tracking device.
• bpm_stats incrementally computes user statistics based on BPM measurements from raw_iot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table, while recomputing the downstream table bpm_stats table when a pipeline update is run?

A. Set the pipelines.reset.allowed property to false on raw_iot
B. Set the skipChangeCommits flag to true on raw_iot
C. Set the pipelines.reset.allowed property to false on bpm_stats
D. Set the skipChangeCommits flag to true on bpm_stats

Selected Answer: A

Question #: 117
Topic #: 1
A data engineer, User A, has promoted a pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

A workspace admin, User C, inherits responsibility for managing this pipeline. User C uses the Databricks Jobs UI to take “Owner” privileges of each job. Jobs continue to be triggered using the credentials and tooling configured by User B.

An application has been configured to collect and parse run information returned by the REST API. Which statement describes the value returned in the creator_user_name field?

A. Once User C takes “Owner” privileges, their email address will appear in this field; prior to this, User A’s email address will appear in this field.
B. User B’s email address will always appear in this field, as their credentials are always used to trigger the run.
C. User A’s email address will always appear in this field, as they still own the underlying notebooks.
D. Once User C takes “Owner” privileges, their email address will appear in this field; prior to this, User B’s email address will appear in this field.
E. User C will only ever appear in this field if they manually trigger the job, otherwise it will indicate User B.

Selected Answer: B

Question #: 112
Topic #: 1
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.

Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?

A. • Total VMs: 8
• 50 GB per Executor
• 20 Cores / Executor
B. • Total VMs: 16
• 25 GB per Executor
• 10 Cores / Executor
C. • Total VMs: 1
• 400 GB per Executor
• 160 Cores/Executor
D. • Total VMs: 4
• 100 GB per Executor
• 40 Cores / Executor
E. • Total VMs: 2
• 200 GB per Executor
• 80 Cores / Executor

Selected Answer: B

Question #: 162
Topic #: 1
A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the “Owner” for each job. They attempt to transfer “Owner” privileges to the “DevOps” group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

A. Databricks jobs must have exactly one owner; “Owner” privileges cannot be assigned to a group.
B. The creator of a Databricks job will always have “Owner” privileges; this configuration cannot be changed.
C. Only workspace administrators can grant “Owner” privileges to a group.
D. A user can only transfer job ownership to a group if they are also a member of that group.

Selected Answer: A

Question #: 11
Topic #: 1
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake’s time travel functionality. They are concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

A. Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
B. Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
C. Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
D. Because Delta Lake’s delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.

Selected Answer: A

Question #: 136
Topic #: 1
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Selected Answer: A

Question #: 10
Topic #: 1
A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?

A. All records are cached to an operational database and then the filter is applied
B. The Parquet file footers are scanned for min and max statistics for the latitude column
C. All records are cached to attached storage and then the filter is applied
D. The Delta log is scanned for min and max statistics for the latitude column
E. The Hive metastore is scanned for min and max statistics for the latitude column

Selected Answer: D

Question #: 102
Topic #: 1
What is the first line of a Databricks Python notebook when viewed in a text editor?

A. %python
B. // Databricks notebook source
C. # Databricks notebook source
D. — Databricks notebook source
E. # MAGIC %python

Selected Answer: C

Question #: 63
Topic #: 1
A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table –

Which of the following describes how results are generated each time the dashboard is updated?

Selected Answer: C

Question #: 60
Topic #: 1
The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOAT
If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

A. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.
B. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.
C. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
D. Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
E. Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Selected Answer: C

Question #: 8
Topic #: 1
An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

A. Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
C. Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
D. Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will fail.
E. Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Selected Answer: B

Question #: 123
Topic #: 1
A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?

A. Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.
B. Use Repos to create a fork of the remote repository, commit all changes, and make a pull request on the source repository.
C. Use Repos to pull changes from the remote Git repository; commit and push changes to a branch that appeared as changes were pulled.
D. Use Repos to merge all differences and make a pull request back to the remote repository.

Selected Answer: C

Question #: 4
Topic #: 1
The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.
The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

A. The total average temperature across all sensors exceeded 120 on three consecutive executions of the query
B. The recent_sensor_recordings table was unresponsive for three consecutive runs of the query
C. The source query failed to update properly for three consecutive minutes and then restarted
D. The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query
E. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Selected Answer: E

Question #: 20
Topic #: 1
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

A. No; Delta Lake manages streaming checkpoints in the transaction log.
B. Yes; both of the streams can share a single checkpoint directory.
C. No; only one stream can write to a Delta Lake table.
D. Yes; Delta Lake supports infinite concurrent writers.
E. No; each of the streams needs to have its own checkpoint directory.

Selected Answer: E

Question #: 33
Topic #: 1
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.
The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.
Which statement exemplifies best practices for implementing this system?

A. Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.
B. Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.
C. Storing all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.
D. Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.
E. Because all tables must live in the same storage containers used for the database they’re created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Selected Answer: A

Question #: 55
Topic #: 1
Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.
Which statement describes a main benefit that offset this additional effort?

A. Improves the quality of your data
B. Validates a complete use case of your application
C. Troubleshooting is easier since all steps are isolated and tested individually
D. Yields faster deployment and execution times
E. Ensures that all steps interact correctly to achieve the desired end result

Selected Answer: C

Question #: 57
Topic #: 1
Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

A. /jobs/runs/list
B. /jobs/runs/get-output
C. /jobs/runs/get
D. /jobs/get
E. /jobs/list

Selected Answer: D

Question #: 168
Topic #: 1
What is the retention of job run history?

A. It is retained until you export or delete job run logs
B. It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3
C. It is retained for 60 days, during which you can export notebook run results to HTML
D. It is retained for 60 days, after which logs are archived

Selected Answer: C

Question #: 110
Topic #: 1
A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.

Which of the following solutions would you implement to achieve this requirement?

A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
B. Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
C. Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
D. Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
E. Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.

Selected Answer: B

Question #: 122
Topic #: 1
A Delta Lake table was created with the below query:

Consider the following query:

DROP TABLE prod.sales_by_store –

If this statement is executed by a workspace admin, which result will occur?

A. Data will be marked as deleted but still recoverable with Time Travel.
B. The table will be removed from the catalog but the data will remain in storage.
C. The table will be removed from the catalog and the data will be deleted.
D. An error will occur because Delta Lake prevents the deletion of production data.

Selected Answer: C

Question #: 166
Topic #: 1
An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?
A. A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.
B. An external table will be created in the storage container mounted to /mnt/finance_eda_bucket.
C. A managed table will be created in the DBFS root storage container.
D. An managed table will be created in the storage container mounted to /mnt/finance_eda_bucket.

Selected Answer: D

Question #: 163
Topic #: 1
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the auditing group executes the following query:

SELECT * FROM user_ltv_no_minors

Which statement describes the results returned by this query?

A. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.
B. All age values less than 18 will be returned as null values, all other columns will be returned with the values in user_ltv.
C. All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv.
D. All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

Selected Answer: A

Question #: 139
Topic #: 1
A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

A. Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
B. Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.
C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
D. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.

Selected Answer: C

Question #: 151
Topic #: 1
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

A. The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
B. Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
C. Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
D. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

Selected Answer: D

Question #: 167
Topic #: 1
The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

A. No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
B. “Read” permissions should be set on a secret key mapped to those credentials that will be used by a given team.
C. “Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.
D. “Manage” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

Selected Answer: C

Question #: 164
Topic #: 1
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the “registration” topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which solution meets the requirements?

A. All data should be deleted biweekly; Delta Lake’s time travel functionality should be leveraged to maintain a history of non-PII information.
B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
C. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.
D. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Selected Answer: C

Question #: 160
Topic #: 1
A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

A. Data corruption can occur if a query fails in a partially completed state because Type 2 tables require setting multiple fields in a single update.
B. Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.
C. Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.
D. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

Selected Answer: D

Question #: 159
Topic #: 1
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

A. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.
B. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake’s upsert functionality.
C. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.
D. All Delta Lake transactions are ACID compliant against a single table, and Databricks does not enforce foreign key constraints.

Selected Answer: D

Question #: 157
Topic #: 1
A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company’s data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

A. Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.
B. Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.
C. Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.
D. Databricks notebooks send all executable code from the user’s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Selected Answer: C

Question #: 152
Topic #: 1
The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A. A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.
B. The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.
C. No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.
D. An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

Selected Answer: B

Question #: 50
Topic #: 1
A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.
When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

A. The five Minute Load Average remains consistent/flat
B. Bytes Received never exceeds 80 million bytes per second
C. Total Disk Space remains constant
D. Network I/O never spikes
E. Overall cluster CPU utilization is around 25%

Selected Answer: E

Question #: 135
Topic #: 1
A view is registered with the following code:

Both users and orders are Delta Lake tables.

Which statement describes the results of querying recent_orders?

A. All logic will execute when the view is defined and store the result of joining tables to the DBFS; this stored data will be returned when the view is queried.
B. Results will be computed and cached when the view is defined; these cached results will incrementally update as new records are inserted into source tables.
C. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
D. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

Selected Answer: D

Question #: 101
Topic #: 1
Which indicators would you look for in the Spark UI’s Storage tab to signal that a cached table is not performing optimally? Assume you are using Spark’s MEMORY_ONLY storage level.

A. Size on Disk is < Size in Memory B. The RDD Block Name includes the “*” annotation signaling a failure to cache C. Size on Disk is > 0
D. The number of Cached Partitions > the number of Spark Partitions
E. On Heap Memory Usage is within 75% of Off Heap Memory Usage

Selected Answer: C

Question #: 108
Topic #: 1
Which statement describes the default execution mode for Databricks Auto Loader?

A. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; the target table is materialized by directly querying all valid files in the source directory.
B. New files are identified by listing the input directory; the target table is materialized by directly querying all valid files in the source directory.
C. Webhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data.
D. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
E. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.

Selected Answer: D

Question #: 95
Topic #: 1
A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.

If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

A. Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
B. All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
C. The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
D. Duplicate records arriving more than 2 hours apart will be dropped, but duplicates that arrive in the same batch may both be written to the orders table.
E. The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

Selected Answer: A

Question #: 42
Topic #: 1
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv –
Which statement describes the results returned by this query?

A. Three columns will be returned, but one column will be named “REDACTED” and contain only null values.
B. Only the email and ltv columns will be returned; the email column will contain all null values.
C. The email and ltv columns will be returned with the values in user_ltv.
D. The email.age, and ltv columns will be returned with the values in user_ltv.
E. Only the email and ltv columns will be returned; the email column will contain the string “REDACTED” in each row.

Selected Answer: E

Question #: 34
Topic #: 1
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?

A. Whenever a database is being created, make sure that the LOCATION keyword is used
B. When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.
C. Whenever a table is being created, make sure that the LOCATION keyword is used.
D. When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.
E. When the workspace is being configured, make sure that external cloud object storage has been mounted.

Selected Answer: C

Question #: 16
Topic #: 1
A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
B. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.
C. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
D. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
E. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Selected Answer: B

Cart

Practice Exam

Install openssl-1.0.2k on Amazon Linux 2023

Protect Your System: Understanding CVE-2024-3400 Zero-Day Vulnerability

Cart