Certified Associate Developer for Apache Spark Topic 3
Question #: 166
Topic #: 1
The code block shown below contains an error. The code block is intended to read JSON at the file path filePath into a DataFrame with the specified schema schema. Identify the error.
Code block:
spark.read.schema(“schema”).format(“json”).load(filePath)
A. The schema operation from read takes a schema object rather than a string — the argument should be schema.
B. There is no load() operation for DataFrameReader — it should be replaced with the json() operation.
C. The spark.read operation should be followed by parentheses in order to return a DataFrameReader object.
D. There is no read property of spark — spark should be replaced with DataFrame.
E. The schema operation from read takes a column rather than a string — the argument should be col(“schema”).
Selected Answer: A
Question #: 165
Topic #: 1
Which of the following code blocks writes DataFrame storesDF to file path filePath as text files overwriting any existing files in that location?
A. storesDF.write(filePath, mode = “overwrite”, source = “text”)
B. storesDF.write.mode(“overwrite”).text(filePath)
C. storesDF.write.mode(“overwrite”).path(filePath)
D. storesDF.write.option(“text”, “overwrite”).path(filePath)
E. storesDF.write().mode(“overwrite”).text(filePath)
Selected Answer: B
Question #: 163
Topic #: 1
The code block shown below should return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId and column employeeId. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.join(employeesDF, [__1__ == __2__, __3__ == __4__])
A. 1. storesDF.storeId
2. storesDF.employeeId
3. employeesDF.storeId
4. employeesDF.employeeId
B. 1. col(“storeId”)
2.col(“storeId”)
3.col(“employeeId”)
4. col(“employeeId”)
C. 1. storeId
2. storeId
3. employeeId
4. employeeId
D. 1. col(“storeId”)
2. col(“employeeId”)
3. col(“employeeId”)
4. col(”storeId”)
E. 1. storesDF.storeId
2. employeesDF.storeId
3. storesDF.employeeId
4. employeesDF.employeeId
Selected Answer: D
Question #: 164
Topic #: 1
The code block shown below should return a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__(__3__)
A. 1. DataFrame
2. union
3. storesDF, acquiredStoresDF
B. 1. DataFrame
2. concat
3. storesDF, acqulredStoresDF
C. 1. storesDF
2. union
3.acquiredStoresDF
D. 1. storesDF
2. unionByName
3. acquiredStoresDF
E. 1. DataFrame
2. unionAll
3. storesDF, acquiredStoresDF
Selected Answer: A
Question #: 162
Topic #: 1
Which of the following operations can be used to perform a left join on two DataFrames?
A. DataFrame.join()
B. DataFrame.crossJoin()
C. DataFrame.merge()
D. DataFrame.leftJoin()
E. Standalone join() function
Selected Answer: C
Question #: 161
Topic #: 1
The code block shown below should return a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1st, 1970.
An example of Java’s SimpleDateFormat is “Sunday, Dec 4, 2008 1:05 pm”.
A sample of storesDF is displayed below:
Code block:
storesDF.__1__(“openDateString”, __2__(__3__, __4__))
A. 1. withColumn
2. from_unixtime
3. col(“openDate”)
4. “EEEE, MMM d, yyyy h:mm a”
B. 1. withColumn
2. date_format
3. col(“openDate”)
4. “EEEE, mmm d, yyyy h:mm a”
C. 1. newColumn
2. from_unixtinie
3. “openDate”
4. “EEEE, MMM d, yyyy h:mm a”
D. 1. withColumn
2. from_unixtlme
3. col(“openDate”)
4. SimpleDateFormat
E. 1. withColumn
2. from_unixtime
3. col(“openDate”)
4. “dw, MMM d, yyyy h:mm a”
Selected Answer: A
Question #: 159
Topic #: 1
Which of the following code blocks creates a Python UDF assessPerformanceUDF() using the integer-returning Python function assessPerformance() and applies it to Column customerSatisfaction in DataFrame storesDF?
A. assessPerformanceUDF = udf(assessPerformance, IntegerType) storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
B. assessPerformanceUDF = udf(assessPerformance, IntegerType()) storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
C. assessPerformanceUDF – udf(assessPerformance) storesDF.withColumn(“result”, assessPerformance(col(“customerSatisfaction”)))
D. assessPerformanceUDF = udf(assessPerformance) storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
E. assessPerformanceUDF = udf(assessPerformance, IntegerType()) storesDF.withColumn(“result”, assessPerformance(col(“customerSatisfaction”)))
Selected Answer: B
Question #: 158
Topic #: 1
The code block shown below should return a 25 percent sample of rows from DataFrame storesDF with reproducible results. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
StoresDF.__1__(__2__ = __3__, __4__ = __5__)
A. 1. sample
2. fraction
3. 0.25
4. seed
5. True
B. 1. sample
2. withReplacement
3. True
4. seed
5. True
C. 1. sample
2. fraction
3. 0.25
4. seed
5. 1234
D. 1. sample
2. fraction
3. 0.15
4. seed
5. 1234
E. 1. sample
2. withReplacement
3. True
4. seed
5. 1234
Selected Answer: B
Question #: 154
Topic #: 1
The code block shown below should return a new DataFrame where rows in DataFrame storesDF containing at least one missing value have been dropped. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
StoresDF.__1__.__2__(__3__ = __4__)
A. 1. na
2. drop
3. subset
4. “any”
B. 1. na
2. drop
3. how
4. “all”
C. 1. na
2. drop
3. subset
4. “all”
D. 1. na
2. drop
3. how
4. “any”
E. 1. drop
2. na
3. how
4. “any”
Selected Answer: C
Question #: 151
Topic #: 1
Which of the following cluster configurations is least likely to experience delays due to garbage collection of a large DataFrame?
Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores.
A. Scenario #4
B. Scenario #1
C. Scenario #5
D. More information is needed to determine an answer.
E. Scenario #6
Selected Answer: A
Question #: 150
Topic #: 1
Which of the following operations is least likely to result in a shuffle?
A. DataFrame.join()
B. DataFrame.fliter()
C. DataFrame.orderBy()
D. DataFrame.distinct()
E. DataFrame.intersect()
Selected Answer: A
Question #: 148
Topic #: 1
Which of the following describes slots?
A. Slots are the most coarse level of execution in the Spark execution hierarchy.
B. Slots are resource threads that can be used for parallelization within a Spark application.
C. Slots are resources that are used to run multiple Spark applications at once on a single cluster.
D. Slots are the most granular level of execution in the Spark execution hierarchy.
E. Slots are unique segments of data from a DataFrame that are split up by row.
Selected Answer: C
Question #: 146
Topic #: 1
In what order should the below lines of code be run in order to read a parquet at the file path filePath into a DataFrame?
Lines of code:
1. storesDF
2. .load(filePath, source = “parquet”)
3. .read \
4. spark \
5. .read() \
6. .parquet(filePath)
A. 1, 5, 2
B. 4, 5, 2
C. 4, 3, 6
D. 4, 5, 6
E. 4, 3, 2
Selected Answer: A
Question #: 145
Topic #: 1
Which of the following code blocks returns a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF?
A. storesDF.unionByName(acquiredStoresDF)
B. unionAll(storesDF, acquiredStoresDF)
C. union(storesDF, acquiredStoresDF)
D. concat(storesDF, acquiredStoresDF)
E. storesDF.union(acquiredStoresDF)
Selected Answer: A
Question #: 141
Topic #: 1
Which of the following code blocks uses SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF?
A. storesDF.createOrReplaceTempView()
spark.sql(“SELECT storeId, managerName FROM stores”)
B. storesDF.query(”SELECT storeid, managerName from stores”)
C. spark.createOrReplaceTempView(“storesDF”)
storesDF.sql(“SELECT storeId, managerName from stores”)
D. storesDF.createOrReplaceTempView(“stores”)
spark.sql(“SELECT storeId, managerName FROM stores”)
E. storesDF.createOrReplaceTempView(“stores”)
storesDF.query(“SELECT storeId, managerName FROM stores”)
Selected Answer: A
Question #: 139
Topic #: 1
Which of the following code blocks returns a 15 percent sample of rows from DataFrame storesDF without replacement?
A. storesDF.sample(True, fraction = 0.15)
B. storesDF.sample(fraction = 0.15)
C. storesDF.sampleBy(fraction = 0.15)
D. storesDF.sample(fraction = 0.10)
E. storesDF.sample()
Selected Answer: C
Question #: 132
Topic #: 1
Which of the following best describes the similarities and differences between the MEMORY_ONLY storage level and the MEMORY_AND_DISK storage level?
A. The MEMORY_ONLY storage level will store as much data as possible in memory and will store any data that does on fit in memory on disk and read it as it’s called.
The MEMORY_AND_DISK storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it’s called.
B. The MEMORY_ONLY storage level will store as much data as possible in memory on two cluster nodes and will recompute any data that does not fit in memory as it’s called.
The MEMORY_AND_DISK storage level will store as much data as possible in memory on two cluster nodes and will store any data that does on fit in memory on disk and read it as it’s called.
C. The MEMORY_ONLY storage level will store as much data as possible in memory on two cluster nodes and will store any data that does on fit in memory on disk and read it as it’s called.
The MEMORY_AND_DISK storage level will store as much data as possible in memory on two cluster nodes and will recompute any data that does not fit in memory as it’s called.
D. The MEMORY_ONLY storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it’s called.
The MEMORY_AND_DISK storage level will store as much data as possible in memory and will store any data that does on fit in memory on disk and read it as it’s called.
E. The MEMORY_ONLY storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it’s called.
The MEMORY_AND_DISK storage level will store half of the data in memory and store half of the memory on disk. This provides quick preview and better logical plan design.
Selected Answer: A
Question #: 131
Topic #: 1
Which of the following will cause a Spark job to fail?
A. Never pulling any amount of data onto the driver node.
B. Trying to cache data larger than an executor’s memory.
C. Data needing to spill from memory to disk.
D. A failed worker node.
E. A failed driver node.
Selected Answer: A
Question #: 130
Topic #: 1
Spark’s execution/deployment mode determines where the driver and executors are physically located when a Spark application is run. Which of the following Spark execution/deployment modes does not exist? If they all exist, please indicate so with Response E.
A. Client mode
B. Cluster mode
C. Standard mode
D. Local mode
E. All of these execution/deployment modes exist
Selected Answer: B
Question #: 129
Topic #: 1
Which of the following identifies multiple narrow operations that are executed in sequence?
A. Slot
B. Job
C. Stage
D. Task
E. Executor
Selected Answer: D
Question #: 128
Topic #: 1
Which of the following describes a partition?
A. A partition is the amount of data that fits in a single executor.
B. A partition is an automatically-sized segment of data that is used to create efficient logical plans.
C. A partition is the amount of data that fits on a single worker node.
D. A partition is a portion of a Spark application that is made up of similar jobs.
E. A partition is a collection of rows of data that fit on a single machine in a cluster.
Selected Answer: E
Question #: 127
Topic #: 1
Which of the following cluster configurations will induce the least network traffic during a shuffle operation?
Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores.
A. This cannot be determined without knowing the number of partitions.
B. Scenario 5
C. Scenario 1
D. Scenario 4
E. Scenario 6
Selected Answer: C
Question #: 126
Topic #: 1
Which of the following types of processes induces a stage boundary?
A. Shuffle
B. Caching
C. Executor failure
D. Job delegation
E. Application failure
Selected Answer: B
Question #: 124
Topic #: 1
Which of the following statements about the Spark driver is true?
A. Spark driver is horizontally scaled to increase overall processing throughput.
B. Spark driver is the most coarse level of the Spark execution hierarchy.
C. Spark driver is fault tolerant — if it fails, it will recover the entire Spark application.
D. Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode.
E. Spark driver is only compatible with its included cluster manager.
Selected Answer: D
Question #: 123
Topic #: 1
Which of the following code blocks returns a new DataFrame with a new column customerSatisfactionAbs that is the absolute value of column customerSatisfaction in DataFrame storesDF? Note that column customerSatisfactionAbs is not in the original DataFrame storesDF.
A. storesDF.withColumn(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))
B. storesDF.withColumnRenamed(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))
C. storesDF.withColumn(col(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))
D. storesDF.withColumn(“customerSatisfactionAbs”, abs(col(customerSatisfaction)))
E. storesDF.withColumn(“customerSatisfactionAbs”, abs(“customerSatisfaction”))
Selected Answer: B
Question #: 99
Topic #: 1
Which of the following code blocks returns a new Data Frame from DataFrame storesDF with no duplicate rows?
A. storesDF.removeDuplicates()
B. storesDF.getDistinct()
C. storesDF.duplicates.drop()
D. storesDF.duplicates()
E. storesDF.dropDuplicates()
Selected Answer: A
Question #: 87
Topic #: 1
The code block shown below should return a new 12-partition DataFrame from DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__(__3__)
A. 1. storesDF
2. coalesce
3. 4
B. 1. storesDF
2. coalesce
3. 4, “storeId”
C. 1. storesDF
2. repartition
3. “storeId”
D. 1. storesDF
2. repartition
3. 12
E. 1. storesDF
2. repartition
3. Nothing
Selected Answer: B
Question #: 69
Topic #: 1
Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column modality is the constant string “PHYSICAL”? Assume DataFrame storesDF is the only defined language variable.
A. storesDF.withColumn(“modality”, lit(PHYSICAL))
B. storesDF.withColumn(“modality”, col(“PHYSICAL”))
C. storesDF.withColumn(“modality”, lit(“PHYSICAL”))
D. storesDF.withColumn(“modality”, StringType(“PHYSICAL”))
E. storesDF.withColumn(“modality”, “PHYSICAL”)
Selected Answer: C
Question #: 68
Topic #: 1
The code block shown below should return a new DataFrame from DataFrame storesDF where column storeId is of the type string. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.__1__(“storeId”, __2__(“storeId”).__3__(__4__)
A. 1. withColumn
2. col
3. cast
4. StringType()
B. 1. withColumn
2. cast
3. col
4. StringType()
C. 1. newColumn
2. col
3. cast
4. StringType()
D. 1. withColumn
2. cast
3. col
4. StringType
E. 1. withColumn
2. col
3. cast
4. StringType
Selected Answer: B
Question #: 125
Topic #: 1
The code block shown below should write DataFrame storesDF to file path filePath as parquet and partition by values in column division. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.__1__.__2__(__3__).__4__(__5__)
A. 1. write
2. partitionBy
3. “division”
4. path
5. filePath, node = parquet
B. 1. write
2. partitionBy
3. “division”
4. parquet
5. filePath
C. 1. write
2. partitionBy
3. col(“division”)
4. parquet
5. filePath
D. 1. write()
2. partitionBy
3. col(“division”)
4. parquet
5. filePath
E. 1. write
2. repartition
3. “division”
4. path
5. filePath, mode = “parquet”
Selected Answer: B