Certified Associate Developer for Apache Spark Topic 2
Question #: 94
Topic #: 1
Which of the following operations performs a cross join on two DataFrames?
A. DataFrame.join()
B. The standalone join() function
C. The standalone crossJoin() function
D. DataFrame.crossJoin()
E. DataFrame.merge()
Selected Answer: D
Question #: 1
Topic #: 1
Which of the following describes the Spark driver?
A. The Spark driver is responsible for performing all execution in all execution modes – it is the entire Spark application.
B. The Spare driver is fault tolerant – if it fails, it will recover the entire Spark application.
C. The Spark driver is the coarsest level of the Spark execution hierarchy – it is synonymous with the Spark application.
D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application.
E. The Spark driver is horizontally scaled to increase overall processing throughput of a Spark application.
Selected Answer: D
Question #: 143
Topic #: 1
Which of the following code blocks returns a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat?
Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1st, 1970.
An example of Java’s SimpleDateFormat is “Sunday, Dec 4, 2008 1:05 pm”.
A sample of storesDF is displayed below:
A. storesDF.withColumn(“openDatestring”, from unixtime(col(“openDate“), “EEEE, MMM d, yyyy h:mm a”))
B. storesDF.withColumn(“openDateString”, from_unixtime(col(“openDate“), “EEEE, MMM d, yyyy h:mm a”, TimestampType()))
C. storesDF.withColumn(“openDateString”, date(col(“openDate”), “EEEE, MMM d, yyyy h:mm a”))
D. storesDF.newColumn(col(“openDateString”), from_unixtime(“openDate”, “EEEE, MMM d, yyyy h:mm a”))
E. storesDF.withColumn(“openDateString”, date(col(“openDate“), “EEEE, MMM d, yyyy h:mm a”, TimestampType))
Selected Answer: A
Question #: 119
Topic #: 1
Which of the following cluster configurations is most likely to experience delays due to garbage collection of a large Dataframe?
Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores.
A. More information is needed to determine an answer.
B. Scenario #5
C. Scenario #4
D. Scenario #1
E. Scenario #2
Selected Answer: D
Question #: 115
Topic #: 1
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 AND the value in column customerSatisfaction is greater than or equal to 30?
A. storesDF.filter(col(“sqft”) <= 25000 and col("customerSatisfaction") >= 30)
B. storesDF.filter(col(“sqft”) <= 25000 or col("customerSatisfaction") >= 30)
C. storesDF.filter(sqft) <= 25000 and customerSatisfaction >= 30)
D. storesDF.filter(col(“sqft”) <= 25000 & col("customerSatisfaction") >= 30)
E. storesDF.filter(sqft <= 25000) & customerSatisfaction >= 30)
Selected Answer: A
Question #: 110
Topic #: 1
The code block shown below contains an error. The code block intended to return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId. Identify the error.
Code block:
StoresDF.join(employeesDF, Seq(“storeId”)
A. The key column storeId needs to be a string like “storeId”.
B. The key column storeId needs to be specified in an expression of both Data Frame columns like storesDF.storeId ===employeesDF.storeId.
C. The default argument to the joinType parameter is “inner” – an additional argument of “left” must be specified.
D. There is no DataFrame.join() operation – DataFrame.merge() should be used instead.
E. The key column storeId needs to be wrapped in the col() operation.
Selected Answer: A
Question #: 107
Topic #: 1
The code block shown below should create a single-column DataFrame from Scala list years which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__(__3__).__4__
A. 1. spark
2. createDataFrame
3. years
4. IntegerType
B. 1. spark
2. createDataset
3. years
4. IntegerType
C. 1. spark
2. createDataset
3. List(years)
4. toDF
D. 1. spark
2. createDataFrame
3. List(years)
4. IntegerType
Selected Answer: C
Question #: 106
Topic #: 1
The code block shown below contains an error. The code block is intended to create the Scala UDF assessPerformanceUDF() and apply it to the integer column customers1t1sfaction in Data Frame storesDF. Identify the error.
Code block:
A. The input type of customerSatisfaction is not specified in the udf() operation.
B. The return type of assessPerformanceUDF() must be specified.
C. The withColumn() operation is not appropriate here – UDFs should be applied by iterating over rows instead.
D. The assessPerformanceUDF() must first be defined as a Scala function and then converted to a UDF.
E. UDFs can only be applied via SQL and not through the Data Frame API.
Selected Answer: A
Question #: 103
Topic #: 1
The code block shown below should extract the integer value for column sqft from the first row of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__.__3__[Int](__4__)
A. 1. storesDF
2. first()
3. getAs()
4. “sqft”
B. 1. storesDF
2. first
3. getAs
4. sqft
C. 1. storesDF
2. first()
3. getAs
4. col(“sqft”)
D. 1. storesDF
2. first
3. getAs
4. “sqft”
Selected Answer: D
Question #: 83
Topic #: 1
Which of the following code blocks creates and registers a SQL UDF named “ASSESS_PERFORMANCE” using the Scala function assessPerformance() and applies it to column customerSatisfaction in table stores?
A. spark.udf.register(“ASSESS_PERFORMANCE”, assessPerformance)
spark.sql(“SELECT customerSatisfaction, ASSESS_PERFORMANCE(customerSatisfaction) AS result FROM stores”)
B. spark.udf.register(“ASSESS_PERFORMANCE”, assessPerformance)
C. spark.udf.register(“ASSESS_PERFORMANCE”, assessPerformance)
spark.sql(“SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores”)
D. spark.udf.register(“ASSESS_PERFORMANCE”, assessPerformance)
storesDF.withColumn(“result”, assessPerformance(col(“customerSatisfaction”)))
E. spark.udf.register(“ASSESS_PERFORMANCE”, assessPerformance)
storesDF.withColumn(“result”, ASSESS_PERFORMANCE(col(“customerSatisfaction”)))
Selected Answer: C
Question #: 78
Topic #: 1
Which of the following code blocks returns a DataFrame sorted alphabetically based on column division?
A. storesDF.sort(“division”)
B. storesDF.orderBy(desc(“division”))
C. storesDF.orderBy(col(“division”).desc())
D. storesDF.orderBy(“division”, ascending – true)
E. storesDF.sort(desc(“division”))
Selected Answer: A
Question #: 31
Topic #: 1
Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division in DataFrame storesDF?
A. storesDF.agg(approx_count_distinct(col(“division”)).alias(“divisionDistinct”))
B. storesDF.agg(approx_count_distinct(col(“division”), 0.01).alias(“divisionDistinct”))
C. storesDF.agg(approx_count_distinct(col(“division”), 0.15).alias(“divisionDistinct”))
D. storesDF.agg(approx_count_distinct(col(“division”), 0.0).alias(“divisionDistinct”))
E. storesDF.agg(approx_count_distinct(col(“division”), 0.05).alias(“divisionDistinct”))
Selected Answer: C
Question #: 135
Topic #: 1
Which of the following code blocks returns a new DataFrame where column division is the first two characters of column division in DataFrame storesDF?
A. storesDF.withColumn(“division”, substr(col(“division”), 0, 2))
B. storesDF.withColumn(“division”, susbtr(col(“division”), 1, 2))
C. storesDF,withColumn(“division”, col(“division”).substr(0, 3))
D. storesDF.withColumn(“division”, col(“division”).substr(0, 2))
E. storesDF.withColumn(“division”, col(“division”).substr(l, 2))
Selected Answer: D
Question #: 133
Topic #: 1
Which of the following Spark properties is used to configure whether DataFrames found to be below a certain size threshold at runtime will be automatically broadcasted?
A. spark.sql.broadcastTimeout
B. spark.sql.autoBroadcastJoinThreshold
C. spark.sql.shuffle.partitions
D. spark.sql.inMemoryColumnarStorage.batchSize
E. spark.sql.adaptive.localShuffleReader.enabled
Selected Answer: B
Question #: 73
Topic #: 1
Which of the following code blocks returns a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000?
A sample of DataFrame storesDF is below:
A. storesDF.na.fill(30000, Seq(“sqft”))
B. storesDF.nafill(30000, col(“sqft”))
C. storesDF.na.fill(30000, col(“sqft”))
D. storesDF.fillna(30000, col(“sqft”))
E. storesDF.na.fill(30000, “sqft”)
Selected Answer: E
Question #: 144
Topic #: 1
Which of the following code blocks returns a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF?
A. storesDF.crossJoin(employeesDF)
B. storesDF.join(employeesDF, “storeId”, “cross”)
C. crossJoin(storesDF, employeesDF)
D. join(storesDF, employeesDF, “cross”)
E. storesDF.join(employeesDF, “cross”)
Selected Answer: A
Question #: 137
Topic #: 1
The code block shown below should return a new DataFrame where rows in DataFrame storesDF with missing values in every column have been dropped. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.__1__.2__(3__ = __4__)
A. 1. na
2.drop
3. how
4.”any”
B. 1. na
2.drop
3. subset
4.”all”
C. 1. na
2.drop
3.subset
4. “any”
D. 1. na
2.drop
3. how
4. “all”
E. 1. drop
2. na
3. how
4. “all”
Selected Answer: D
Question #: 84
Topic #: 1
The code block shown below should use SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__(“stores”)
__3__.__4__(“SELECT storeId, managerName FROM stores”)
A. 1. spark
2. createOrReplaceTempView
3. storesDF
4. query
B. 1. spark
2. createTable
3. storesDF
4. sql
C. 1. storesDF
2. createOrReplaceTempView
3. spark
4. query
D. 1. spark
2. createOrReplaceTempView
3. storesDF
4. sql
E. 1. storesDF
2. createOrReplaceTempView
3. spark
4. sql
Selected Answer: E
Question #: 2
Topic #: 1
Which of the following describes the relationship between nodes and executors?
A. Executors and nodes are not related.
B. Anode is a processing engine running on an executor.
C. An executor is a processing engine running on a node.
D. There are always the same number of executors and nodes.
E. There are always more nodes than executors.
Selected Answer: C
Question #: 169
Topic #: 1
Which of the following code blocks returns a DataFrame with column storeSlogan where single quotes in column storeSlogan in DataFrame storesDF have been replaced with double quotes?
A sample of DataFrame storesDF is below:
A. storesDF.withColumn(“storeSlogan”, col(“storeSlogan”).regexp_replace(“’” “\””))
B. storesDF.withColumn(“storeSlogan”, regexp_replace(col(“storeSlogan”), “’”))
C. storesDF.withColumn(“storeSlogan”, regexp_replace(col(“storeSlogan”), “’”, “\””))
D. storesDF.withColumn(“storeSlogan”, regexp_replace(“storeSlogan”, “’”, “\””))
E. storesDF.withColumn(“storeSlogan”, regexp_extract(col(“storeSlogan”), “’”, “\””))
Selected Answer: C
Question #: 153
Topic #: 1
Which of the following code blocks returns a new DataFrame with column storeReview where the pattern “End” has been removed from the end of column storeReview in DataFrame storesDF?
A sample DataFrame storesDF is below:
A. storesDF.withColumn(“storeReview”, col(“storeReview”).regexp_replace(” End$”, “”))
B. storesDF.withColumn(“storeReview”, regexp_replace(col(“storeReview”), ” End$”, “”))
C. storesDF.withColumn(“storeReview”, regexp_replace(col(“storeReview”), ” End$”))
D. storesDF.withColumn(“storeReview”, regexp_replace(“storeReview”, ” End$”, “”))
E. storesDF.withColumn(“storeReview”, regexp_extract(col(“storeReview”), ” End$”, “”))
Selected Answer: B
Question #: 180
Topic #: 1
Which of the following code blocks returns a DataFrame where rows in DataFrame storesDF containing missing values in every column have been dropped?
A. storesDF.na.drop()
B. storesDF.dropna()
C. storesDF.na.drop(“all”, subset = “sqft”)
D. storesDF.na.drop(“all”)
E. storesDF.nadrop(“all”)
Selected Answer: D
Question #: 111
Topic #: 1
Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with “a” and “b” respectively, to specify two key columns column1 and column2?
A. joinExprs = col(“a.column1”) === col(“b.column1”) and col(“a.column2”) === col(“b.column2”)
B. usingColumns = Seq(col(“column1”), col(“column2”))
C. All of these options can be used to perform an inner join with two key columns.
D. joinExprs = storesDF(“column1”) === employeesDF(“column1”) and storesDF(“column2”) === employeesDF (“column2”)
E. usingColumns = Seq(“column1”, “column2”)
Selected Answer: D
Question #: 122
Topic #: 1
The code block shown below should read a JSON at the file path filePath into a DataFrame with the specified schema schema. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__.__3__(__4__).format(“csv”).__5__(__6__)
A. 1. spark
2. read()
3. schema
4. schema
5. json
6. filePath
B. 1. spark
2. read()
3. json
4. filePath
5. format
6. schema
C. 1. spark
2. read()
3. schema
4. schema
5. load
6. filePath
D. 1. spark
2. read
3. schema
4. schema
5. load
6. filePath
E. 1. spark
2. read
3. format
4. “json”
5. load
6. filePath
Selected Answer: D
Question #: 52
Topic #: 1
Which of the following operations can perform an outer join on two DataFrames?
A. DataFrame.crossJoin()
B. Standalone join() function
C. DataFrame.outerJoin()
D. DataFrame.join()
E. DataFrame.merge()
Selected Answer: D
Question #: 155
Topic #: 1
Which of the following operations calculates the simple average of a group of values, like a column?
A. simpleAvg()
B. mean()
C. agg()
D. average()
E. approxMean()
Selected Answer: B
Question #: 175
Topic #: 1
Which of the following code blocks returns a DataFrame containing a column month, an integer representation of the month from column openDate from DataFrame storesDF?
Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1 st, 1970.
A sample of storesDF is displayed below:
A. storesDF.withColumn(“month”, getMonth(col(“openDate”)))
B. storesDF.withColumn(“month”, substr(col(“openDate”), 4, 2))
C. (storesDF.withColumn(“openDateFormat”, col(“openDate”).cast(“Date”))
.withColumn(“month”, month(col(“openDateFormat”))))
D. (storesDF.withColumn(“openTimestamp”, col(“openDate”).cast(“Timestamp”))
.withColumn(“month”, month(col(“openTimestamp”))))
E. storesDF.withColumn(“month”, month(col(“openDate”)))
Selected Answer: A
Question #: 174
Topic #: 1
Which of the following operations will always return a new DataFrame with updated partitions from DataFrame storesDF by inducing a shuffle?
A. storesDF.coalesce()
B. storesDF.rdd.getNumPartitions()
C. storesDF.repartition()
D. storesDF.union()
E. storesDF.intersect()
Selected Answer: D
Question #: 173
Topic #: 1
Which of the following code blocks attempts to cache the partitions of DataFrame storesDF only in Spark’s memory?
A. storesDF.cache(StorageLevel.MEMORY_ONLY).count()
B. storesDF.persist().count()
C. storesDF.cache().count()
D. storesDF.persist(StorageLevel.MEMORY_ONLY).count()
E. storesDF.persist(“MEMORY_ONLY”).count()
Selected Answer: D
Question #: 172
Topic #: 1
The code block shown below contains an error. The code block is intended to create and register a SQL UDF named “ASSESS_PERFORMANCE” using the Python function assessPerformance() and apply it to column customerSatistfaction in table stores. Identify the error.
Code block:
spark.udf.register(“ASSESS_PERFORMANCE”, assessPerformance)
spark.sql(“SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores”)
A. There is no sql() operation — the DataFrame API must be used to apply the UDF assessPerformance().
B. The order of the arguments to spark.udf.register() should be reversed.
C. The customerSatisfaction column cannot be called twice inside the SQL statement.
D. Registered UDFs cannot be applied inside of a SQL statement.
E. The wrong SQL function is used to compute column result — it should be ASSESS_PERFORMANCE instead of assessPerformance.
Selected Answer: A
Question #: 171
Topic #: 1
The code block shown below should print the schema of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__(__3__)
A. 1. storesDF
2. schema
3. Nothing
B. 1. storesDF
2. str
3. schema
C. 1. storesDF
2. printSchema
3. True
D. 1. storesDF
2. printSchema
3. Nothing
E. 1. storesDF
2. printSchema
3. “all”
Selected Answer: D
Question #: 170
Topic #: 1
Which of the following operations can be used to rename and replace an existing column in a DataFrame?
A. DataFrame.renamedColumn()
B. DataFrame.withColumnRenamed()
C. DataFrame.wlthColumn()
D. col()
E. DataFrame.newColumn()
Selected Answer: C
Question #: 168
Topic #: 1
The code block shown below should return a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 AND the value in column customerSatisfaction is greater than or equal to 30. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.__1__(__2__ __3__ __4__)
A. 1. filter
2. (col(“sqft”) <= 25000)
3. &
4. (col("customerSatisfaction") >= 30)
B. 1. filter
2. (col(“sqft”) <= 25000
3. &
4. col("customerSatisfaction") >= 30
C. 1. filter
2. (col(“sqft”) <= 25000)
3. and
4. (col("customerSatisfaction") >= 30)
D. 1. drop
2. (col(sqft) <= 25000)
3. &
4. (col(customerSatisfaction) >= 30)
E. 1. filter
2. col(“sqft”) <= 25000
3. and
4. col("customerSatisfaction") >= 30
Selected Answer: C
Question #: 167
Topic #: 1
Which of the following describes executors?
A. Executors are the communication pathways from the driver node to the worker nodes.
B. Executors are the most granular level of execution in the Spark execution hierarchy.
C. Executors always have a one-to-one relationship with worker nodes.
D. Executors are synonymous with worker nodes.
E. Executors are processing engine instances for performing data computations which run on a worker node.
Selected Answer: C