Certified Associate Developer for Apache Spark Topic 1
Question #: 30
Topic #: 1
Which of the following operations fails to return a DataFrame with no duplicate rows?
A. DataFrame.dropDuplicates()
B. DataFrame.distinct()
C. DataFrame.drop_duplicates()
D. DataFrame.drop_duplicates(subset = None)
E. DataFrame.drop_duplicates(subset = “all”)
Selected Answer: E
Question #: 44
Topic #: 1
The code block shown below should create a single-column DataFrame from Python list years which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
_1_._2_(_3_, _4_)
A. 1. spark
2. createDataFrame
3. years
4. IntegerType
B. 1. DataFrame
2. create
3. [years]
4. IntegerType
C. 1. spark
2. createDataFrame
3. [years]
4. IntegertType
D. 1. spark
2. createDataFrame
3. [years]
4. IntegertType()
E. 1. spark
2. createDataFrame
3. years
4. IntegertType()
Selected Answer: E
Question #: 43
Topic #: 1
The code block shown below contains an error. The code block is intended to use SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF. Identify the error.
Code block:
storesDF.createOrReplaceTempView(“stores”)
storesDF.sql(“SELECT storeId, managerName FROM stores”)
A. The createOrReplaceTempView() operation does not make a Dataframe accessible via SQL.
B. The sql() operation should be accessed via the spark variable rather than DataFrame storesDF.
C. There is the sql() operation in DataFrame storesDF. The operation query() should be used instead.
D. This cannot be accomplished using SQL – the DataFrame API should be used instead.
E. The createOrReplaceTempView() operation should be accessed via the spark variable rather than DataFrame storesDF.
Selected Answer: B
Question #: 42
Topic #: 1
The code block shown below contains an error. The code block is intended to create a Python UDF assessPerformanceUDF() using the integer-returning Python function assessPerformance() and apply it to column customerSatisfaction in DataFrame storesDF. Identify the error.
Code block:
assessPerformanceUDF – udf(assessPerformance)
storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
A. The assessPerformance() operation is not properly registered as a UDF.
B. The withColumn() operation is not appropriate here – UDFs should be applied by iterating over rows instead.
C. UDFs can only be applied vie SQL and not through the DataFrame API.
D. The return type of the assessPerformanceUDF() is not specified in the udf() operation.
E. The assessPerformance() operation should be used on column customerSatisfaction rather than the assessPerformanceUDF() operation.
Selected Answer: D
Question #: 41
Topic #: 1
The code block shown below should create and register a SQL UDF named “ASSESS_PERFORMANCE” using the Python function assessPerformance() and apply it to column customerSatisfaction in table stores. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
spark._1_._2_(_3_, _4_)
spark.sql(“SELECT customerSatisfaction, _5_(customerSatisfaction) AS result FROM stores”)
A. 1. udf
2. register
3. “ASSESS_PERFORMANCE”
4. assessPerformance
5. ASSESS_PERFORMANCE
B. 1. udf
2. register
3. assessPerformance
4. “ASSESS_PERFORMANCE”
5. “ASSESS_PERFORMANCE”
C. 1. udf
2. register
3.”ASSESS_PERFORMANCE”
4. assessPerformance
5. “ASSESS_PERFORMANCE”
D. 1. register
2. udf
3. “ASSESS_PERFORMANCE”
4. assessPerformance
5. “ASSESS_PERFORMANCE”
E. 1. udf
2. register
3. ASSESS_PERFORMANCE
4. assessPerformance
5. ASSESS_PERFORMANCE
Selected Answer: A
Question #: 39
Topic #: 1
Which of the following code blocks applies the function assessPerformance() to each row of DataFrame storesDF?
A. [assessPerformance(row) for row in storesDF.take(3)]
B. [assessPerformance() for row in storesDF]
C. storesDF.collect().apply(lambda: assessPerformance)
D. [assessPerformance(row) for row in storesDF.collect()]
E. [assessPerformance(row) for row in storesDF]
Selected Answer: D
Question #: 27
Topic #: 1
Which of the following code blocks returns a new DataFrame with column storeDescription where the pattern “Description: ” has been removed from the beginning of column storeDescription in DataFrame storesDF?
A sample of DataFrame storesDF is below:
A. storesDF.withColumn(“storeDescription”, regexp_replace(col(“storeDescription”), “^Description: “))
B. storesDF.withColumn(“storeDescription”, col(“storeDescription”).regexp_replace(“^Description: “, “”))
C. storesDF.withColumn(“storeDescription”, regexp_extract(col(“storeDescription”), “^Description: “, “”))
D. storesDF.withColumn(“storeDescription”, regexp_replace(“storeDescription”, “^Description: “, “”))
E. storesDF.withColumn(“storeDescription”, regexp_replace(col(“storeDescription”), “^Description: “, “”))
Selected Answer: E
Question #: 26
Topic #: 1
Which of the following code blocks returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF?
A sample of storesDF is displayed below:
A. storesDF.withColumn(“productCategories”, explode(col(“productCategories”)))
B. storesDF.withColumn(“productCategories”, split(col(“productCategories”)))
C. storesDF.withColumn(“productCategories”, col(“productCategories”).explode())
D. storesDF.withColumn(“productCategories”, col(“productCategories”).split())
E. storesDF.withColumn(“productCategories”, explode(“productCategories”))
Selected Answer: A
Question #: 85
Topic #: 1
The code block shown below contains an error. The code block intended to create a single-column DataFrame from Scala List years which is made up of integers. Identify the error.
Code block:
spark.createDataset(years)
A. The years list should be wrapped in another list like List(years) to make clear that it is a column rather than a row.
B. The data type is not specified – the second argument to createDataset should be IntegerType.
C. There is no operation createDataset – the createDataFrame operation should be used instead.
D. The result of the above is a Dataset rather than a DataFrame – the toDF operation must be called at the end.
E. The column name must be specified as the second argument to createDataset.
Selected Answer: C
Question #: 25
Topic #: 1
Which of the following code blocks returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory?
A sample of DataFrame storesDF is displayed below:
A. (storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “_”)[1])
.withColumn(“storeSizeCategory”, split(col(“storeCategory”), “_”)[2]))
B. (storesDF.withColumn(“storeValueCategory”, col(“storeCategory”).split(“_”)[0])
.withColumn(“storeSizeCategory”, col(“storeCategory”).split(“_”)[1]))
C. (storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “_”)[0])
.withColumn(“storeSizeCategory”, split(col(“storeCategory”), “_”)[1]))
D. (storesDF.withColumn(“storeValueCategory”, split(“storeCategory”, “_”)[0])
.withColumn(“storeSizeCategory”, split(“storeCategory”, “_”)[1]))
E. (storesDF.withColumn(“storeValueCategory”, col(“storeCategory”).split(“_”)[1])
.withColumn(“storeSizeCategory”, col(“storeCategory”).split(“_”)[2]))
Selected Answer: C
Question #: 24
Topic #: 1
The code block shown below should return a new DataFrame from DataFrame storesDF where column modality is the constant string “PHYSICAL”, Assume DataFrame storesDF is the only defined language variable. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF. _1_(_2_,_3_(_4_))
A. 1. withColumn
2. “modality”
3. col
4. “PHYSICAL”
B. 1. withColumn
2. “modality”
3. lit
4. PHYSICAL
C. 1. withColumn
2. “modality”
3. lit
4. “PHYSICAL”
D. 1. withColumn
2. “modality”
3. SrtringType
4. “PHYSICAL”
E. 1. newColumn
2. modality
3. SrtringType
4. PHYSICAL
Selected Answer: C
Question #: 23
Topic #: 1
Which of the following code blocks returns a new DataFrame with a new column employeesPerSqft that is the quotient of column numberOfEmployees and column sqft, both of which are from DataFrame storesDF? Note that column employeesPerSqft is not in the original DataFrame storesDF.
A. storesDF.withColumn(“employeesPerSqft”, col(“numberOfEmployees”) / col(“sqft”))
B. storesDF.withColumn(“employeesPerSqft”, “numberOfEmployees” / “sqft”)
C. storesDF.select(“employeesPerSqft”, “numberOfEmployees” / “sqft”)
D. storesDF.select(“employeesPerSqft”, col(“numberOfEmployees”) / col(“sqft”))
E. storesDF.withColumn(col(“employeesPerSqft”), col(“numberOfEmployees”) / col(“sqft”))
Selected Answer: A
Question #: 22
Topic #: 1
Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column storeId is of the type string?
A. storesDF.withColumn(“storeId, cast(col(“storeId”), StringType()))
B. storesDF.withColumn(“storeId, col(“storeId”).cast(StringType()))
C. storesDF.withColumn(“storeId, cast(storeId).as(StringType)
D. storesDF.withColumn(“storeId, col(storeId).cast(StringType)
E. storesDF.withColumn(“storeId, cast(“storeId”).as(StringType()))
Selected Answer: B
Question #: 21
Topic #: 1
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30?
A. storesDF.filter(col(“sqft”) <= 25000 | col("customerSatisfaction") >= 30)
B. storesDF.filter(col(“sqft”) <= 25000 or col("customerSatisfaction") >= 30)
C. storesDF.filter(sqft <= 25000 or customerSatisfaction >= 30)
D. storesDF.filter(col(sqft) <= 25000 | col(customerSatisfaction) >= 30)
E. storesDF.filter((col(“sqft”) <= 25000) | (col("customerSatisfaction") >= 30))
Selected Answer: E
Question #: 40
Topic #: 1
The code block shown below contains an error. The code block is intended to print the schema of DataFrame storesDF. Identify the error.
Code block:
storesDF.printSchema
A. There is no printSchema member of DataFrame – schema and the print() function should be used instead.
B. The entire line needs to be a string – it should be wrapped by str().
C. There is no printSchema member of DataFrame – the getSchema() operation should be used instead.
D. There is no printSchema member of DataFrame – the schema() operation should be used instead.
E. The printSchema member of DataFrame is an operation and needs to be followed by parentheses.
Selected Answer: E
Question #: 36
Topic #: 1
Which of the following code blocks fails to return a DataFrame reverse sorted alphabetically based on column division?
A. storesDF.orderBy(“division”, ascending – False)
B. storesDF.orderBy([“division”], ascending = [0])
C. storesDF.orderBy(col(“division”).asc())
D. storesDF.sort(“division”, ascending – False)
E. storesDF.sort(desc(“division”))
Selected Answer: C
Question #: 35
Topic #: 1
Which of the following code blocks returns a collection of summary statistics for all columns in
DataFrame storesDF?
A. storesDF.summary(“mean”)
B. storesDF.describe(all = True)
C. storesDF.describe(“all”)
D. storesDF.summary(“all”)
E. storesDF.describe()
Selected Answer: E
Question #: 33
Topic #: 1
Which of the following operations can be used to return the number of rows in a DataFrame?
A. DataFrame.numberOfRows()
B. DataFrame.n()
C. DataFrame.sum()
D. DataFrame.count()
E. DataFrame.countDistinct()
Selected Answer: D
Question #: 29
Topic #: 1
The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.
A sample of DataFrame storesDF is displayed below:
Code block:
storesDF.na.fill(30000, col(“sqft”))
A. The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object.
B. The na.fill() operation does not work and should be replaced by the dropna() operation.
C. he argument to the subset parameter of fill() should be a the numerical position of the column rather than a Column object.
D. The na.fill() operation does not work and should be replaced by the nafill() operation.
E. The na.fill() operation does not work and should be replaced by the fillna() operation.
Selected Answer: A
Question #: 18
Topic #: 1
Which of the following operations can be used to create a DataFrame with a subset of columns from DataFrame storesDF that are specified by name?
A. storesDF.subset()
B. storesDF.select()
C. storesDF.selectColumn()
D. storesDF.filter()
E. storesDF.drop()
Selected Answer: B
Question #: 12
Topic #: 1
Which of the following statements about Spark’s stability is incorrect?
A. Spark is designed to support the loss of any set of worker nodes.
B. Spark will rerun any failed tasks due to failed worker nodes.
C. Spark will recompute data cached on failed worker nodes.
D. Spark will spill data to disk if it does not fit in memory.
E. Spark will reassign the driver to a worker node if the driver’s node fails.
Selected Answer: E
Question #: 30
Topic #: 1
Which of the following operations fails to return a DataFrame with no duplicate rows?
A. DataFrame.dropDuplicates()
B. DataFrame.distinct()
C. DataFrame.drop_duplicates()
D. DataFrame.drop_duplicates(subset = None)
E. DataFrame.drop_duplicates(subset = “all”)
Selected Answer: E
Question #: 3
Topic #: 1
Which of the following will occur if there are more slots than there are tasks?
A. The Spark job will likely not run as efficiently as possible.
B. The Spark application will fail – there must be at least as many tasks as there are slots.
C. Some executors will shut down and allocate all slots on larger executors first.
D. More tasks will be automatically generated to ensure all slots are being used.
E. The Spark job will use just one single slot to perform all tasks.
Selected Answer: A
Question #: 147
Topic #: 1
The code block shown below should read a CSV at the file path filePath into a DataFrame with the specified schema schema. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__.__3__(__4__).format(“csv”).__5__(__6__)
A. 1. spark
2. read()
3. schema
4. schema
5. json
6. filePath
B. 1. spark
2. read()
3. schema
4. schema
5. load
6. filePath
C. 1. spark
2. read
3. format
4. “json”
5. load
6. filePath
D. 1. spark
2. read()
3. json
4. filePath
5. format
6. schema
E. 1. spark
2. read
3. schema
4. schema
5. load
6. filePath
Selected Answer: E
Question #: 152
Topic #: 1
The code block shown below should return a new DataFrame where column productСategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
A sample of storesDF is displayed below:
Code block:
storesDF.__1__(__2__, __3__(__4__(__5__)))
A. 1. newColumn
2. “productCategories”
3. col
4. split
5. “productCategories”
B. 1. withColumn
2. “productCategory”
3. split
4. col
5. “productCategories”
C. 1. withColumn
2. “productCategory”
3. explode
4. col
5. “productCategories”
D. 1. newColumn
2. “productCategory”
3. explode
4. col
5. “productCategories”
E. 1. withColumn
2. “productCategories”
3. explode
4. col
5. “productCategories”
Selected Answer: E
Question #: 160
Topic #: 1
The code block shown below contains an error. The code block is intended to create a single-column DataFrame from Python list years which is made up of integers. Identify the error.
Code block:
spark.createDataFrame(years, IntegerType)
A. The column name must be specified.
B. The years list should be wrapped in another list like [years] to make clear that it is a column rather than a row.
C. There is no createDataFrame operation in spark.
D. The IntegerType call must be followed by parentheses.
E. The IntegerType call should not be present — Spark can tell that list years is full of integers.
Selected Answer: D
Question #: 176
Topic #: 1
The code block shown below should read a parquet at the file path filePath into a DataFrame. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__.__2__.__3__(__4__)
A. 1.spark
2. read()
3. parquet
4. filePath
B. 1. spark
2. read()
3. load
4. filePath
C. 1. spark
2. read
3. load
4. filePath, source = “parquet”
D. 1. storesDF
2. read()
3. load
4. filePath
E. 1. spark
2. read
3. load
4. filePath
Selected Answer: E
Question #: 15
Topic #: 1
A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?
A. Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
B. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
C. DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
E. DataFrame A should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
Selected Answer: B
Question #: 178
Topic #: 1
Which of the following describes why garbage collection in Spark is important?
A. Logical results will be incorrect if inaccurate data is not collected and removed from the Spark job.
B. Spark jobs will fail or run slowly if inaccurate data is not collected and removed from the Spark job.
C. Spark jobs will fail or run slowly if memory is not available for new objects to be created.
D. Spark jobs will produce inaccurate results if there are too many different transformations called before a single action.
E. Spark jobs will produce inaccurate results if memory is not available for new tasks to run and complete.
Selected Answer: C
Question #: 105
Topic #: 1
The code block shown below contains an error. The code block is intended to create and register a SQL UDF named “ASSESS_PERFORMANCE” using the Scala function assessPerformance() and apply it to column customerSatisfaction in the table stores. Identify the error.
Code block:
spark.udf.register(“ASSESS_PERFORMANCE”, assessPerforance)
spark.sql(“SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores”)
A. The customerSatisfaction column cannot be called twice inside the SQL statement.
B. Registered UDFs cannot be applied inside of a SQL statement.
C. The order of the arguments to spark.udf.register() should be reversed.
D. The wrong SQL function is used to compute column result – it should be ASSESS_PERFORMANCE instead of assessPerformance.
E. There is no sql() operation – the DataFrame API must be used to apply the UDF assessPerformance().
Selected Answer: D
Question #: 70
Topic #: 1
The code block shown below contains an error. The code block is intended to return a new DataFrame where column managerName from DataFrame storesDF is split at the space character into column managerFirstName and column managerLastName. Identify the error.
A sample of DataFrame storesDF is displayed below:
Code block:
storesDF.withColumn(“managerFirstName”, col(“managerName”).split(” “).getItem(0))
.withColumn(“managerLastName”, col(“managerName”).split(” “).getItem(1))
A. The index values of 0 and 1 are not correct – they should be 1 and 2, respectively.
B. The index values of 0 and 1 should be provided as second arguments to the split() operation rather than indexing the result.
C. The split() operation comes from the imported functions object. It accepts a string column name and split character as arguments. It is not a method of a Column object.
D. The split() operation comes from the imported functions object. It accepts a Column object and split character as arguments. It is not a method of a Column object.
E. The withColumn operation cannot be called twice in a row.
Selected Answer: C
Question #: 142
Topic #: 1
The code block shown below should adjust the number of partitions used in wide transformations like join() to 32. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
__1__(__2__, __3__)
A. 1. spark.conf.get
2. “spark.sql.shuffle.partitions”
3. “32”
B. 1. spark.conf.set
2. “spark.default.parallelism”
3. 32
C. 1. spark.conf.text
2. “spark.default.parallelism”
3. “32”
D. 1. spark.conf.set
2. “spark.default.parallelism”
3. “32”
E. 1. spark.conf.set
2. “spark.sql.shuffle.partitions”
3. “32”
Selected Answer: D