How do I debug Hadoop code

Debug Apache Spark jobs running in HDInsight

  • 3 minutes to read

This article shows you how to track and debug Apache Spark jobs running on HDInsight clusters. Debug using the Apache Hadoop YARN UI, Spark UI, and Spark History Server. You start a Spark job using the notebook available in the Spark cluster Machine Learning: Predictive analysis on food inspection data using MLLib (Machine Learning: Predictive Analysis of Food Control Data Using MLLib). Use the following steps to track an application that you used by another method such as spark-submit have submitted.

If you don't have an Azure subscription, you can create a free account before you start.

requirements

Tracking an application in the YARN user interface

  1. Start the YARN user interface. Click below Cluster dashboards on Yarn.

    tip

    Alternatively, you can start the YARN user interface via the Ambari user interface. To start the Ambari user interface, click under Cluster dashboards on Ambari home (Ambari homepage). From the Ambari user interface, navigate to Yarn > Quicklinks, select the active resource manager, and click Resource Manager UI (User interface of the resource manager).

  2. Since you started the Spark job with Jupyter Notebook instances, the application got the name remotesparkmagics (this is the name for all applications that are started via the notebook instances). Click the Application ID for the application name for more information about the job. This action opens the application view.

    For applications that are started via the Jupyter Notebook instances, the status is always IS RUNNINGuntil you exit the notebook instance.

  3. In the application view, you can view more details to find the containers associated with the application and the logs (stdout / stderr). You can also launch the Spark UI by clicking the shortcut for the Tracking URL click as shown below.

Tracking an Application in the Spark UI

In the Spark UI, you can view details of the Spark jobs that are generated by the application you launched earlier.

  1. As shown in the screenshot above, in the Applications view, click the link for the Tracking URLto launch the Spark UI. It will display all of the Spark jobs launched by the application running on the Jupyter Notebook instance.

  2. Click the tab Executors (Executor) to view processing and storage information for each executor. You can also get the call stack by clicking the link Thread dump Click on (Backup copy of the thread).

  3. Click the tab Stages (Phases) to view the phases associated with the application.

    Each stage can have multiple tasks for which you can view execution statistics, as shown below.

  4. You can start the DAG visualization on the phase details page. Expand the link DAG visualization at the top of the page as shown below.

    With DAG (Direct Aclyic Graph, directed acyclic graph) the different phases in the application are represented. Each blue box on the graph represents a Spark operation called by the application.

  5. You can also launch the timeline view for the application from the phase details page. Expand the link Event timeline at the top of the page as shown below.

    This figure shows the Spark events in the form of a timeline. The timeline view is available on three levels: for orders, within an order and within a phase. The image above shows the timeline view for a specific phase.

    tip

    If you check the box Enable zooming enable you to scroll left and right in the timeline view.

  6. Other tabs in the Spark UI also contain useful information about the Spark instance.

    • Storage Tab: If your application creates an RDD object, refer to the Storage tab for information.
    • Environment Tab: This tab contains a lot of useful information about your Spark instance, including the following:
      • Scala version
      • Event log directory associated with the cluster
      • Number of executor cores for the application

View information about completed jobs using the Spark History Server

When a job is completed, the information about the job is retained on the Spark History Server.

  1. Click the page to start the Spark History Server Overview under Cluster dashboard on Spark history server.

    tip

    Alternatively, you can start the Spark History Server UI from the Ambari UI. To start the Ambari user interface, click on the Overview sheet under Cluster dashboards on Ambari home (Ambari homepage). In the Ambari user interface, navigate to Spark2 > Quicklinks > Spark2 History Server UI (Spark2 history server user interface).

  2. A list of completed applications is displayed. Click an application ID to view more information about the application.

additional Information

Is this page helpful?