Jekyll2018-12-22T21:22:37+00:00http://blog.kprajapati.com/feed.xmlKaushal PrajapatiAn amazing website.Kaushal PrajapatiHow to extend Spark UI2018-09-05T00:00:00+00:002018-09-05T00:00:00+00:00http://blog.kprajapati.com/spark-ui-extension<p>As you know spark is a framework with lots of unique and developer friendly features, one of them
is its UI(Spark UI). It provides variety of information about spark, like currently running jobs,
their stages, tasks, memory usage and plenty others.</p>
<p>But if you are developing an application with Spark, at times you would want to show some other
details. There is no point to create a separate UI module to show your basic details or metrics.</p>
<p>Spark is already providing you that functionality using which you can add details and visualize
metrics within the Spark UI.</p>
<p>Let’s take an example, In which I want to show schema information of my Spark dataframes in Spark UI
So in that case you need three things:</p>
<ul>
<li>
<p>first is, data object to store the details required to show in UI which can be updated with new information based on the requirement.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">object</span> <span class="nc">Utility</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">allSchema</span><span class="k">:</span> <span class="kt">ConcurrentMap</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ConcurrentHashMap</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]()</span>
<span class="k">implicit</span> <span class="k">class</span> <span class="nc">DataFrameSchema</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">df</span><span class="k">:</span> <span class="kt">Dataset</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">registerSchema</span><span class="k">:</span> <span class="kt">Dataset</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">=</span> <span class="o">{</span>
<span class="n">schemas</span><span class="o">.</span><span class="n">put</span><span class="o">(</span><span class="nc">Thread</span><span class="o">.</span><span class="n">currentThread</span><span class="o">().</span><span class="n">getStackTrace</span><span class="o">.</span><span class="n">slice</span><span class="o">(</span><span class="mi">2</span><span class="o">,</span><span class="mi">4</span><span class="o">).</span><span class="n">mkString</span><span class="o">(</span><span class="s">"\n"</span><span class="o">),</span> <span class="n">df</span><span class="o">.</span><span class="n">schema</span>
<span class="o">.</span><span class="n">treeString</span><span class="o">)</span>
<span class="n">df</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div> </div>
<p>Here <strong>allSchema</strong> map stores your schema information and <strong>regiserSchema</strong> function updates schema for it.</p>
</li>
<li>
<p>Second is, to create a class which extends <strong>WebUIPage</strong>, in which you can write your HTML logic
for your visualization</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">DataFrameSchemaUIPage</span><span class="o">(</span><span class="n">parent</span><span class="k">:</span> <span class="kt">ExtendedUIServer</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">WebUIPage</span><span class="o">(</span><span class="s">""</span><span class="o">)</span> <span class="k">with</span> <span class="nc">Logging</span> <span class="o">{</span>
<span class="cm">/** Render the page */</span>
<span class="k">def</span> <span class="n">render</span><span class="o">(</span><span class="n">request</span><span class="k">:</span> <span class="kt">HttpServletRequest</span><span class="o">)</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Node</span><span class="o">]</span> <span class="k">=</span> <span class="o">{</span>
<span class="k">import</span> <span class="nn">scala.collection.JavaConversions._</span>
<span class="k">val</span> <span class="n">content</span> <span class="k">=</span> <span class="o"><</span><span class="n">h4</span><span class="o">></span><span class="nc">The</span> <span class="n">below</span> <span class="n">table</span> <span class="n">shows</span> <span class="n">registered</span> <span class="n">dataframes</span> <span class="n">on</span> <span class="n">the</span> <span class="n">left</span><span class="o">,</span> <span class="k">with</span> <span class="n">there</span> <span class="n">schemas</span> <span class="n">on</span> <span class="n">the</span>
<span class="n">right</span><span class="o">.</</span><span class="n">h4</span><span class="o">></span>
<span class="o"><</span><span class="n">br</span><span class="o">/></span>
<span class="o"><</span><span class="n">div</span><span class="o">></span>
<span class="o"><</span><span class="n">table</span> <span class="n">class</span><span class="o">=</span><span class="s">"table table-bordered table-condensed"</span> <span class="n">id</span><span class="o">=</span><span class="s">"task-summary-table"</span><span class="o">></span>
<span class="o"><</span><span class="n">thead</span><span class="o">></span>
<span class="o"><</span><span class="n">tr</span> <span class="n">style</span><span class="o">=</span><span class="s">"background-color: rgb(255, 255, 255);"</span><span class="o">></span>
<span class="o"><</span><span class="n">th</span> <span class="n">width</span><span class="o">=</span><span class="s">"50%"</span> <span class="n">class</span><span class="o">=</span><span class="s">""</span><span class="o">></span><span class="nc">DataFrame</span><span class="o"></</span><span class="n">th</span><span class="o">></span>
<span class="o"><</span><span class="n">th</span> <span class="n">width</span><span class="o">=</span><span class="s">"50%"</span> <span class="n">class</span><span class="o">=</span><span class="s">""</span><span class="o">></span><span class="nc">Schema</span><span class="o"></</span><span class="n">th</span><span class="o">></span>
<span class="o"></</span><span class="n">tr</span><span class="o">></span>
<span class="o"></</span><span class="n">thead</span><span class="o">></span>
<span class="o"><</span><span class="n">tbody</span><span class="o">></span>
<span class="o">{</span><span class="nc">Utility</span><span class="o">.</span><span class="n">schemas</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span>
<span class="o"><</span><span class="n">tr</span> <span class="n">style</span><span class="o">=</span><span class="s">"background-color: rgb(249, 249, 249);"</span><span class="o">></span>
<span class="o"><</span><span class="n">td</span><span class="o">>{</span><span class="n">s</span><span class="s">"${x._1}"</span><span class="o">}</</span><span class="n">td</span><span class="o">></span>
<span class="o"><</span><span class="n">td</span><span class="o">><</span><span class="n">pre</span><span class="o">>{</span><span class="n">s</span><span class="s">"${x._2}"</span><span class="o">}</</span><span class="n">pre</span><span class="o">></</span><span class="n">td</span><span class="o">></span>
<span class="o"></</span><span class="n">tr</span><span class="o">>)}</span>
<span class="o"></</span><span class="n">tbody</span><span class="o">></span>
<span class="o"><</span><span class="n">tfoot</span><span class="o">></</span><span class="n">tfoot</span><span class="o">></span>
<span class="o"></</span><span class="n">table</span><span class="o">></span>
<span class="o"></</span><span class="n">div</span><span class="o">></span>
<span class="nc">UIUtils</span><span class="o">.</span><span class="n">headerSparkPage</span><span class="o">(</span>
<span class="s">"This is the extension to Spark UI to display custom information about your application."</span><span class="o">,</span>
<span class="n">content</span><span class="o">,</span> <span class="n">parent</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div> </div>
<p>This class is having all the logic for rendering your html page.</p>
</li>
<li>
<p>Third is, to attach your page (class which having html logic) with existing spark UI</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">ExtendedUIServer</span><span class="o">(</span><span class="n">sparkContext</span><span class="k">:</span> <span class="kt">SparkContext</span><span class="o">)</span>
<span class="k">extends</span> <span class="nc">SparkUITab</span><span class="o">(</span><span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="o">),</span> <span class="s">"dataframeschema"</span><span class="o">)</span>
<span class="k">with</span> <span class="nc">Logging</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">val</span> <span class="n">name</span> <span class="k">=</span> <span class="s">"Dataframe Schema"</span>
<span class="k">val</span> <span class="n">parent</span><span class="k">:</span> <span class="kt">SparkUI</span> <span class="o">=</span> <span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="o">)</span>
<span class="n">attachPage</span><span class="o">(</span><span class="k">new</span> <span class="nc">DataFrameSchemaUIPage</span><span class="o">(</span><span class="k">this</span><span class="o">))</span>
<span class="n">parent</span><span class="o">.</span><span class="n">attachTab</span><span class="o">(</span><span class="k">this</span><span class="o">)</span>
<span class="k">def</span> <span class="n">detach</span><span class="o">()</span> <span class="o">{</span>
<span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="o">).</span><span class="n">detachTab</span><span class="o">(</span><span class="k">this</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">object</span> <span class="nc">ExtendedUIServer</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="k">:</span> <span class="kt">SparkContext</span><span class="o">)</span><span class="k">:</span> <span class="kt">SparkUI</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">sparkContext</span><span class="o">.</span><span class="n">ui</span><span class="o">.</span><span class="n">getOrElse</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">SparkException</span><span class="o">(</span><span class="s">"Parent SparkUI to attach this tab to not found!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div> </div>
<p>Here I’m attaching my page using this method call <strong>attachPage(new DataFrameSchemaUIPage(this))</strong></p>
</li>
</ul>
<p>So that is it, now you are ready to go and test your custom webpage.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">TestUIExtension</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">main</span><span class="o">(</span><span class="n">args</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">String</span><span class="o">])</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">spark</span> <span class="k">=</span> <span class="o">...</span>
<span class="k">new</span> <span class="nc">ExtendedUIServer</span><span class="o">(</span><span class="n">spark</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">)</span>
<span class="k">import</span> <span class="nn">spark.implicits._</span>
<span class="k">import</span> <span class="nn">Utility._</span>
<span class="nc">Seq</span><span class="o">(</span><span class="s">"1"</span><span class="o">).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"id"</span><span class="o">).</span><span class="n">registerSchema</span>
<span class="n">println</span><span class="o">(</span><span class="s">"First Dataframe"</span><span class="o">)</span>
<span class="nc">Thread</span><span class="o">.</span><span class="n">sleep</span><span class="o">(</span><span class="mi">10000</span><span class="o">)</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.count</span>
<span class="nc">Seq</span><span class="o">((</span><span class="s">"1"</span><span class="o">,</span> <span class="mi">1</span><span class="o">)).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"id"</span><span class="o">,</span> <span class="s">"count"</span><span class="o">).</span><span class="n">groupBy</span><span class="o">(</span><span class="s">"id"</span><span class="o">).</span><span class="n">agg</span><span class="o">(</span><span class="n">count</span><span class="o">(</span><span class="s">"id"</span><span class="o">)</span> <span class="n">as</span> <span class="s">"count"</span><span class="o">)</span>
<span class="o">.</span><span class="n">registerSchema</span>
<span class="n">println</span><span class="o">(</span><span class="s">"Second Dataframe"</span><span class="o">)</span>
<span class="nc">Seq</span><span class="o">(</span><span class="s">"1"</span><span class="o">).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"otherId"</span><span class="o">).</span><span class="n">distinct</span><span class="o">().</span><span class="n">registerSchema</span><span class="o">.</span><span class="n">show</span>
<span class="n">println</span><span class="o">(</span><span class="s">"Third Dataframe"</span><span class="o">)</span>
<span class="nc">Thread</span><span class="o">.</span><span class="n">sleep</span><span class="o">(</span><span class="mi">60000</span><span class="o">)</span>
<span class="n">println</span><span class="o">(</span><span class="s">"Test Done .."</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>So as a first step I’ll create an instance of <strong>ExtendedUIServer</strong> class which will attach and
render your page and then later I will call <strong>registerSchema</strong> which will add schema of the dataframe to UI page.</p>
<p>For full code you can visit my github repo <a href="https://github.com/skp33/spark-ui-extension">link</a>.</p>Kaushal PrajapatiIn this post, I'll be adding a new tab in Spark UI to show custom details about any spark applicationIntegrating Rest services with Apache Spark without any Web-Server2018-08-26T00:00:00+00:002018-08-26T00:00:00+00:00http://blog.kprajapati.com/spark-rest-integration<p>There is always a use case in which you have to use web-service and enable user intervention to achieve some task.
But how about with spark?
In spark there could be a situation where you have to perform some tasks dynamically: like trigger a spark job, change config parameter or any application
specific tasks.</p>
<p>In order to achieve this, we need to integrate rest framework like jersey which provides API endpoints to those end services.</p>
<p>But have you noticed that when you start a spark app, it opens a port for UI.
This UI port by default is <strong>4040</strong> which also provide you few Rest API to get more information about your application, like number of jobs, stages, tasks, environment variables etc.
For more information you can go through this link <a href="https://spark.apache.org/docs/latest/monitoring.html#rest-api">Spark Monitoring</a></p>
<p>So why not reuse this existing spark feature and add our rest apis to achieve some basic requirements.</p>
<p>if you see below code of <a href="https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala#L238"><strong>getServletHandler</strong></a>
method of <strong>org.apache.spark.status.api.v1.ApiRootResource</strong> class</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="n">getServletHandler</span><span class="o">(</span><span class="n">uiRoot</span><span class="k">:</span> <span class="kt">UIRoot</span><span class="o">)</span><span class="k">:</span> <span class="kt">ServletContextHandler</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">jerseyContext</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ServletContextHandler</span><span class="o">(</span><span class="nc">ServletContextHandler</span><span class="o">.</span><span class="nc">NO_SESSIONS</span><span class="o">)</span>
<span class="n">jerseyContext</span><span class="o">.</span><span class="n">setContextPath</span><span class="o">(</span><span class="s">"/api"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">holder</span><span class="k">:</span> <span class="kt">ServletHolder</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">ServletHolder</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">ServletContainer</span><span class="o">])</span>
<span class="n">holder</span><span class="o">.</span><span class="n">setInitParameter</span><span class="o">(</span><span class="nc">ServerProperties</span><span class="o">.</span><span class="nc">PROVIDER_PACKAGES</span><span class="o">,</span> <span class="s">"org.apache.spark.status.api.v1"</span><span class="o">)</span>
<span class="nc">UIRootFromServletContext</span><span class="o">.</span><span class="n">setUiRoot</span><span class="o">(</span><span class="n">jerseyContext</span><span class="o">,</span> <span class="n">uiRoot</span><span class="o">)</span>
<span class="n">jerseyContext</span><span class="o">.</span><span class="n">addServlet</span><span class="o">(</span><span class="n">holder</span><span class="o">,</span> <span class="s">"/*"</span><span class="o">)</span>
<span class="n">jerseyContext</span>
<span class="o">}</span>
</code></pre></div></div>
<p>spark provides <strong>org.apache.spark.status.api.v1</strong> package for jersey rest. This means, if you implement any rest api within this package it will automaticlly register your API with jersey.</p>
<p>Cool, lets write some rest api.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="nn">org.apache.spark.status.api.v1</span>
<span class="k">import</span> <span class="nn">javax.ws.rs.</span><span class="o">{</span><span class="nc">GET</span><span class="o">,</span> <span class="nc">Path</span><span class="o">,</span> <span class="nc">PathParam</span><span class="o">,</span> <span class="nc">Produces</span><span class="o">}</span>
<span class="k">import</span> <span class="nn">javax.ws.rs.core.MediaType</span>
<span class="nd">@Path</span><span class="o">(</span><span class="s">"/custom"</span><span class="o">)</span>
<span class="k">class</span> <span class="nc">TestService</span> <span class="k">extends</span> <span class="o">{</span>
<span class="nd">@GET</span>
<span class="nd">@Path</span><span class="o">(</span><span class="s">"sum/{x}/{y}"</span><span class="o">)</span>
<span class="nd">@Produces</span><span class="o">(</span><span class="nc">Array</span><span class="o">(</span><span class="nc">MediaType</span><span class="o">.</span><span class="nc">APPLICATION_JSON</span><span class="o">))</span>
<span class="k">def</span> <span class="n">getApplicationList</span><span class="o">(</span>
<span class="nd">@PathParam</span><span class="o">(</span><span class="s">"x"</span><span class="o">)</span> <span class="n">x</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span>
<span class="nd">@PathParam</span><span class="o">(</span><span class="s">"y"</span><span class="o">)</span> <span class="n">y</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">x</span><span class="o">+</span><span class="n">y</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Now call this api once your spark context is up using this <a href="http://localhost:4040/api/custom/sum/43/8897">http://localhost:4040/api/custom/sum/43/8897</a> endpoint.</p>
<p>So far everything seems fine, but what if you want to provide your own package to create rest services.</p>
<p>There’s a workaround for this as well, check the below code:</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="nn">org.apache.spark.rest</span>
<span class="k">import</span> <span class="nn">org.apache.spark.</span><span class="o">{</span><span class="nc">SparkContext</span><span class="o">,</span> <span class="nc">SparkException</span><span class="o">}</span>
<span class="k">import</span> <span class="nn">org.apache.spark.internal.Logging</span>
<span class="k">import</span> <span class="nn">org.apache.spark.ui.SparkUI</span>
<span class="k">import</span> <span class="nn">org.eclipse.jetty.servlet.</span><span class="o">{</span><span class="nc">ServletContextHandler</span><span class="o">,</span> <span class="nc">ServletHolder</span><span class="o">}</span>
<span class="k">import</span> <span class="nn">org.glassfish.jersey.server.ServerProperties</span>
<span class="k">import</span> <span class="nn">org.glassfish.jersey.servlet.ServletContainer</span>
<span class="k">class</span> <span class="nc">RestAPI</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">getServletHandler</span><span class="k">:</span> <span class="kt">ServletContextHandler</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">jerseyContext</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ServletContextHandler</span><span class="o">(</span><span class="nc">ServletContextHandler</span><span class="o">.</span><span class="nc">NO_SESSIONS</span><span class="o">)</span>
<span class="n">jerseyContext</span><span class="o">.</span><span class="n">setContextPath</span><span class="o">(</span><span class="s">"/rest"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">holder</span><span class="k">:</span> <span class="kt">ServletHolder</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">ServletHolder</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">ServletContainer</span><span class="o">])</span>
<span class="n">holder</span><span class="o">.</span><span class="n">setInitParameter</span><span class="o">(</span><span class="nc">ServerProperties</span><span class="o">.</span><span class="nc">PROVIDER_PACKAGES</span><span class="o">,</span> <span class="s">"org.apache.spark.rest.services"</span><span class="o">)</span>
<span class="n">jerseyContext</span><span class="o">.</span><span class="n">addServlet</span><span class="o">(</span><span class="n">holder</span><span class="o">,</span> <span class="s">"/*"</span><span class="o">)</span>
<span class="n">jerseyContext</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">object</span> <span class="nc">RestAPI</span> <span class="k">extends</span> <span class="nc">Logging</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="k">:</span> <span class="kt">SparkContext</span><span class="o">)</span><span class="k">:</span> <span class="kt">SparkUI</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">sparkContext</span><span class="o">.</span><span class="n">ui</span><span class="o">.</span><span class="n">getOrElse</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">SparkException</span><span class="o">(</span><span class="s">"Parent SparkUI to attach this tab to not found!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">sparkContext</span><span class="k">:</span> <span class="kt">SparkContext</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="n">attach</span><span class="o">(</span><span class="n">sparkContext</span><span class="o">)</span>
<span class="k">def</span> <span class="n">attach</span><span class="o">(</span><span class="n">sparkContext</span><span class="k">:</span> <span class="kt">SparkContext</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="o">).</span><span class="n">attachHandler</span><span class="o">(</span><span class="k">new</span> <span class="nc">RestAPI</span><span class="o">().</span><span class="n">getServletHandler</span><span class="o">)</span>
<span class="n">logInfo</span><span class="o">(</span><span class="s">"Started rest-server"</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">def</span> <span class="n">detach</span><span class="o">(</span><span class="n">sparkContext</span><span class="k">:</span> <span class="kt">SparkContext</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">getSparkUI</span><span class="o">(</span><span class="n">sparkContext</span><span class="o">).</span><span class="n">detachHandler</span><span class="o">(</span><span class="k">new</span> <span class="nc">RestAPI</span><span class="o">().</span><span class="n">getServletHandler</span><span class="o">)</span>
<span class="n">logInfo</span><span class="o">(</span><span class="s">"Stopped rest-server"</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>After that, you have to call <strong>attach(sparkContext: SparkContext)</strong> method in your code to start your API.
This way you can write your basic rest features and integrate them with your spark application.</p>
<p>For full code you can visit my github repo <a href="https://github.com/skp33/spark-rest">link</a>.</p>Kaushal PrajapatiThis post is about enabling basic Rest API in your Spark application with the minimal codeUpdating Spark Table Metadata in append mode2018-07-23T00:00:00+00:002018-07-23T00:00:00+00:00http://blog.kprajapati.com/Update-metadata-of-Spark-table-also-in-append-mode<p>Spark supports a feature which adds metadata information to spark table. Metadata can be your Number, a String or an Array type that can be used to store table specific stats or data aggregation related info.
for example, how many classes you have in your feature column?
or what is the maximum date in your date column?</p>
<p>These type of details once persisted can be used for further calculation whenever you are going to use these tables in near future.</p>
<p>Let’s see the above case through an example:</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="nc">Seq</span><span class="o">((</span><span class="mi">12</span><span class="o">,</span> <span class="mi">20180411</span><span class="o">),</span> <span class="o">(</span><span class="mi">5</span><span class="o">,</span> <span class="mi">20180411</span><span class="o">),</span> <span class="o">(</span><span class="mi">11</span><span class="o">,</span> <span class="mi">20180411</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="mi">20180412</span><span class="o">),</span> <span class="o">(</span><span class="mi">29</span><span class="o">,</span> <span class="mi">20180413</span><span class="o">),</span>
<span class="o">(</span><span class="mi">31</span><span class="o">,</span> <span class="mi">20180414</span><span class="o">),</span> <span class="o">(</span><span class="mi">18</span><span class="o">,</span> <span class="mi">20180415</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="mi">20180412</span><span class="o">),</span> <span class="o">(</span><span class="mi">31</span><span class="o">,</span> <span class="mi">20180413</span><span class="o">),</span> <span class="o">(</span><span class="mi">8</span><span class="o">,</span> <span class="mi">20180416</span><span class="o">),</span> <span class="o">(</span><span class="mi">29</span><span class="o">,</span> <span class="mi">20180413</span><span class="o">),</span>
<span class="o">(</span><span class="mi">31</span><span class="o">,</span> <span class="mi">20180414</span><span class="o">),</span> <span class="o">(</span><span class="mi">8</span><span class="o">,</span> <span class="mi">20180415</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="mi">20180412</span><span class="o">),</span> <span class="o">(</span><span class="mi">23</span><span class="o">,</span> <span class="mi">20180413</span><span class="o">),</span> <span class="o">(</span><span class="mi">51</span><span class="o">,</span> <span class="mi">20180414</span><span class="o">),</span> <span class="o">(</span><span class="mi">15</span><span class="o">,</span> <span class="mi">20180415</span><span class="o">))</span>
<span class="k">val</span> <span class="n">orders</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">toDF</span><span class="o">(</span><span class="s">"order id"</span><span class="o">,</span> <span class="s">"date"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">maxDate</span> <span class="k">=</span> <span class="n">orders</span><span class="o">.</span><span class="n">agg</span><span class="o">(</span><span class="n">max</span><span class="o">(</span><span class="s">"date"</span><span class="o">)).</span><span class="n">as</span><span class="o">[</span><span class="kt">Int</span><span class="o">].</span><span class="n">take</span><span class="o">(</span><span class="mi">1</span><span class="o">)(</span><span class="mi">0</span><span class="o">)</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.types.</span><span class="o">{</span><span class="nc">Metadata</span><span class="o">,</span> <span class="nc">MetadataBuilder</span><span class="o">}</span>
<span class="k">val</span> <span class="n">metadata</span><span class="k">:</span> <span class="kt">Metadata</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">MetadataBuilder</span><span class="o">().</span><span class="n">putLong</span><span class="o">(</span><span class="s">"max_dt"</span><span class="o">,</span> <span class="n">maxDate</span><span class="o">).</span><span class="n">build</span>
<span class="n">orders</span><span class="o">.</span><span class="n">groupBy</span><span class="o">(</span><span class="ss">'date </span><span class="n">as</span><span class="o">(</span><span class="s">"date"</span><span class="o">,</span> <span class="n">metadata</span><span class="o">)).</span><span class="n">agg</span><span class="o">(</span><span class="n">count</span><span class="o">(</span><span class="s">"order id"</span><span class="o">)</span> <span class="n">as</span> <span class="s">"order_count"</span><span class="o">)</span>
<span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">saveAsTable</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">tillProcessDate</span> <span class="k">=</span> <span class="n">sparkSession</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">table</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">)</span>
<span class="o">.</span><span class="n">schema</span><span class="o">(</span><span class="s">"date"</span><span class="o">).</span><span class="n">metadata</span><span class="o">.</span><span class="n">getLong</span><span class="o">(</span><span class="s">"max_dt"</span><span class="o">)</span>
</code></pre></div></div>
<p>So in the above example you can see that I’ve created a dataset first and then added the maxDate after aggregation to the metadata instance.
This metadata will be attached with the schema of the table I’m writing. Further we can read the table metadata from the table schema and use it for our processing.
This way we’ll be able to persist table details in our schema.</p>
<p>In the below example you can see, I’ve updated the metadata information:</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">val</span> <span class="n">moreOrders</span> <span class="k">=</span> <span class="o">(</span><span class="n">data</span> <span class="o">++</span> <span class="nc">Seq</span><span class="o">((</span><span class="mi">2</span><span class="o">,</span> <span class="mi">20180417</span><span class="o">),</span> <span class="o">(</span><span class="mi">41</span><span class="o">,</span> <span class="mi">20180417</span><span class="o">),</span> <span class="o">(</span><span class="mi">25</span><span class="o">,</span> <span class="mi">20180417</span><span class="o">),</span>
<span class="o">(</span><span class="mi">41</span><span class="o">,</span> <span class="mi">20180418</span><span class="o">),</span> <span class="o">(</span><span class="mi">25</span><span class="o">,</span> <span class="mi">20180418</span><span class="o">))).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"order id"</span><span class="o">,</span> <span class="s">"date"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">maxDate</span> <span class="k">=</span> <span class="n">moreOrders</span><span class="o">.</span><span class="n">agg</span><span class="o">(</span><span class="n">max</span><span class="o">(</span><span class="s">"date"</span><span class="o">)).</span><span class="n">as</span><span class="o">[</span><span class="kt">Int</span><span class="o">].</span><span class="n">take</span><span class="o">(</span><span class="mi">1</span><span class="o">)(</span><span class="mi">0</span><span class="o">)</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.types.</span><span class="o">{</span><span class="nc">Metadata</span><span class="o">,</span> <span class="nc">MetadataBuilder</span><span class="o">}</span>
<span class="k">val</span> <span class="n">metadata</span><span class="k">:</span> <span class="kt">Metadata</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">MetadataBuilder</span><span class="o">().</span><span class="n">putLong</span><span class="o">(</span><span class="s">"max_dt"</span><span class="o">,</span> <span class="n">maxDate</span><span class="o">).</span><span class="n">build</span>
<span class="n">moreOrders</span><span class="o">.</span><span class="n">groupBy</span><span class="o">(</span><span class="ss">'date </span><span class="n">as</span><span class="o">(</span><span class="s">"date"</span><span class="o">,</span> <span class="n">metadata</span><span class="o">)).</span><span class="n">agg</span><span class="o">(</span><span class="n">count</span><span class="o">(</span><span class="s">"order id"</span><span class="o">)</span> <span class="n">as</span> <span class="s">"order_count"</span><span class="o">)</span>
<span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">mode</span><span class="o">(</span><span class="s">"overwrite"</span><span class="o">).</span><span class="n">saveAsTable</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">tillProcessDate</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">table</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">)</span>
<span class="o">.</span><span class="n">schema</span><span class="o">(</span><span class="s">"date"</span><span class="o">).</span><span class="n">metadata</span><span class="o">.</span><span class="n">getLong</span><span class="o">(</span><span class="s">"max_dt"</span><span class="o">)</span>
</code></pre></div></div>
<p>So you must’ve noticed that I used <strong>overwrite</strong> mode to update my table. This will update my data
as well as my schema according to the details I specify.</p>
<p>But what about the append mode?
There could also be a case where you don’t want to drop the data but just want to new data alongwith new table metadata details.
So in this case we have a limitation in spark, as it doesn’t support this feature in append mode.
If you want to update the metadata, instead of updating it spark will keep the old values.</p>
<p>So here is the solution to update the metadata in append mode in spark 2.2 version.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="nn">org.apache.spark.sql</span>
<span class="k">import</span> <span class="nn">scala.language.implicitConversions</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.types.StructType</span>
<span class="k">object</span> <span class="nc">UpdateSparkMetadata</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">alterTableSchema</span><span class="o">(</span><span class="nc">_table</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">schema</span><span class="k">:</span> <span class="kt">StructType</span><span class="o">)(</span><span class="k">implicit</span> <span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">spark</span><span class="o">.</span><span class="n">sessionState</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">alterTableSchema</span><span class="o">(</span>
<span class="n">spark</span><span class="o">.</span><span class="n">sessionState</span><span class="o">.</span><span class="n">sqlParser</span><span class="o">.</span><span class="n">parseTableIdentifier</span><span class="o">(</span><span class="nc">_table</span><span class="o">),</span> <span class="n">schema</span>
<span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>So in above example, I simply created an object using <strong>org.apache.spark.sql</strong> package.
Note that this package includes a developer API called <strong>sessionState</strong>. In this package we have <strong>SessionCatalag</strong> object which provides us a method called <strong>alterTableSchema</strong>, which accepts <strong>TableIdentifier</strong> and <strong>Schema</strong> as it’s method parameters.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">val</span> <span class="n">moreOrders</span> <span class="k">=</span> <span class="o">(</span><span class="n">data</span> <span class="o">++</span> <span class="nc">Seq</span><span class="o">((</span><span class="mi">2</span><span class="o">,</span> <span class="mi">20180417</span><span class="o">),</span> <span class="o">(</span><span class="mi">41</span><span class="o">,</span> <span class="mi">20180417</span><span class="o">),</span> <span class="o">(</span><span class="mi">25</span><span class="o">,</span> <span class="mi">20180417</span><span class="o">),</span>
<span class="o">(</span><span class="mi">41</span><span class="o">,</span> <span class="mi">20180418</span><span class="o">),</span> <span class="o">(</span><span class="mi">25</span><span class="o">,</span> <span class="mi">20180418</span><span class="o">))).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"order id"</span><span class="o">,</span> <span class="s">"date"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">maxDate</span> <span class="k">=</span> <span class="n">moreOrders</span><span class="o">.</span><span class="n">agg</span><span class="o">(</span><span class="n">max</span><span class="o">(</span><span class="s">"date"</span><span class="o">)).</span><span class="n">as</span><span class="o">[</span><span class="kt">Int</span><span class="o">].</span><span class="n">take</span><span class="o">(</span><span class="mi">1</span><span class="o">)(</span><span class="mi">0</span><span class="o">)</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.types.</span><span class="o">{</span><span class="nc">Metadata</span><span class="o">,</span> <span class="nc">MetadataBuilder</span><span class="o">}</span>
<span class="k">val</span> <span class="n">metadata</span><span class="k">:</span> <span class="kt">Metadata</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">MetadataBuilder</span><span class="o">().</span><span class="n">putLong</span><span class="o">(</span><span class="s">"max_dt"</span><span class="o">,</span> <span class="n">maxDate</span><span class="o">).</span><span class="n">build</span>
<span class="k">val</span> <span class="n">orderAgg</span> <span class="k">=</span> <span class="n">moreOrders</span><span class="o">.</span><span class="n">groupBy</span><span class="o">(</span><span class="ss">'date </span><span class="n">as</span><span class="o">(</span><span class="s">"date"</span><span class="o">,</span> <span class="n">metadata</span><span class="o">)).</span><span class="n">agg</span><span class="o">(</span><span class="n">count</span><span class="o">(</span><span class="s">"order id"</span><span class="o">)</span> <span class="n">as</span> <span class="s">"order_count"</span><span class="o">)</span>
<span class="n">orderAgg</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">mode</span><span class="o">(</span><span class="s">"overwrite"</span><span class="o">).</span><span class="n">saveAsTable</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">)</span>
<span class="k">val</span> <span class="n">tillProcessDate</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">table</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">)</span>
<span class="o">.</span><span class="n">schema</span><span class="o">(</span><span class="s">"date"</span><span class="o">).</span><span class="n">metadata</span><span class="o">.</span><span class="n">getLong</span><span class="o">(</span><span class="s">"max_dt"</span><span class="o">)</span>
<span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">UpdateSparkMetadata</span><span class="o">.</span><span class="n">alterTableSchema</span><span class="o">(</span><span class="s">"daily_order_count"</span><span class="o">,</span> <span class="n">orderAgg</span><span class="o">.</span><span class="n">schema</span><span class="o">)</span>
</code></pre></div></div>
<p>So once you call the method and pass the required argumeents it will forcefully update the metadata information in the table.</p>
<p>Hope you find article useful, for complete code please follow my github <a href="www.github
.com/skp33/update-spark-table-metadata">link</a></p>Kaushal PrajapatiThis post is about one of the spark features with which you can modify the metadata of a table.Evaluating column expressions without a dataframe in apache spark2018-07-11T00:00:00+00:002018-07-11T00:00:00+00:00http://blog.kprajapati.com/spark-column-expression<p>Spark these days is a de-facto for ETL and building data pipelines. During the pipeline building phase we generally come across at writing test cases for the code.
There will be many incidents when we write complex column expressions or UDFs and a need to test them arises.</p>
<p>So I’ll be discussing about how I used “eval()” method to test my expressions and UDFs without even creating a dataframe or running a spark application.
let me explain this with a few simple examples:</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scala</span><span class="o">></span> <span class="k">:</span><span class="kt">paste</span>
<span class="c1">// Entering paste mode (ctrl-D to finish)
</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.Column</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.array_contains</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.split</span>
<span class="k">def</span> <span class="n">checkDevice</span><span class="o">(</span><span class="n">device</span><span class="k">:</span> <span class="kt">Column</span><span class="o">)</span> <span class="k">=</span> <span class="n">array_contains</span><span class="o">(</span><span class="n">split</span><span class="o">(</span><span class="n">device</span><span class="o">,</span> <span class="s">","</span><span class="o">),</span> <span class="s">"mobile"</span><span class="o">)</span>
<span class="c1">// Exiting paste mode, now interpreting.
</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.Column</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.array_contains</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.split</span>
<span class="n">checkDevice</span><span class="k">:</span> <span class="o">(</span><span class="kt">device:</span> <span class="kt">org.apache.spark.sql.Column</span><span class="o">)</span><span class="kt">org.apache.spark.sql.Column</span>
<span class="n">scala</span><span class="o">></span> <span class="n">checkDevice</span><span class="o">(</span><span class="n">lit</span><span class="o">(</span><span class="s">"mobile,desktop"</span><span class="o">)).</span><span class="n">expr</span><span class="o">.</span><span class="n">eval</span><span class="o">()</span>
<span class="n">res3</span><span class="k">:</span> <span class="kt">Any</span> <span class="o">=</span> <span class="kc">true</span>
<span class="n">scala</span><span class="o">></span> <span class="n">checkDevice</span><span class="o">(</span><span class="n">lit</span><span class="o">(</span><span class="s">"mobiles,desktop"</span><span class="o">)).</span><span class="n">expr</span><span class="o">.</span><span class="n">eval</span><span class="o">()</span>
<span class="n">res4</span><span class="k">:</span> <span class="kt">Any</span> <span class="o">=</span> <span class="kc">false</span>
</code></pre></div></div>
<p>Here is another example in which i am evaluating a udf:</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scala</span><span class="o">></span> <span class="k">:</span><span class="kt">paste</span>
<span class="c1">// Entering paste mode (ctrl-D to finish)
</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.udf</span>
<span class="k">val</span> <span class="n">cleanColumn</span> <span class="k">=</span> <span class="n">udf</span><span class="o">{</span> <span class="o">(</span><span class="n">str</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=></span> <span class="o">{</span>
<span class="n">str</span><span class="o">.</span><span class="n">toLowerCase</span><span class="o">.</span><span class="n">replaceAll</span><span class="o">(</span><span class="s">"\\W"</span><span class="o">,</span> <span class="s">" "</span><span class="o">).</span><span class="n">replaceAll</span><span class="o">(</span><span class="s">"\\s+"</span><span class="o">,</span> <span class="s">" "</span><span class="o">).</span><span class="n">trim</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">filter</span><span class="o">(</span><span class="n">w</span> <span class="k">=></span> <span class="n">w</span><span class="o">.</span><span class="n">size</span> <span class="o">></span> <span class="mi">2</span><span class="o">).</span><span class="n">distinct</span>
<span class="o">}}</span>
<span class="c1">// Exiting paste mode, now interpreting.
</span>
<span class="k">import</span> <span class="nn">org.apache.spark.sql.functions.udf</span>
<span class="n">cleanColumn</span><span class="k">:</span> <span class="kt">org.apache.spark.sql.expressions.UserDefinedFunction</span> <span class="o">=</span> <span class="nc">UserDefinedFunction</span><span class="o">(<</span><span class="n">function1</span><span class="o">>,</span><span class="nc">ArrayType</span><span class="o">(</span><span class="nc">StringType</span><span class="o">,</span><span class="kc">true</span><span class="o">),</span><span class="nc">Some</span><span class="o">(</span><span class="nc">List</span><span class="o">(</span><span class="nc">StringType</span><span class="o">)))</span>
<span class="n">scala</span><span class="o">></span> <span class="n">cleanColumn</span><span class="o">(</span><span class="n">lit</span><span class="o">(</span><span class="s">" here is a test string ... "</span><span class="o">)).</span><span class="n">expr</span><span class="o">.</span><span class="n">eval</span><span class="o">()</span>
<span class="n">res5</span><span class="k">:</span> <span class="kt">Any</span> <span class="o">=</span> <span class="o">[</span><span class="kt">here</span>,<span class="kt">test</span>,<span class="kt">string</span><span class="o">]</span>
</code></pre></div></div>
<p>So what I’ve tried to show here is that you can evaluate your column expression or UDF without using a dataframe. This could be useful when you are doing test driven developement and want to test your changes more frequently.</p>Kaushal PrajapatiThis post is about one of the spark feature which evaluate column expressions without a dataframe