Saturday, 6 October 2018

Debugging and Remote Submission of Spark Job in HDInsight Spark Cluster

Introduction

In this article we are trying to create a Scala program in IntelliJ and debug it locally and then debug the same program within HDInsight spark cluster and finally submit the Spark Job within HDInsight spark cluster remotely.

Steps Involved:

From IntelleJ create a application that can read data from Blob storage and write count output in blob storage
Run the file in HDInsight (IntelleJ HDInsight submission)
Create JAR file
Upload JAR file into BLOB storage
Install cURL utility for command line sumbission
Submit the SPARK job with cURL command line.

Our HDInsight Spark Cluster should be BLOB as Primary Storage. This Process NOT work if the HDInsight Spark cluster Primary storage is Data Lake Storage (ADLS)

Download Link:

https://drive.google.com/file/d/1dMKsgH1dpYsMtgZu2tzhEinxtxGS3d2d/view?usp=sharing

Sunday, 24 June 2018

Spark working with Unstructured data

Introduction

In my previous article with Spark, we worked with structure and semi structure data source. Here in this article we are trying to work with unstructured data source.

Hope it will be interesting.

Case Study

We have a note book and we want to find number of work count in it.

Scala Code

val dfsFilename = "D:/spark/bin/examples/src/main/resources/notebook.txt"

val text = sc.textFile(dfsFilename)

val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)

counts.collect.foreach(println)

Output

Hope it will be interesting.

Posted by: MR. JOYDEEP DAS

Wednesday, 20 June 2018

Spark to Connect with Azure SQL DB and read Table

Introduction

In this article we are trying to connect spark with Azure SQL DB and just simply read a table.

Hope it will be interesting.

Scala Code


	import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
	import com.microsoft.azure.sqldb.spark.config.Config
	import com.microsoft.azure.sqldb.spark.connect._

	val url = "[Enter your url here]"
	val databaseName = "[Enter your database name here]"
	val dbTable = "[Enter your database table here]"

	val user = "[Enter your username here]"
	val password = "[Enter your password here]"

	// READ FROM CONFIG
	val readConfig = Config(Map(
	"url" -> url,
	"databaseName" -> databaseName,
	"user" -> user,
	"password" -> password,
	"connectTimeout" -> "5",
	"queryTimeout" -> "5",
	"dbTable" -> dbTable
	))

	val df = sqlContext.read.sqlDB(readConfig)
	println("Total rows: " + df.count)
	df.show()

	// TRADITIONAL SYNTAX
	import java.util.Properties

	val properties = new Properties()
	properties.put("databaseName", databaseName)
	properties.put("user", user)
	properties.put("password", password)
	properties.put("connectTimeout", "5")
	properties.put("queryTimeout", "5")

	val df = sqlContext.read.sqlDB(url, dbTable, properties)
	println("Total rows: " + df.count)
	df.show()

Hope you like it.

Posted by: MR. JOYDEEP DAS

Monday, 18 June 2018

SSIS Folder Traversing in SPARK SQL

Introduction

Here in this article, we are trying to demonstrate Folder Traversing of SSIS ForEach loop container for searching a specified file.

Hope it will be interesting

Scenario

We have a folder named “Sample”. Under this folder, we have three other folder named “Sample-1”, “Sample-2” and “Sample-3”. For each folder there is a flat file named “Student-1.txt”,”Student-2.txt” and “Student-3.txt”.

We need to read the entire file from different folder location

The folder and file structure is displayed by DOS TREE command

Scala Code

//---------------------------------------

// Scala for SPARK to Read Flat File form Different Folder

// Implementation Folder Traversing of SSIS in Spark

// Creation Date: 06/18/2018

//-----------------------------------------

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder

import org.apache.spark.sql.Encoder

import spark.implicits._

case class Student(roll: Long, name: String)

val employeeDF = spark.sparkContext.textFile("d:/spark/bin/examples/src/main/resources/sample/*/student-*.txt").map(_.split(",")).map(attributes=>Student(attributes(0).trim.toInt, attributes(1).trim)).toDF()

employeeDF.createOrReplaceTempView("student")

val youngstersDF = spark.sql("SELECT roll, name FROM student")

youngstersDF.show

Output

Hope you like it.

Posted By: MR. JOYDEEP DAS

Sunday, 17 June 2018

SSIS Conditional Split with SPARK SQL

Introduction

Here in this article we are trying to make SSIS conditional Split Transform by using SPARK SQL. It is called Split Data Frame using Filter Transform.

Hope it will be interesting.

What We Want to Do

We have a Flat File Named Student Marks Details mentioned bellow.

101,Joydeep Das,Math,10

102,Deepasree Das,Math,89

103,Shipra Roy Chowdhury,Math,100

104,Rajesh Roy,Math,45

105,Sunita Tendon,Math,98

106,Amit Basu,Math,20

107,Debalina Bhattacharya,Math,99

108,Pritam Das,Math,40

109,Partha Deb Das,Math,89

110,Onkita Gupta,Math,100

The file contains Student Roll, Student Name, Subject and Marks. Based on Marks we need to display records

If [ Marks ] > 50	Display Records
If [ Marks ] < 50	Display Records

Scala Code

//---------------------------------------

// Scala for SPARK to Read Flat File

// Make Conditional Split Depends on Makes (Marks>50)

// Creation Date: 06/172018

//-----------------------------------------

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder

import org.apache.spark.sql.Encoder

import spark.implicits._

case class Student(roll: Long, name: String, subject: String, marks: Long)

//Read FLAT File

val studentDF = spark.sparkContext.textFile("d:/spark/bin/examples/src/main/resources/studentmarksdetails.txt").map(_.split(",")).map(attributes => Student(attributes(0).trim.toInt, attributes(1).trim, attributes(2).trim, attributes(3).trim.toInt)).toDF()

studentDF.cache() // recommended to prevent repeating the calculation

val condition = col("marks") > 50 // Condition

val studentDF1 = studentDF.filter(condition)

val studentDF2 = studentDF.filter(not(condition))

//Making View for FLAT file

studentDF1.createOrReplaceTempView("studentrecord1")

studentDF2.createOrReplaceTempView("studentrecord2")

val youngstersDF1 = spark.sql("SELECT roll, name, subject, marks FROM studentrecord1")

youngstersDF1.show

val youngstersDF2 = spark.sql("SELECT roll, name, subject, marks FROM studentrecord2")

youngstersDF2.show

Output

Hope you like it.

Posted by: MR. JOYDEEP DAS

Friday, 15 June 2018

Download JSON file from Azure Storage and Read it by SSIS – Part 2

Introduction

This article is the continuation of my previous article named “Download JSON file from Azure Storage and Read it by SSIS”.

You can find this article:
http://sqlknowledgebank.blogspot.in/2018/04/download-json-file-from-azure-storage.html

In this article we are going to read the JSON file and store it in Relational Database Table object. For that we are using Script Component. Hope it will be interesting.

JSON File

The sample of the example JSON file

Data Flow Task

Script Component Settings

Connection Manager with JSON File

Input Output Columns

Edit Script

References needed Syste.Web.Entry

Name Spaces needed:

using System;

using System.Data;

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

using System.Web.Script.Serialization;

C# Code:

#region Help: Introduction to the Script Component

/* The Script Component allows you to perform virtually any operation that can be accomplished in

* a .Net application within the context of an Integration Services data flow.

* Expand the other regions which have "Help" prefixes for examples of specific ways to use

* Integration Services features within this script component. */

#endregion

#region Namespaces

using System;

using System.Data;

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

using System.Web.Script.Serialization;

#endregion

/// <summary>

/// This is the class to which to add your code. Do not change the name, attributes, or parent

/// of this class.

/// </summary>

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]

public class ScriptMain : UserComponent

{

IDTSConnectionManager100 connMgr;

/// <summary>

/// This method is called once, before rows begin to be processed in the data flow.

///

/// You can remove this method if you don't need to do anything here.

/// </summary>

public override void PreExecute()

{

base.PreExecute();

* Add your code here

connMgr = this.Connections.MyFlatFileConnectionMgr;

}

/// <summary>

/// This method is called after all the rows have passed through this component.

///

/// You can delete this method if you don't need to do anything here.

/// </summary>

public override void PostExecute()

{

base.PostExecute();

* Add your code here

}

public override void CreateNewOutputRows()

{

Add rows by calling the AddRow method on the member variable named "<Output Name>Buffer".

For example, call MyOutputBuffer.AddRow() if your output was named "MyOutput".

JavaScriptSerializer js = new JavaScriptSerializer();

byte[] jsonbyte = System.IO.File.ReadAllBytes(connMgr.ConnectionString);

string reviewConverted = System.Text.Encoding.ASCII.GetString(jsonbyte);

// deserialize the string

jData jdata = js.Deserialize<jData>(reviewConverted);

foreach (files f in jdata.rows)

{

Output0Buffer.AddRow();

Output0Buffer.id = string.IsNullOrEmpty(f.id) ? "" : f.id.ToString();

Output0Buffer.dis = string.IsNullOrEmpty(f.dis) ? "" : f.dis.ToString();

Output0Buffer.siteRef = string.IsNullOrEmpty(f.siteRef) ? "" : f.siteRef.ToString();

Output0Buffer.assetGmars = string.IsNullOrEmpty(f.assetGmars) ? "" : f.assetGmars.ToString();

Output0Buffer.assetVfa = string.IsNullOrEmpty(f.assetVfa) ? "" : f.assetVfa.ToString();

Output0Buffer.equipRef = string.IsNullOrEmpty(f.equipRef) ? "" : f.equipRef.ToString();

}

internal class jData

{

public files[] rows { get; set; }

}

internal class files

{

public string id { get; set; }

public string dis { get; set; }

public string siteRef { get; set; }

public string assetGmars { get; set; }

public string assetVfa { get; set; }

public string equipRef { get; set; }

}

Hope you like it.

Posted By: MR. JOYDEEP DAS