Wednesday 7 February 2018

Azure Data Lake – U-SQL Script Construction


Introduction

In this article we are trying to discuss about basic U-SQL query construction Technique. Hope it will be informative.

 

First U-SQL Project Solution Explorer

View à Solution Explorer



 

The solution contains single U-SQL project. Here in this example it is “Test” project. Inside that project is the empty Script.usql file.

Notice that we can expand the .usql file node. It contains a C# file. This is called the U-SQL
Code-Behind file. U-SQL Code-behind is a convenience feature of ADLToolsForVS and not a part of the U-SQL language itself.

Input and Output

Compiling U-SQL file required Input. U-SQL script transforms Input into Output. There are two types of Input and Output.

·         Files
·         U-SQL Table


Location of Input and Output

During U-SQL Cloud Execution the inputs/outputs must all be in the cloud - typically this means Azure Data Lake Store.

During U-SQL Local Execution the inputs/outputs must all be on your own box. There's a special name for this location: The U-SQL Local Data Root.

We can find the local data root by going to Tools à Data Lake à Options and Settings. It in the field called DataRoot at the top.





Sample U-SQL Script

This script don’t have any transformation steps.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

OUTPUT @searchlog  
    TO "/output/SearchLog-first-u-sql.csv"
    USING Outputters.Csv();


This script doesn't have any transformation steps. It reads from the source file called SearchLog.tsv, schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv.
Notice the question mark next to the data type in the Duration field. It means that the Duration field could be null.

Here we have to understand Rowset, Extract and Output.





·         Rowset variables: Each query expression that produces a rowset can be assigned to a variable. U-SQL follows the T-SQL variable naming pattern (@searchlog, for example) in the script.

·         The EXTRACT keyword reads data from a file and defines the schema on read. Extractors.Tsv is a built-in U-SQL extractor for tab-separated-value files. We can develop custom extractors.

·         The OUTPUT writes data from a rowset to a file. Outputters.Csv() is a built-in U-SQL outputter to create a comma-separated-value file. You can develop custom outputters.
The EXTRACT and OUTPUT statements use file paths. File paths can be absolute or relative.


Here we can use scalar variable to make the script maintenance easier.

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM @in
    USING Extractors.Tsv();

OUTPUT @searchlog  
    TO @out
    USING Outputters.Csv();



Hope you like it.


Posted by: MR. JOYDEEP DAS

4 comments:

  1. Good article that help me to understand the basic of usql .... Waiting for next in this series... Thanks

    ReplyDelete
  2. Waiting for next article. Hope it will come as early as possible.

    ReplyDelete
  3. Yes... I too layman of Azure... Its help me... to know basic of U-SQl...Waiting for next in this series...

    ReplyDelete
  4. It is a good article for understanding basic U-Sql.

    ReplyDelete