World of Whatever

A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Tuesday, February 9, 2021

Including a local python module

Including a local python module

As we saw in reusing your python code, you can create a python file, a module, that contains our core business logic and then re-use that file. This post is going to talk about to make that happen.

What happens when you import antigravity? The python interpreter is going to check all the places it knows to find modules. It's going to check the install location for a module, it's going to check if you defined pythonhome/pythonpath environment variables, and you can hint to your hearts desire where to find files. If I import a real module, import pprint, I can access the __file__ property which will tell you where it found the module. In my case it was C:\Python38\lib\pprint.py

Python is happy to tell you what the current search path is import sys print(sys.path)

On my machine, that displays an array of ['', 'C:\\Python38\\python38.zip', 'C:\\Python38\\DLLs', 'C:\\Python38\\lib', 'C:\\Python38', 'C:\\Python38\\lib\\site-packages'] If I want reusable_code.py to be callable by another module, then it needs to exist in one of those locations. That first entry of a blank path is the current directory so as long as the module I need is in the same folder, we're golden! I have added i_use_resuable.py to the same folder as the above

# This module lives in the same folder as our reusable_code.py file
import reusable_code


def main():
    c = reusable_code.Configuration()
    print(c.get_modify_date())


if __name__ == '__main__':
    main()

Executing that, I get the expected timestamp - the exact same experience as running our corporate file but now I can focus on using the business logic instead of writing it. We're going to need something more though if we're going to get this reuable code into our pyspark cluster.

In the next post, we'll learn how to package our business module up into something we can install instead of just assuming where the file is.

As always, the code is on my github repro

Monday, February 8, 2021

Reusing your python code

Reusing your ptyhon code

I learned python in 2003 and used it for all the ETL work I was doing. It was beautiful and I would happilly wax to any programmer friends about the language and how they should be learning it. It turns out, my advocacy was just 15+ years too early. I recently had a client reach out to engage me to work on their Databricks project. No gentle reader, I don't much of anything about Databricks. But I do know about working with data, python programming (which I was already updating my mental model to 3.0) and pandas. Yes, pandas is not what we do in databricks but the concepts are similar.

One of the early observations is that they had dozens of notebooks with copy and paste code across them. Copy and paste code in a metadata driven solution isn't an evil but when you're hand crafting boiler plate code artifacts by hand, you're going to sneak a code mutation in there. So, let's look at how we can avoid this with code re-use.

Let's assume we use an important business process that needs to be consistent across our infrastructure. In this case, it's a modification date which is used as part of our partition strategy. This code nugget is spread across all those notebooks datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ") When processing starts up, we set a timestamp so that all activities accrue under that same timestamp. It's a common pattern across data processing. How could we do this better?

In classic python programming, we would abstract that logic away in a reusable library. In this example, I have created a module (file) named reusable_code.py In it, I created a class named Configuration and it exposes a method get_modify_date

# reusable_code.py is a python module that simulates our desire to 
#consolidate our corporate business logic into a re-usable entity

from datetime import datetime

class Configuration():
    """A very important class that provides a standardized approach for our company"""
    def __init__(self):
        # The modify date drives very important business functionality
        # so let's be consitent in how it is defined (ISO 8601)
        # 2021-02-07T17:58:20Z
        self.__modify_date__ = datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")

    def get_modify_date(self):
        """Provide a standard interface for accessing the modify date"""
        return self.__modify_date__


def main():
    c = Configuration()
    print(c.get_modify_date())

if __name__ == "__main__":
    main()

Usage is simple, I create an instance of my class c, which causes the constructor/initalizer to fire and set the modify date for the life of that object. Calling the get_modify_date method results in an ISO 8601 date to be emitted

2021-02-07T17:58:20Z

At this point, I hope you have an understanding of how we can make a reusable widget. Think about your business processes that you need to encapsulate in to reusable components and tomorrow we'll review using existing python modules in new files. After that, we'll cover converting this module into a wheel. And then we'll walk through installing it to a DataBricks cluster and using it from a notebook. Sound good?

All of this code is available on my github repository -> 2021-02-08_PythonReusableCode

Thursday, February 20, 2020

Making a delimited list

Making a delimited list

There are various ways to concatenate values together. A common approach I see is that people will add a delimiter and then the value and loop until they finish. Then they take away the first delimiter. Generally, that's easier coding, prepending a delimiter, than to append the delimiter and not do it for the final element of a list. Or add it and then remove the final delimiter from the resulting string.

Gentle reader, there is a better way. And has been for quite some time but if you weren't looking for it, you might not know it exists. In .NET, it's String.Join

Look at the following code, what would you rather write? The Avoid this block or simply use the libraries?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;
using System.Text;
                    
public class Program
{
    public static void Main()
    {
        // generate a list of numbers
        List data = (Enumerable.Range(0, 10)).ToList();

        string delimiter = ",";
        // Avoid this, unless you know you need to do it for a specific reason
        {
            StringBuilder sb = new StringBuilder();
            foreach(var i in data)
            {
                sb.Append(delimiter);
                sb.Append(i);
            }
            
            // Convert the string builder to a string and then strip the first
            // character out of it
            string final = sb.ToString().Substring(delimiter.Length);
            
            Console.WriteLine(final);
        }
        
        // Make a comma delimited list
        Console.WriteLine(String.Join(delimiter, data));
        
        // What if we want to do something, like put each element in an XML tag?
        Console.WriteLine(String.Join(string.Empty, data.Select(x => string.Format("{0}", x)).ToList()));

    }
}
Output of running the above
0,1,2,3,4,5,6,7,8,9
0,1,2,3,4,5,6,7,8,9
0123456789
Gist for .net

But Bill, I use Python. The syntax changes but the concept remains the same. delimiter.join(data). Whatever language you use, there's probably an equivalent method. Look for it and use it. Don't write your own implementation.

Was there a better way to have done this? Let me know.

Tuesday, November 19, 2019

Generating characters in TSQL

Generating characters in TSQL

I had to do a thing* and it involved generating "codes" as numbers were too hard for people. So, if you have need to convert an arbitrary number into characters, this is your lucky day/post.

Background

As I get longer in the tooth programming becomes more accessible, I find that people might not have been exposed to underpinnings of how things used to work. Strings were just a bunch of characters put together and a character was a subset of the Latin alphabet shoved into 128 characters (0 to 127). The characters below 32 were referred to as the non-printable characters or control characters. Things above 32 are what you see on a US keyboard. There was a time, if you bought a programming book, it would have an ASCII table somewhere in the reference. Capital A is character 65, Capital Z is character 90 (65/A + 25 characters later). In TSQL, the CHAR function takes a number and gives you the ASCII character for the value so SELECT CHAR(66) AS B; will generate a capital B.

The mod or modulus function will return the remainder after division. Modding a value is a handy way to constrain a value between 0 and an upper threshold. In this case, if I modded any number by 26 (because there are 26 characters in the English alphabet), I'll get 0 to 25 as my result.

Knowing that the modulus function will give me 0 to 25 and knowing that my target character range starts at 65, I could use the previous expression to print any number's ascii value like SELECT CHAR((2147483625 % 26) + 65) AS StillB;. Break that apart, we do the modulus, %, which gives us the value of 1 which we then add to the starting offset (65).

Rolling all that together, here's a quick little tester to see what we can then do with it.

SELECT
    D.rn
,   ASCII_ORD.ord_value
,   ASCII_ORD.replicate_count
    -- CHAR converts a number to a character
,   CHAR(ASCII_ORD.ord_value) AS ord_value_as_character
    -- REPLICATE repeats a string N times
,   REPLICATE(CHAR(ASCII_ORD.ord_value), ASCII_ORD.replicate_count) AS RepeatedCharacter
    -- CONCAT is a null and type approach for string building (requires 2012+)
,   CONCAT(CHAR(ASCII_ORD.ord_value), ASCII_ORD.replicate_count) AS ConcatenatedCharacter
FROM
(
    -- Generate 0 to N-1 rows
    SELECT TOP (300)
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1
    FROM
        sys.all_columns AS AC
)D(rn)
CROSS APPLY
(
    -- There are 26 characters in the English language
    -- 65 is the ASCII ordinal position of a capital A
    SELECT
        D.rn % 26 + 65
    ,   D.rn / 26 + 1
) ASCII_ORD(ord_value, replicate_count)
ORDER BY
    D.rn
;

Ultimately, it was decided that using a combination of character and digits (ConcatenatedCharacter) might be more user friendly than purely a repeated character approach. Neither of which will help you when you're in the 2 billion range like our sample input of 2147483625

Key takeaways

Don't confuse the CHAR function with the char data type. Similar but different

That's why books always had ASCII tables in them

Modulus function can generate a bounded set of numbers

Older developers might know some weird tricks/trivia

Even older developers will scoff at memorized ASCII tables in favor of EBCDIC tables

Thursday, September 5, 2019

Using Newtonsoft.Json with Biml

Using Newtonsoft.Json with Biml

Twitter provided an opportunity for a quick blog post

#sqlhelp #biml I would have the metadata in a Json structure. How would you parse the json in the C# BIML Script? I was thinking use Newtonsoft.Json but I don't know how to add the reference to it

Adding external assemblies is a snap but here I'll show how to use the NewtonSoft Json library to parse a Json based metadata structure and then use that in our Biml.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Given the following structure

///{
///  "packages": [
///    "p1",
///    "p2",
///    "p3"
///  ]
///}

// Assume the json file is located as specified
string sourceFile = @"C:\ssisdata\trivial.json";

// Read the data into a string variable
string json = System.IO.File.ReadAllText(sourceFile);
 
// Deserialize the json into a dictionary of strings (packages) and a list of strings (p1, p2, p3)
Dictionary<string, List<string>> metadata = JsonConvert.DeserializeObject<Dictionary<string, List<string>>>(json);
#>
<Packages>
<#
// Shred the dictionary for our values
foreach (string item in metadata["packages"])
{
    //WriteLine(String.Format("<!-- {0} -->", item));
#>
    <Package Name="<#=item #>" />
<#
}
#> 
</Packages>
</Biml>

<#@ import namespace="Newtonsoft.Json" #>
<#* Assuming we have GAC'ed the assembly *#>
<#@ assembly name= "Newtonsoft.Json.dll" #>

The gist is also posted in case I mangled the hand crafted html entities above.

Also, not covered is GAC'ing the assembly but you can use an explicit path to your DLL name="C:\where\did\I\put\this\Newtonsoft.Json.dll"

Wednesday, February 27, 2019

Biml in Azure aka CallBimlScriptContent

CallBimlScriptContent was introduced with the migration from Mist to BimlStudio. Why is this cool? You do not have to use files sitting on your computer as the source for your Biml. As long as you can reconstitute the Biml contents into a string, you can store your scripts where ever you'd like. If you want them in a database, that's great. Store them in the cloud? Knock yourself out.

As a consultant, the latter is rather compelling. Maybe I'm only licensing my clients to use accelerators during our engagement. If I leave files on the file system after I roll off, or they image my computer and accidentally collect them, I am David fighting Goliath. CallBimlScriptContent is a means to protect myself and my IP. Let's look at a trivial example. I set a C# string with an empty Package tag (hooray for doubling up my double quotes). Within my Packages collection, I invoke CallBimlScriptContent passing in my Biml content.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Do something here to populate myBimlFile
string myBimlFile = @"<Package Name=""ABC"" />";
#>    
    <Packages>
        <#=CallBimlScriptContent(myBimlFile)#>
    </Packages>
</Biml>
The rendered Biml for the above would look like
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package Name="ABC" />
  </Packages>
</Biml>

It's super that it works, but that's not convenient. Like, at all! Plus, good luck trying to embed any complexity in that string.

So, let's try something a little more complex. Conceptually, imagine we have two Biml Scripts we might choose to call inc_Package_00.biml and inc_Package_10.biml <#@ property name="parameterName" type="string" #>

inc_Package_00.biml

<Package Name="ABC" />

inc_Package_10.biml

<#@ property name="packageName" type="string" #>
<Package Name="packageName" />
Our original code could then look like
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Do something here to populate myBimlFile
string myBimlFile =System.IO.File.ReadAllText(@"C:\tmp\inc_Package_00.biml");
#>    
    <Packages>
        <#=CallBimlScriptContent(myBimlFile, "Package_00)"#>
    </Packages>
</Biml>
Do you need to pass parameters? It's no different than what you're used to doing
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Do something here to populate myBimlFile
string myBimlFile =System.IO.File.ReadAllText(@"C:\tmp\inc_Package_10.biml");
#>    
    <Packages>
        <#=CallBimlScriptContent(myBimlFile, "Package_10)"#>
    </Packages>
</Biml>

In the next post, I'll show you how to use reference data stored in tables or Azure as your BimlScript Content. Stay tuned!

Thursday, December 20, 2018

My github repository

In preparation for my talk at the Kansas City SQL Server User Group this afternoon, I am putting this post here so people can get the materials easily.

Lightning Talks