ETL file processing pattern
In this post, I'd like to talk about what I always feel silly mentioning because it seems so elementary yet plenty of people stumble on it. The lesson is simple, be consistent. For file processing, the consistency I crave is where my files are going to be for processing. This post will have examples in SSIS but whether you're using a custom python script or informatica or pentaho kettle, the concept remains true: have your files in a consistent location.
You can tell when I've been somewhere because I really like the following file structure for ETL processing.
root node
All ETL processing will use a common root node/directory. I call it SSISData to make it fairly obvious what is in there but call it as you will. On my dev machine, this is usually sitting right off C:. On servers though, oh hell no! The C drive is reserved for the OS. Instead, I work with the DBAs to determine where this will actually be located. Heck, they could make it a UNC path and my processing won't care because I ensure that root location is an externally configurable "thing." Whatever you're using for ETL, I'm certain it will support the concept of configuration. Make sure the base node is configurable.A nice thing about anchoring all your file processing to a head location is that if you are at an organization that is judicious with handing out file permissions, you don't have to create a permission request for each source of data. Get your security team to sign off on the ETL process having full control to a directory structure starting here. The number of SSIS related permissions issues I answer on StackOverflow is silly.
Disaster recovery! If you need to stand up a new processing site to simulate your current production environment, it'd be really convenient to only have to pull in one directory tree and say "we're ready."
I find it useful to make the root folder a network share as well so that whatever team is responsible for supporting the ETL process can immediately open files without having to RDP into a box just to check data. Make the share read-only because SOX, SAS70, etc.
subject area
Immediately under my root node, I have a subject area. If you're just beginning your ETL, this is a place people usually skip. "We'll commingle all the files in one folder, it's easier." Or, they don't even think about it because we gotta get this done.
Please, create subject areas to segment your files. Let's look at some reasons why you are likely going to want to have some isolation. "Data.csv" Great, is that the Sales Data or Employee Data? You may not know until you open it up because you're just receiving data with no control over what the original file is called. If you had work areas for your files, you'd be able to direct the data to be delivered to the correct location with no chance for one file to clobber another.
And while we're talking about processes you can't control, let's talk about how Marketing is just going to copy the file into the folder whenever it's ready. Creating a folder structure by subject area will allow you to better control folder permissions. Remember how I said open a share, if the process is that Marketing copies the file, give them write only access to the folder they need to deliver to. If everyone copies files to a common folder, how many curious eyes will want to open Salaries_2016.xlsx? Folders make a fine mechanism for separation of sensitive data.
Input
The first folder under a given subject area is called Input. If my process needs to find data, it need only look in the input folder. That's really about it, source data goes here.Output
This is usually the least used of my folders but I ensure every subject area has an "Output" folder. This way, I always know where I'm going to write output files to. Output files might be immediately swept off to an FTP server or some other process consumes them but this folder is where I put the data and where I can control access to external consumers of my data. I've been in places where developers made the great agreement of "you supply the data as CSV and we'll generate an Excel file when we're done" Except the people picking up the files weren't terribly computer savvy and didn't have their file extensions turned on... Yeah, so have an output folder and dump your data there.Archive
This is my kitchen utility drawer. I throw everything in here when I'm done with it. For inbound files, the pattern looks something likeforeach source file
- process data
- move file to archive folder (possibly with date processed on the name)
Outbound processing is identical. The core process completes and generates files. A new process fires off and delivers those files.
foreach outbound file
- deliver file
- move file to archive folder (possibly with date processed on the name)
Two things to call out with this approach. The first is if you rename the files to have a date stamp on them. That's great for having a trail of when you actually processed the data. For example, we'd get in BigFile.csv on New Year's day but due to year end processes running long, we didn't actually load the file until January Second. Thus, when it gets archived, we might tack on a processed date like BigFile_2016-01-02.csv On January 5th, bad things happen and we have to restore the database to January 1st. ETL processing is no problem, you just copy those files back into the Input folder and oh, we expect the file to be named BigFile.csv exactly. Now you have to manipulate the file name before you can reprocess data. That's a pain. Or if the process accepts a file mask, you'll end up with BigFile_2016-01-02_2016-01-05.csv in the Archive folder because now we have processed the file twice.
The second thing is to use your library methods for renaming files. Don't assume everything from the first period to the end of the file name is the file extension. Don't assume the file extension is 3 characters.
Do not archive your files with any date stamp that isn't in the form of yyyy-mm-dd. Month names sort horribly. When I'm looking for files, I'm looking by year, then month, then day. Do use a delimiter between year, month and day. I know, yyyymmdd is nicer in TSQL and such but for whatever reason, I find it harder to mentally parse in a file name.
SSIS file pattern
The following Biml expresses two different implementations of the same pattern. The first uses project and package level parameters in addition to SSIS variables. The Project level parameter expresses the root node. The package level parameter defines the Subject Area. Maybe Subject Area is promoted to project parameter, based on the size and scope of your work.
Within each package, we'll then use expressions to build a FolderBase which is RootNode + SubjectArea. We'll then use expressions to define FolderArchive, FolderInput, and FolderBase. I name them in this manner because I want them to sort into the same area in my Variables window. If you really like to get clever, define a namespace for your variables beyond User like "template" or "framework."
<Biml xmlns="http://schemas.varigence.com/biml.xsd"> <Projects> <PackageProject Name="POC"> <Parameters> <Parameter DataType="String" Name="FolderRoot">C:\ssisdata</Parameter> </Parameters> <Packages> <Package PackageName="FileProcessingParam"></Package> </Packages> </PackageProject> </Projects> <Packages> <Package Name="FileProcessingParam"> <Parameters> <Parameter DataType="String" Name="SubjectArea">Sales</Parameter> </Parameters> <Variables> <Variable DataType="String" Name="FolderBase" EvaluateAsExpression="true">@[$Project::FolderRoot] + "\\" + @[$Package::SubjectArea]</Variable> <Variable DataType="String" Name="FolderArchive" EvaluateAsExpression="true">@[User::FolderBase] + "\\" + "Archive"</Variable> <Variable DataType="String" Name="FolderInput" EvaluateAsExpression="true">@[User::FolderBase] + "\\" + "Input"</Variable> <Variable DataType="String" Name="FolderOutput" EvaluateAsExpression="true">@[User::FolderBase] + "\\" + "Output"</Variable> </Variables> </Package> <Package Name="FileProcessingClassic"> <Variables> <Variable DataType="String" Name="FolderRoot">C:\ssisdata</Variable> <Variable DataType="String" Name="SubjectArea">Sales</Variable> <Variable DataType="String" Name="FolderBase" EvaluateAsExpression="true">@[User::FolderRoot] + "\\" + @[User::SubjectArea]</Variable> <Variable DataType="String" Name="FolderArchive" EvaluateAsExpression="true">@[User::FolderBase] + "\\" + "Archive"</Variable> <Variable DataType="String" Name="FolderInput" EvaluateAsExpression="true">@[User::FolderBase] + "\\" + "Input"</Variable> <Variable DataType="String" Name="FolderOutput" EvaluateAsExpression="true">@[User::FolderBase] + "\\" + "Output"</Variable> </Variables> </Package> </Packages> </Biml>Using the above Biml generates two SSIS packages: FileProcessingParam which uses parameters as part of the 2012+ Project Deployment Model.
FileProcessingClassic is an approach that will work across all versions of SSIS from 2005 to 2016 whether you use the Project Deployment Model or the Package Deployment Model.
Take away
Create a consistent environment for all of your file based processing. It'll reduce the number of decisions junior developers need to make. It will lower the complexity of recreating your environment and all of its permissions as projects migrate through environments. It'll ensure everyone knows where the data should be in the event of an emergency or someone is on vacation. Finally, it simplifies your code because all I need to do to be successful is ensure the file lands in the right spot.
2 comments:
I like this approach as well, it seems like I have worked with a guy who liked it.
What I find interesting is the lack of ETL knowledge from a day to day level that is on the internet. We find lots of articles solving specific problems, but we rarely find ones solving the problem of how to do good release management, and good file management.
There are some good articles, and some good information, but you never hear about any good books or definitive patterns for it that aren't ancient.
Things work differently now than they did 20 years ago, but there is still this mind set around older technologies dictating the new information.
Glad to see someone tackling something that "seems" so obvious, but you can really get bitten by when you go to other locations.
The only thing that I have to note about this is that this CAN fall down as a general rule when you start to receive a LOT of files into a specific folder. The archive itself can become burdened with cruft. I haven't seen a method for solving this in any particularly way yet, but I imagine you could set up some archive process that moved things to dated folders if they were older than X months old.
I had a problem with my last job where we had many files per day, all very large, and the actual sorting in the folder itself was slow to just bring the data up, it would have been useful to have another method, unfortunately there wouldn't have been one as easy as just leaving them in the same folder and waiting.
I often see some variations on this pattern:
1) this pattern, but instead of altering the original file name to include a timestamp, put each file into a folder timestamped with the ETL start time ("SSISData\Subject Area\Archive\201604051504\input.csv")
2) the "Subject Area" and "Input/Output/Archive" folders turned around: "SSISData\Input\Subject Area"; "SSISData\Archive\201604051504\Subject Area\input.csv"
3) Another folder on the "Input/Output/Archive" level, called "In Progress". This is a folder users don't have access to, so you can't mess with the files the SSIS is processing. Also enhances the debug experience: after a failing ETL process, it's even more clear which files were being processed.
Post a Comment