A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Thursday, April 22, 2010

Python text parser

@Neil_Hambly is having some file import issues "BCP file Quoted identifier, comma seperated. some data fields have quotes. so client having import issues that issue is that these TEXT files are not going to work form them due to the data already having quotes etc.. 2/2 any suggestions" It sounded like some python scripting might be helpful for standardizing/resetting his input files. Python was my preferred tool for parsing through files as one point in my life and I thought I'd dust the cobwebs off that part of my life and attempt a solution.

Much thanks to Doug Hellmann and his Python Module of the Week for the syntax refresher.

Given these two files f1.txt
header1|"header 2"|"header | 3"|header 4
delimited file|1234|Pop goes the weasel|bob's your uncle
Looks at these c,,,,|5321|abc|bob's your uncle
and f2.txt
h1,h2,h3
a,"b,c",d
",1",2,"3"
Z,Y,"X,"

I created a file called parseIt.py
mport csv
import glob

# helpful refreshers from http://www.doughellmann.com/PyMOTW/csv/index.html
csv.register_dialect('pipes', delimiter='|')
for fname in glob.glob(r'*.txt'):
    fin = open(fname, 'rt')
    print "processing ", fname
    if fname == 'f1.txt' :
        reader = csv.reader(fin, dialect='pipes')
    else:
        reader = csv.reader(fin)

    # Perform whatever action with the parsed data, either rewrite into something simpler or
    # make direct database calls
    # This code will simply rewrite everything into a consistent output format

    fout = open(fname + ".out.csv", 'wt')
    # writer = csv.writer(fout, quoting=csv.QUOTE_ALL)
    # writer = csv.writer(fout, quoting=csv.QUOTE_MINIMAL)
    writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
    # writer = csv.writer(fout, quoting=csv.QUOTE_NONE)
    for line in reader:
        print(line)
        writer.writerow(line)
    fin.close()
    fout.close()

So what's that do for us? Depending on which writer/quoting style you select above, it could give you any of the following. QUOTE_ALL, QUOTE_MINIMAL and QUOTE_NONNUMERIC presented below for f1.txt and f2.txt
"header1","header 2","header | 3","header 4"
"delimited file","1234","Pop goes the weasel","bob's your uncle"
"Looks at these c,,,,","5321","abc","bob's your uncle"

header1,header 2,header | 3,header 4
delimited file,1234,Pop goes the weasel,bob's your uncle
"Looks at these c,,,,",5321,abc,bob's your uncle

"header1","header 2","header | 3","header 4"
"delimited file","1234","Pop goes the weasel","bob's your uncle"
"Looks at these c,,,,","5321","abc","bob's your uncle"

"h1","h2","h3"
"a","b,c","d"
",1","2","3"
"Z","Y","X,"

h1,h2,h3
a,"b,c",d
",1",2,3
Z,Y,"X,"

"h1","h2","h3"
"a","b,c","d"
",1","2","3"
"Z","Y","X,"

Unfortunately, I'm going to be late for work or I'd love to show off some of the nifty things you can do with the data or the hacks you can employ to standardize it if it's even less formatted than I'm imagining.

No comments: