Much thanks to Doug Hellmann and his Python Module of the Week for the syntax refresher.
Given these two files f1.txt
header1|"header 2"|"header | 3"|header 4 delimited file|1234|Pop goes the weasel|bob's your uncle Looks at these c,,,,|5321|abc|bob's your uncleand f2.txt
h1,h2,h3 a,"b,c",d ",1",2,"3" Z,Y,"X,"
I created a file called parseIt.py
mport csv import glob # helpful refreshers from http://www.doughellmann.com/PyMOTW/csv/index.html csv.register_dialect('pipes', delimiter='|') for fname in glob.glob(r'*.txt'): fin = open(fname, 'rt') print "processing ", fname if fname == 'f1.txt' : reader = csv.reader(fin, dialect='pipes') else: reader = csv.reader(fin) # Perform whatever action with the parsed data, either rewrite into something simpler or # make direct database calls # This code will simply rewrite everything into a consistent output format fout = open(fname + ".out.csv", 'wt') # writer = csv.writer(fout, quoting=csv.QUOTE_ALL) # writer = csv.writer(fout, quoting=csv.QUOTE_MINIMAL) writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC) # writer = csv.writer(fout, quoting=csv.QUOTE_NONE) for line in reader: print(line) writer.writerow(line) fin.close() fout.close()
So what's that do for us? Depending on which writer/quoting style you select above, it could give you any of the following. QUOTE_ALL, QUOTE_MINIMAL and QUOTE_NONNUMERIC presented below for f1.txt and f2.txt
"header1","header 2","header | 3","header 4" "delimited file","1234","Pop goes the weasel","bob's your uncle" "Looks at these c,,,,","5321","abc","bob's your uncle" header1,header 2,header | 3,header 4 delimited file,1234,Pop goes the weasel,bob's your uncle "Looks at these c,,,,",5321,abc,bob's your uncle "header1","header 2","header | 3","header 4" "delimited file","1234","Pop goes the weasel","bob's your uncle" "Looks at these c,,,,","5321","abc","bob's your uncle" "h1","h2","h3" "a","b,c","d" ",1","2","3" "Z","Y","X," h1,h2,h3 a,"b,c",d ",1",2,3 Z,Y,"X," "h1","h2","h3" "a","b,c","d" ",1","2","3" "Z","Y","X,"
Unfortunately, I'm going to be late for work or I'd love to show off some of the nifty things you can do with the data or the hacks you can employ to standardize it if it's even less formatted than I'm imagining.
No comments:
Post a Comment