Archive for the ‘csv’ Tag
Dumping Large Volumes Of Data In Oracle
I recently had to do a dump of our system to csv files for analysis by auditors. Now exporting 23 million rows is not the easiest thing to do. Luckily I picked up a few tricks from our datawarehousing guru.
The first thing you need to do is move the data from the production system to the datawarehouse. For example if you need to export data from a table called supplier_claims this is roughly what you would do:
First logon to the warehouse and run the following command:
create table supplier_claims as select * from sharedb.supplier_claims@dblink where 1 = 2; alter table supplier_claims compress;
This ensures that the table is created with the exact columns and types as the live table. The where clause ensure no rows are dumped right yet. The compress statement will ensure that the data is compressed for this table. This not only saves database storage but it is also more efficient since the server has less I/O to do. Since I/O is slower this makes sense.
Now you can dump the rows you need:
insert into <code>supplier_claims </code>/* APPEND */ select * from sharedb.supplier_claims@dblink where billing_date >= '01-JAN-09'
Now you have the data you need on the warehouse which means the production system will not have to suffer and you can work with the data anytime of the day.
A very useful and valueable tool that has saved me hours is the sqluldr2 tool. It is so simple and easy to use. You can export the data you need into a csv file specifying the delimiter you want and if you need headings included.
Clean A Dirty Text File Into CSV Using Python
So since I’ve been working in the BI team at MMSA in Durbs there wasn’t many requests for me to export data to csv. Now as simple as this sounds it is very difficult doing this on millions of rows. Since I’m still a nOOb at Oracle and datawarehousing, it proved to be a challenge completing this task.
One of the barriers is that there is an access limit on the production system that basically won’t let you put too much pressure on it (an I/O usage restriction). So what I have to do is dump the data at night or during certain times. Using the SQL Plus tool I spooled the data to a text file. The problem is that I didn’t set the page size and a whole host of other commands that would write the data in a much easier to work with format.
Unfortunately I was in such a rush and ended up with a very dirty text file. Since I know Python I thought maybe I can use my skills to clean the file. I spent about 4-5 hours on this little baby:
def make_csv_row(temp):
count = 1
the_row = ''
for t in temp:
t = t.replace('\t',' ')
t = '"|"'.join(t.split())
t = '"' + t + '"'
if count != 4:
t = t + "|"
count += 1
the_row += t
return the_row
print 'Opening files'
OPEN_FILE = 'AUG08.csv'
OUT_FILE = 'AUG08_step_1.csv'
f = open(OPEN_FILE, 'rb') # file to read
fout = open(OUT_FILE, 'w') # file to write
print 'Process file'
counter = 0
for line in f.readlines():
if not(line.find('---------') != -1 or line.find('FIELD1') != -1 or line.find('FIELD2') != -1 or line.find('FIELD3') != -1 or line.find('FIELD4) != -1):
if line != ('\r\n'):
fout.write(line)
print 'Completed'
# close all files
f.close()
fout.close()
print 'process step 2'
f = open(OUT_FILE, 'rb') # file to read
fout = open(OPEN_FILE, 'w') # file to write
count = 1
temp = []
fout.write('"FIELD1"|"FIELD2"|"FIELD3"' + '\r\n')
for line in f.readlines():
temp.append(line.replace('\r\n', ''))
if count == 4:
t = make_csv_row(temp)
fout.write(str(t) + '\r\n')
count = 0
temp = []
count = count + 1
f.close()
fout.close()
Please don’t even ask what this does since it is too much of an effort to explain and still confuses me. Unfortunately this script almost worked but there was one little bug that prevented me from using it. So close yet so far.
There are probably a lot of refinements that could be done with this code but remember that I hadn’t touched Python in almost 6 months.
Leave a Comment