Archive for the ‘Databases’ Tag
Deleting Rows In PostgreSQL Based On A Duplicate Column Values
Back at my previous company we usually had the task of deleting rows that had duplicate values in a specific column. This column would usually be the email field. As simple as this sounds, it’s not that easy to do in PostgreSQL.
My friend was talking about this after I had left and I thought I’d write a little SQL query to help solve the problem. It took me around an hour to do but it was worth the time and effort. It’s not the most efficient query and there may be other ways of doing it but this does what you need in SQL only.
DELETE FROM st_emails where id NOT IN --Any ids not in the list returned below are duplicates and therefore will be deleted
(
--Find all unique rows
SELECT
id
FROM
st_emails
WHERE
--Determine the smallest ids for rows with duplicate emails
id IN(
select id
from st_emails
where id in (select min(id) from st_emails group by email having (count(*) > 1) )
)
--Determine all emails that are unique
OR email IN (select email from st_emails group by email having (count(*) = 1))
)
The query basically deletes rows where the row id is not in a list of unique rows.
The unique rows are determined by finding the unique emails and then finding the smallest row id where the email is duplicated (this will ensure that all duplicates, except a single unique row is deleted – kinda like your limit clause).
If you want the newest email duplicate entry not to be deleted then just replace “min” with “max”.
Usual Things You Need To Find In Oracle
There are always silly little things that you always need to do but always can’t remember the syntax to do it. Well in Oracle some of these things are finding the full column and table name when you only know part of the column name; and the other is getting the full value (meaning) of a list of values from the cg_ref_codes table.
So lets look at the query to get the table name and column name:
select table_name, column_name from all_tab_columns where column_name like '%COL_NAME%'
This will give you a list of table names and column names which will hopefully help you find what you where looking for.
Oracle has this wonderful technology called Oracle Forms that basically helps you develop applications that makes very good use of all of the Oracle’s database features.
One of the tables that is created with Oracle Forms is the cg_ref_codes table. This table holds the column name, low value, meaning, and lots of other info that you probably don’t need to know. For example, if you have a column that holds the type of customers with values: ‘VIP’, ‘Common’, and ‘Special’; you could specify them in the Oracle Forms as single characters: ‘V’, ‘C’, and ‘S’.
If you’re running a large system you will most likely forget the meanings of ‘V’, ‘C’, and ‘S’. However, you can simply use the following query to get the full values you need:
select rv_low_value, rv_meaning, rv_domain from cg_ref_codes where rv_domain = 'CUST_TYPE'
The rv_domain is the column name that holds the values for your customer types. The rv_low_value will be ‘V’, ‘C’, and ‘S’; and the rv_meaning will be ‘VIP’, ‘Common’, and ‘Special’.
Simple yet easy to forget!
A Simple Way To Optimized Your Query
If you’re running a large query against your Oracle database and don’t want to kill it, then writing optimized queries is a must. One of the most simple and basic, yet powerful ways of doing this is by using an indexed column in your where clause.
The first thing to do is find all the indexes for a table and then use the most appropriate one.
Here is the code:
SELECT INDEX_NAME, COLUMN_NAME FROM ALL_IND_COLUMNS WHERE TABLE_NAME = 'CUSTOMER';
The code if you’re going via a link:
SELECT INDEX_NAME, COLUMN_NAME FROM ALL_IND_COLUMNS@DB_LINK WHERE TABLE_NAME = 'CUSTOMER';
Although this is something so simple, many people forget about indexes and how useful they are. All that’s left to do now is incorporating the index(s) you found using the query above into your main query. That will make a huge difference when you’re selecting millions of rows and joining millions of rows.
SQLite:The Most Widely Deployed Database
If you’re wondering how your cell phone remembers all your personal preferences and at the same time keeps all of your applications settings, SQLite could be the cause for all that convenience! It also claims to be one of the most deployed SQL databases around.
Now I know most of out there would much rather use Access with it’s simple interface, but the sqlite manager extension for Firefox makes using SQLite a breeze. Here is a screenshot of this little gem:
Here is a complete list of applications that can support this plugin.
SQLite is very useful when it comes to developing lightweight applications or, prototyping and testing applications. It is also integrated into Python and PHP so you can develop an app really quick without worrying about setting up a database server and all the admin that goes with it.
An impressive list of companies are also using SQLite.
Clean A Dirty Text File Into CSV Using Python
So since I’ve been working in the BI team at MMSA in Durbs there wasn’t many requests for me to export data to csv. Now as simple as this sounds it is very difficult doing this on millions of rows. Since I’m still a nOOb at Oracle and datawarehousing, it proved to be a challenge completing this task.
One of the barriers is that there is an access limit on the production system that basically won’t let you put too much pressure on it (an I/O usage restriction). So what I have to do is dump the data at night or during certain times. Using the SQL Plus tool I spooled the data to a text file. The problem is that I didn’t set the page size and a whole host of other commands that would write the data in a much easier to work with format.
Unfortunately I was in such a rush and ended up with a very dirty text file. Since I know Python I thought maybe I can use my skills to clean the file. I spent about 4-5 hours on this little baby:
def make_csv_row(temp):
count = 1
the_row = ''
for t in temp:
t = t.replace('\t',' ')
t = '"|"'.join(t.split())
t = '"' + t + '"'
if count != 4:
t = t + "|"
count += 1
the_row += t
return the_row
print 'Opening files'
OPEN_FILE = 'AUG08.csv'
OUT_FILE = 'AUG08_step_1.csv'
f = open(OPEN_FILE, 'rb') # file to read
fout = open(OUT_FILE, 'w') # file to write
print 'Process file'
counter = 0
for line in f.readlines():
if not(line.find('---------') != -1 or line.find('FIELD1') != -1 or line.find('FIELD2') != -1 or line.find('FIELD3') != -1 or line.find('FIELD4) != -1):
if line != ('\r\n'):
fout.write(line)
print 'Completed'
# close all files
f.close()
fout.close()
print 'process step 2'
f = open(OUT_FILE, 'rb') # file to read
fout = open(OPEN_FILE, 'w') # file to write
count = 1
temp = []
fout.write('"FIELD1"|"FIELD2"|"FIELD3"' + '\r\n')
for line in f.readlines():
temp.append(line.replace('\r\n', ''))
if count == 4:
t = make_csv_row(temp)
fout.write(str(t) + '\r\n')
count = 0
temp = []
count = count + 1
f.close()
fout.close()
Please don’t even ask what this does since it is too much of an effort to explain and still confuses me. Unfortunately this script almost worked but there was one little bug that prevented me from using it. So close yet so far.
There are probably a lot of refinements that could be done with this code but remember that I hadn’t touched Python in almost 6 months.
My View on PostgreSQL & MySQL
Well as most of you may know, PostgreSQL and MySQL are 2 of the most commonly used open source databases.
I have experience in using both of these excellent databases but a little more in MySQL.

MySQL Logo
I’ve always found MySQL easier to use because I do not have enough experience and knowledge in PostgreSQL. PostgreSQL certainly has many more features than MySQL and is extremely stable and standard compliant.
Recently I’ve been looking into settling on a database system that provides me with all the features I need while being easy to use, standard compliant and stable.
So on I go onto trusty Google and search for MySQL 6 since I remember that it was the next major releases. Clicking the link I am surprised to find that MySQL 6 is no longer in development.
Rather, the guys at MySQL have decided on a new Development Cycle. This may or may not be a good move but just dropping an expected release makes me feel uneasy about stability.
So I have a look at the latest version of PostgreSQL and it has the features that they promised and some very advanced ones too.

PostgreSQL Logo
So in my opinion, I see PostgreSQL as a much more stable database that has some really advanced features that I will hopefully learn to use one day.
Sure it may not be as popular as MySQL but it certainly is more technically sound and has enterprise features that sets it apart from all other databases out there, even commercial ones that cost an arm and a leg.
Limiting a Result Set in Oracle
The limit clause in SQL is pretty easy to use and very handy. This clause limits the result of your query and is used a lot for pagination and simple testing or debugging.
Here is a simple SQL statment that uses the limit clause:
select * from table_name limit 5;
The above will return 5 rows from the table.
Now this works find on most databases except for oracle. In oracle the limit functionality is as follows:
select * from table_name where rownum <= 5;
It’s kind of strange that Oracle does not implement the limit clause like most other databases but hopefully now you know how to do it using the where clause.
Comments (1)