Saturday, April 10, 2010

SELECT * FROM spreadsheet (or How to parse a CSV file using PL/SQL)



I recently needed to retrieve/download a comma-separated values (CSV) file from a website, and insert the data in an Oracle database table.

After googling around a bit, I found various pieces of the solution on AskTom, ExpertsExchange and other sites, which I put together in the following generic utility package for CSV files.

Usage


Because I have implemented the main parsing routine as a pipelined function, you can process the data either using straight SQL, or in a PL/SQL program.



For example, you can retrieve a download a CSV file as a clob directly from the web and return it as a table with a single statement:


select *
from table(csv_util_pkg.clob_to_csv(httpuritype('http://www.foo.example/bar.csv').getclob()))


And maybe do a direct insert via INSERT .. SELECT :

insert into my_table (first_column, second_column)
select c001, c002
from table(csv_util_pkg.clob_to_csv(httpuritype('http://www.foo.example/bar.csv').getclob()))


You can of course also use SQL to filter the results (although this may affect performance):

select *
from table(csv_util_pkg.clob_to_csv(httpuritype('http://www.foo.example/bar.csv').getclob()))
where c002 = 'Chevy'


Or you can do it in a more procedural fashion, like this:

create table x_dump
(clob_value clob,
 dump_date date default sysdate,
 dump_id number);


declare
  l_clob clob;

  cursor l_cursor
  is
  select csv.*
  from x_dump d, table(csv_util_pkg.clob_to_csv(d.clob_value)) csv
  where d.dump_id = 1;

begin

  l_clob := httpuritype('http://www.foo.example/bar.csv').getclob();
  insert into x_dump (clob_value, dump_id) values (l_clob, 1);
  commit;
  dbms_lob.freetemporary (l_clob);

  for l_rec in l_cursor loop
    dbms_output.put_line ('row ' || l_rec.line_number || ', col 1 = ' || l_rec.c001);
  end loop;

end;



Auxiliary functions


There are a few additional functions in the package that are not necessary for normal usage, but may be useful if you are doing any sort of lower-level CSV parsing. The csv_to_array function operates on a single CSV-encoded line (so to use this you would have to split the CSV lines yourself first, and feed them one by one to this function):

declare
  l_array t_str_array;
  l_val varchar2(4000);
begin

  l_array := csv_util_pkg.csv_to_array ('10,SMITH,CLERK,"1200,50"');

  for i in l_array.first .. l_array.last loop
    dbms_output.put_line('value ' || i || ' = ' || l_array(i));
  end loop;

  -- should output SMITH
  l_val := csv_util_pkg.get_array_value(l_array, 2);
  dbms_output.put_line('value = ' || l_val);

  -- should give an error message stating that there is no column called DEPTNO because the array does not contain seven elements
  -- leave the column name out to fail silently and return NULL instead of raising exception
  l_val := csv_util_pkg.get_array_value(l_array, 7, 'DEPTNO');
  dbms_output.put_line('value = ' || l_val);

end;



Installation


In order to compile the package, you will need these SQL types in your schema:

create type t_str_array as table of varchar2(4000);
/

create type t_csv_line as object (
  line_number  number,
  line_raw     varchar2(4000),
  c001         varchar2(4000),
  c002         varchar2(4000),
  c003         varchar2(4000),
  c004         varchar2(4000),
  c005         varchar2(4000),
  c006         varchar2(4000),
  c007         varchar2(4000),
  c008         varchar2(4000),
  c009         varchar2(4000),
  c010         varchar2(4000),
  c011         varchar2(4000),
  c012         varchar2(4000),
  c013         varchar2(4000),
  c014         varchar2(4000),
  c015         varchar2(4000),
  c016         varchar2(4000),
  c017         varchar2(4000),
  c018         varchar2(4000),
  c019         varchar2(4000),
  c020         varchar2(4000)
);
/

create type t_csv_tab as table of t_csv_line;
/


UPDATE 04.04.2012: The latest version of the package itself (CSV_UTIL_PKG) can be found as part of the Alexandria Utility Library for PL/SQL.


Performance


On my test server (not my laptop), it takes about 35 seconds to process 12,000 rows in CSV format. I don't consider this super-fast, but probably fast enough for many CSV processing scenarios.

If you have any performance-enhancing tips, do let me know!

Bonus: Exporting CSV data


You can also use this package to export CSV data, for example by using a query like this.

select csv_util_pkg.array_to_csv (t_str_array(company_id, company_name, company_type)) as the_csv_data
from company
order by company_name

THE_CSV_DATA
--------------------------------
260,Acorn Oil & Gas,EXT
261,Altinex,EXT
262,Amerada Hess,EXT
263,Atlantic Petroleum,EXT
264,Beryl,EXT
265,BG,EXT
266,Bow Valley Energy,EXT
267,BP,EXT

This might come in handy, even in these days of XML and JSON ... :-)

Tuesday, April 6, 2010

Using TRUNC and ROUND on dates

Maybe this is old news to some, but I recently became aware that it is possible to use TRUNC and ROUND not just on a NUMBER, but also on a DATE value.

For example, you can get the start of the month for a given date (using TRUNC), or the "closest" start of the month, rounded forward or backwards in time appropriate (using ROUND):

select sysdate,
  trunc(sysdate, 'YYYY') as trunc_year,
  trunc(sysdate, 'MM') as trunc_month,
  round(sysdate, 'MM') as round_month,
  round(sysdate + 15, 'MM') as round_month2
from dual

The above gives the following results:

SYSDATE                   TRUNC_YEAR                TRUNC_MONTH               ROUND_MONTH               ROUND_MONTH2              
------------------------- ------------------------- ------------------------- ------------------------- ------------------------- 
06.04.2010 20:10:56       01.01.2010 00:00:00       01.04.2010 00:00:00       01.04.2010 00:00:00       01.05.2010 00:00:00       


Somewhat related to this topic is the relatively obscure (?) EXTRACT function, which allows you to extract a part of a DATE:

select sysdate,
  extract(day from sysdate) as extract_day,
  extract(month from sysdate) as extract_month,
  extract(year from sysdate) as extract_year
from dual

Which gives the following results:

SYSDATE                   EXTRACT_DAY            EXTRACT_MONTH          EXTRACT_YEAR           
------------------------- ---------------------- ---------------------- ---------------------- 
06.04.2010 20:13:01       6                      4                      2010                   

If you try to extract the "hour", "minute" or "second" from a DATE, however, you get an ORA-30076: invalid extract field for extract source.

For some reason, these only work on TIMESTAMP values, not on the DATE datatype (which seems like an arbitrary limitation to me). Nevertheless:

select systimestamp,
  extract(hour from systimestamp) as extract_hour,
  extract(minute from systimestamp) as extract_minute,
  extract(second from systimestamp) as extract_second
from dual

The above gives the following results:

SYSTIMESTAMP  EXTRACT_HOUR           EXTRACT_MINUTE         EXTRACT_SECOND         
------------- ---------------------- ---------------------- ---------------------- 
06.04.2010 20.17.12,047000000 +02:00 18                     17                     12,047