Archive for the ‘Informatica’ Category
Informatica Issue
I got this question as a comment.
Hi,
In informatica one of our mappings are using oracle target table. Daily it runs in production and complete in 10 min. but suddenly its just running and running. we stopped it forcely. what could be the potential cause for its continuous run?
Thanks.
There could be many reasons for this. One that comes to my mind is database statistics. When was the last time you gathered stats? The other reasons could be duplicate keys on the target table or if you have setup your session to collect all kinds of stats and data in the logs (it would just take longer but not a whole lot) or huge lookup source with no calculated cache settings….I could go on with this but it would be useless. I would like you to provide some information. Tell me about the version, the OS, the DB version, and all other architectural details. And tell me a bit about your mapping as well. If you want, go ahead and paste the session log here. I will take a look at it, see if I can find the issue and then remove it from the site.
Kirtan
Oracle timestamp in Informatica
To extract values from an oracle table column of timestamp datatype, you need to convert it to a character datatype since the Oracle Timestamp datatype is not currently supported by PowerCenter. To extract the value do the following.
SQL> insert into tst1 values (current_timestamp);
1 row created.
SQL> /
1 row created.
SQL> /
1 row created.
SQL> /
1 row created.
SQL> /
1 row created.
SQL> commit;
Commit complete.
SQL> select * from tst1;
TMSTMP
————————————————-
10-SEP-07 02.44.14.807930 PM
10-SEP-07 02.44.16.700373 PM
10-SEP-07 02.44.17.357597 PM
10-SEP-07 02.44.18.163876 PM
10-SEP-07 02.44.18.685926 PM
SQL> select tmstmp, to_char(tmstmp,’YYYY-MM-DD HH:MM:SS.FF6′) as CHAR_TMSTMP from tst1;
TMSTMP CHAR_TMSTMP
—————————- ————————–
10-SEP-07 02.44.14.807930 PM 2007-09-10 02:09:14.807930
10-SEP-07 02.44.16.700373 PM 2007-09-10 02:09:16.700373
10-SEP-07 02.44.17.357597 PM 2007-09-10 02:09:17.357597
10-SEP-07 02.44.18.163876 PM 2007-09-10 02:09:18.163876
10-SEP-07 02.44.18.685926 PM 2007-09-10 02:09:18.685926
The resultant column CHAR_TMSTMP in the output of the query is now ready to be extracted in a mapping.
Hope this helps…
Kirtan
nvarchar2 & varchar2
Informatica imports all VARCHAR2 fields as nVARCHAR2. This is a big deal. Especially when you are using update strategy to perform update-else-insert operations on target columns that are of VARCHAR2 datatype. One could expect noticable performance gain after converting all target column definitions for VARCHAR2 columns in Informatica.
Kirtan
Cache
Informatica uses cache files for various transformation wherever it needs to store data throughout the life of the mapping. For all these transformations Informatica uses Index Cache and Data Cache. This is done using either memory alone or using memory and physical disks. When the cache settings are calculated correctly and enough room is provided, Informatica PowerCenter stores data in memory. In cases where not enough size is provided for index and a data cache, Informatica stores as much as it can in the memory and then starts creating cache files on the physical disks. For example, if your cache file was supposed to have 500,000 lines but you only provided room for 100,000 lines through your cache settings, Informatica will write rest (400,000) of the lines into cache files on physical disks. I would think the size of these files depends on the kind of platform you are on.
In case of Lookup Transformation, Informatica also sorts the data by condition ports sequence. In cases where lookup cache is needed to be written to the physical disk, this sorting operation itself can become time consuming. But this could be advantageous if you can select the rows to be processed from the source qualifier in the same sorted order as the lookup condition ports sequence. So, if cache data for a particular key is in memory, then most probably the data for the adjacent keys will also be in memory. In this case, the time that lookup transformation would take to lookup data would keep increasing for each row as it would have to keep digging deeper in the file for each new row. Eventhough it sounds bad, it’s not it’s better than having data in random order. With appropriate indexes on the source tables, this operation could become very fast.
Wherever possible, use sorted data for these transformations. It is advisable to count appropriate amount of cache settings for each transformation that uses caching. Also one must consult the DBAs to come up with a good indexing strategy for all source tables.
Transformations that use caching for their process are…
Lookup Tranformation
Joiner Tranformation
Aggregator Tranformation
Rank Tranformation
Sorter Tranformation
Informatica – Seq. Gen. performance issue
Someone emailed me yesterday with a “never heard of”, at least for me, Informatica performance issue. The message said that they only have a sequence generator that is adding something to the data in the mapping. Everything else is coming straight from the source. The session took less than a minute to execute without a sequence generator and seven minutes with a sequence generator. It sounded a little odd because normally you would think that seq. gen. doesn’t make too much impact on the performance. But after a little investigation and a screenshot of the transformation properties we found out that a setting called “Number of Cached Values” was causing the issue. The value for it was set to 1 instead of recommended 0 (zero) or some large number. Setting this number to a small value could and would cause some issues for you. This setting is useful when there is a need to keep a unique set of values across mappings that share the same sequence generator. Usually these mappings are ran in parallel and the targets are same across all mappings that use this sequence generator.
This setting tells Informatica about the number of values that are stored in the repository when the session is started. The way Informatica processes this value is the key. Every time during the life of a session when this value is reached, Informatica queries the repository to get the next value for the Seq. Generator. So if you have this value set to 1, Informatica will query the repository database for every row that asks for a value from this transformation. And as you can imagine, this would definitely add some time to your the life of a session run.
If you set this number to some higher value like 15000, Informatica will query the repository after every 15000 rows. That may not necessarily be a bad thing. It would not have a major impact on total execution time. If the number is set to zero this setting would no more be a factor. Because then Informatica would NOT store the value of in the repository. Instead, it would be kept in the session memory.
The sequence number is kept on the server if the value is set to zero. In all other cases where it is set to be 1 or greater, the value is stored in the repository and the Informatica server process retrieves it from the database.
hope this helps…
Kirtan
Pivot Query
So how to pivot data in a table?
I was asked about this by someone at work today. Let’s say you have a table that looks like following. I have included create and insert statements so that you can see it and understand it yourself.
Create table t1 (a number, b number);
insert into t1 values (1,1);
insert into t1 values (1,2);
insert into t1 values (1,3);
insert into t1 values (2,1);
insert into t1 values (2,2);
insert into t1 values (2,3);
insert into t1 values (3,1);
insert into t1 values (3,2);
insert into t1 values (3,3);
select * from t1 order by 1,2;
####
(A hint aside: You must specify the order in which you want the data to be back. It is completely wrong to say that oracle will return the data in order it was inserted.)
####
A B
----- ------
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Now, let’s say you want to retrieve all the records from table t1 in a fashion where all the values in column B that belong to the group Column A should be shown side by side. Something like
A Val1 Val2 Val3
-- ---- ---- ----
1 1 2 3
One way would be to write a procedural code that stores different values of a group in an array or a series of variables and then displays them on screen or returns them to the calling program.
Or
You could just write a simple SQL statement that does it for you. You could write a piece of code that looks like this…
select
a
,max(decode(b,1,b,null)) col1
,max(decode(b,2,b,null)) col2
,max(decode(b,3,b,null)) col3
from t1
group by a;
A COL1 COL2 COL3
—– ——- ——- ——-
1 1 2 3
2 1 2 3
3 1 2 3
3 rows selected
… and you won’t have to worry about supporting an object such as a function or a procedure.
What if you had null values in one of the columns.
update t1 set b=null where a=2 and b=2;
One row updated
select
a
,max(decode(b,1,b,null)) col1
,max(decode(b,2,b,null)) col2
,max(decode(b,3,b,null)) col3
from t1
group by a;
A COL1 COL2 COL3
—– ——- ——- ——-
1 1 2 3
2 1 3
3 1 2 3
3 rows selected
There are many ways to write pivot queries. This is probably the easiest way of doing it.
Using Informatica to perform pivot operation.
You can use the Normalizer Transformation to generate rows out of columns. But to pivot values in rows into columns you would have to use Aggregator Transformation. You have to use the *FIRST* or *LAST* function to achieve this. First or last functions will pick first or last of the incoming rows if there are duplicates. Here is an example…
Suppose you have a source table with this data that is a record of monthly expenses for each of your departments. create table deptexp (sales_id number(5), mon char(3), amt number(10,2));
insert into deptexp values (1,’JAN’,100);
insert into deptexp values (1,’FEB’,120);
insert into deptexp values (1,’MAR’,135);
insert into deptexp values (2,’JAN’,110);
insert into deptexp values (2,’FEB’,130);
insert into deptexp values (2,’MAR’,120);
Commit;
You want to load this data after denormalizing it into this structure.
create table deptexp1 (DEPT_ID number(5), JAN_AMT number(10,2), FEB_AMT number(10,2), MAR_AMT number(10,2))
The data after our process would look like…
DEPT_ID JAN_AMT FEB_AMT MAR_AMT
1 100 120 135
2 110 130 120
Do the following to accomplish this.
1. Create an Aggregator transformation with the following ports and expressions:
NAME IN/OUT EXPRESSION GROUP BY 2. Connect the DEPT_ID, MONTH and the AMOUNT ports from the Source Qualifier to the Aggregator.
DEPT_ID IN YES
MONTH IN NO
AMOUNT IN NO
JAN_AMT OUT FIRST(AMOUNT, MONTH='JAN')
FEB_AMT OUT FIRST(AMOUNT, MONTH='FEB')
MAR_AMT OUT FIRST(AMOUNT, MONTH='MAR')
APR_AMT OUT FIRST(AMOUNT, MONTH='APR')
3. Connect the JAN_AMT, FEB_AMT, MAR_AMT, etc. ports to the target.
And that would do it…
Hope this helps…
Kirtan
Informatica PowerCenter Connect for Web Services
(You may find similar contents in Informatica Documentation. I have borrowed some description from the manuals to keep single version of truth alive.)
Overview:
Informatica PowerCenter Connect for Web Services works as a consumer of a Web Service. You can use PowerCenter Connect for Web Services to integrate with PowerCenter to read data from a web service source and write data to a web service target. You can also transform data during a session using PowerCenter Connect for Web Services.
For example, in a financial system where you are required to submit only ‘validated’ data to your financial application (eg. peoplesoft). Your financial application may provide you with a functionality that lets you validate incoming data. It may provide you with a web services oriented functionallity to do that. In such case, while processing incoming data for some other validation, you would use Web Services Consumer Transformation using PowerCenter Connect for Web Services to send and recieve messages from the provided web service.
A web service is a set of operations that you can ‘consume’ over a intranet or the internet. You can look at a web service as one or more functions that return results. When you access a web service, you request that the web service perform an operation and return data. A web service can contain many web service operations.
Web service operations contain input and output messages. These messages are XML-formatted messages. They specify how to structure a request for a web service. Web service access involves providers and consumers. A web service provider refers to the server that hosts the web service. A web service consumer refers to the client that requests a web service. PowerCenter Connect for Web Services accesses web services as a web service consumer.
The web service you access can be remote or local. Someone at another organization can create and publish the web service, or someone at your organization can create and publish it.
You only use PowerCenter Connect for Web Services as a web service consumer. To expose a PowerCenter workflow as a web service and make it available to others, you use PowerCenter Web Services Provider.
To read data from a web service, write data to a web service, or transform data using a web service, you must import a web service operation. You can import a web service operation from a Web Services Description Language (WSDL) file. WSDL files describe web services and web service operations. PowerCenter Connect for Web Services uses the information in the WSDL file to access a web service operation.
PowerCenter Connect for Web Services uses the Simple Object Access Protocol (SOAP). SOAP is a protocol for exchanging information between computers. It specifies how to encode XML data so that programs on different operating systems can pass information to each other.
Web services hosts contain WSDL files and web services.
Coming soon : WSDL Files
Informatica Performance Tuning
Informatica Performance Tuning
I have gathered this list over time while working at different client sites. It is not necessarily my list. But it definitely is a good list. If you don’t agree with anything below, don’t follow it. Anyways, here it goes….
INFORMATICA BASIC TUNING GUIDELINES
The following points are high-level issues on where to go to perform “tuning” in Informatica’s products. These are, in no way, permanent problem solvers, nor are they the end-all solution. Just some items (which if tuned first) might make a difference. The level of skill available for certain items will cause the results to vary.
To ‘test’ performance throughput it is generally recommended that the source set of data produce about 200,000 rows to process. Beyond this – the performance problems / issues may lie in the database – partitioning tables, dropping / re-creating indexes, striping raid arrays, etc… Without such a large set of results to deal with, your average timings will be skewed by other users on the database, processes on the server, or network traffic. This seems to be an ideal test size set for producing mostly accurate averages.
Try tuning your maps with these steps first. Then move to tuning the session, iterate this sequence until you are happy, or cannot achieve better performance by continued efforts. If the performance is still not acceptable, then the architecture must be tuned which can mean changes to what maps are created). In this case, you can contact us – we tune the architecture and the whole system from top to bottom.
KEEP THIS IN MIND: In order to achieve optimal performance, it’s always a good idea to strike a balance between the tools, the database, and the hardware resources. Allow each to do what they do best. Varying the architecture can make a huge difference in speed and optimization possibilities.
Utilize a database for significant data handling operations (such as sorts, groups, aggregates). In other words, staging tables can be a huge benefit to parallelism of operations. In parallel design – simply defined by mathematics, nearly always cuts your execution time.
Localize. Localize all target tables on to the SAME instance of Oracle (same SID), or same instance of Sybase. Try not to use Synonyms (remote database links) for anything (including: lookups, stored procedures, target tables, sources, functions, privileges, etc…). Utilizing remote links will most certainly slow things down. For Sybase users, remote mounting of databases can definitely be a hindrance to performance.
If you can – localize all target tables, stored procedures, functions, views, and sequences in the SOURCE database. Again, try not to connect across synonyms. Synonyms (or remote database tables) could potentially affect performance by as much as a factor of 3 times or more.
Remember that Informatica suggests that each session takes roughly 1 to 1 1/2 CPU’s. In keeping with this – Informatica play’s well with RDBMS engines on the same machine, but does NOT get along (performance wise) with ANY other engine (reporting engine, java engine, OLAP engine, java virtual machine, etc…)
Remove any database based sequence generators. This requires a wrapper function / stored procedure call. Utilizing these stored procedures can cause performance to drop by a factor of 3 times. This slowness is not easily debugged – it can only be spotted in the Write Throughput column. Copy the map, replace the stored proc call with an internal sequence generator for a test run – this is how fast you COULD run your map. If you must use a database generated sequence number, use them as part of the source qualifier query. If you’re dealing with GIG’s or Terabytes of information – this should save you lot’s of hours tuning.
TURN OFF VERBOSE LOGGING. The session log has a tremendous impact on the overall performance of the map. Force over-ride in the session, setting it to NORMAL logging mode. Unfortunately the logging mechanism is not “parallel” in the internal core, it is embedded directly in to the operations.
Turn off ‘collect performance statistics’. This also has an impact – although minimal at times – it writes a series of performance data to the performance log. Removing this operation reduces reliance on the flat file operations. However, it may be necessary to have this turned on DURING your tuning exercise. It can reveal a lot about the speed of the reader, and writer threads.
Try to eliminate the use of non-cached lookups. By issuing a non-cached lookup, your performance will be impacted significantly. Particularly if the lookup table is also a “growing” or “updated” target table – this generally means the indexes are changing during operation, and the optimizer looses track of the index statistics. Again – utilize staging tables if possible. In utilizing staging tables, views in the database can be built which join the data together; or Informatica’s joiner object can be used to join data together – either one will help dramatically increase speed. The thing with Informatica’s Joiner Transformation is that it needs caching of the data to scan through it to find matches. In such cases, make sure you tweak the cache sizes. Unless your source tables are from different RDBMS, let the RDBMS join them.
Separate complex maps – try to break the maps out in to logical threaded sections of processing. Re-arrange the architecture if necessary to allow for parallel processing. There may be more smaller components doing individual tasks, however the throughput will be proportionate to the degree of parallelism that is applied.
Balance between Informatica and the power of SQL and the database. Try to utilize the DBMS for what it was built for reading / writing / sorting / grouping / filtering data en-masse. Use Informatica for the more complex logic, outside joins, data integration, multiple source feeds, etc… The balancing act is difficult without DBA knowledge. In order to achieve a balance, you must be able to recognize what operations are best in the database, and which ones are best in Informatica. This does not degrade from the use of the ETL tool, rather it enhances it – it’s a MUST if you are performance tuning for high-volume throughput.
TUNE the DATABASE. Don’t be afraid to estimate: small, medium, large, and extra large source data set sizes (in terms of: numbers of rows, average number of bytes per row), expected throughput for each, turnaround time for load, is it a trickle feed? Give this information to your DBAs and ask them to tune the database for “worst case”. Help them assess which tables are expected to be high read/high write, which operations will sort (order by) etc. Moving disks, assigning the right table to the right disk space could make all the difference. Utilize a PERL script to generate “fake” data for small, medium, large, and extra large data sets. Run each of these through your mappings – in this manner, the DBA can watch or monitor throughput as a real load size occurs.
Be sure there is enough SWAP, and TEMP space on your PMSERVER machine. Not having enough disk space could potentially slow down your entire server during processing (in an exponential fashion). Sometimes this means watching the disk space while your session runs. Otherwise you may not get a good picture of the space available during operation. Particularly if your maps contain aggregates, or lookups that flow to disk Cache directory or if you have a JOINER object with heterogeneous sources.
Place some good server load monitoring tools on your PMServer in development – watch it closely to understand how the resources are being utilized, and where the hot spots are. Try to follow the recommendations – it may mean upgrading the hardware to achieve throughput.
TWEAK SESSION SETTINGS. In the session, there is only so much tuning you can do. Balancing the throughput is important – by turning on “Collect Performance Statistics” you can get a good feel for what needs to be set in the session – or what needs to be changed in the database. Read the performance section carefully in the Informatica manuals. Basically what you should try to achieve is: OPTIMAL READ, OPTIMIAL THROUGHPUT, and OPTIMAL WRITE. Over-tuning one of these three pieces can result in ultimately slowing down your session.
Try to merge expression objects, set your lookups to unconnected (for re-use if possible), check your Index and Data cache settings if you have aggregation, or lookups being performed. Etc… If you have a slow writer, change the map to a single target table at a time – see which target is causing the “slowness” and tune it. Make copies of the original map, and break down the copies. Once the “slower” of the N targets is discovered, talk to your DBA about partitioning the table, updating statistics, removing indexes during load, etc… There are many database things you can do here.
Remove all other “applications” on the PMServer. Except for the database / staging database or Data Warehouse itself. PMServer plays well with RDBMS (relational database management system) – but doesn’t play well with application servers, particularly JAVA Virtual Machines, Web Servers, Security Servers, application, and Report servers. All of these items should be broken out to other machines. This is critical to improving performance on the PMServer machine.
Hope this helps.
Kirtan Desai
Updates
Dan Linstedt from Myers-Holum Inc made a few comments a few days ago on this post. I have added his comments below.
- If you use a joiner, always use a sorted joiner – the cost of sorting data coming from the RDBMS staging area is frequently less than the cost of building the caches in place. Furthermore, if you are NOT on 64 bit Informatica, or you don’t have unlimited (seemingly unlimited) RAM, you have an upper limit to the caching mechanisms in all the cached objects – including Joiner.
- If the RDBMS is NOT tuned properly, putting more work into the RDBMS will actually slow things down.
- Too many Instances of an RDBMS on the SAME MACHINE will actually kill performance rather than help it, and it really is not necessary in order to handle fail-over as many set it up to do.
- Replacing a Lookup with a SORTED JOINER can improve performance dramatically.
- The manuals have the formula wrong for the Data / Index Cache settings, even though it’s counter-intuitive, you want 100% of the INDEX cached if you can get it, giving up Data Cache for disk. Why? Because if you can’t access RAM to check for the existence of data you are actually increasing I/O – when you increase I/O you slow performance dramatically. Set your Index caches to twice the size of your data caches to be safe.
There are many variables which can change the course of one’s actions. Feel free to ask questions on this blog.
You are currently browsing the archives for the Informatica category.