Quantcast
Channel: Data Services and Data Quality
Viewing all 236 articles
Browse latest View live

Calling RFC from BODS

$
0
0

Calling RFC from BODS

 

Introduction:-

 

In this scenario I am demonstrating about how to call Remote enabled Function Module from BODS.

 

1) Create SAP Application Datastore.

     In this example I am using the “SAP_BI” as the SAP Application datastore.

     As i have created the Fm in BI system, i have crated datastor for that system.

 

2) Import RFC from SAP system.

  • In Local Object Library expand SAP datastore.
  • Right click on Functions & click "Import By Name".

 

Image1_1.jpg

 

  • Enter the name of the RFC to import & click on "Import".

        Here I am using the “ZBAPI_GET_EMPLOYEE_DETAILS” as the RFC.

 

Image2.jpg

  • RFC will be imported & can be seen in the Local Objet Library.

 

Image3.jpg

Note :- This RFC takes Employee ID as the input & displays Employee details.

            I have stored Employee id in the text file, so to read text file I am using File format as the source.

 

3) Create File Format for flat (text) file.

 

   This file format(here "Emp_Id_Format") has the list of employee ids.

 

Image4.jpg

 

4) Create Job, Workflow, Dataflow as usual.

 

5) Drag File Format into dataflow & mark it as a Source.

 

6) Drag a query platform also in to data flow & name it (here "Query_fcn_call").

 

Image5.jpg

7) Assign RFC call from Query

 

  • Double click on Query.

 

Image6.jpg

 

  • Right click on "Query_fcn_call" & click "New Function Call".

 

Image7.jpg

 

  • “Select Function” window will open. Choose appropriate function & click "Next".

Image8.jpg

 

·         In below window click on button.jpgbutton & define an input parameter.

 

Image9.jpg

 

  • Select the file format that we have created earlier in "Input Parameter" window & press OK.

 

Image10.jpg

 

  • Select Column name from the input file format & press "OK".

        Here the file format has one column only with name as “Id”.

 

Image11.jpg

 

  • Click "Next" & select Output Parameters.

 

Image12.jpg

 

  • Select the required output parameters & click "Finish".

        Here i am selecting all the fields.

 

Image13.jpg

 

Image14.jpg

 

Now the Query editor for query platform "Query_fcn_call" can be seen as follows.

 

Image15.jpg

 

8) Add another query platform into dataflow for mapping & name it (here "Query_Mapping").

 

Image16.jpg

 

9) Add a template table also.

 

Image17.jpg

 

10) Mapping.

  • Double click on query "Query_Mapping" & do the necessary mappings.

 

Image18.jpg

 

11) Save the Job, validate & execute.

 

12) During execution employee id is taken as a input to the RFC & output of the rfc is stored in the table.

        Output can be seen as follow after execution.

        Here employee ids are taken from the File Format & given to RFC as an input.

        Output of the RFC is given as an input to the query "Query_Mapping" where it is mapped to the target table fields.

 

Image19.jpg

 

Thanks,

Rahul S. More

(Technical Lead)

IGATE Global Solutions Pvt Ltd.

 

igateLogo.jpg


Magic Quadrant for Data Integration Tools: Déjà vu All Over Again

$
0
0

If you follow Data Integration, then you probably already know that Gartner has released its 2014 Magic Quadrant for Data Integration Tools. It’s a fairly long document and like most people, you won’t have the time to thoroughly read it. So here’s a bit of advice: If you still have last year’s report, then review it again . . . Not that much has changed. Below is a brief review that summarizes the few changes in the Magic Quadrant for Data Integration Tools between 2013 and 2014.

 

If you look at the Magic Quadrant, you can see that Informatica is now positioned at the upper-right most spot in the Magic Quadrant (“Leaders”), by overtaking IBM in “ability to execute”, making them the outright leader in both execution and vision in the data integration space. Informatica is a company striving to create broader solutions and has launched or will be launching more comprehensive offerings with a breadth of functionality and usage in the near future...

 

Click here for the entire article.

An approach to build Target file in Excel format in SAP BODS using XSL

$
0
0

In this blog i would like to explain an approach to build the target file in the desired Excel format using the xsl style sheet.  As we are aware SAP BusinessObjects Data Services accesses Excel workbooks as sources only (not as targets). So to overcome this limitation we can adopt this approach to display our output in the desired excel format with the help of XSL.


Details on the approach

In this approach we will be building a xml file using the BODS and will be displaying the xml content in the desired tabular format with the help to XSL.

So first we have to create a batch job that creates a xml which contain the required data. Special care must be taken while designing the Xml structure that holds the data need to be displayed in tabular structure. Consider this excel structure in the below example.

XlStructure.JPG

 

In this we have two tabular structure one to hold the header part and second to hold the category part. So when we define the xml structure in the BODS we need to create two schema to hold the Header tabular information and Category tabular information.And these schema will hold the records that need to be populated in the target.So for our sample scenario the xml structure will be as follows

XmlStructure.JPG.

 

Next we have to build the xsl to describes how to display an XML document. An XSL style sheet is, like with CSS, a file that describes how to display an XML document of a given type.XML does not use predefined tags (we can use any tag-names we like), and therefore the meaning of each tag is not well understood. So a without an XSL sheet browser does not know how to display xml document.

XSL consists of three parts:

XSLT - a language for transforming XML documents

XPath - a language for navigating in XML documents

XSL-FO - a language for formatting XML documents

 

The root element that declares the document to be an XSL style sheet is <xsl:stylesheet>

xsl_header.JPG

An XSL style sheet consists of one or more set of rules that are called templates.A template contains rules to apply when a specified node is matched.

The <xsl:template> element is used to build templates.The match attribute is used to associate a template with an XML element.(match="/" defines the whole document. i.e. The match="/" attribute associates the template with the root of the XML source document.)

Xsl_Template.JPG

The XSL <xsl:for-each> element can be used to select every XML element of a specified node-set. So we cab specify how to display values coming in that specified note-set. Considering our sample scenario we can select every element in the Header & Category schema to mention how to display values coming inside that node set.The <xsl:value-of> element can be used to extract the value of an XML element and add it to the output stream of the transformation.

xsl_for each.JPG

After building the xsl file we need to place that file in the target folder where BODS will be building the target file. And we also need to alter the XML header in the target XML structure inside the job. Default Header defined in the XML header will be <?xml version="1.0" encoding = "UTF-8" ?> we need to change that to<?xml version="1.0" encoding = "UTF-8" ?><?xml-stylesheet type="text/xsl" href="<xsl_fileName>"?>

XML_Target.JPG

And  In our target XML, hearder will be like this

Xml_File_Header.JPG

 

Target xml generated after the execution of the job can be opened with Excel. where you will promted with option to open the xml after applying the stylesheet. And  in that we need select our stylesheet to get the output in the desired Excel format.

ExcelOpenOption.JPG

 

And our output in Excel will be displayed as given below

Final_target.JPG

 

Note: Both the XSL file and the xml target file should be available in the same folder for getting the desired output.

Attaching the sample xsl and xml file for reference.

SAP Data Services 4.2 SP3 – What’s New On a Page

$
0
0

The new quarterly release cycle for SAP Data Services speeds up the delivery of new innovations without the requirement of a ramp up period.  SP3, the 2nd service pack of 2014, is now available with new innovations based on 4 themes. BIG DATA, SIMPLE, ENTERPRISE & GOVERNANCE.

 

BIG DATA

/profile/OSIoZGvph4iKpIvkWFJMY4/documents/4xsA3UbAw6aB7MwjMIfvMR/thumbnail?max_x=700&max_y=700/profile/OSIoZGvph4iKpIvkWFJMY4/documents/XBmOmpgScS3hitSma23xJk/thumbnail?max_x=700&max_y=700

SIMPLE

/profile/OSIoZGvph4iKpIvkWFJMY4/documents/5ycbo3Os8WD6wUBbmiocMd/thumbnail?max_x=700&max_y=700/profile/OSIoZGvph4iKpIvkWFJMY4/documents/maj16JRfoBAolBGe1vo2TH/thumbnail?max_x=700&max_y=700

ENTERPRISE

/profile/OSIoZGvph4iKpIvkWFJMY4/documents/yy83N0pyGZ24MWWBnO8F5q/thumbnail?max_x=700&max_y=700/profile/OSIoZGvph4iKpIvkWFJMY4/documents/J5MMaue6LKqNRn0gEs2bqH/thumbnail?max_x=700&max_y=700

 

GOVERNANCE

/profile/OSIoZGvph4iKpIvkWFJMY4/documents/UphQJ3BTVhys7F1b9dKlKl/thumbnail?max_x=700&max_y=700

Please see the What’s new documentation at http://help.sap.com/bods42 for more detail.

Performant extraction of most recent data from a history table

$
0
0

You have a table that contains multiple time stamped records for a given primary key:

 

KeyAttTimestamp
037472012.11.11 04:17:30
01ABC2014.09.30 17:45:54
02UVW2014.04.16 17:45:23
01DEF

2014.08.17 16:16:27

02XYZ2014.08.25 18:15:45
01JKL2012.04.30 04:00:00
03777

2014.07.15 12:45:12

01GHI2013.06.08 23:11:26
037372010.12.06 06:43:52

 

Output required is the most recent record for every key value:

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

 

 

Solution #1: Use the gen_row_num_by_group function

 

Build a dataflow as such:

  df1.png

In the first query transform, sort the input stream according to Key and Timestamp desc(ending). The sort will be pushed to the underlying database, which is often good for performance.

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
01DEF

2014.08.17 16:16:27

01GHI2013.06.08 23:11:26
01JKL2012.04.30 04:00:00
02XYZ2014.08.25 18:15:45
02UVW2014.04.16 17:45:23
03777

2014.07.15 12:45:12

037472012.11.11 04:17:30
037372010.12.06 06:43:52

 

In the second query transform, add a column Seqno and map it to gen_row_num_by_group(Key).

 

KeyAttTimestampSeqno
01ABC2014.09.30 17:45:541
01DEF

2014.08.17 16:16:27

2
01GHI2013.06.08 23:11:263
01JKL2012.04.30 04:00:004
02XYZ2014.08.25 18:15:451
02UVW2014.04.16 17:45:232
03777

2014.07.15 12:45:12

1
037472012.11.11 04:17:302
037372010.12.06 06:43:523

 

In the third query transform, add a where-clause Seqno = 1 (and don’t map the Seqno column).

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

 

 

Solution #2: use a join

 

Suppose we’re talking Big Data here, there are millions of records in the source table. On HANA. Obviously. Although the sort is pushed down to the database, the built-in function is not. Therefore every single record has to be pulled into DS memory; and then eventually written back to the database.

 

Now consider this approach:

df2.png

 

The first query transform selects two columns from the source table only: Key and Timestamp. Define a group by on Key and set the mapping for Timestamp to max(Timestamp).

 

KeyTimestamp
012014.09.30 17:45:54
022014.08.25 18:15:45
03

2014.07.15 12:45:12

 

In the second transform, (inner) join on Key and Timestamp and map all columns from the source table to the output.

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

If you uncheck bulk loading of the target table, you’ll notice that the full sql (read and write) will be pushed to the underlying database. And your job will run so much faster!

 

Note: This second approach produces correct results only if there are no duplicate most recent timestamps within a given primary key.

SCD Type 1 Full Load With Error Handle - For Beginners

$
0
0

This example may help us to understand the usage of SCD Type 1 and with how to handle the error messages.


Brief about Slowly Changing Dimensions: Slowly Changing Dimensions are dimensions that have data that changes over time.

There are three methods of handling Slowly Changing Dimensions are available: Here we are concentrating only on SCD Type 1.


Type 1-  No history preservation - Natural consequence of normalization.

 

For a SCD Type 1 change, you find and update the appropriate attributes on a specific dimensional record. For example, to update a record in the

SALES_PERSON_DIMENSION table to show a change to an individual’s SALES_PERSON_NAME field, you simply update one record in the

SALES_PERSON_DIMENSION table. This action would update or correct that record for all fact records across time. In a dimensional model, facts have no meaning until you link them with their dimensions. If you change a dimensional attribute without appropriately accounting for the time dimension, the change becomes global across all fact records.

 

This is the data before the change:

 

SALES_PERSON_

KEY

SALES_PERSON_

ID

NAMESALES_TEAM
1500120Doe, John BAtlanta

 

This is the same table after the salesperson’s name has been changed:


SALES_PERSON_

KEY

SALES_PERSON_

ID

NAMESALES_TEAM
1500120Smith, John BAtlanta


However, suppose a salesperson transfers to a new sales team. Updating the salesperson’s dimensional record would update all previous facts so that the

salesperson would appear to have always belonged to the new sales team. This may cause issues in terms of reporting sales numbers for both teams. If you want to preserve an accurate history of who was on which sales team, Type 1 is not appropriate.


Below is the step by Step Batch Job creation using SCD Type 1 using error Handling.


Create new job

 

 

Add Try and "Script" controls from the pallet and drag to the work area

 

Create a Global variable for SYSDATE

 

6.png

 

Add below script in the script section.

 

# SET TODAYS DATE

$SYSDATE = cast( sysdate( ), 'date');

print( 'Today\'s date:' || cast( $SYSDATE, 'varchar(10)' ) ); 

 

 

Add DataFlow.

 

Now double click on DF and add Source Table.

 

Add Query Transformation

 

Add LOAD_DATE new column in Query_Extract

Map created global variable $SYSDATE. If we mention sysdate() this functional call every time which may hit the performance.

 

14.png

 

Add another query transform for lookup table

 

  Create new Function Call for Lookup table.

 

 

20.png

 

21.png

 

Required column added successfully via Lookup Table.

 

Add another Query Transform. This query will decide whether source record will insert and update.

 

Now remove primary key to the target fileds.

 

 

 

Create new column to set FLAG to update or Insert.

 

Now write if then else function if the LKP_PROD_ID is null update FLAG with INS if not with UPD.

 

ifthenelse(Query_LOOKUP_PRODUCT_TIM.LKP_PROD_KEY is null, 'INS', 'UP')

 

27.png

 

Now Create case Transform.

 

30.png

 

 

Create two rules to FLAG filed to set “INS” or ”UPD”

Create Insert and Update Query to align the fields

Change LKP_PROD_KEY to PROD_KEY and PROD_ID to SOURCE_PROD_ID for better understanding in the target table.

Now create  Key Generation transform to generate Surrogate key

Select Target Dimension table with Surrogate key (PROD_KEY)

Set Target instance

 

 

43.png

Add a Key_Generation transformation for the Quary_Insert to add count for the new column.

 

And for Query _Update we need Surrogate key and other attributes. Use the Map Operation transform to update records.

 

By default Normal mode as Normal. We want to update records in normal mode.

 

53.png

 

Update Surrogate key, Product key and other attributes.

 

Go back to insert target table -->  Options --> Update Error Handling as below:

55.png

 

Go back to Job screen and create catch block

 

57.png

 

Select required exception you want to catch. and Create script to display error messages

 

60.png

 

Compose your message to print errors in the script_ErrorLogs as below.

 

print( 'Error Handling');

print( error_message() || ' at ' || cast( error_timestamp(), 'varchar(24)'));

raise_exception( 'Job Failed');

 

now Validate script before proceed further.

 

Now these messages will catch errors with job completion status.

Now create a script to print error message if there is any database rejections:

 

65.png

 

# print ( ' DB Error Handling');

if( get_file_attribute( '[$$LOG_DIR]/ VENKYBODS_TRG_dbo_Product_dim.txt ', 'SIZE') > 0 )

raise_exception( 'Job Failed Check Rejection File');

 

note: VENKYBODS_TRG_dbo_Product_dim.txt is the file name which we mentioned in the target table error handling section.

 

Before Execute, Source and Target table data of Last_updated_Date.

 

73.png

 

Now Execute the job and we can see the Last_Updated_Dates.

72.png

 

Now try to generate any error to see the error log captured our error Handling.

 

try to implement the same and let me know if you need any further explanation on this.

 

Thanks

Venky

DataServices Text Analysis and Hadoop - the Details

$
0
0

I have already used the text analysis feature within SAP DataServices in various projects (the transform in DataServices is called Text Data Processing or TDP in short). Usually, the TDP transform runs in the DataServices engine, means that DataServices first loads the source text in its own memory  and then runs the text analysis on its own server / engines. The text sources are usually unstructured text or binary files such as Word, Excel, PDF files etc. If these files reside on a Hadoop cluster as HDFS files, DataServices can also push down the TDP transform as a MapReduce job to the Hadoop cluster.

Why running DataServices Text Analysis
within Hadoop ?


Running the text analysis within Hadoop (means as MapReduce jobs) can be an appealing approach if the total volume of source files is big and at the same time the Hadoop cluster has enough resources. Then the text analysis might run much quicker inside Hadoop than within DataServices. The text analysis extracts entities from unstructured text and as such it will transform unstructured data into structured data. This is a fundamental pre-requisite in order to be able to run any kind of analysis with the data in the text sources. You only need the text sources as input for the text analysis. Afterwards you could theoretically delete the text sources, because they are no longer needed for the analysis process. But in reality you still want to keep the text sources for various reasons:


    1. Quality assurance of the text analysis process: doing manual spot checks by reviewing the text source. Is the generated sentiment entity correct? Is the generated organisation entity Apple correct or does the text rather refer to the fruit apple? ...

    2. As an outcome of the QA you might want to improve the text analysis process and therefore rerun the text analysis with documents that already had been analyzed previously.

    3. You might want to run other kinds of text analysis with the same documents. Let's say you did a sentiment analysis on text documents until now. But in future you might have a different use case and you want to analyze customer requests with the same documents.

     

    Anyway, in all these scenarios the text documents are only used as a source. Any BI solution will rely on the structured results of the text analysis. But it will not need to lookup any more information from these documents. Keeping large volumes of these documents in a Hadoop cluster can therefore be a cost-effective solution. If at the same time the text analysis process can run in parallel on multiple nodes on that cluster the overall performance might be improved - again on comparative cheap hardware.


    I was therefore curious how the push-down of the DataServices TDP transform to Hadoop works technically. Unfortunately my test cluster is too small to test the performance. Instead my test results focus on technical functionality only. Below are my experiences and test results:

    My Infrastructure

     

    For my tests I am using a Hortonworks Data Platform HDP 2.1 installation on just one cluster. All HDP services are running on this node. The cluster is installed on a virtual machine with CentOS 6.5, 8 GB memory and 4 cores.


    DataServices 4.2.3 is installed on another virtual machine based on CentOS 6.5, 8 GB memory and 4 cores.

     

     

    My Test Cases

     

    1. Text analysis on 1000 Twitter tweets in one CSV file:
      The tweets are a small sample from a bigger file extracted from Twitter on September 9 using a DataSift stream. The tweets in the stream were filtered, they all mention Apple products like iPhone, iPad,AppleWatch and so on.

      Dataflow TA_HDP_APPLE_TWWET
      Dataflow TA_HDP_APPLE_TWWET

    2. 3126 binary MSGfiles. These files contain emails that I had sent out in 2013:

      Dataflow TA_HDP_EMAILS
      Dataflow TA_HDP_EMAILS

    3. 3120 raw text files. These files contain emails that I had sent out in 2014

      Dataflow DF_TA_HDP_EMIALS_TEXT
      Dataflow DF_TA_HDP_EMIALS_TEXT

     

    I have been running each of these test cases in two variants:

      • Running the TDP transform within the DS engine. In order to achieve this, the source files were saved on a local filesystem
      • Running the TDP transform within Hadoop. In order to achieve this, the source files were saved on the Hadoop cluster.

    The TDP transform of both variants of the same test case is basically the same. This means they are copies of each other with exactly the same configuration options. For all the test cases I have configured standard entity extraction in addition to a customer provided dictionary which defines the names of Apple products. Furthermore, I specified sentiment and request analysis for English and German languages:


    TDP configuration options

    TDP configuration options

     

    Whether a TDP transform runs in the DS engine or is pushed down to Hadoop depends solely on the type of source documents: on a local file system or in HDFS. Finally, I wanted to compare the results between both variants. Ideally, there should be no differences.

     

    DataServices vs. Hadoop: comparing the results

    Surprisingly, there are significant differences between the two variants. When the TDP transform runs in the DS engine the transform produces much more output than when running the same transform inside Hadoop as a MapReduce job. More precise, for some source files the MapReduce job did not generate any output at all, while the DS engine generates output for the same documents:

     

    Comparing number of documents and entities between the two variants

    Comparing number of documents and entities between the two variants

     

    For  example, the DS engine produced 4647 entities from 991 documents. This makes sense, because 9 tweets do not contain any meaningful content at all, so that the text analysis does not generate any entities for these documents. But the MapReduce job generated 2288 entities from only 511 documents. This does no longer make sense. The question is what happened with the other 502 documents?

     

    The picture is similar for the other two test cases analyzing email files. So, the problem is not related to the format of the source files. I tried to narrow down this problem within Hadoop – please see the section Problem Tracking –> Missing entities at the end of this blog for more details.

    On the other hand, for those documents when both, the DS engine and the MapReduce job generated entities, the results between the two variants are nearly identical. For example, these are the results of both variants for one example tweet.

     

    TDP output for one tweet - TDP running in DS

    TDP output for one tweet – TDP running in DS

     

    TDP output for one tweet - TDP running in Hadoop

    TDP output for one tweet – TDP running in Hadoop

     

    The minor differences are highlighted in the Hadoop version. They can actually be ignored, because they won’t have any impact on a BI solution processing this data. Please note that the entity type APPLE_PRODUCT has been derived from the custom dictionary that I provided, whereas the entity type PRODUCT has been derived from the system provided dictionary.

     

    We can also see that the distribution of entity types is nearly identical between the two variants, just the number of entities is different. The chart below shows the distribution of entity types across all three test cases:

     

    Distribution of entity types

    Distribution of entity types

    Preparing Hadoop for TDP

     

    It is necessary to read the relevant section in the SAP DataServices reference manual. Section 12.1.2 Configuring Hadoop for text data processing describes all the necessary prerequisites. The most important topics here are:

     

    Transferring TDP libraries to the Hadoop cluster

    The $LINK_DIR/hadoop/bin/hadoop_env.sh -c script will simply transfer some required libraries and the complete $LINK_DIR/TextAnalysis/languages directory from the DS server to the Hadoop cluster. The target location in the Hadoop cluster is set in the environment variable $HDFS_LIB_DIR (this variable will be set in the same script when called with the -e option).You need to provide this directory in the Hadoop cluster with appropriate permissions. I recommend that the directory in the Hadoop cluster is owned by the same login under which that the Data Services engines are running (in my environment this is the user ds). After $LINK_DIR/hadoop/bin/hadoop_env.sh -c has been executed successfully the directory in HDFS will look something like this:

     

    $HDFS_LIB_DIR in HDSF

     

    Note that the $LINK_DIR/TextAnalysis/languages directory will be compressed in HDFS.

     

    Important: If you are using the Hortonworks Data Platform (HDP) distribution version 2.x you have to use another script instead: hadoop_HDP2_env.sh. hadoop_env.sh does not transfer all the required libraries for HDP 2.x installations.

     

    Optimizing text data processing for use in the Hadoop framework

    The description in the SAP DataServices reference manual, section 12.1.2 Configuring Hadoop only works for older Hadoop distributions as the referred Hadoop parameters are deprecated. For HDP 2.x I had to use different parameters. See the section Memory Configuration in YARN and Hadoop in this blog below for more details.


    HDFS source file formats

    Reading the SAP DataServices reference manual word-for-word, it says that only unstructured text files work for the push-down to Hadoop and furthermore that the source file must be connected (directly?) to the TDP transform:

     

    Extract from DataServices Reference Guide: sources for TDP

    Extract from DataServices Reference Guide: sources for TDP

     

    My tests – fortunately – showed that the TDP MapReduce job can handle more formats:

    • HDFS Unstructured Text

     

    • HDFS Unstructured Binary
      I just tested MSG and PDF files, but I assume that the same binary formats as documented in the SAP DataServices 4.2 Designer Guide– see section 6.10 Unstructured file formatsThe TDP transform is able to generate some metadata for unstructured binary files (the TDP option Document Properties) : this feature does not work when the TDP transform is pushed down to Hadoop. The job still analyzes the text in the binary file. But the MapReduce job will temporary stage the entity records for the document metadata. When reading this staging file and passing the records back to DataServices file format, warnings will be thrown. Thus you will still receive all generated text entities but not those for the document metadata.

     

    • HDFS CSV files:
      I was also able to place a query transform between the HDFS source format and the TDP transform and specify some filters, mappings and so in the query transform. All these settings still got pushed down to Hadoop via a Pig script. This is a useful feature, the push-down seems to work very similar to the push-down to relational databases: as long as DataServices knows about corresponding functions in the source system (here HDFS and Pig Latin language) it pushes as much as possible to the underlying system instead of processing the functions in its own engine.

     

    TDP pushed down to Hadoop

     

    It is very useful to understand roughly how the TDP push down to Hadoop works, because it will ease performance and problem tracking. Once all perquisites for the push-down are met (see previous section Preparing Hadoop for TDP –> HDFS source file formats) DataServices will generate a Pig script which in turn starts the TDP MapReduce job on the Hadoop cluster. The generated Pig script will be logged in the DataServices trace log file:

     

    DS trace log with pig script

    DS trace log with pig script

     

    The Pig script runs on the DataServices server. But the commands in the Pig script will be executed against the Hadoop cluster. This means – obviously – that the Hadoop client libraries and tools must be installed on the DataServices server. The Hadoop client does not necessarily need to be part of the Hadoop cluster, but it must have access to the Hadoop cluster. More details on the various installation options are provided in the SAP DataServices reference manual or on SCN – Configuring Data Services and Hadoop.

     

    In my tests I have noticed two different sorts of Pig scripts that can be generated, depending on the source file formats.

     

    1. CSV files
      The Pig scripts for analyzing text fields in CSF files usually looks similar to this:Obviously,DataServices provides its own Pig functions for loading and storing data. They are also used for staging temporary results within the Pig script. During problem tracking it might be useful to check these files. All required files and information are located in one local directory on the DataServices server:

      Pig script for CSV files
      Pig script for CSV files

      Given all these information, you might also run performance tests or problem analysis by executing or modifying the Pig script directly on the DataServices server.

      Pig directory
      Pig directory

    2. Unstructured binary or text files
      The Pig script for unstructured binary or text files looks a bit different. This is obviously because theMapReduce jar file provided byDataServices can read these files directly. In contrast, in case of CSV files, some additional pre-processing need to happen in order to extract the text fields from the CSV file.In case of unstructured binary or text files the Pig script calls another Pig script which in turn runs a Unix shell command (I don’t know what the purpose of these two wrapper Pig scripts is, though?). The Unix shell command simply runs the MapReduce job within the Hadoop framework:

      Pig script for CSV files
      Pig script for unstructured files

    What if DS does not push down the TDP transform to Hadoop?

    Unstructured binary or text HDFS files:

    If unstructured binary or text HDFS files are directly connected to the TDP transform, the transform should get pushed down to Hadoop. In most cases it does not make sense to place other transforms between the HDFS source file format and the TDP transform. If you do have one or more query transforms after the HDFS source file format, you should only use functions in the query transforms that DataServices is able to push down to Pig. As long as you are using common standard functions such as substring() and so on, the chances are hight that DataServices will push them down to Pig. In most cases you will anyway have to find out on your own which functions are candidates for a push-down because they are not documented in the SAP manuals.

     

    CSV files:

    Basically the same rules apply as for unstructured binary or text files. In addition there are potential pitfalls with the settings in the CSV file format:

    keep in mind that DataServices provides its own functions for reading HDFS CSV files (see previous section TDP pushed down to Hadoop). We don’t know about the implemented features of the DSLoad Pig function and they are not documented in the SAP manuals. It may be that some of the settings in the CSV file format are not supported by the DSLoad function. In this case DataServices cannot push down the TDP transform to Hadoop, because it first need to read the complete CSV file from HDFS into its own engine. It will then process the specific CSV settings locally and also process the TDP transform locally in its own engine.

     

    A typical example for this behaviour are the row skipping options in the CSV file format. They are apparently not supported by the DSLoad function. If you set these options, DataServices will not push down the TDP transform to Hadoop:

     

     

    HDFS file format options for CSV

    HDFS file format options for CSV

     

    Memory Configuration in YARN and Hadoop

     

    I managed to run a MapReduce job using standard entity extraction and english sentiment analysis using the default memory settings for YARN and MapReduce (means the settings initially provided when I setup the cluster). But when using the German voice-of-customer libraries (for sentiment or request analysis) I had to increase some memory settings. According to SAP the german language is more complex so that these libraries require much more memory. Please note that the memory required for the MapReduce jobs depends on these libraries and not on the size of the source text files that will be analyzed.

     

    If you do not configure enough memory the TDP MapReduce job will fail. Such kind of problems cannot be tracked using the DataServices errorlogs. See the section Problem Tracking in this blog below.

     

    I followed the descriptions in the Hortonworks manuals about YARN and MapReduce memory configurations. This way I had overridden the following default settings in my cluster (just as an example!):

     

    Overriden YARN configurations

    Overriden YARN configurations

     

    Overriden MR2 configurations

    Overriden MR2 configurations

     

    Using these configurations I managed to run the TDP transform with German VOC libraries as MapReduce jobs.

     

    Problem Tracking

     

    Problems while running a TDP transform as a MapReduce job are usually more difficult to track. The errorlog of the DataServices job will simply contain the errorlog provided by the generated Pig script. It might already point you to the source of the problem. On the other hand, other kind of problems might not be listed here. Instead you need to review the relevant Hadoop or MapReduce log files within Hadoop.

     

    For example, in my case I identified a memory issue within MapReduce like this:

     

    DataServices errorolog

    If a TDP MapReduce jobs fails due to insufficient memory the DataServices errorlog might look similar to this:

     

    DS errorlog with failed TDP MapReduce job

    DS errorlog with failed TDP MapReduce job

     

    In this DS errorlog excerpt you just see that one MapReduce job from the generated Pig script failed, but you don’t see any more details about the failure. Also, the other referred log files pig_*.err and hdfsRead.err do not provide any more details. In such cases you ned to review the errorlog of the MapReduce job in Hadoop to get more details.

     

    MapReduce errorlogs

    In general it is helpful to roughly understand how logging works inside Hadoop. In this specific case of YARN and/or MapReduce I found the chapter Log files in Hortonworks’ YARN and MR2 documentation useful (especially section MapReduce V2 Container Log Files).

     

    It is also good to know about YARN log aggregation, see  HDP configuration files for more information. Depending on the setting of the yarn.log-aggregation-enable parameter the most recent log files are either still on the local file system of the NameNode or they are compressed within the HDFS cluster.

     

    In anyway, you can start the error log review with tools such as Hue or the JobHsitory server provided with Hadoop. In my example I used the Job browser view within Hue. Clicking on the TDP MapReduce job lists various failed attempts of the job to obtain a so-called memory container from YARN:

     

    Failed task attempts

    Failed task attempts

     

    The log file of one of these attempts will show the reason for failure, for instance low memory:

    Error log for low memory

    Error log for low memory

     

    In this case the MapReduce job couldn’t start at all and an error will be returned to the DataServices engine which started the Pig script. So in this case the DataServices job will also get a failed status.

     

    Other type of errors to watch out

    Once a task attempt succeeds it does not necessarily mean that the task itself will succeed. It means that a task has successfully started. Any errors during task execution will then be logged in the task log file. In case of TDP, a task on one node will sequentially analyze all the text source files. If the task fails when analyzing one of the source documents the error will be logged but the task will go ahead with the next source document. This is a typical fault tolerant behaviour of MapReduce jobs in Hadoop.

     

    Important: although that errors with individual documents or records get logged in the log file of the MapReduce TDP task, the overall result of the MapReduce job will be SUCCEDED and no error or even warnings get returned to DataServices. The DataServices job will get a successful status in such situations.

     

    It is therefore important to monitor and analyze the MapReduce log files in addition to the DataServices error logs!


    During my tests I found some other typical errors in the MapReduce log files that hadn’t been visible in the DataServices errorlog. For example:

     

    • Missing language libraries:

      MapReduce error log: missing language libraries
      MapReduce error log: missing language libraries

      I configured automatic language detection in the TDP transform, but forgot to install all the available TDP languages during DS installation. After installing all the languages and re-executing $LINK_DIR/hadoop/bin/hadoop_env.sh -c these errors disappeared.
      (BTW: when running the same job with the TPD transform in the DS engine instead of having it pushed-down to Hadoop DataServices did not print any warnings in the errorlog, so I probably would have never caught this issue)

    • Non-ASCII characters in text sources:
      In the case of the Twitter test case some tweets may contain special characters such as the © or signs. The TDP transform (or MapReduce job) cannot interpret such characters as raw text and then aborts the analysis of this document.
      If the TDP runs within the DS engine these errors will be printed as warnings on the DS errorlog.
      If the TDP runs as MapReduce job there will be no warnings in the DS errorlog, but the errorlog of the MapReduce will contain these errors.

    • Missing entities:
      As described above I found that the TDP transform running as a MapReduce job generates much less entities than when running the same TDP transform (with the same input files) within the DS engine. To narrow down the problem I checked the stdout log file of the MapReduce job. For each text source file (or for each record in a CSV file) it prints a message like this if the analysis succeeds:

      Map Reduce stdout: analyzing text documents
      Map Reduce stdout: analyzing text documents

      In order to understand how many text sources had been successfully analyzed I grepped the stdout file:

      MapReduce stdout: check number of successfully analyzed documents.

      MapReduce stdout: check number of successfully analyzed documents.

      Well, in this particular case it actually didn’t help me to solve the problem: there had been 3126 text files as source for the TDP transform. According to stdout all of them had been successfully analyzed by the TDP job. Nevertheless, text entities had been generated for only xyz documents. This would actually mean that xyz documents are empty or have no meaningful content so that the text analysis does not generate text entities. But because the same TDP transform when running in the DS engine generates entities for much more document, I believe that this is a bug somewhere in the overall process.Unfortunately, I did not manage to find the root cause of this problem so far. I appreciate if anybody who encountered the same problems can provide some feedback!

    Data Pre-Validation tool for SAP Conversions

    $
    0
    0

    Introduction: Today organizations need effective SAP implementation in order to create value for their customers’ and at the same time saving cost of services for themselves. For resolving key business issues or for making strategic business decisions the decision maker's look at data. Hence effective data conversion is gaining importance.

    Data migration basically means moving data from one system to another. Data Migration could be driven by several initiatives taken up by the customer like application changes (moving from Oracle to SAP) or upgrade (moving to newer SAP Releases).

     

    Conversion does not simply mean moving data from one system to another, rather it means moving meaningful data.

     

    Just to emphasize more on this fact let’s just consider a simple example; I have data conversion requirement for customer master and during the data load due to unknown reasons there is a digit missing from the customer’s contact number. Just imagine how much impact this small miss is going to have on Customer Service?

     

    Data Conversion Challenges: In general, data migration is considered to a simple task deflating the real risks involved.

    • Other major challenge is about knowing the data before it’s late. We may lose both time and resources eventually resulting in loss of money.
    • The source of data governs the course of migration but it may itself change as there could be other initiatives within the organization driving of the source system.
    • With increase in surge data in terms of volumes and need to migrate data from multiple source systems’ poses multiple migration challenges.

     

     

    How can data Pre-validation tool help?

    This tool is a step closer towards smooth data migration by facilitating the data migration team with an ability to perform checks on the data before we actually start to load the data. We see pre-validation as a step in between the sequential steps defined by the industry; Transform and Load to ensure quality of data and also save time on the migration activity.Capture.JPG

    This Preload Validation Tool is generic and scalable; it can be used across SAP systems’ for diverse conversion requirements pertaining to data load activities across various functional modules

     

    Value Proposition

    This tool gives you the flexibility to identify and resolve issues related to data even before it’s loaded to SAP.

    • The Power of this tool lies in its ability to incorporate complex business rules to validate the data.
    • It comes with ability to carry out DDIC checks harnessing the value of definition attributes.
    • It focuses on saving the time and cost addressing the key considerations for ensuring the data Quality.
    • Furthermore, it contrasts to the standard SAP Load program as it has ability to capture multiple errors with a single field value.
    • This tool Works for both standard and custom conversions requirements.
    • In case moving data from legacy system to SAP is a periodic task, Using this tool we can identify the root cause and request the legacy team to fix the issue from their side. Example: Issues like length mismatch between the two systems.

     

    Technical Design

    The Idea is to create a generic tool and for that it’s necessary to determine the input file structure at run time.

     

    We will need to create couple of custom tables along with their maintenance views;

    1. Header Table: To hold unique Conversion ID to help the program uniquely identify the file structure along with description and other unique attributes associated with the conversion.
    2. Item Table: To hold specifics about the file structure related to the conversion ID like table name associated with each field, Sequence in which the fields are going to appear in the file and also additional attributes like a flag to ignore the field value during validation run.
      These tables will serve as backbone to the pre-load validation program.
    3. Error Classification: We can create a customizing table and store the categories specfied below. We have classified the errors into four categories;
      • Length Mismatch
      • Type Conflict
      • Input Data not defined in SAP
      • Input format issue
    4. Define Output Structure for ALV Display

     

    Next step would be to create a report program with selection screen field conversion identification number and Input File Path as mandatory input and after the execution the report will display an ALV Output to list out errors with each record in excel used as input. The output will be easy to understand with only few fields, it will just tell the user the excel row that has the error, Field Name, Field Value and error description.

     

    The report can also generate a graphical output displaying the errors associated with each field based on the predefined categories.

     

    The core validation logic of the report will hold logic to read the DDIC attributes associated with each field and table name, stored in the item table to perform type check, value table or check table checks, format checks for date or currency fields and length related checks.

     


    Tool Development

     

    Step 1: Create Header Table

    Capture1.JPG

    Step 2: Create Item Table

    Capture3.JPG

    Step 3: Define generic output structure for ALV display.

    Capture8.JPG

    Step 4: Define a customizing table for error categories defined (Non-mandatory Step)

    Capture9.JPG

    Capture10.JPG

     

    Step 5: Create a Report Program in transaction SE38 with some name like "Z_VALIDATE_DATA_READ_VALIDATE" with Selection screen as described in the above section.

    Capture4.JPG

    Step 6: Build the code similar to code snapshot in appendix section at the end.

     

    Step 7: The ALV Output/Graphical Output

    Capture5.JPG

    Capture6.jpg

    You can also choose the Chart Type, to club the count for errors classified in predefined categories.

    Capture7.jpg

     

    Appendix: Code Snapshot is attached


    File based CDC in Data Services 4.2

    $
    0
    0

    Databases like SQL Server,Oracle etc has CDC feature to enable tracking changed/inserted/deleted records. However, there are situation where we might need to implement the same using a flat file source.

    For the purpose of this blog, I am going to consider an example of product master data that comes as flat file and loaded through data services.

     

    Outline:

    1) Load the initial data normally as you would do and dump it in a physical table.

    2) Create separate data flow to handle the delta.

    3) Use table comparison transform to identify the inserted/updated/deleted records with the target table.

    4) Use map operation transform to filter out the records based out of operation type flag that is coming out of table comparison transform.

     

    Below is the sample flow:

    C1.png

     

    Configuration for Table Comparison transform

    c2.png

    Configuration for Map Operation - Filter out only records that needs to be inserted

    c3.png

    Similarly, filter for records that needs to be updated

     

    c4.png

    Mapping fields in Map Operation transform

    c5.png

    Attached atl file and sample data used for loading.

    Below video demonstrates working example of flat file based CDC using table comparison and map operation transforms.

    Capture Killed job status in BODS

    $
    0
    0

    Error Handling and recovery mechanisms are very important aspect of any ETL tool. BO Data Services have in-built error-handling and automatic recovery mechanisms in place. Also by using different dataflow designs, we can manually recover a job from a failed execution and ensure proper data in the target.


    In manual recovery, Each and every dataflow/workflow's execution status should be captured in a table(we call it as control table) which helps to execute only failed datflow/workflow in next run.


    But if we have got a scenario where the job is stuck and we have to kill the job manually then the status of the killed job will not be automatically updated from 'Running' to 'Killed'/'Failed' in the control table as when a job is killed,job gets terminated there itself,the flow doesn't go inside catch block also where we put the script or dataflow to capture the job status.


    In this scenario, We can put a script at the starting of our job which will first check the previous execution status of the job in control table,if it shows 'Running' then we can update the previous instance status in control table as 'Failed'/'Completed' using AL_HISTORY table(This metadata table captures all the jobs status with job name,job runid,start and end date):


    $G_PREV_RUNID = sql('<DATASTORE_NAME>','select max(JOB_RUN_ID) from JOB_CONTROL where JOB_NAME = {$G_JOB_NAME } and JOB_STATUS = \'R\'') ;

     

    $G_ERR_STATUS = sql('DS_DBH','select STATUS from AL_HISTORY where SERVICE = {$G_JOB_NAME } and (END_TIME) = select max(END_TIME) from JOB_CONTROL where JOB_NAME = {$G_JOB_NAME }) ;


    IF($G_ERR_STATUS=\'E\')

    sql('DS_DBH','UPDATE JOB_CONTROL SET JOB_STATUS = \'F\' WHERE JOB_RUN_ID=[$G_PREV_RUNID]');

     

    AL_HISTROY table contains following columns :

     

    upload.JPG

     

    NOTE : We need to have 'Select' access to the database on which BODS repository is created.

    History_Preserving Transform

    $
    0
    0


    Use of History Preserving Transform

     

     

    Introduction:-

     

    It is used to preserve the history of the source records. If the source row has operation code of Insert/Update then it insert a new record in the target table.

     

    Scenario:-


    We are doing a scenario where we want to insert the updated record into target table to preserve the history of the source records.

     

    1) Create project, job, workflow & dataflow as usual.

     

    2) Drag a source table to dataflow. Its contents are as follows.

     

    Image1.png

     

    3) Drag a target table to dataflow. Its contents are as follows.

     

    Image2.png

     

    4) Drag query, Table-Comparison, History_Preserving transform as shown in the figure.

     

    Image3.png

     

    5)  Open Query & do mappings as you do normally.

     

    Image9.png

     

    6) Open Table_Comparison block & enter all the properties.

     

    Image5.png

     

    • Table Name:- Select Target Table from the dropdown box.
    • Generated Key Column:- Specify key column
    • Select the "EMP_ID" node from the tree on LHS & drag into "Input primary key columns" list box. Now the comparison of the target table will take place based on whether the source EMP_ID is present in the target or not & comparison will be made based on the column s given under "Compare columns"  list box.
    • Similarly select the columns that are to be compared while transferring the data & drag it to "Compare Columns" list box.
    • Select "Cached comparison table" radio button.

     

    7) Similarly provide details for the History_Preserving block.

     

    Image6.png

    • In Compare column select the columns as specified in the Table Compassion transform.
    • Specify Date columns as specified.
    • Here we are mentioning the valid date as 9000.12.31.
    • In target table we have maintained the column as "Flag" & based on the Update operation the original value of the column for that particular record will be replaced from Y to N. And new records will be inserted with the status as 'Y'.

     

    8) Now after this we have updated 1st 3 rows of source records & 4th row is deleted .

     

    Image7.png

         Fields where changes are made are circled with the red marks as seen in the above figure.

     

    9) Validate & Execute the job.

     

    10) 3 new records got added in the target table as shown below.


    Image8.png  

    You can see that new entry for updated record is made in the target table along with the  'Y' flag & new END_DATE as '9000.12.31'

    & the flag of the original records are changed to 'N'.

     

    Summary:-


    So in this way History Preserving  block is useful in preserving the History if the  source records.

     

    Thanks & Regards,

     

    Rahul S. More

    (Technical Lead)

     

    IGATE Global Solutions Pvt. Ltd.

    logo.jpg

    UNIX/LINUX commands to know the version of Data Services and other S/W Components

    $
    0
    0

    Here Unix/Linux commands are listed for the software components which are relevant for both the fresh Install and upgrade of Data Services

     

     

    • Operating System Version, this becomes crucial when you are planning to upgrade the Data Services.

     

    • Database Version, this becomes relevant when you are planning to upgrade the Data Services

     

    • IPS Version, this also becomes critical when you are planning to upgrade want to know the information at patch level

     

    • Data Services Version, which can be check at multiple places to get the correct list of versions for all the components like Local Repository Version,
      Job server Version, Designer Version

     

     

    I will add more information to this blog, however as of now I am putting the below commands which may be useful to you if you are new to the world of

     

    Linux/Data Services Administration.

     

     

    Installation directories may be different for your environement, However Here idea is to share the commands to get the version of the installed software

     

    which affects Data Services Installation and Upgrade.

     

     

    Operating System Version

     

    Command: - more /etc/*-release

     

    Database DB2 Version

     

    Command :- /usr/local/bin/db2ls

     

    Data Services Version

     

    paht for Navigation :- cd $LINK_DIR/bin

     

    Command: - ./al_jobserver -v

     

    This will give you the Job server version. To get the complete version list you may use GUI based Data Services development tool which is Data Services Designer.  Menu >> Help

     

    There is command to get the repository version using repoman utility from the command line in LInux as well

     

    repoman -U<username/Schemaname> –P<password> -S<databaseServerHostName> -s -N<Database_DB2_Oracle_Etc)> -Q<DatabaseName> -p<Port_on_which_database_is_installed> -V<Database version> -tlocal –v

     

     

    IPS Version (Information Platform Services)

     

    Path for Navigation :- /usr/sap/<SID>/businessobjects/sap_bobj/enterprise_xi40/linux_x64

     

    Command: - strings boe_cmsd | grep BOBJVERSION

     

    More detailed version specific related information is discussed in below blog

     

    Data Services on Linux - Version check of Data Services Components

    Pre-requisites for connecting SAP BODS with ECC system

    $
    0
    0

    For connecting SAP BODS with ECC system, we need to create a SAP Applications datastore in Data Services. For this we need to specify the data transfer method. This method defines how data that is extracted by the ABAP running on the SAP application server becomes available to the Data Services server.

     

    The options are:

     

    o    RFC: Use to stream data from the source SAP system directly to the Data Services data flow process using RFC.

    o    Direct download: The SAP server transfers the data directly to the Local directory using the SAP-provided function GUI_DOWNLOAD or WS_DOWNLOAD.

    o    Shared directory: Default method. The SAP server loads the transport file into the Working directory on SAP server. The file is read using the Application path to the shared directory from the Job Server computer.

    o    FTP: The SAP server loads the Working directory on SAP server with the transport file. Then the Job Server calls an FTP program and connects to the SAP server to download the file to the Local directory.

    o    Custom Transfer: SAP server loads the Working directory on SAP server with the transport file. The file is read by a third-party file transfer (custom transfer) program and loaded to the Custom transfer local directory.

    Prerequisites:

    1.     Need to define a SAP Applications datastores which includes the following information

    o    Connection information including the application server name, the language used by the SAP client application, the client and system numbers

    o    Data transfer method used to exchange information between Data Services and the SAP application.

    o    Security information, specifically the SAP security profile to be used by all connections instigated from this datastore between Data Services and the SAP application.

    2.     In case the Data Transfer Method is Direct Download the following checks should be ensured

    o    Check whether direct download is the right method for us as it is actually calling the gui_download ABAP function call which is very unreliable with bigger amounts of data.

    o    Transport of data takes about 40 times longer than with the other protocols.

    o    We cannot use 'execute in background' with this option

    o    Configuring it is simple; we just specify a directory on the jobserver in the field Client Download Directory.

    o    But we need to ensure whether this directory actually exists

    3.     In case the Data Transfer Method is Shared Directory the following checks should be ensured

    o    While the 'working directory on SAP server' is the point where the ABAP will write the file to, the 'Application path to the shared directory' is the path to access this same directory from the jobserver.

    o    Whatever we specify as working directory, SAP should have the write access to that.

    o    The files generated by the SAP account, the BODS user has to have read permissions for. Typically, this is done by placing the BODS user into the same group as SAP is.

     

    4.     In case the Data Transfer Method is FTP  the following checks should be ensured

    o    Ensure that through the command prompt we are able to login by using the hostname the ftp server is running on, the username to login to ftp , and the password (In the command prompt, call ftp 'hostname' and type username password)

    o    Next check what 'cd' (change directory) command we have to do in order to get to the working directory on SAP server? Copy this path as the 'ftp relative path' in the datastore properties.

    o    Next step would be to check permissions on the files. In general, SAP should create the files with read permission on its main group; the ftp user should be part of that SAP group so it can read the files.

    o    Ensure that the directory the file should be downloaded will be a directory on the jobserver computer.

    5.     In case the Data Transfer Method is Custom Transfer  we need to ensure

    o    A batch file needs to be specified that does all the download.

    6.     The execution mode should be generate_and_execute

     

     

     

    To define SAP Application Datastore:

    a)     In the Datastore tab of the object library, right-click and select New.

    b)     Enter a unique name for the datastore in the Datastore name box.

    c)     The name can contain alphanumeric characters and underscores. It cannot contain spaces.

    d)     For Datastore type, select SAP Applications.

    e)     Enter the Application server name.

    f)      Enter the User name and Password information.

    g)     To add more parameters, click Advanced, enter the information as below and click OK to successfully create a SAP Application Datastore.

    Capture1.PNG

     

    Capture2.PNG

     

    Herethe Working directory on SAP server is the point where the ABAP will write the file to and the Generated ABAP directory is the path to access this same directory from the jobserver.

    Performance Tuning for Table-Comparison

    $
    0
    0

    Often you have an EIM prozess where you get source data, which includes new and updated data. In the data flow, you can use the table comparison transformation, to identify the changes.

     

    If you have a large dimension table or a large fact table and only little load data, the process often have a long processing time. This depends on the big comparsion table in the table-comparison component.

     

    You can tune this, if you reduce the comparsison data to the minimum. This can be done for a cutomer dimension like this:

     

    • source table for new/changed customer data: imp_customer
    • target table for customer: dim_customer
    • for the comparison create a view like:
      • create view comp_customer as

                   select d.* from  dim_customer d inner join imp_customer i on (d.cutomer_num = i.customer_num);

    • Chance the table-comparison  transformation and set the comparison table to the view comp_customer

     

    Now, the processing time of the transformation is improved.

    SAP Data Services and SAP Replication Server

    $
    0
    0

    SAP Data Services & SAP Replication Server

     

    SAP Data Services had the ability in previous versions to use SAP Replication Server for Change Data Capture capabilities but this involved the use of PowerDesigner and a staging area using Sybase ASE. As of Data Services 4.2 SP3 you no longer need PowerDesigner or the staging area as the integration is now much simpler and SAP Data Services can now connect directly to SAP Replication Server.Architecture.png

    SAP Data Services connects to the source systems to collect metadata and then to SAP Replication Server / Agent for configuration and change data retrieval.

     

    You create 2 datastores, one with No CDC that will be used for the initial load

    No CDC.png

    and one using the CDC option which will be used for the delta loads.

    CDC datastore.png

    Under the advanced section you now add the Replication Server and Replication Agent details.

     

    (This example is using Sybase ASE as the source system, as of Data Services 4.2 SP3 Sybase ASE is not supported as a CDC source. Please refer to the Product Availability Matrix for supported systems)

     

    Import the same tables from both datastores.

    datastores.pngYou can now start creating your job. You may want to start with a conditional workflow to check if the job is needs to do an initial or delta load. The initial load is just a standard dataflow using the table from the Initial datastore.

     

    The dataflow for the delta (Rep Server CDC) part of the job has to be inside a Continuous Workflow.

    cont_WF.png

    Once the job has been started it will continue to run and fetch changed data from the Replication Server. The Continuous Workflow has options (e.g. number of runs or custom function) to control how long it should run for.

     

    The dataflow inside the continuous workflow now reads the table from the Delta datastore.

    dataflow.pngWhen viewing the data from the Delta datastore you will notice 2 additional columns, DI_SEQUENCE_NUMBER and DI_OPERATION_TYPE. Operation type is used to determine if the record is an insert, update or delete record.

    data.png

    In the example dataflow above the Operation type is then used in a Map_CDC_Operation transform to tell SAP Data Services to generate update, insert or delete statements. If you don’t want to delete data from the target then a Map_Operation transform can be used to turn a delete into an update for example.

     

    With SAP Data Services 4.2 SP3 this new simplified architecture makes it even easier to perform real time transformations as part of a data integration process.


    How to setup SuccessFactors Adapter with SAP Data Services

    $
    0
    0

    Within SAP, we created a SAP Data Services Job to extract some recruiting data from SuccessFactors in order to store those data on an internal HANA system for reporting.

     

    Since SAP Data Services 4.1, a specific adapter is available to connect with SuccessFactors. An adpater is a java service running in the background of Data Services which is able to deal with Cloud Web services. I would like to share with you some of my findings with SAP Data Services 4.2 and SFSF adapter.

     

    Prerequisite

     

    Please identify which Cloud instance of SuccessFactors you will use. It can be done via this URL:

     

    https://sfapitoolsflms.hana.ondemand.com/SFIntegration/sfapitools.jsp


    06-01-2015 09-39-13.gif

    Please validate with this URL that you are able to open the connection from the SAP Data Services Job Server. To achieve this you need to ask SuccessFactors administrator the following things:

    • provide an API user with proper authorization and password
    • provide the Company ID
    • Setup the data services IP address into the whitelist of SuccessFactors otherwise you will face this kind of error message

    22-12-2014 15-02-22.gif

     

    Adapter - Proxy Setup:

     

    As your SAP Data Services is running in your internal network, please pay attention to the proxy definition by adding the following parameters in the adapter web interface setup:

    22-12-2014 13-57-08.gif

     

    Parameters to be added : -Dhttps.proxyHost=<proxy> -Dhttps.proxyPort=<8080>

     

    Please also activate the Trace mode to True to have a view on trace and errors.

     

    Certificate setup

     

    If you receive this error below, you need to upload the right certificates in the keystore used by the adapter in order to enable the SSL connection (There are sometime mistakes in some documentations.)

    22-12-2014 13-55-26.gif

     

    Please perform the steps below:

     

    Obtain 3 SFSF certificates

     

    Below are the steps to export all certificates from the certificate path using FireFox.

    1. Click on the lock

    certificate_01.png

     

    Click ‘More Information’

    certificate_02.png

    3. Click ‘View Certificate’

    certificate_03.png

     

    4. Click ‘Details’

    certificate_04.png

     

    5. The ‘Certificate Hierarchy’ show 3 certificates - it is really important to get those 3 certificates to enable the SSL connection.

    • The root certificate is ‘VeriSign Class 3 Public Primary Certification Authority – G5’.
    • The child certificate is ‘VeriSign Class 3 Secure Server CA - G3’.
    • The grandchild certificate is ‘*.successfactors.eu’.

    certificate_05.png

     

    6. To export a certificate, click to highlight the certificate. Below shows how to export the root certificate.

    certificate_06.png

    7. Click ‘Export’, to save the certificate file.

    certificate_07.png

     

    Upload those 3 certificates in the right keystore

     

     

    a) Open a Dos command and Type set JAVA_HOME=%LINK_DIR%\ext and press Enter.

     

    b) Type set path=%LINK_DIR%\ext\jre\bin;%path% and press Enter.

     

    c) Type cd %link_dir%\ssl\trusted_certs and press Enter.

     

    d) Type notepad sslks.key to view the keystore password.

     

    e) Type keytool -import -alias verisign_class3g5ca -file "need full path to VeriSignClass3PublicPrimaryCertificationAuthority-G5.crt" -keystore jssecacerts and press Enter.

    When ask ‘Enter keystore password:’, copy the password in step (d).

    When ask ‘Trust this certificate? [no]:’, type yes

     

    f) Type keytool -import -alias verisign_g3 -file "need full path to VeriSignClass3SecureServerCA-G3.crt" -keystore jssecacerts and press Enter.

    When ask ‘Enter keystore password:’, copy the password in step (d).

    When ask ‘Trust this certificate? [no]:’, type yes

     

    g) Type keytool -import -alias sfsf_eu -file "need full path to sap.successfactors.eu.crt" -keystore jssecacerts and press Enter.

    When ask ‘Enter keystore password:’, copy the password in step (d).

    When ask ‘Trust this certificate? [no]:’, type yes

     

    h) Restart the adapter

     

    Nota Bene : In some technical documentation it is mentioned %LINK_DIR%\ext\jre\lib\security and to have the connection working properly we work here with \ssl\trusted_certs. In the technical documentation they mentioned the keystore cacerts and here we use jssecacerts.


    Reference blogs:

     

    I hope those tips will help you if you need to enable the communication between SuccessFactors and SAP Data Services (on premise).

    Other possibilities are available with SAP PI and >Hana Cloud Integration which are presented in the reference blogs.

     

    I will continue to update this blog with other findings later in 2015.

     

    Best regards

     

    Thomas

    Fields in EDIDC table that can be used

    $
    0
    0

    When  generating IDocs using Business Object Data Services (BODS), the EDIDC control table data is passed. There is a need to identify the IDoc against the corresponding records that are used to generate the IDocs. One option that I have used is to store the legacy value in some of the fields in EDIDC table when generating the IDocs using BODS. I would like to know whether there are alternative options that are more structured to using EDIDC fields for storing this corelation between legacy data and IDocs generated.

     

    Below is an example where EDIDC fields RCVLAD was used to store legacy reference for Busienss Partner when creating/updating business partners when generating IDocs using BODS.

     

    Fig1.jpg

     

    The IDocs loaded using this method all have NEW_<GUID> in column RCVLAD. A sample is shown in Fig2 below.

     

    Fig2.jpg

     

    The same approach can be applied on column "SNDSAD".

     

    What would be better to understand is whether there are any other standard solutions provided that we can utilise in order to facilitate the IDoc to legacy record reference.

     

    All comments on this topic is welcome.

    Remove second password query at DS repository login

    $
    0
    0

    Reason

    In some cases you get a second password query if you want to log on the repository. The second  password query looks like the following screenshot.

    Repo Password.png

    Reason for this second login are missing permissions on the enterprise server. To remove the second login you have to change the permissions at the CMC.

     

    Solution

    First you have to log on the CMC with an admin user. If you are logged in, you open the Data Services section of the CMCCMC Home.JPG

    In the Data Services sections you have two options to change the permissions, depending on your requirements.

    If you want to remove for every repository, you open the User Security of the repository folder on the left side. Otherwise you open the User Security of the specific repository.

    CMC DS.JPG

    On the new user security window you see the different user groups. Select the user group (in my case it is the Administrators group) and click Assign Security.

    User Security.JPG

     

    The problem is, that the Full Control access level does not have all permission. There for you have open the Advanced tab. In the advanced tab, you click the hyperlink Add/Remove Rights.

    Advanced Tab.png

    In the new window you open the topic Application on the left side and the subtopic Data Services Repository. Now you can the see the specific Data Services rights for Repositorys. To remove the second password query grand the first two right and apply these to object and subobject.

    Add Remove Rights.png

    After confirm all windows with OK, the password query is removed.

    NEW TO SAP DATA SERVICES: NATIVE SUPPORT FOR MONGODB

    $
    0
    0

     


    This is a guest blog by Subha Ramachandran, Vice President, Product Management at SAP Labs. It represents her personal views, thoughts, and opinions. It is not endorsed by SAP nor does it constitute any official communication from SAP.

     

     

    The bulk of work in any Big Data initiative is in preparation of the data – specifically data integration and ensuring data quality. With native support of MongoDB in the new release of SAP Data Services 4.2 SP04, those tasks just got easier.


    Bring in Data that Lives in Various Source Systems

    SAP Data Services for MongoDB simplifies the extract transform and load (ETL) of data from the database by preserving the fidelity of JSON structures instead of artificially flattening them, which can result in data redundancies/repetition. Within SAP Data Services, ETL developers can operate on hierarchical structures, perform required transformations, and flatten as needed to fuel analytics use cases. For example, users can load multi-level machine/equipment data stored in MongoDB into SAP HANA, Hadoop, or any other data warehouse on a regular basis (e.g. daily or weekly ) for analytics.


    Simplify & Maximize Performance with SAP Data Services

    SAP Data Services provides a rich set of native out-of-the-box transformations, with over 80 built-in functions in its library, including native text data processing, data masking, and data quality transformations to standardize, validate, cleanse and enrich data.

     

    MongoDB's dynamic schema allow SAP Data Services to automatically scan collections in parallel to quickly infer the metadata. The software also supports the pushdown of allowed operations to MongoDB and thus maximizes performance. The SAP Data Services 4.2 SP04 release supports both single node and replica set deployments of MongoDB.

     

    In summary, the SAP Data Services 4.2 SP04 release allows you to combine MongoDB’s schema flexibility with a market leading set of data integration and data quality capabilities. This gives developers the power to easily extract, transform, and load MongoDB data as part of any Big Data initiative. SAP Data Services and MongoDB help you deliver a complete and accurate view of your data, allowing you to identify new insights and convert them into business value.

     

    Stay tuned for further details on what’s to come in 2015, including plans for support of MongoDB sharded clusters.

     

    Find out more about SAP Data Services, Data Services 4.2SP 04 native support of MongoDB and why SAP Data Services is recognized as a Leader in the Magic Quadrants for both Data Integration and Data Quality tools.

     

    More security features in SAP Data Services

    $
    0
    0

    This message contains some internal system details which have been hidden for security. If you need to see the full contents of the original message, ask your administrator to assign additional privileges to your account.


    Have you ever run into this error message before? And you were curious to see the original message? Here's how to get it.


    Start the Central Management Console. Navigate to Data Services Application:

    1.png

    Select User Security:


    2.png


    Select the user or group you want to authorise and select "Assign Security":

    3.png

    Select the Advanced tab, then "Add/remove Rights":

    4.png

    Grant "View internal information in log" and apply changes in both panels.


    Next time your DS job runs into an error, you'll see the complete original error message.

    Viewing all 236 articles
    Browse latest View live