Xpansiv XSignals Python SDK

This document describes the Xpansiv Python SDK which enables external users to use the Xpansiv platform tools within their Python scripts.

The Python SDK can be used within:

  • a single Python script that is ran thanks to the Python Runner task within a pipeline in the Xpansiv application
  • a single Jupyter Notebook that is ran thanks to the Jupyter Runner task within a pipeline in the Xpansiv application
  • an ensemble of Python scripts that are part of a container, for a Task created by the user, used in a pipeline in the Xpansiv application

Examples of usage can be found at the bottom of the documentation.

Note that the SDK does not cover everything from API documentation but rather commonly used features.

The scope of the SDK:

  • Datalake Handler - downloading / uploading / reading files from the data lake
  • Status Handler - sending statuses about the task run
  • Task Handler - enables communication between tasks within a pipeline and reading/writing parameters
  • Time Series Handler - retrieves data directly from the time series database

How to install and set the package:

Install

Copy
Copied
pip3 install xpansiv_xsignals==1.0.0

As the library is available from pip, it can be installed as a specific version within a Python Task from within requirements.txt just by adding:

Copy
Copied
xpansiv_xsignals==1.0.0

The package relies on the requests library so, in the project, the user must install this library in the requirements.txt file.

Copy
Copied
pip3 install requests

Environment Variables

The package uses information from the environment variables. They are automatically provided when running a script within a pipeline (as a Task or within the Python/Jupyter Runners). If running locally the script, users must set them in the project to be able to run the project locally.

Mandatory environment variables to set:

  • LOGIN → login received from Xpansiv
  • PASSWORD → password to log in. Credentials are used to generate the token so that each request is authenticated.
  • NG API ENDPOINT → the URL to the Xpansiv platform API (by default, the url is set to https://api.xsignals.xpansiv.com )

This allows to pass the authentication process and directs users' requests to the Xpansiv environment API. Alternatively, instead of the LOGIN and PASSWORD, you may set NGAPIKEY environment variable with an API key generated within the Xpansiv platform to enable the authentication process.

For SSO only user accounts, it is additionally needed to set the following environment variables for Auth0:

  • CLIENT_ID
  • CLIENT_SECRET

When using different NGAPIENDPOINT than the default, also set the following environment variables: DOMAIN, GRANT_TYPE, REALM, SCOPE, AUDIENCE.

The full SSO user setup in code would look the folowing:

Copy
Copied
import os
os.environ['LOGIN']         = ''
os.environ['PASSWORD']      = ''
os.environ['CLIENT_ID']     = ''
os.environ['CLIENT_SECRET'] = ''

os.environ['NG_API_ENDPOINT']   = ''
os.environ['DOMAIN']            = ''
os.environ['GRANT_TYPE']        = ''
os.environ['REALM']             = ''
os.environ['SCOPE']             = ''
os.environ['AUDIENCE']          = ''

import xpansiv_xsignals as xp

Other variables may be useful when creating the tasks within the platform:

  • NG STATUS GROUP_NAME → the group on the data lake where the pipeline is located, and is used to display the statuses
  • JOBID → any value; when the pipeline is executed, this value is set by the Xpansiv platform
  • PIPELINE_ID → any value; when the pipeline is created, this value is set by the Xpansiv platform

    Datalake Handler

How to download or read a file from data lake by its name?

The DatalakeHandler class can be used as follow within a script to download or upload a file:

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

dh = xp.DatalakeHandler()

dh.download_by_name(file_name='my_file.csv', 
                    group_name='My Group', 
                    file_type='SOURCE',
                    dest_file_name='folder/local_name.csv',
                    save=True,
                    unzip=False)


fileIO = dh.download_by_name(file_name='my_file.csv', 
                            group_name='My Group',
                            file_type='SOURCE',
                            dest_file_name=None,
                            save=False,
                            unzip=False)

df = pd.read_csv(fileIO)

The download methods allows to either:

  • download and save locally the wanted file, if save=True
  • read the file directly from the datalake and get a BytesIO object (kept in memory only, that can for example be read by pandas as a dataframe directly)

Note that by default:

  • the file is NOT saved locally, but returned as a BytesIO object (streamed from the datalake).
  • the argument destfilename=None , which will save the downloaded file to the root folder with its original name.

**How to download or read a file from data lake by its ID? **

In the case that the file ID is known, it can be directly downloaded/read as follow:

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

dh = xp.DatalakeHandler()


dh.download_by_id(file_id='XXXX-XXXX', 
                  dest_file_name='folder/local_name.csv',
                  save=True,
                  unzip=False)


fileIO = dh.download_by_id(file_id='XXXX-XXXX', 
                            dest_file_name=None,
                            save=False,
                            unzip=False)

df = pd.read_csv(fileIO)

The download methods allows to either:

  • download and save locally the wanted file, if save=True
  • read the file directly from the datalake and get a BytesIO object (kept in memory only, that can for example be read by pandas as a dataframe directly)

Note that by default:

  • the file is NOT saved locally, but returned as a BytesIO object (streamed from the datalake).
  • the argument destfilename=None , which will save the downloaded file to the root folder with its original name.

How to upload a file to the data lake?

The uploading method will upload to the given group the file at the specified path, and returns its ID on the lake:

Copy
Copied
import xpansiv_xsignals as xp

dh = xp.DatalakeHandler()

file_id = dh.upload_file(file='path/local_name.csv', 
                        group_name='My Group', 
                        file_upload_name='name_in_the_datalake.csv')

It is possible as well to stream a python object's content directly to the datalake from memory, without having to save the file on the disk.

The prerequisite is to pass to the uploading method a BytesIO object (not other objects such as pandas Dataframe).

Copy
Copied
import xpansiv_xsignals as xp
import io

dh = xp.DatalakeHandler()

fileIO = io.BytesIO(df.to_csv().encode()) 

file_id = dh.upload_file(file=fileIO, 
                        group_name='My Group', 
                        file_upload_name='name_in_the_datalake.csv')

Timeseries Queries

How to get the list of existing symbols for a given group?

Data saved in the time series database is structured by group, keys and timestamp. Each set of keys has unique dates entries in the database, with corresponding columns values.

To explore what are the available symbols for a given group, the following method can be used:

Copy
Copied
import xpansiv_xsignals as xp

ts = xp.Timeseries()

group = 'My Group'
group_symbols = ts.get_symbols(group_name=group)

The default size of the returned list is 1 000 items.

Note that the return object from the get_symbols() method is a JSON (python dict) where the keys and columns are accessible in the items key of the JSON.

How to query by metadata or descriptions?

In order to find the symbols querying by the metadata, column or symbol names, search_for parameter may be used. It will look for passed string in the whole time series database and return the JSON with keys and columns where searched string appears.

Copy
Copied
import xpansiv_xsignals as xp

ts = xp.Timeseries()

search = 'Data description'
searched_symbols = ts.get_symbols(search_for=search)

Passing both groupname and searchfor parameters of the getsymbols() method allows to narrow down the results from selected group. The user must provide either groupname or search_for to the method in order to obtain the symbols.

If trying to get the list of symbols from a group that contains more than 1 000 items, the results will be paginated (by default into chunks of 1 000 items). To navigate in the large results the get_symbols() method takes as extra arguments the size of the returned list and the from page:

Copy
Copied
import xpansiv_xsignals as xp

ts = xp.Timeseries()

group = 'My Group'

group_symbols = ts.get_symbols(group_name=group, _size=200, _from=5)

By default, these parameters are _size=2000 (the maximum limit for the items lengths) and _from=0.

How to read data from Timeseries database?

It is possible to use the SDK to directly query the TimeSeries database for data, given the symbol's keys, columns and the datalake group it is stored on.

On the application, it is similar of creating a Dataprep instance, that selects a set of symbols from groups into a basket.

The retrieved data can be:

  • streamed directly to memory, retrieved as a BytesIO object, by setting file_name as None (default value),
  • saved as a csv file locally with the provided path and name as file_name .

The symbols are the keys the data was saved with in the database. For a given symbol, all the keys must be passed, as a dictionary object with the key name and value. It is possible to use a wildcard for the symbols values, to have all the values for that key, using *.

The wanted columns are then passed as a list that can contain one or more items. If an empty list [ ] is passed to the function, it returns all the available columns.

To read all available data for specific symbols and columns with no time frame, no start or end date are passed to the method.

Extra settings are as well available to query data:

  • Metadata: to either return it in the query of not
  • Format: either get a Normalized CSV (NCSV) or a dataframe format
  • Timezone: get the timestamps in the timezone of the user's account or get the data in a specific timezone
  • Corrections: how to handle corrections to the TimeSeries database (corrections set to 'yes', 'no', 'history' or 'only')
  • Delta: whether to show by datatimestamp file (delta=False) or insert time (delta=True)

The following code shows an example of how to query the TimseSeries database: :

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

symbols = {'Key1': "Val1", "Key2": "Val2"}
columns = ['Open', 'Close']

ts.retrieve_data_as_csv(file_name='test/query.csv',
                        symbols=symbols,
                        columns=columns,
                        group_name='My Group'
                        )

df = pd.read_csv("test/query.csv")


fileIO = ts.retrieve_data_as_csv(file_name=None,
                                 symbols=symbols,
                                 columns=columns,
                                 group_name='My Group'
                                 )

df = pd.read_csv(fileIO)

How to read data from Timeseries for specific dates?

To retrieve data the data within specific time frame, user can specify the start and end date.

There are two options how the start and end date may look like:

  • only date (e.g., 2021-01-04)
  • date and time (e.g., 2021-02-01T12:00:00; ISO format must be followed)

For example, if user specified startdate=2021-02-01 and enddate=2021-02-06, then data will be retrieved like this: from 2021-02-01 00:00:00 till 2021-02-06 23:59:59.

If date and time is specified then data will be retrieved exactly for the specified time frame.

Note that ISO format must be followed: YYYY-MM-DDTHH:mm:ss. Pay attention to the "T" letter between date and time.

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

# Symbols to query from database
symbols = {'Key1': "Val1", "Key2": "Val2"}
columns = 'Open'

ts.retrieve_data_as_csv(file_name='test/test.csv',
                        symbols=symbols,
                        columns=columns,
                        group_name='My Group',
                        start_date='2021-01-04',
                        end_date='2021-02-05'
                        )

ts.retrieve_data_as_csv(file_name='test/test.csv',
                        symbols=symbols,
                        columns=columns,
                        group_name='My Group',
                        start_date='2021-01-04T12:30:00',
                        end_date='2021-02-05T09:15:00'
                        )

fileIO = ts.retrieve_data_as_csv(file_name=None,
                                 symbols=symbols,
                                 columns=columns,
                                 group_name='My Group',
                                 start_date='2021-01-04',
                                 end_date='2021-02-05'
                                 )

df = pd.read_csv(fileIO)

How to use a wildcard for a key's values?

To get all the value for one a several keys for the query, the character * can be used as a wildcard. The argument allow_wildcard should be set to True in the retrieval function to enable the use of wildcard.

Please note that by default, the use of wildcards is DISABLED.

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

symbols = {'Key1': "Val1", "Key2": "Val2", "Key3": "*"}
columns = ['Open', 'Close']

fileIO = ts.retrieve_data_as_csv(file_name=None,
                                 symbols=symbols,
                                 columns=columns,
                                 group_name='My Group',
                                 allow_wildcard=True
                                 )

df = pd.read_csv(fileIO)

How to get all the columns for a given set of keys?

To get all the columns values for a given set of keys in the database, the query can take an empty list as the queried columns, as follow:

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

symbols = {'Key1': "Val1", "Key2": "Val2", "Key3": "Val3"}
columns = []

fileIO = ts.retrieve_data_as_csv(file_name=None,
                                 symbols=symbols,
                                 columns=columns,
                                 group_name='My Group'
                                 )

df = pd.read_csv(fileIO)

Note that this configuration can be used with keys wildcards (with allow_wildacrd=True) and any other setting.

How to modify the Time Zone of the data?

The timestamps in the queried time series are set by default in the timezone of the user's account, who created the script or the pipeline. It is described in the Date column header between brackets (for example Date(UTC))

To modify the time zone in the retrieved dataset, the timezone can be passed directly to the retrieval function as follow. It must respect the Continent/City format.

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

symbols = {'Key1': "Val1", "Key2": "Val2"}
columns = ['Open', 'Close']

fileIO = ts.retrieve_data_as_csv(file_name=None,
                                 symbols=symbols,
                                 columns=columns,
                                 group_name='My Group',
                                 timezone='Europe/London'
                                 )

df = pd.read_csv(fileIO)

How to get the metadata along with the data?

It is possible to get extra columns in the retrieved data, along with the keys & columns values, containing the metadata of the symbols. It is done by setting the argument metadata=True in the retrieval function.

By default, no metadata is included in the queried data.

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

symbols = {'Key1': "Val1", "Key2": "Val2"}
columns = ['Open', 'Close']


fileIO = ts.retrieve_data_as_csv(file_name=None,
                                 symbols=symbols,
                                 columns=columns,
                                 group_name='My Group',
                                 metadata=True
                                 )

df = pd.read_csv(fileIO)

How to modify the format of the received data?

The queried data comes by default as Normalized CSV format (NCSV), with in this order:

  • the keys columns,
  • the date column, with timestamps in either the default timezone or the specified one ( timezone argument in the function),
  • the values columns,
  • the metadata columns, if wanted ( metadata=True )

By setting NCSV=False in the retrieval method, the data will be returned as Dataframe format (PANDAS in API docs), as a JSON. The JSON (python dict) has timestamps as keys and a dictionary containing pairs of symbols_columns and their value.

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

ts = xp.Timeseries()

symbols = {'Key1': "Val1", "Key2": "Val2"}
columns = ['Open', 'Close']

file_json = ts.retrieve_data_as_csv(file_name=None,
                                    symbols=symbols,
                                    columns=columns,
                                    group_name='My Group',
                                    metadata=True,
                                    NCSV=False
                                    )

df = pd.DataFrame(file_json).T

Note that the dataframe, created from the JSON containing the data, is then transposed to have timestamps as DatetimeIndex (along rows axis).


Task Handler

Users can extend the existing set of tasks on Xpansiv platform by executing scripts or notebooks respectively from the Python Runner Task or the Jupyter Runner Task.

This task then can be used in a pipeline and be able to communicate with other tasks by:

  • reading outputs from other tasks, as inputs
  • writing outputs, that can be used by others tasks as inputs

A task can as well receive inputs directly as a file picked from the datalake, either for a specific file either for the newest available version of a file on the lake, given its name and group.

Whithin a python script, run into either one of the Python Runner Task, this is implemented as follow:

Read a Task Input

The input file passed to the Task can be:

  • either downloaded to the disk,
  • either to be read on the fly (useful when limited space on the disk but not in memory).
Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

th = xp.TaskHandler()

th.download_from_input_parameter(arg_name='Input #1', 
                                 dest_file_name='data.csv',
                                 save=True)

file_content = th.download_from_input_parameter(arg_name='Input #2', 
                                                dest_file_name=None,
                                                save=False)

df = pd.read_csv(file_content)

If destfilename=None then the file is saved on the disk with its original name from the datalake.

How to obtain additional file information

There is a possibility to retrieve information related to the input parameters using the DatalakeHandler (dh.get_info_from_id()) and TaskHandler (th.read_task_parameter_value()) together.

Copy
Copied
import xpansiv_xsignals as xp
import pandas as pd

th = xp.TaskHandler()
dh = xp.DatalakeHandler()

dh.get_info_from_id(th.read_task_parameter_value('Input #1'))

This can be used to retrieve any metadata associated with the file itself like file name or arrival time (time it was uploaded to the system)

Set a Task Output

The output of the Task can be set :

  • either by uploading a file saved on the disk to the datalake,
  • either by streaming the python object content to the datalake as the destination file.

Once uploaded, the Output is set to point to the file on the datalake (by its ID, name ,group name)

Copy
Copied
import xpansiv_xsignals as xp
import io

th = xp.TaskHandler()

th.upload_to_output_parameter(output_name='Output #1', 
                              file='path/dataset.csv', 
                              group_name='My Final Group',
                              file_upload_name=None,
                              file_type='SOURCE')

df_io = io.BytesIO(df.to_csv().encode())

th.upload_to_output_parameter(output_name='Output #1', 
                              file=df_io, 
                              group_name='My Final Group',
                              file_upload_name='dataset.csv',
                              file_type='SOURCE')

If fileuploadname=None then the saved file will be uploaded with its original name. If the file is streamed directly to the datalake, the fileuploadname argument must be set.


Statuses

Sending status can be used to show in the application what is the progress of the task execution. It allows to use 3 different levels:

  • FINISHED (green),
  • WARNING (orange),
  • ERROR (red).

Sending statuses remains optional as the Xpansiv platform sends general statuses. Only if user needs to pass some specific information in the status, this is worth using.

Copy
Copied
import xpansiv_xsignals as xp

sh = xp.StatusHandler()

sh.send_status(status='INFO', message='Crucial Information')

sh.info(message='Pipeline Finished Successfully')
sh.warn(message='Something suspicious is happening ...')
sh.error(message='Oops, the task failed ...')

Note that the info status is informing the status service that the task executed successfully and is finished.

There is also a possibility to send a warning status with a custom warning message under some circumstances and immediately stop the execution of the pipeline.

Copy
Copied
from xpansiv_xsignals.ExceptionHandler import PythonStepWarnException

i=1
if i>1:
    raise PythonStepWarnException(message='The value of i is bigger than 1! Stopping pipeline execution.')

Example 1 - OOP

To simplify the use of the SDK methods in a script, Python SDK methods can be inherited by the user’s main class.

Below is an example of a class that has 3 methods:

  • Download raw data (or take from the previous task)
  • Process the data
  • Upload the data to datalake and pass it to the next task
Copy
Copied
import io
import xpansiv_xsignals as xp
import pandas as pd

class Runner:
    def __init__(self):
        self.handler = xp.TaskHandler()
        self.df = None
        
    def download_data(self):
        self.handler.download_from_input_parameter(arg_name='Dataset', dest_file_name='data.csv', save=True)
        
        return pd.read_csv("data.csv")
        
    
    def process_data(self, df):
        return df_processed

    
    def upload_data(self, df_processed):
        fileIO = io.BytesIO(df_processed.to_csv().encode())
        
        self.handler.upload_to_output_parameter(output_name='Processed Dataset', file=fileIO, group_name='Final Group')
    
        
    def run(self):
        df = self.download_data()
        df_processed = self.process_data(df)
        self.upload_data(df_processed)
        
        
status = xp.StatusHandler()
Runner().run()
status.info('Test Pipeline Finished')

Example 2 - functional programming

The SDK methods may be used in functional programming simple scripts. Below is an example of a script that:

  • Downloads the data from Input #1
  • Processes the data
  • Uploads the data to datalake and passes it to the next task
Copy
Copied
import os
os.environ['LOGIN'] = ''
os.environ['PASSWORD'] = ''

os.environ['Input #1'] = "{'name':'', 'groupName':''}"

os.environ['NG_STATUS_GROUP_NAME'] = ''

os.environ['NG_API_ENDPOINT'] = ''



import os, sys
import io
import xpansiv_xsignals as xp
import pandas as pd

import logging ; logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
log = logging.getLogger()

th = xp.TaskHandler()

df_io = th.download_from_input_parameter('Input #1')
df = pd.read_csv(df_io)
log.info('Data from Input #1 downloaded.')

df_processed = your_processing_function(df)

file_io = io.BytesIO(df_processed.to_csv(index=False).encode())

th.upload_to_output_parameter(output_name='Output #1',
                                        file=file_io,
                                        file_upload_name='',
                                        group_name=os.environ.get('NG_STATUS_GROUP_NAME'),
                                        file_type='SOURCE')

Who do I talk to?

  • Admin: Xpansiv info@xpansiv.com