Data Ingestion using Backend

A guide to backend data ingestion for Amazon S3 and Azure.

Overview

Data ingestion using backend is available for Amazon S3 and Azure.

The largest file size you can upload is 5GB.

Once your backend integration is set up you can access a dedicated cloud storage to upload your files to the in directory so that they can be finally ingested by Fynapse. Fynapse allows the ingestion of files in the following formats:

  • CSV - such files can be uploaded via Fynapse UI and backend
  • gzip - such files can be uploaded only via the backend. The file format is automatically detected by the system so you do not have to add a file extension i.e. .gz. After uploading to the dedicated cloud storage folder the files in the gzip format are decompressed to check if they meet the same requirements as files in the CSV format and to undergo the same validation and deduplication processes to be finally ingested.

There can be up to four directories in a particular Entity in the dedicated cloud storage:

  • in - a directory where you upload files and from which files are picked up for processing. Only one file is processed at the same time for a single route and the processing is performed in an alphabetic order. Files are moved to the in progress directory once their processing starts.

Anti-malware scanning of all new files placed in the in directory is performed. After the scanning is complete, all safe files are moved to the in progress directory for further processing. Files tagged as unsafe are replaced with files containing information that the original files were detected as malicious. Original malicious files are quarantined and the safe files (with replaced content) are moved to the done and error directories.

  • in progress – a temporary directory where files that are being processed are kept only during processing. Once the processing ends, files are moved to the done directory.
  • done – a directory where uploaded and processed files are kept in proper folders named using the following syntax: yyyymmddhhmmss which corresponds to a timestamp of particular files. For example, a file uploaded on 2022.07.29 at 11:11:12 will be placed in the folder named: 20220729111112. If the file was incorrect, an additional file containing error descriptions is created in the error directory. The done directory also contains a file that confirms that the file was delivered. The name of the file has the following syntax: {file_name}-receipt.csv, for example, test_data-receipt.csv. The receipt file includes information about the processing of a particular ingestion such as:
    • An Ingestion ID
    • A name of the ingestion
    • A status of the ingestion
    • The total number of processed records
    • A total number of records successfully processed
    • A total number of records processed with errors
    • Validation errors (on the ingestion level)
    • A Checksum. Information about it:
      • Will be always available in the receipt file, when the data file is sent via Fynapse GUI (because in this case the checksum is generated and added to a data file header automatically).
      • Will be available in the receipt file if the file is sent via AWS CLI and users remember to send proper metadata in a header to include a string with the checksum (because in this case it is not done automatically).
  • error – a directory where files containing all incorrect records from the processed files with error messages corresponding to each invalid record are kept. The name of the file with errors always has the -error suffix, for example, test_data-error.csv. The file is placed in a folder which name has the following syntax: yyyymmddhhmmss which corresponds to a timestamp of an original file uploaded to the bucket.

Input Files Placement

All input files are placed on a cloud storage and can be found in dedicated paths that depend on the ingestion status of a particular input file.

Note that fynapse is a default namespace created by the system.

All files placed in the in directory are scanned using the anti-malware software.

General Path for Input Files

namespace/entityName/in/

Example:

fynapse/BusinessEvent/in/

Full S3 URI pattern

s3://bucketName/fynapse/BusinessEvent/in/fileName.csv

Example:

s3://93f2445-52f3-4964-92cc-7bac25af3bef/fynapse/BusinessEvent/in/test-fynapse-data\_14.csv

Full Azure Blob URI pattern

https://storageAccountName.blob.core.windows.net/storageContainerName/fynapse/BusinessEvent/in/fileName.csv

Example:

https://i793f244552f3496492cc7ba.blob.core.windows.net/i793f244552f3496492cc7bac25af3bef-storage-container/fynapse/BusinessEvent/in/test-fynapse-data\_14.csv

Path for done files

namespace/entityName/done/dateTime/

Full S3 URI pattern

s3://bucketName/namespace/entityName/done/dateTime/fileName.csv

Example:

s3://793f2445-52f3-4964-92cc-7bac25af3bef/fynapse/BusinessEvent/done/20221213104843/test-fynapse-data\_14.csv

Full Azure Blob URI pattern

https://storageAccountName.blob.core.windows.net/storageContainerName/namespace/entityName/done/dateTime/fileName.csv

Example:

https://i793f244552f3496492cc7ba.blob.core.windows.net/i793f244552f3496492cc7bac25af3bef-storage-container/fynapse/BusinessEvent/done/20221213104843/test-fynapse-data\_14.csv

Path for errored files

The files that are placed under this path failed due to file parsing problems during the ingestion process.

namespace/entityName/error/dateTime/

Full S3 URI pattern

s3://bucketName/namespace/entityName/error/dateTime/fileName.csv

Example:

s3://793f2445-52f3-4964-92cc-7bac25af3bef/fynapse/BusinessEvent/error/20221129144507/test-fynapse-data\_11-error.csv

Full Azure Blob URI pattern

https://storageAccountName.blob.core.windows.net/storageContainerName/namespace/entityName/error/dateTime/fileName.csv

Example:

https://i793f244552f3496492cc7ba.blob.core.windows.net/i793f244552f3496492cc7bac25af3bef-storage-container/fynapse/BusinessEvent/error/20221129144507/test-fynapse-data\_11-error.csv

Path for receipt files

namespace/entityName/done/dateTime/

Full S3 URI pattern

s3://bucketName/namespace/entityName/done/dateTime/fileName-receipt.csv

Example

s3://793f2445-52f3-4964-92cc-7bac25af3bef/fynapse/BusinessEvent/done/20221213104843/test-fynapse-data\_14-receipt.csv

Full Azure Blob URI pattern

https://storageAccountName.blob.core.windows.net/storageContainerName/namespace/entityName/done/dateTime/fileName-receipt.csv

Example:

https://i793f244552f3496492cc7ba.blob.core.windows.net/i793f244552f3496492cc7bac25af3bef-storage-container/fynapse/BusinessEvent/done/20221213104843/test-fynapse-data\_14-receipt.csv

To learn where extract files are placed on a dedicated cloud storage, refer to the Extract Files Placement section in the Data Extraction chapter.

Learn more