NUTRINTG (1) File Fetcher
Short Summary
Input | Epsilon SFTP of a specific region |
---|---|
Output | specific directory structure on S3 in a single multi-region bucket |
Region-awareness | 4 instances with different Spring Boot profiles (amer, sea, eu, us) |
Scalability | possible per region, but not too much sense to scale it up (files appear too infrequently) |
Metadata Collections |
|
Flow Chart
Poll a file from SFTP every N seconds specified via
process.fixedDelay
property from a directory specified viaprocess.remoteDirectory
.process.fixedDelay
is currently set to5
seconds for every region on production andprocess.remoteDirectory
is set tooutgoing
for every region. An underneath algorithm for ordering of file polling depends on underlying Spring Integration implementation.Proceed if the file name matches a regex pattern, skip the file in this iteration otherwise. File name pattern is specified by
process.fileNamePattern
property and on production looks like:US -
(?<filename>rb_(?<region>amer_usa)_.*_(?<date>[0-9]{8})[0-9]{6}(\\.md5|\\.dat\\.pgp
SEA -
(?<filename>rb_(?<region>sea)_.*_(?<date>[0-9]{8})[0-9]{6}(\\.md5|\\.dat\\.pgp))
EU -
(?<filename>rb_(?<region>eu)_.*_(?<date>[0-9]{8})[0-9]{6}(\\.md5|\\.dat\\.pgp))
AMER -
(?<filename>rb_(?<region>amer)_.*_(?<date>[0-9]{8})[0-9]{6}(\\.md5|\\.dat\\.pgp))
Proceed if file metadata is not in metadata store, skip the file in this iteration otherwise. Name of metadata collection in MongoDB is specified by
s3.fileMetadataStoreCollection
property and has a value ofcdpToSfmc_fileMetaDataStore
for all regions on production.Proceed if the file transfer is not already in progress by other thread, skip the file in this iteration otherwise. Name of progress mark collection in MongoDB is
cdpToSfmcFileFetcher_progressMark
.Upload the file to S3 via multipart upload.
If upload fails - check if a file with the same name already exists on S3. Generally, files are overwritten on S3 and there should be no exception in this case. The only situation when file can be already on S3 is if it was already uploaded, but no metadata stored in MongoDB due to unexpected error during save operation. If the file already on S3 - delete and log operation, otherwise just finish. No metadata will be stored for the file, so the process will run again for it eventually.
If upload succeeds - save metadata to metadata store in MongoDB and finish. If metadata saving fails, retry the amount of times specified by
process.retryCount
property.
Support Tips and Notes
As there are 4 regions, there are 4 instances of this service and 4 Spring profile with properties set to different SFTP users and passwords. Properties can be accessed under https://bitbucket.org/rbdigital/spring-boot-cdp-to-sfmc-integration. Production logs can be observed and downloaded from https://kubernetes-dashboard-production.frankfurt.rbdigitalcloud.com under namespace cdp-to-sfmc-integration. For access, the platform team has to be contacted.
In order to re-download files to S3 you can simply delete the appropriate metadata documents from cdpToSfmc_fileMetadataStore collection. Files will be rewritten. Deleting files from S3 without deleting metadata will not cause the re-download. Example query to delete all files for 20211001 of region amer:
db.cdpToSfmc_fileMetadataStore.deleteMany({"_id": {"$regex": "rb_amer_.*20211001.*"}})
If file download fails, there still could be a document in cdpToSfmcFetcher_progressMark collection. It has a TTL of one hour, so the time has to pass or it has to be deleted manually before retrying download.
A file appears on S3 when fully downloaded. Files generally appear on Epsilon SFTP only when full and ready, not when in progress. However, if files are manually uploaded by us on Epsilon SFTP, there may be a situation when it appears as having 0 bytes on S3. In this case we have to re-download it.
A file with an example name of rb_amer_actdet_20211001120000.dat.pgp will appear in S3 bucket under path amer/20211001/rb_amer_actdet_20211001120000.dat.pgp.