Sometimes I have the need to keep a local copy of an S3 bucket. Using the AWS console is ok if you just have few objects in the S3 bucket. But what do you do if you have hundreds of objects in your S3 bucket? The aws cli comes to rescue with this simple command:
aws s3 cp--recursive s3://my_s3_bucket .
The recursive flag downloads the entire S3 bucket recursively into the local directory (that’s what the dot at the end is for). The operation may take some time depending on the number of objects stored in the S3 bucket so be patient!
I recently wrote a blog post for the AWS blog. The blog post is available on the AWS Public Sector blog and describes how we are using AWS in the Dictionaries department of Oxford University Press to make high-quality language data available to licensees, software developers, and the wider public.
As I’m using more and more often Jenkins Pipelines, I found the need to validate a Jenkinsfile in order to fix syntax errors before committing the file into version control and running the pipeline job in Jenkins. This can saves some time during development and allows to follow best practices when writing a Jenkinsfile as validation returns both errors and warnings.
There are several ways to validate a Jenkinsfile and some editors like VS Code even have a built-in linter. Personally, the easiest way I found is to validate a Jenkinsfile is to run the following command from the command line (provided you have Jenkins running somewhere):
This weekend I am attending Markup UK at King’s College London , a 2 day conference on XML and other markup technologies. I am presenting a paper on running XSpec tests in a serverless environment with AWS Lambda (which I blatantly titled XSpec in the Cloud with Diamonds). The paper is available here and the slides of my presentation are available here.
Somewhere I read that sending unencrypted email is like sending postcards: anyone can potentially read them. This is not nice for privacy but becomes very dangerous when the content of the email or attached files contains secrets like passwords, access keys, etc. Anyone who can get hold of your email can also potentially access your systems.
For sending encrypted email I generally use Enigmail which is data encryption and decryption extension for the Thunderbird email client. I also used Mailvelope which is an add-on for Firefox and Chrome allowing to integrate encryption in webmail providers such as Gmail, Outlook, etc. These tools simplify the encryption/decryption process, especially if you are not familiar with it.
However, it has occurred to me to have to encrypt large files containing data dumps. The challenge with email extensions is that they don’t allow you to send email with such huge attachments. Plus, Mailvelope doesn’t allow to encrypt files larger than 25 MB. This is when knowing how to encrypt and decrypt a file via the command line comes in handy. You can easily upload a large encrypted file on an FTP server or cloud hosting service without worrying that the file will end in the wrong hands. As a bonus, an encrypted file is generally smaller than a non-encrypted file so the upload is also quicker.
The encryption process requires to first get the GPG public key from the person you want to send the encrypted file or email to. Once you have the recipient’s public key, you can encrypt a file with that key. You send the email or upload the file and then ask the recipient to decrypt it at their end using their GPG private key. I’m going to cover both processes. Note that this is also useful in order to encrypt the content of an email that you want to keep secret and send it as attachment in a non-encrypted email.
Generate GPG public and private keys
Install gpg or gpg2 on Linux or MacOS. This is generally part of the standard packages, for example on Ubuntu:
sudo apt install gnupg2
If you are on Windows, you can use Cygwin and install gpg or use the GnuPG utility which should work similarly (although I have not tried it).
Generate a GPG key and follow the instructions. I recommend selecting RSA and RSA (default) as kind of key and 4096 as keysize of the key:
You should now have two files in
.gnupg within your home directory (e.g.
Verify your public key with:
Verify your private key with:
Encrypt and decrypt files
You have received a public key from someone and you want to encrypt a file with their public key in order to transmit it securely. The file containing the public key will typically have an extension
Import the public key (e.g.
someonekey.asc is the filename of the key):
Trust the public key (
email@example.com is the email associated with the key and should be shown as output of the import command):
You’ll get a prompt
trust and select
5=Itrust ultimately. Type
quit to exit.
Encrypt the file with the public key of the user (replace the email address with the email address of the user associated to the public key):
This will generate an encrypted file
mysecretdocument.txt.gpg which is smaller than the original file. Transmit the encrypted file and tell the user to decrypt it at their end with the following command:
We were kindly hosted by Perusha and Ming from the local Google Developer Group Meetup and got lots of interesting questions from the audience. This also gave us few ideas on how to develop further the Oxford Dictionaries API and what developers may want to do with it in the fields of Natural Language Processing and Machine Learning.
This week I am attending XML Prague at the University of Economics College campus in Prague, a conference on markup languages and data on the web. Together with other XSpec developers I am organising the XSpec Users Meetup. I’m also giving a lightning talk in the Schematron Users Meetup on how to test Schematron with XSpec.
Slides of the XSpec Users Meetup are available here whereas my lightning talk on testing Schematron with XSpec is available here.
Sometimes you prefer not to update a specific package in Linux. This may be because you don’t want to upgrade to a new version with new features but no security updates. Or maybe because upgrading requires a service restart that you want to avoid just yet. This was the case for me recently when a new version of Docker came up and upgrading would have restarted the docker daemon and stopped the running containers.
It is possible to exclude a package from being updated. On Linux RPM systems (RedHat, CentOS, Fedora, etc.) this is the command to install all updates but exclude a specific package (say docker):
sudo yum update--exclude=docker
On Debian-like systems (Debian, Ubuntu, Mint, etc.) it is slightly more convoluted because you need to hold a package first and then upgrade the system
sudo apt-mark hold docker&&sudo apt-getupgrade
and remember to remove the hold when you’re ready to upgrade that package too
This weekend I am attending XML London at University College London, a 2 day conference on XML, Open Data, Digital Publishing, and Data Management. I am presenting a paper on XSpec. The paper is available here and the slides of my presentation are available here.
Serverless is gaining attention as the next big thing in the DevOps space after containers. Developers are excited because they don’t have to worry about servers any more; Ops may be sceptical and slightly worried to hear about a world without servers (and sys admin maintaining them). Can these two worlds co-exist? Can serverless just be another tool in the DevOps toolkit?
I recently implemented a real use case at work where we took advantage of an event-driven workflow to trigger Jenkins jobs originally created to be executed manually or on a schedule. The workflow is as follows:
1. New data is uploaded to an S3 bucket
2. The S3 event calls a lambda function that triggers a Jenkins job via the Jenkins API
3. The Jenkins job validates the data according to various criteria
4. If the job passes, the data is upload on an S3 bucket and a successful message is sent to a Slack channel
5. If the job fails, a message with a link to the failed job is sent to a Slack channel
Let’s start by creating a new user with the correct permissions in Jenkins. This allows to restrict what the lambda function can do in Jenkins.
In Manage Jenkins -> Manage Users -> Create User I create a user called
In Manage Jenkins -> Configure Global Security -> Authorization -> Matrix-based Security add the user
lambda to User/group to add and set the permissions as in the matrix below:
This is a minimum set up and allows the lambda user to build jobs. According to your security policies, you may want to further restrict the permissions of the lambda user in order to run only some specific jobs (you may need role based authentication for setting this up).
AWS IAM Role
Now let’s move to AWS and set up a IAM Role for the lambda function. Head to IAM -> Roles and create a new roles with the following policies (my role name is
digiteum-file-transfer , sensitive information is obfuscated for security reasons):
This role allows to execute lambda functions, access S3 buckets as well as the Virtual Private Cloud (VPC).
I create an empty S3 bucket using the wizard configuration in S3 and name it
gadictionaries-leap-dev-digiteum. This is the bucket that is going to trigger the lambda function.
AWS Lambda Configuration
Finally, let’s write the lambda function. Go to Lambda -> Functions -> Create a Lambda Function. Select Python 2.7 (read Limitations to see why I’m not using Python 3) as runtime environment and select a blank function.
In Configure Trigger, set up the trigger from S3 to Lambda, select the S3 bucket you created above (my S3 bucket is named
gadictionaries-leap-dev-digiteum ), select the event type to occur in S3 (my trigger is set to respond to any new file drop in the bucket) and optionally select prefixes or suffixes for directories and file names (I only want the trigger to occur on XML files). Here is my trigger configuration:
In Configure Function, choose a name for your function (mine is
file_transfer ) and check out the following Python code before uploading it:
print('Loading lambda function')
# TODO: private IP of the EC2 instance where Jenkins is deployed, public IP won't work
# TODO: these environment variables should be encrypted in Lambda
# Get the S3 object and its filename from the S3 event
return'Job Digiteum File Transfer started on Jenkins'
print('Cannot connect to Jenkins server or run build job')
Note the following:
Line 6 imports the python-jenkins module. This module is not in Python’s standard library and needs to be provided within the zip file (more on this in a minute).
Line 12 sets up the URL of the EC2 instance where Jenkins is deployed. Note that you need to use the private IP address as shown in EC2, it won’t work if you use the public IP address or the Elastic IP address.
Lines 15 and 16 set up the credentials of the Jenkins user lambda. The credentials will be exposed to the lambda function as environment variables and, unlike in this example, it is recommended to encrypt them.
Lines 18-31 contain the handler function that is triggered automatically by a new file upload in the S3 bucket. The handler function does the following:
retrieve the filename of the new file uploaded on S3 (lines 21-22)
log into Jenkins via username and password for the lambda user (line 25)
build the job called
Digiteum_File_Transferin the folder Pipeline (line 26)
throw an error if it can’t connect to Jenkins or start the job (lines 28-31)
As an example, here is the zip file to upload in Configure Function. It contains the lambda function and all the Python modules needed, including the python-jenkins module. Make sure you edit the private IP address of your Jenkins instance in line 12. If you need to install additional Python modules, you can follow these instructions.
Here is how my Configure Function looks like:
Note the name (it should read
file_transfer instead of
file_transfe ), the handler (as in the Python code above), and the role (as created in IAM). Note also that the username and the password of the Jenkins user lambda are provided as environment variables (ideally, you should encrypt these values by using the option Enable encryption helpers).
Once you’ve done the basic configuration, click on Advanced Settings. In here you need to select the VPC, subnet, and security group of the EC2 instance where Jenkins is running (all these details about the instance are in EC2 -> Instances). In fact, the lambda function needs to run in the same VPC as Jenkins otherwise it cannot connect to Jenkins. For example, here is how my advanced settings look like (sensitive information is obfuscated):
Finally, review your settings and click on Create Function.
Test the Lambda Function
Once you created a lambda function, configure a test event to make sure the lambda function behaves as intended. Go to Actions -> Configure test event and select S3 Put to simulate a data upload in the S3 bucket. You need to replace the bucket name (in this example
gadictionaries-leap-dev-digiteum) and the name of an object in that bucket (in this example I uploaded a file in the bucket and called it
test.xml). Here is a test example to adapt:
Click on Save and Test and you should see the lambda function in action. Go to Jenkins and check that the job has been executed by user
lambda . If it doesn’t work, have a look at the logging in AWS Lambda to debug what went wrong.
Finally, I set up a Slack integration in Jenkins so that every time the Jenkins job is executed, a notification is sent to a Slack channel. This also allows several people to get notified about a new data delivery.
First, install and configure the Slack plugin in Jenkins following the instructions on the GitHub page. The main configuration is done in Manage Jenkins -> Configure System -> Global Slack Notifier Settings. For example, this is my configuration:
Team Subdomain is the name of your Slack account
Channel is the name of your default slack channel (you can override this in every job)
Integration Token Credential ID is created by clicking Add and creating a token in Jenkins’ credentials. As the message says, it is recommended to use a token for security reasons. Here is an example of a Token Credential ID for Slack in Jenkins:
You typically want to add a notification to a specific Slack channel in your Jenkins job as a post-build action in order to notify the result of a job. In Jenkins go to your job’s configuration, add Post-build Actions -> Slack Notifications and use settings similar to these:
This sends a notification to the Slack channel (either the default one set in Global Slack Notifier Settings or a new one set here in Project Channel) every time a job passes or fails. When a notification is sent to Slack, I will look like this:
Now you can keep both technical and non-technical users informed without having to create specific accounts on Jenkins or AWS or spamming users with emails.
I ran into two problems that I was not yet been able to solve due to lack of time. I want to flag them as they can improve the lambda function and make it more maintainable. If anyone want to help me to fix this, please send me your comments.
Encryption: I tried to encrypt the Jenkins password but I could not make the lambda function decrypt the password. I set up an encryption key in IAM -> Encryption keys -> Configuration -> Advances settings -> KMS key and pasted the sample code in the lambda function but the lambda function timed out without giving an error message. I imported the
b64decode module from
base64 in the Python code but there must be an issue with this instructions that decrypts the variable
Python 2.7: I wanted to use Python 3 but I had issues with the installation of some modules. Therefore I used Python 2.7 but the code should be compatible with Python 3 (apart from the imported modules).
Integrating AWS Lambda and Jenkins requires a little bit of configuration but I hope this tutorial may help other people to set it up. If the integration needs to be done the other way round (i.e. trigger a lambda function from a Jenkins job), check out the AWS Lambda Plugin.
I believe integrating AWS Lambda (or any FaaS) with Jenkins (or any CI/CD server) is particularly suited for the following uses cases:
Organisations that already have some DevOps practices in place and a history of build jobs but want to take advantages of the serverless workflow without completely re-architecturing their infrastructure.
CI/CD pipelines that need be triggered by events but are too complex or long to be crammed in a single function.