Is It Really Time to Move EO Data Processing to the Cloud?

Marco Peters
Jun 17, 2024
6 min read

Hello EO-Masters!

Recently I stumbled upon a blog written by Sinergise for the Copernicus Data Space Ecosystem, its title: Why it’s time to switch your EO data processing and analysis to the cloud

But I ask myself is it really like that? I will discuss this scenario, its pros and its cons, and compare the processing task presented in the CDSE article with a processing task I have recently done. I would also like to hear what you think about this topic. Let's discuss it in the comments.

Before I start the analysis, I want to make it clear that I am not questioning the usefulness or quality of CDSE, Sentinel Hub or any other cloud processing provider. There are good reasons to use cloud processing. I just want to point out that there are also reasons not to go into the cloud, and I want to encourage a discussion with the EO community about this. In particular, I would like to hear opinions other than mine.

Reasons to Move into the Cloud

In the Synergise article an example is introduced, which produces a simple harvest detection map. For the detection, a Normalized Difference Vegetation Index (NDVI) and a Bare Soil Index (BSI) are calculated and based on thresholds it is decided for two scenes if the pixel is classified as harvested or not.

The workflow in the cloud is much simple and involves less steps compared to classical processing. No download and validation of input data, no image stacking and mosaicking. In the cloud we only need to define the area, the dates, and the algorithm. Afterwards we get very quickly a result. That’s great.

I do not go further into the details of the processing performed, as this has already done in the mentioned article. I will focus on the next chapter.

Reasons Not to Move into the Cloud

There are three main reasons why it does not yet make sense to switch to the cloud in general.

Costs

The cost of processing is not mentioned in the article, but I think it should be considered. For the presented simple task, it can be left out, but what if you do a more complex real-world example? Is the processing still covered by a free account? Later in this blog I get back on the costs of processing in the cloud.

Data Availability

On Sentinel Hub there are already different data sets available. Sentinel, MODIS, ENIVSAT, Landsat and more are available. But, you know, Murphy’s Law: There is always some data set that is critical to your processing that is not available. Sure, you can provide it yourself. But at additional effort and additional costs.

Development

During developing the algorithm, it is necessary to test the processing repeatedly on a small area.

Depending on the complexity this can be very quick when using cloud processing. But often it is still faster to do the test processing locally, especially when using so called unit level testing. Thus, the development time is increased. Even waiting only 30 seconds for each test, extends the development time.

In the past I made the experience that it can take 15 minutes or even up to an hour till the processing was started. This slowed down the development process in the project and we implemented the algorithm twice. One for fast testing locally and one which was used in production on the cloud. This has improved today, but still needs to be considered.

When talking about the development process. Going into the cloud often forces you to use a specific programming language. In many cases this is Python, and it’s not a big issue for data scientist. They are used to it. But in the case of Sentinel Hub it is JavaScript. It’s a language most scientists need to learn. Also, the support during the development from the IDE is limited in the cloud. The local IDE can help you better when refactoring the code or debugging problems. This is changing and cloud-based IDE can be very good. But they are not yet usable in all cases.

What if you want to use a specific library or executable in your processing which is not available in the cloud or in the programming language needed? Then you are stuck, or you need work arounds, or you rely on what the cloud provider offers. But in this case the provider becomes a gate keeper. For example, only the atmospheric corrections provided by the cloud provider can be used and he can decide which one becomes popular.

Comparison of Processing Task

Recently, I’ve processed a global coastal map. It provides the usual land/water flag and a coastline flag, but also a flag for intertidal areas and a vicinity indicator for land/water pixels. I will not go into details here, but you can follow me on social media to get notified when it is released.

This processing was not even very complex or was like rocket science. I think it is just an average processing but on a global scale.

In the CDSE article it is pointed out that in the traditional workflow the download time of the data is several magnitudes larger than the processing time. Let's have a look.

In the article it is assumed that you have a 100Mbit/s connection. That’s okay, but one can already assume 250Mbit/s, especially for people who often work with large datasets. Even in Germany where the internet speed is a meme, the 250Mbit/s is quite common.

But the internet connection doesn’t help much when data providers limit the data throughput and the number of simultaneous connections. So, in the end the 100Mbit/s is a fair number.

The article shows a graph comparing the processing and download times of the traditional method and the cloud approach. The following image is a subset of this graph, focusing on the traditional method.

This is a reasonable ratio for small processing tasks with low complexity. But often the calculation takes longer than the calculation of an NDVI. For my coastal map calculation, the graph looks different.

Downloading the data took only 32 hours or less. I’m not sure anymore. But the processing took ~2500 hours. This is the total opposite to the example in the article.

Both examples are on opposite sides of the performance comparison. And the average common use case is somewhere between. But I think in most cases the processing time is far more significant as in the harvest detection example.

Even though the processing of the costal map is not possible on Sentinel Hub, because not all data is available and it does not support morphological functions, I calculate the cost, assuming the processing would be possible. I calculate the costs using the pricing tables and examples provided on Sentinel Hub.

Parameter	Quantity	Factor	Details
Output size (width x height)	36000x36000 px	x 4943	One tile is 36000x36000 / (512x512) = 4943.847
Number of input bands	1	x 1/3	The processing needs multiple inputs not available on Sentinel Hub. But I’m counting as it would be there.
Output format	16-bit	x 1	Only using 16 bit, 32-bit float TIFF is not needed.
Number of data samples	2	x 2	2 data samples one for flags and one for vicinity. Thus, the multiplication factor is 2.
Orthorectification	No	x 1	Orthorectification is not requested, which results in a multiplication factor of 1.
	Single Tile	= 3295 PU	To calculate the number of processing units for single tile request multiply all the individual multiplication factors: 4943 x 1/3 x 1 x 2 x 1 = 3295.33
	For Globe	x 7200	There are 7200 tiles
	Total	23724000 PU	3295 * 7200

Using the price for a single PU in the Enterprise L plan of 0.001 €, the price for a single tile is 3.29 €. This still sounds reasonable, but when doing the processing on a larger scale it sums up to 23724.00 € for the globe.

Maybe when contacting the cloud provider, it might be possible to get special conditions for such processing tasks. But how much would it be, maybe I end up at 20K€ or at best at 15K€. Still too much for me. I couldn’t effort it.

I also wonder if the prices will stay as they are or might go up in the future. At the moment, such processing services are often subsidised by ESA projects, or they want to gain market share. But this can change, and prices can rise, like the prices of streaming services such as Netflix or Spotify.

For example, you get 10.000 PUs when you register for a free CDSE account and you can use the Processing API. At SentinelHub you only get 5.000 PUs with a free account, and you cannot use the Processing API. You would need to subscribe to the Exploration price plan. This costs 30€/Month or 300€/Year (VAT not included). You get also some more feature compared to the CDSE free account, but considering this price, the free CDSE account is subsidised by something around 5€ per Month. Probably less, because of contractual arrangements, but you get the idea. How long will ESA/Copernicus keep doing this?

Summary

Overall, I would say there is no black or white. It is not true that you should do all your processing in the cloud, but doing all processing locally is also not the solution. Especially for small tasks the cloud processing is great. I think of first year students who can already do great processing stuff in the cloud which wouldn’t be possible locally. But there are also still many cases where it is beneficial to use your local computer.

A simple cluster which is easy to setup, and which allows to use the existing network resources in an office, or a shared apartment would be nice. This would allow for massive processing even at low cost. But I’m not aware of such. There is always a big overhead involved. Do you know such a solution?

That’s it. If you have another opinion than me, that’s fine. Let me know in the comments. You can also tell me if you share my opinion. Just let us discuss friendly.

Tschüss & Goodbye

Marco

EOMasters