In this post I’ll show you how you can make a program which will automatically collect data from some website and then further process it. More specifically, I’ll show you how I made myself a PHP script to periodically download half-hourly METAR reports from the website of the Slovak Hydrometeorological Institute and then processed them to show a plot of how temperature and pressure changes over time. You can easily adapt this for any kind of data collection you might wish to do, for example logging of exchange rates over time. So, let’s start!
Introduction
If you’re reading this, then probably there’s some data that you would like to collect over time, whether it’s exchange rates, meteorological data such as temperature, or some other information which changes with time. Ideally, this information should be accessible on some website and the website’s structure shouldn’t be changing, which is the case for example with RSS feeds or websites whose content is updated automatically.
In this case, we’ll use the website of the Slovak Hydrometeorological Institute (SHMU) which provides half-hourly meteorological reports in the so-called METAR format, which contains the date and time of the measurement and the measured data such as temperature, pressure, visibility, wind speed, etc. The institute has a specific webpage which is automatically updated every hour and which we’ll use to extract the data we want: http://www.shmu.sk/sk/?page=483. A screenshot from the site is shown below.
Notice that the website contains measurements for two consecutive times which differ by half an hour — in this case, the measurements were taken at 16:30 and 17:00 UTC (see image above), so we’ll only need to download the reports in one hour intervals to cover measurements from each half hour.
Also notice that each set of measurements contains METAR data measured at different place, which is denoted by the 4-letter code following the “METAR” directive — in this case the codes “LZIB”, “LZKZ”, “LZPP”, etc., represent cities of Bratislava, Kosice, Piestany, and so on. In this case, I only care about measurements made in Bratislava, and so we’ll only download those measurements — they’re shown in red rectangles in the screenshot above.
Next, notice that each METAR measurement starts with the directive “METAR” and ends with the equals sign “=”, between which various data are stored. For example, one of the lines in this case is METAR LZIB 151700Z 12003KT CAVOK 08/05 Q1031 NOSIG=
. This will make our process of extraction of data easier, since we know that all the data we want is stored between “METAR” and “=”.
Finally, in order to extract the METAR report, we need to know in which HTML structure it is located (by which HTML tags it is surrounded). To determine this, I’ve opened the SHMU website and used the Google Chrome Developer Tools (View»Developer»Developer Tools) to see where the line we want to extract is located. You can also just view the source of the website and search for the line, but I find the Developer Tools to be easier to use. Anyway, below is the screenshot of the HTML structure.
From the above screenshot we can see that the first set of the METAR measurements that we want to extract (taken at 17:00 UTC time in this case) is enclosed by the <pre>
and </pre>
HTML tags, and the same holds for the second set of measurements (in this case taken at 16:30 UTC time). This will allow us in the PHP script which we’ll make shortly to identify the portions of the webpage which we want to extract, since we know they’ll always be surrounded by the <pre>
tags.
Script for saving METAR data into a file
Now, to actually get the described data, I have made a PHP script which uses the CURL extension to download the METAR data from the SHMU website and parse them using DOM (Document Object Model) and some PHP functions (I’ve used parts of the parsing example from http://htmlparsing.com/php.html as a starting point). See the source code below and save this as a script metar-parser.php
(I’m assuming it’s in the directory /path/to/your/metar/file/
).
<!-- file /path/to/your/metar/file/metar-parser.php -->
<?php
// Use the CURL extension to query SHMU and get back a page of results
$url = "http://www.shmu.sk/sk/?page=483";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
// Create a DOM parser object
$dom = new DOMDocument();
// Parse the HTML
@$dom->loadHTML($html);
// Load METAR data
$numberOfParsedContents = 0;
foreach($dom->getElementsByTagName('pre') as $pre) {
if($numberOfParsedContents == 0) {
$metar_data1 = $pre->textContent;
}
if($numberOfParsedContents == 1) {
$metar_data2 = $pre->textContent;
}
$numberOfParsedContents++;
}
// Extract METAR data for Bratislava
$metar_data_bratislava1 = explode("=", $metar_data1)[0];
$metar_data_bratislava2 = explode("=", $metar_data2)[0];
// Replace possible newlines by spaces
$metar_data_bratislava1 = str_replace("\r\n", " ", $metar_data_bratislava1);
$metar_data_bratislava2 = str_replace("\r\n", " ", $metar_data_bratislava2);
// Insert "METAR" at the beginning if there's none
if($metar_data_bratislava1 == "") $metar_data_bratislava1 = "METAR";
if($metar_data_bratislava2 == "") $metar_data_bratislava2 = "METAR";
// Combine date/time and METAR data
$lineToSave1 = date("Y-m-d H:i:s") . " " . $metar_data_bratislava1 . "=";
$lineToSave2 = date("Y-m-d H:i:s") . " " . $metar_data_bratislava2 . "=";
// Save the METAR measurements to a file
$file = "/path/to/your/metar/file/metar-data.csv";
$current = file_get_contents($file);
$current .= $lineToSave2."\n";
$current .= $lineToSave1."\n";
file_put_contents($file, $current);
?>
So, let’s now look at what each part of the code does. The lines 4-12 basically download the whole source code of the SHMU website into a variable $html
using the CURL extension, so that we can process it later. CURL is a nice extension for transferring data with various formats, and in this case we use it to transfer an HTML website from the SHMU server to my server. In order to download the website, we first need to init CURL (line 6), then specify the website’s URL (lines 5,8) and the connection timeout (lines 7,10), and finally execute the CURL command to download the website (line 11) and close the connection (line 12). Now we have the whole SHMU website downloaded into a variable $html
.
Next, we create a Document Object Model (line 15) and parse the HTML of the downloaded website (line 18). When the HTML is parsed in the $dom
variable, we then go through all the <pre>
elements in the HTML structure (line 22), and for the first two such occurrences (corresponding with the two sets of METAR measurements we want to extract, see lines 23,26) we put the contents of the <pre>
tags into two variables $metar_data1
and $metar_data2
(line 24,27). These two variables now contain the two sets of METAR measurements for all cities.
Now, since we only want the measurements for Bratislava, which is the first METAR measurement in each set of measurements, we then divide the set of measurements into individual measurements by “exploding” them by the equals sign “=” (lines 33,34), since each individual METAR measurement must end with the equals sign. After this step, the variables $metar_data_bratislava1
and $metar_data_bratislava2
contain the two METAR measurements for Bratislava which were taken half an hour apart. In other words, these two variables contain only the contents in the two red boxes shown in the first screenshot on the top of this post.
Next, since the a METAR measurement may be split into two lines on the SHMU website, we want to make sure that any newline symbols will be converted into spaces (lines 37,38).
Also, it may happen that there will be no METAR measurements shown on the SHMU website (this usually seems to happen around midnights, probably because of maintenance), in which case the variables with METAR measurements will be empty. However, to make these measurements easier to parse later on, we’ll replace the empty string with the “METAR” directive (lines 41,42).
Next, to make it easier to look up the measurements later on, we’ll add the date at which our script downloads the data at the beginning of each METAR report, after which we’ll add our parsed METAR report for Bratislava and end it with an equals sign. Therefore, now the variables $lineToSave1
and $lineToSave2
will contain something like this: 2015-03-15 17:15:01 METAR LZIB 151600Z 14005KT 9999 FEW025 BKN042 09/04 Q1031 NOSIG=
, where $lineToSave1
contains the measurement taken half an hour later after $lineToSave2
.
Finally, we’ll save these two lines into a file on our server with the path specified on line 48. But in order not to replace the data we have saved previously, we’ll first read the current contents of the file (line 50) and then append them with the two measurements stored in $lineToSave1
and $lineToSave2
(line 51,52) and finally write all of this into the metar-data.csv
file. So, for example, the following two lines may be appended to the file:
2015-03-15 17:15:01 METAR LZIB 151530Z 13008KT 9999 SCT034 BKN041 09/04 Q1031 NOSIG=
2015-03-15 17:15:01 METAR LZIB 151600Z 14005KT 9999 FEW025 BKN042 09/04 Q1031 NOSIG=
CRON job
Now we need to make our script above to execute automatically every hour so that we can capture the METAR reports continuously. This can be done by setting up what’s called a CRON job on our server, which is just a way of telling the Linux OS that we want the system to periodically execute some script for us. To do this, login to your server and on the terminal, execute the command crontab -e
, which will bring up a screen where you’ll be able to add and edit your system’s CRON jobs. It looks like the screenshot below.
In order to add a CRON job, add the following line to the end of your CRON file, as shown in the screenshot above:
15 * * * * /usr/bin/php /path/to/your/metar/file/metar-parser.php >/dev/null
What this command means is that the first 5 parameters (15 * * * ) represent the minute, hour, day, month, and day of week in which to execute your script, in that order (see this link for more detail). Since a pair of two new METAR reports appear at the SHMU website approximately at every hour, we want our PHP script to execute some 15 minutes after that time (to make sure the data has already been uploaded), and so we’ll tell CRON to execute the script at 15 minutes (15) after every hour (), every day (), every month () and every day of the week (*). The next parameter to the CRON job is the actual command to execute, and since we want PHP to execute our script metar-parser.php
, the command will be to launch PHP (/usr/bin/php) with the parameter being the path to the script which we want to execute (/path/to/your/metar/file/metar-parser.php). Lastly, to suppress any output produced by this command, we redirect the command’s output to /dev/null (>/dev/null), which is a place where everything sent to it gets discarded. And so now our script will get executed every hour!
Extract data from METAR reports
Now that our half-an-hourly METAR reports are saved into a file, we’ll want to parse this file to get some useful information such as temperature and pressure which we’ll plot later on. Just for reference, here are some lines from the file metar-data.csv
which you may see:
2015-03-15 10:15:01 METAR LZIB 150830Z 12010KT 3400 -RA BR BKN008 BKN036 06/05 Q1031 BECMG SCT009 BKN020=
2015-03-15 10:15:01 METAR LZIB 150900Z 13012KT 4200 BR BKN008 06/05 Q1031 BECMG SCT009 BKN020=
2015-03-15 11:15:02 METAR LZIB 150930Z 13014KT 4800 BR SCT008 BKN012 07/05 Q1031 BECMG BKN015=
2015-03-15 11:15:02 METAR LZIB 151000Z 13014KT 5000 BR FEW008 SCT010 BKN014 07/05 Q1031 BECMG BKN015=
2015-03-15 12:15:01 METAR LZIB 151030Z 13012KT 6000 FEW008 BKN013 07/05 Q1031 BECMG BKN015=
2015-03-15 12:15:01 METAR LZIB 151100Z 13012KT 7000 FEW008 BKN015 08/06 Q1031 NOSIG=
So, to extract some useful information from these measurements, I have created another PHP script called metar-split.php
, which will extract the date and time of measurement, and the temperature and pressure and present them in a CSV format which we’ll later import into Excel for further processing. So, here’s the script:
<!-- file /path/to/your/metar/file/metar-split.php -->
<h1>METAR Splitter</h1>
<?php
echo "<b>Date,Time,Temperature,Pressure</b><br>";
$file = fopen("metar-data.csv", "r");
if($file) {
while(($line = fgets($file)) !== false) {
// if there's an empty METAR report, skip it
if(strpos($line, "METAR=") !== FALSE) continue;
$output_line = "";
// get date
$date = explode(" ", $line)[0];
$output_line .= $date . ",";
// get time of measurement
$time_of_measurement = explode(" ", $line)[4];
// select only time in HH:MM
$time_of_measurement = substr($time_of_measurement, 2, 4);
// insert colon between HH and MM
$time_of_measurement = substr_replace($time_of_measurement, ":", 2, 0);
$output_line .= $time_of_measurement . ",";
// get Temp/DewPoint
preg_match("/M?[0-9]{2}\/M?[0-9]{2}/", $line, $matches);
$temperature = $matches[0];
// get just Temp
$temperature = explode("/", $temperature)[0];
// replace M by minus sign
$temperature = str_replace("M", "-", $temperature);
$output_line .= $temperature . ",";
// select pressure
preg_match("/Q.{4}/", $line, $matches);
$pressure = $matches[0];
$pressure = explode("Q",$pressure)[1];
$output_line .= $pressure . "<br>";
// display parsed data
echo $output_line;
}
fclose($file);
} else {
echo "Error reading METAR file.";
}
?>
What this script does is that it first opens the file where our CRON job is saving the METAR reports (line 9) and reads line by line from it until the end of the file (line 11). Then, for each line (corresponding to one METAR report) it does the following.
First, it checks whether the current report is an empty one, in which case it skips it (line 14). If the report is not empty, an output line string is initialized.
Then, the date of the measurement is extracted by exploding the METAR string by spaces, where the date represents the first element of the exploded array (line 19), and this date is then appended to the output line (line 20). For example, if the METAR report starts with 2015-03-15 10:15:01 METAR LZIB 150830Z …
, then the date will be 2015-03-15
(note that the date is not extracted from the date 150830Z
in the actual report, because it doesn’t contain the year and month of measurement unlike the date 2015-03-15
written by our PHP script).
Next, the time of measurement is extracted by exploding the METAR string by spaces and selecting the fifth element of the resulting array. For example, for a report starting with 2015-03-15 10:15:01 METAR LZIB 150830Z …
, the extracted time will be 150830Z
(note that opposite to the case before, here we extract the time 150830Z
from the actual report as opposed to the time 10:15:01
provided by our script, since the time produced by the script corresponds to the time when the METAR report was downloaded, not when it was actually measured). Now, since this150830Z
date/time is in format DDHHMMZ
, where D
is the day, H
is the hour and M
is the minute when the measurement was taken, and Z
represents the UTC time, to extract just the number of hours and minutes, we take the substring of this date/time from the character on position 2 (ie 3rd character) with the length of substring equal to 4 (line 25). This will extract just the part HHMM
. Finally, we insert a colon between HH
and MM
(line 27) and append this time to the output string (line 28).
The third part we’re going to extract from the METAR report is temperature. Since the temperature in METAR reports is always in the format MTT/MDD
, where M
represents a possible minus sign (which may or may not be there), T
represents the temperature in °C and D
represents the dew point in °C. To extract this pattern, we use regular expressions (lines 31, 32). Next, since we’re only interested in the temperature, we split the string MTT/MDD
by the forward slash to get just MTT
(line 34). Finally, to make things easier to process later, we replace M
by an actual minus sign (line 36) and append the temperature to the output line (line 37).
The last thing we’ll extract from the report is pressure. Since the METAR reports give pressure in the form QPPPP
where PPPP
is the pressure in hPa, we extract this string by using regular expressions (lines 40,41), and remove the leading letter Q
(line 42) to get just the pressure in hPa. This pressure is then appended to the output line (line 43).
Finally, as we have now extracted the date, time, temperature and pressure from the METAR report and separated them by commas (thus forming a CSV format), we now display the whole output line (line 46). As the while
cycle repeats itself, these data are extracted and displayed for every METAR report, and an example output from this metar-split.php
script might look like this:
Date,Time,Temperature,Pressure
2015-01-23,14:30,07,1014
2015-01-23,15:00,06,1014
2015-01-23,15:30,06,1014
2015-01-23,16:00,06,1014
2015-01-23,16:30,06,1014
2015-01-23,17:00,06,1015
2015-01-23,17:30,05,1015
2015-01-23,18:00,05,1015
2015-01-23,18:30,05,1015
Plot data in Excel
Now, let’s actually use the data that we have extracted by plotting them on a graph in Excel, and try to deduce if there’s any correlation between temperature and pressure over time. To do this, open the script metar-split.php
, copy all its outputs, paste them into a file on your computer and save the file in the CSV (comma separated values) format, e.g. metar-parsed.csv
.
Then, open Excel and import this CSV file into your spreadsheet (File » Import » CSV File » Import).
When the data has been imported, select the data you want to plot and select Marked Scatter from the Charts tab. This should plot the temperature and pressure against time, as shown below.
However, as you can see, there is not much that can be deduced form the graph, because the scale is too large to see the differences in temperature and pressure. To decrease the scale, one thing we might do is to decrease the values for pressure, since they are 3 orders of magnitude greater than the values for temperature. To do this, we can subtract the mean or atmospheric pressure of 1013hPa from each pressure value, which will decrease the pressure values until the point when their magnitudes will be similar to the temperature magnitudes. So, as shown below, we create another column where we subtract 1013hPa from the pressure in each row and plot that column instead.
Now we can see that the values of temperature can be easily distinguished, and the same applies for pressure. Therefore, this graph shows the variation with measurement number (on the x-axis) of temperature in °C (on the y-axis) as well as the pressure in hPa -1013hPa (also on the y-axis). In this case, we can see that the temperature (denoted blue) decreases over time while the pressure oscillates.
However, it might be much more interesting to see how the temperature and pressure changes in a longer term than just a few measurements, say one month. To do this, I have let the CRON on my Linux server to run the PHP script for the past around 1 month, quietly gathering METAR reports every hour. After importing all this data into Excel and plotting it on a scatter plot graph, I have produced the following graph. The x-axis shows the number of measurement and the y-axis shows the measured temperature (in °C) and pressure (in hPa -1atm = hPa -1013hPa).
So, although I didn’t find any correlation between pressure and temperature as I had hoped, the graph can at least be used to show that the mean temperature rises over the course of one month. This seems to correspond with reality, as these measurements were made from late January to early March, and during that time the temperature on the northern hemisphere increases.
It might be even more interesting to make a similar plot for values measured over a course of a whole year to determine how much the average temperature will rise during the summer and how much it will fall during the winter. Also, another interesting experiment might be to measure the data over the course of multiple years, which might show whether global warming is happening and by how many degrees the temperature increases each year.
So, I’ll leave my METAR downloading script running for some more time and see if I’ll discover any interesting patterns. Thanks for reading and I hope you enjoyed the post!
Leave a Reply