Apple Event: May 7th at 7 am PT

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Sorting data and averaging multiple and uneven numbers of rows

Hi Folks, I have a pressing problem. I have a large data table with about 11000 rows. It represents the hourly wind speed and direction data for a weather station. There are several things that I need to do with this data.


1. Overall I would like to obtain an average the daily data for Wind speed

2. Similarly for Wind secondly direction if possible.


The complication is that the data sequences appear to be disrupted or inconsistent. If the data were single entries for each hour, there should be 8724 rows as 2012 was a leap year.


a) However, thre are actually 10970 rows, so there may be duplicate entries. How do I clean this up? to remove duplicates


b) Some days have entries of data every hour, others have missed hours, i.e. every other hour. How do I average according to the date only (note teh entries are logged by a date and time field)


The table looks like this at teh start:



ob_time

version_num

wind_direction

wind_speed/knots

wind speed ms-1

01/01/2012 00:00

1

220

17

8.75

01/01/2012 01:00

1

210

19

9.77

01/01/2012 02:00

1

240

17

8.75

01/01/2012 03:00

1

230

15

7.72

01/01/2012 04:00

1

230

13

6.69

01/01/2012 05:00

1

230

14

7.20

01/01/2012 06:00

1

220

16

8.23

01/01/2012 07:00

1

230

17

8.75

01/01/2012 08:00

1

220

16

8.23

01/01/2012 09:00

1

220

14

7.20

01/01/2012 10:00

1

220

13

6.69

01/01/2012 11:00

1

220

16

8.23

01/01/2012 12:00

1

230

17

8.75

01/01/2012 13:00

1

230

13

6.69

01/01/2012 14:00

1

230

13

6.69

01/01/2012 15:00

1

230

11

5.66

01/01/2012 16:00

1

220

12

6.17

01/01/2012 17:00

1

220

14

7.20

01/01/2012 18:00

1

210

12

6.17

01/01/2012 19:00

1

240

8

4.12

01/01/2012 20:00

1

230

12

6.17

01/01/2012 21:00

1

240

13

6.69

01/01/2012 22:00

1

240

17

8.75

01/01/2012 23:00

1

230

14

7.20


but at the end, it looks like this...




29/12/2012 20:00

1

210

17

8.75

29/12/2012 22:00

1

220

17

8.75

30/12/2012 00:00

1

230

16

8.23

30/12/2012 02:00

1

220

16

8.23

30/12/2012 04:00

1

230

12

6.17

30/12/2012 06:00

1

230

14

7.20

30/12/2012 08:00

1

220

15

7.72

30/12/2012 10:00

1

220

16

8.23

30/12/2012 12:00

1

230

15

7.72

30/12/2012 14:00

1

240

18

9.26

30/12/2012 16:00

1

240

20

10.29

30/12/2012 18:00

1

230

17

8.75

30/12/2012 19:00

1

230

17

8.75

30/12/2012 20:00

1

230

14

7.20

30/12/2012 22:00

1

230

17

8.75

31/12/2012 00:00

1

220

20

10.29

31/12/2012 02:00

1

220

23

11.83

31/12/2012 04:00

1

210

21

10.80

31/12/2012 06:00

1

220

17

8.75

31/12/2012 08:00

1

210

21

10.80

31/12/2012 11:00

1

220

17

8.75

31/12/2012 13:00

1

250

17

8.75

31/12/2012 15:00

1

260

15

7.72

31/12/2012 17:00

1

240

12

6.17

31/12/2012 19:00

1

220

9

4.63

31/12/2012 21:00

1

250

10

5.14

31/12/2012 23:00

1

250

12

6.17







sorry - it seems to have lost the formatting - not sure why, but the point is to show that the later entries are different to those at teh top - note the observation times.


I know this is a tough one, but woudl appreciate any help that could be given.


cheers


John

MacBook Pro (13-inch, Mid 2012), OS X Mountain Lion (10.8.2)

Posted on Feb 18, 2013 7:59 AM

Reply
20 replies

Feb 18, 2013 10:18 PM in response to diveo

Hi John,


I took a stab at this before reading the rest of the thread.


Looking at the results of copying your sample data from the thread, then pasting it into a Numbers table, my impression is that the Date and Time values are each in a single cell. Gven that, I'm surprised at the frequency of duplicates identified by Badunit's formula.


An issue with the gaps in the data is that dayswith measurements at one hour intervals will outweigh days with data at two hour intervals if all non-duplicate data is included in a simple average calculation.


My approach was to calculate an average for each date, then use that daily average in the calculation of an annual average. Example below uses on the sample data you provided, but will also work with the full table.


Column A has been added to contain the date of each sample, with the time of day for all samples on that date set to 00:00:00.


On the separate, small table, averages have been calculated for each date using AVERAGEIF. In a Footer row, the 'annual' average has been calculated using the Daily average data.

User uploaded file

Formulas:

A2 and filled down (to A25): =DATEVALUE(B2)

A26 and filled down: =DATEVALUE(LEFT(B26,10))


The two formulas were necessary in my tables as Numbers recognized the first rows as date and time values, but saw the December values as text. This might be an artifact of my custom date format. You may need only the first formula.


Small table:


Column A contains entered dates. these are Date and Time values, with the time set (by Numbers) to 00:00;00, used to match the D&T values in column A of the large table.


B2, filled down to B5: =AVERAGEIF(Table 2 :: $A,A2,Table 2 :: $F)

This calculates the average wind speed (in m/s) for each date in column A, giving a single value for that date.


B6: =AVERAGE(B)

This calculates the annual average, with equal weight given to each date on which measurements were recorded.

Row 6 is a Footer Row.


Regards,

Barry

Feb 19, 2013 6:21 AM in response to diveo

diveo wrote:


Hi Jerry,


Thanks for joining in. The reason for averaging the daily wind data, is to allow a comparison with the average daily electricity consumption from a different data source. I cannot obtain consumption data with any greater resolution than daily, but the wind data (and ultimately the energy that can be derived from the wind via a turbine), has been supplied in hourly format.

For the purposes of comparing supply with demand a daily average will be adequate, as the aim of the work is to consider the surplus / deficit and mitigation measures, i.e. export to grid, storage capacity required or additional electical energy to be purchased. The assignment is to consider the financial viability of a community owned wind turbine.


Hope that is sufficient. I will try to mail the data to you separately, I am not clear on whether I can attach it here directly.


Thanks again...

John,


Once you have a handle on the data, you can do lots of analysis. The daily averages can be calculated using AVERAGEIFS. Here's an example:


User uploaded file

The trend line is a 7-day moving average, so following that takes out some of the noise and gives you an idea of the trends by week.


Jerry

Feb 20, 2013 7:12 AM in response to Jerrold Green1

I'd just like to extend a warm thank you to Jerrold, Barry and Badunit for your excellent help on this problem and your explanations of how you have worked through the solutions. It has been a busy couple of days, so I have just had the chance to look through the latest posts today and will work through your methodologies over the next day or so to understand the solutions for future use (I expect additional similar problems in the coming weeks).


Thanks once again and best regards


John

Sorting data and averaging multiple and uneven numbers of rows

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.