Sorting data and averaging multiple and uneven numbers of rows

Question

Level 1

0 points

Sorting data and averaging multiple and uneven numbers of rows

Hi Folks, I have a pressing problem. I have a large data table with about 11000 rows. It represents the hourly wind speed and direction data for a weather station. There are several things that I need to do with this data.

1. Overall I would like to obtain an average the daily data for Wind speed

2. Similarly for Wind secondly direction if possible.

The complication is that the data sequences appear to be disrupted or inconsistent. If the data were single entries for each hour, there should be 8724 rows as 2012 was a leap year.

a) However, thre are actually 10970 rows, so there may be duplicate entries. How do I clean this up? to remove duplicates

b) Some days have entries of data every hour, others have missed hours, i.e. every other hour. How do I average according to the date only (note teh entries are logged by a date and time field)

The table looks like this at teh start:

ob_time	version_num	wind_direction	wind_speed/knots	wind speed ms-1
01/01/2012 00:00	1	220	17	8.75
01/01/2012 01:00	1	210	19	9.77
01/01/2012 02:00	1	240	17	8.75
01/01/2012 03:00	1	230	15	7.72
01/01/2012 04:00	1	230	13	6.69
01/01/2012 05:00	1	230	14	7.20
01/01/2012 06:00	1	220	16	8.23
01/01/2012 07:00	1	230	17	8.75
01/01/2012 08:00	1	220	16	8.23
01/01/2012 09:00	1	220	14	7.20
01/01/2012 10:00	1	220	13	6.69
01/01/2012 11:00	1	220	16	8.23
01/01/2012 12:00	1	230	17	8.75
01/01/2012 13:00	1	230	13	6.69
01/01/2012 14:00	1	230	13	6.69
01/01/2012 15:00	1	230	11	5.66
01/01/2012 16:00	1	220	12	6.17
01/01/2012 17:00	1	220	14	7.20
01/01/2012 18:00	1	210	12	6.17
01/01/2012 19:00	1	240	8	4.12
01/01/2012 20:00	1	230	12	6.17
01/01/2012 21:00	1	240	13	6.69
01/01/2012 22:00	1	240	17	8.75
01/01/2012 23:00	1	230	14	7.20

but at the end, it looks like this...

29/12/2012 20:00	1	210	17	8.75
29/12/2012 22:00	1	220	17	8.75
30/12/2012 00:00	1	230	16	8.23
30/12/2012 02:00	1	220	16	8.23
30/12/2012 04:00	1	230	12	6.17
30/12/2012 06:00	1	230	14	7.20
30/12/2012 08:00	1	220	15	7.72
30/12/2012 10:00	1	220	16	8.23
30/12/2012 12:00	1	230	15	7.72
30/12/2012 14:00	1	240	18	9.26
30/12/2012 16:00	1	240	20	10.29
30/12/2012 18:00	1	230	17	8.75
30/12/2012 19:00	1	230	17	8.75
30/12/2012 20:00	1	230	14	7.20
30/12/2012 22:00	1	230	17	8.75
31/12/2012 00:00	1	220	20	10.29
31/12/2012 02:00	1	220	23	11.83
31/12/2012 04:00	1	210	21	10.80
31/12/2012 06:00	1	220	17	8.75
31/12/2012 08:00	1	210	21	10.80
31/12/2012 11:00	1	220	17	8.75
31/12/2012 13:00	1	250	17	8.75
31/12/2012 15:00	1	260	15	7.72
31/12/2012 17:00	1	240	12	6.17
31/12/2012 19:00	1	220	9	4.63
31/12/2012 21:00	1	250	10	5.14
31/12/2012 23:00	1	250	12	6.17

sorry - it seems to have lost the formatting - not sure why, but the point is to show that the later entries are different to those at teh top - note the observation times.

I know this is a tough one, but woudl appreciate any help that could be given.

cheers

John

MacBook Pro (13-inch, Mid 2012), OS X Mountain Lion (10.8.2)

Posted on Feb 18, 2013 7:59 AM

Reply

Answer 1

Badunit

Level 7

25,784 points

Feb 18, 2013 8:41 AM in response to diveo

If you sort by date (column A) first you can use an IF formula in a new column to flag duplicates.

=IF(A2=A1, "Duplicate","").

You can copy/paste that to the entire column. But you have to sort the table by column A first, before creating this column of formulas. After you have the new column completed, you can then sort by it to get all the duplicates together so you can delete them all at once. I would recommend, though, that if your criteria for duplicates is based soley on the date&time (as I did in the formula above), you first look through the table and determine if they are actually duplicates versus two different data points with the same date and time (not sure how that could happen but I'd still recommend you look).

To get an average of an unspecified number of items, use SUMIF/COUNTIF. We can provide more detail if that isn't enough to get you going.

There is another method for flagging duplicates using the COUNTIF function but I suspect it will be very very slow on a table of 11000 entries.

Reply

Answer 2

diveo Author

Level 1

0 points

Feb 18, 2013 8:48 AM in response to Badunit

Hi Badunit,

Thanks for taking the time on this and getting back so rapidly. I tried your suggestion and am not sure if I am doing it right. I copied your formula into a blank cell and dragged it down and got this...

ob_time	version_num	wind_direction	wind_speed/knots	wind speed ms-1
01/01/2012 00:00	1	220	17	8.75
01/01/2012 01:00	1	210	19	9.77	Duplicate
01/01/2012 02:00	1	240	17	8.75	Duplicate
01/01/2012 03:00	1	230	15	7.72	Duplicate
01/01/2012 04:00	1	230	13	6.69	Duplicate
01/01/2012 05:00	1	230	14	7.20	Duplicate
01/01/2012 06:00	1	220	16	8.23
01/01/2012 07:00	1	230	17	8.75
01/01/2012 08:00	1	220	16	8.23
01/01/2012 09:00	1	220	14	7.20
01/01/2012 10:00	1	220	13	6.69
01/01/2012 11:00	1	220	16	8.23
01/01/2012 12:00	1	230	17	8.75
01/01/2012 13:00	1	230	13	6.69
01/01/2012 14:00	1	230	13	6.69
01/01/2012 15:00	1	230	11	5.66
01/01/2012 16:00	1	220	12	6.17
01/01/2012 17:00	1	220	14	7.20
01/01/2012 18:00	1	210	12	6.17
01/01/2012 19:00	1	240	8	4.12
01/01/2012 20:00	1	230	12	6.17
01/01/2012 21:00	1	240	13	6.69
01/01/2012 22:00	1	240	17	8.75
01/01/2012 23:00	1	230	14	7.20
02/01/2012 00:00	1	230	15	7.72
02/01/2012 01:00	1	230	17	8.75
02/01/2012 02:00	1	230	16	8.23
02/01/2012 03:00	1	230	14	7.20
02/01/2012 04:00	1	220	17	8.75
02/01/2012 05:00	1	230	15	7.72
02/01/2012 06:00	1	220	15	7.72

So it seems to be picking up some of the date time data as the same, but not others. It does this further down the sheet with many more instances, even though the hour entries are different. Any thoughts? - I did not go further than taht, as I think I need the sorted data first before trying to average.

thanks again.

Reply

Answer 3

Badunit

Level 7

25,784 points

Feb 18, 2013 9:06 AM in response to diveo

That is strange.

What is the highlighting on the table in your screenshot?
Take a look and tell me what the formula is on the row for 01/01/2013 02:00 (3rd row down). What row is that entry in?
Is ob_time in column A? Is ob_time completely in column A, not column A is the date and column B is the time? I can't tell these things from your screenshots.

Reply

Answer 4

diveo Author

Level 1

0 points

Feb 18, 2013 9:44 AM in response to Badunit

Hi again,

Apologies, as I had to go and collect my daughter from after school club....

The highlighting is not significant, it was only to highlight the first day as a full day...

I take it you mean the formula that is returning the "duplicate" entry at cell g3 - if so it is

=IF(A4=A3, "Duplicate","")

which I hope was the correct paste in of your original formula in your first reply

Column a is a combined date and time field, i.e. 01/01/2012 01:00 etc - I don't know if they can be separated or if that would help if they could?

Reply

Answer 5

Feb 18, 2013 9:52 AM in response to diveo

John,

I'm curious how you want to average the wind data. Do you want to average the vector wind velocity, or just the scalar values?

If you want to mail me a copy of the data, I'll take a look at why Badunit's duplicate finding expression isn't working.

Jerry

Reply

Answer 6

diveo Author

Level 1

0 points

Feb 18, 2013 10:00 AM in response to Jerrold Green1

Hi Jerry,

Thanks for joining in. The reason for averaging the daily wind data, is to allow a comparison with the average daily electricity consumption from a different data source. I cannot obtain consumption data with any greater resolution than daily, but the wind data (and ultimately the energy that can be derived from the wind via a turbine), has been supplied in hourly format.

For the purposes of comparing supply with demand a daily average will be adequate, as the aim of the work is to consider the surplus / deficit and mitigation measures, i.e. export to grid, storage capacity required or additional electical energy to be purchased. The assignment is to consider the financial viability of a community owned wind turbine.

Hope that is sufficient. I will try to mail the data to you separately, I am not clear on whether I can attach it here directly.

Thanks again...

Reply

Answer 7

Badunit

Level 7

25,784 points

Feb 18, 2013 10:07 AM in response to diveo

I am mystified. You can email it to me, too. Click on my name to go to my profile to get my email address.

I was suggesting the formula be slightly different. Row 3 would have =IF(A3=A2, "Duplicate","") . This way you can fill/paste the formula down to the bottom row of the table. With your version of it, the last row will be an error because it will be trying to reference a row that does not exist (the non-existent row below the last row). But that's a different problem.

The two values (A3 and A4) are obviously different when you look at them. I'd baffled as to why the formula would not work. How about this as another test using three cells in that new column:

In a cell put the formula =A4 . What's the result? Looks exactly like the value in A4?
In another cell put the formula =A3 . What's the result? Looks exactly like the value in A3?
In a third cell put =A4-A3 . Is the result 0d ?

Reply

Answer 8

Badunit

Level 7

25,784 points

Feb 18, 2013 10:30 AM in response to diveo

Regarding the average, if you have regularly spaced data (like hourly), the average is simple to do. If you have missing data or unevenly spaced data, an average for the day might not be completely accurate.

Time Data

1:00 10

2:00 12

3:00 13

4:00 18

Average of all four = (10+12+13+18)/4 = 13.25
If missing the 3:00 data, the average = (10 +12+18)/3 = 13.33
If you try to interpolate to get the 3:00 data point you would get a different "average"
If you summed windspeed x "hours until next data point taken" & divided by the total number of hours you would get a different "average". With the missing 3:00 data it would be =(10*1 + 12*2 + 18*1)/4 = 13.00

My suggestion is to do a simple average of the data points you have. If data points are missing, no method will be absolutely correct.

Reply

Answer 9

Feb 18, 2013 11:15 AM in response to diveo

John,

The most significant issue with your plan, I think, is the fact that a simple average of wind speed will have little bearing on the power produced over a 24-hour period, especially if the speed deviation is significant compared to the average.

Here's a typical output curve (Power generated vs. Wind Speed)

Notice that at low wind speeds you have a cube law effect, and at higher speeds the output is clipped because of limitations of the generator and pitch controls.

Jerry

Reply

Answer 10

diveo Author

Level 1

0 points

Feb 18, 2013 11:26 AM in response to Jerrold Green1

Hi Jerry,

You are correct in your explanation.

However, I would like to be able to take account of the days when the wind speed is between the cut in speed and the rated power, as this will reduce the total energy generated.

Also part of the reason for chasing this down and not using the more simplified approach, is to build a picture of the difference between the supply and consumption of energy. In the case of a grid connected system, that is not such a big problem, as you can draw or export within the grid. However, for an island / isolated system, that option does not exist, and I would like to be able to develop the means to work with the differentials and consider storage systems etc. I appreciate that on a daily basis, this is a bit of an academic exercise, but I think the approach would be the same for a data set that worked at the minute interval level - does that make sense to you?

thanks again

Reply

Answer 11

diveo Author

Level 1

0 points

Feb 18, 2013 11:33 AM in response to Badunit

Hi again Badunit,

I hope this doesn't distract us all off on a tangent.

I tried your suggestion, using : =A5-$A$3 and drag copied this formula down the cells, and it gave differences of 1h, 2h, 3h........1d1h, 1d2h etc

So there might be another step to identify cell pairs where the difference is not 1h, and modify a calculation to account for those???

although I woudl not want you to waste your time on a wild goose chase on this idea...

Reply

Answer 12

Feb 18, 2013 12:52 PM in response to diveo

John,

You've got double entries from the first of the year to the end of March. After that, there are just a few duplicates.

I'll return your file with the dups removed.

Jerry

Reply

Answer 13

diveo Author

Level 1

0 points

Feb 18, 2013 1:12 PM in response to Jerrold Green1

Thanks Jerry, I appreciate you taking the time to help

Best regards

John mcDonald

Reply

Answer 14

Feb 18, 2013 4:19 PM in response to diveo

This sort of presentation might be helpful in making your argument. This is your data with the duplicates removed.

Jerry

Reply

Answer 15

Feb 18, 2013 4:41 PM in response to diveo

The technique I used to remove the duplicates is similar to Badunit's. I'll give you my approach:

Add a column to the left of the date/time column

In row 2 of the new column, write:

=OFFSET(A2, -1,1)=OFFSET(A2, 0,1)

Copy this cell and Paste to the rest of the column.

All the duplicate rows will be identified with TRUE.

Copy the entire aux column and Command-C.

Edit > Paste Values.

Sort the comparison column ascending.

Delete the rows containing TRUE.

Delete the Aux column.

Jerry

Reply