Performance Issues when looking up reference data

Yesterday I encountered a really good example of how using an online lookup of reference data from a survey can dramatically improve performance. The follow image shows the CPU usage on an 8 processor data collection server at the point when a single survey was modified to use online lookups instead of downloading all the reference data to the phones.

The heavy load prior to the switch was negatively impacting all users.

The cause of the problem

A large number of field workers were updating a survey with a hundred thousand records that was the source of reference data for other surveys. These people were working online with data sims in their phones. So as soon as a field worker completed a survey the results were sent to the server updating the reference data. That automatically triggered the downloading of the 100 thousand records, now including updated reference data, to all the other field workers. Each field worker would have the latest reference data each time they completed a survey but the load was high and response times were getting slow.

Choosing a solution for using reference data

The decision about how to access reference data depends on the type of work your field workers are doing and whether or not they have a network connection while they are doing it. Here are three scenarios to help you choose:

Scenario 1 – Working offline without a network connection

In this scenario your workforce might return to a central location that has wi-fi every evening where they will refresh their phones, uploading any data they have collected and downloading changes to surveys and reference data. The next day they head back out into the field and work offline.

The surveys can use the following functions to get reference data from other surveys and from csv files:

  • search. To get choices for a select question.
  • pulldata. To lookup reference values.

So what happens if the reference data gets really large and is updated continuously by the field worker. Well this is not actually a problem. The server will not be overwhelmed since the workers are only synchronizing their phones once per day. They should actually press the refresh button twice, maybe once in the evening and once in the morning. The first time it will upload all the new data and the second time they will download the updated reference information that includes all of the previous days work from all of the field workers.

Scenario 2 – working online with a permanent network connection

Now the field workers are submitting data as they complete surveys and getting updates to reference data whenever that changes. Because they are online you can use the following functions to get reference data:

  • lookup_choices. To get choice for a select question
  • lookup. To lookup reference data.

Now when completing a survey, if reference data is required, there will be a small delay while the network call is made and then the required reference data will be made available. It is no longer necessary to download large amounts of reference data, most of which will never be used, and store it on the phone. If the reference data is large you should find this a faster approach on the phone as you no longer need to wait for the reference data to be loaded into the survey when you open it.

If your reference data files are small, only a few thousand records, or they are not updated often then you can use either the online or the offline functions in this scenario.

Scenario 3 – A hybrid approach

In this scenario perhaps the workers are online most of the time and want immediate updates to reference data but occasionally they wander outside of the network and still need to lookup those references.

At the moment you would need to use the offline approach and look out for potential performance problems from large frequently updated sets of reference data. However we do have a change request pending to only synchronise the updates to the reference data and not to download all the data each time a single record changes. This change request will allow the hybrid scenario to scale without causing performance problems. Let us know if you need it and I will increase its priority.

5 thoughts on “Performance Issues when looking up reference data

  1. daudruy Marc

    Dear Neil,

    the third scenario would solve lots of issues we meet in dfficult environnement.
    We have a hundred of field workers retrieving 7000 lines of producers data to make our follow-up forms.
    We have used parent-child form and tricky fonctions to lead the enumerator to the right follow-up form.
    (fill form 1, if form 1 exists, then fill form 2 etc …).

    When network connection is missing, this would not allow to jump to the next form before the data are updated. THis cause a lot of frustration for the users.

    We now wish to cross data between forms leading to more data download with search() function or pull data() when update is available from network…

    If we don’t have network available this cannot work…

    Having a kind of “local copy” of the database that would only update necessary data, would trummendesly improve the possibilities. Especialy if local data updated within a form could be available staight after the form has been filled in the next form, without the need of new synchronisation.

    You could also think of triggers, that would only sync the necessary data for a specific user.
    on the server side we have role based questions or “view only user data” role that only shows data for a specific user on the server side, retrieving only those data to field task could also narrow the volume of data.

    Making year by year follow up forms is challenging if we want to retrieve previous data to help the enumerators in their job.

    This also solve an issue where data plans are expensive in some places around the world !
    Dicreasing the volume per 10 or 100 would make a difference in the business model we built.

    As we did in the past year, i’ll be glad if Olvea could contribute to such a functionnality …
    let me know if that’s interesting in your plans.

    ma daudruy

    Reply
  2. Neil Penman Post author

    Hi Marc,

    This sounds really interesting. I would be very happy to work with Olvea on another extension to fieldTask. Here are my comments on your requirements:

    1. Having a local copy of the database.
    This is what you get now when you use pulldata() and search(). However it only includes the data that was on the server at the time of the last synchronisation over the network. Effectively it is readonly until it gets refreshed by the next synchronisation. I don’t have any plans to update this local data store with information from locally completed surveys. This would be complex.

    2.There is mechanism to pass data from one form to another when launching forms within forms. (Which we created for you). Does this not meet your requirements?

    3. Scenario 3 would allow for more frequent lower cost synchronisation of this local data store with the latest data on the server. Whether that came from the device that has the local data store or some other users device. However this does require a network to be present, otherwise you are just left with the readonly copy of the server data which could be quite old.

    4. Data costs. Yes scenario 3 synchronisation of local data should significantly reduce data volumes.

    5. Triggers or maybe filters to restrict synchronisation of data to only that which is relevant. Yes this could be done with the new approach. Actually I used to incorporate role based filters to only include relevant reference data for each user however this was creating a significant server load due to the need to generate a CSV file per user and I have removed it. Are you relying on this?

    6. In general I can see the advantages for you in moving to a more connected model. This does not have to mean on the network all the time but more frequent synchronisations of smaller volumes of data. Scenario 3 could assist.

    Reply
    1. MARC-ANTOINE DAUDRUY

      Dear Neil,

      thanks for the clear reply.

      1. ok I understand the readonly approach.
      We use it for a “sequencial” approach child form 2 can be filled only if child form 1 has be filled in a previsous visit. this is handle by a key that is build within the parent form.
      The counter part is that if you don’t have network available that key is not updated, and the user can not go to child form 2 from parent form as far as the key is not update in the parent form data set.
      This can be handle differently on the parent form to allow allow launching child form 2 as soon as the child form 1 has been completed.
      But now the experience shows that after 2 years we still have users not always comfortable with the mechanisms set in the forms.
      having a local writable copy would allow to save and quit the parent form. Returning to it straight after would allow to advance in the process of going to form 2 without the need of an update.
      We met that the business model is hard to handle if enumerators need to make 3 or 4 visits to the farmer, some would need to go for the form 1 & 2 in a row or 2&3 or 3& 4… to make less visits. We also would need to pull data from form 2 to fill the form 4. Scenario 3 would help making it feaseable from the data volume point of view.

      2. yes this is pretty usefull and a must have fonction when using parent-child forms !
      this allow the user to make sure he’s working on the right producer in the child form
      this allow a “post-linking” of the child form to parent form when the user misses to save the parent form after the child form has been saved. (however this breaks the link with UUID in the console, unfortunatly this often happens).
      this allow to handle matching PRI Keys in child forms that matches specific policies (add, merge etc…)
      that’s what we wanted to achieve, and it’s working fine, this also helps to clean the data for analysis.

      3. in our case I calculate this would decrease the volume of data to be exchange by a factor of 70 000 !
      this should help to handle poor connection issues, transfering a few Bits instead of several MB should be more reliable.

      4. we have a set of 7000 lines (2Mo) (should reach 30 000 later). we have 100 enumerators, if you have ten of them updating 1 line each, this make the 100 devices updating 7000 lines 10 times a day… instead we could only update 100 x 1 line 10 times …

      5. I understand that pretty well, I met that sometimes the update of the data set would need to wait a few minutes to be updated on the phone. (this drove me crazy several times when doing some testing…). I ended doubling the size of our server.
      Scenario 3 in our use case would narrow the data by 70000. We have 5 logins set in the system for collectors (1 per farmer organisation). This would devide by 5 the volume of data… which is less interessting. However this would avoid the guys from organisation 1 to see the lists of organisation 2 that is more a security issue in our case.

      6. Our next campain should start in april/may if this can be handle in this timing fill free to contact me to go further in that way !

      coming back to point 2. if we could set an “auto-save & quit” value in forms, that would be usefull to avoid the loss of data in some cases and user would not need to save 2, or 3 forms within the app to complete operation. This is disturbing for most of them actually.

      Reply

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.