Test Data

Oh Test Data ! Why thee trouble me? Anyone who has worked long enough in a QA/tester role, either manual, programmer or automation knows what I am talking about. Test Data is probably a very critical and importance piece in ensuring that Automation can succeed.


  1. Context
  2. Use Case
  3. Strategies
  4. Excel parsing
  5. XML parsing
  6. JSON parsing
  7. YAML parsing
  8. Database and SQL
  9. Other formats parsers


Any application, web or non-web will have data flowing through different layers of the system [front–middletier(n layers)–back]. So unless we have a handle to the test data as it flows across different layers, we cannot call Automation successful. Yes we can advance to an extent by randomizing and faking data, however a majority of applications that are in maintenance mode have data scattered across formats, files, databases and sometimes unstructured and fuzzy. Some of the systems in which data has been stagnating for years in many complex environments are:

  1. Mainframes
  2. SQL databases
  3. Flat files (Excels, csv etc.)
  4. OLAP/OLTP (Business Intelligence tools jargons)
  5. Big Data (Structured, unstructured, video, images, audio and pretty much any data that relates to a system)

A Use Case:

Lets talk about a simple use case on a web application and decompose the problem.

  • Hit the web app url in browser (A form opens up)
  • Fill the form with data, submit the form
  • The server accepts, persists the data in a database behind and response comes back
  • The DOM refreshes and response can be a successful message or redirection to a page or however the workflow is defined

If we have to have a repeatable test Automation script for the above scenario, think about a very simple and abstract pseudo code as follows:

My input test data resides in a data structure (let’s say excel file) and my test automation script is most likely Selenium that identifies each element on the form and fills the page with the excel data and then submits. Now how do we validate what we did just now? And better how do we rerun the same script again? The reason we will need the capability to re-run is a bigger discussion, however for now the premise –  that we must be able reset the state of the system (as it was) before the script executed – is assumed. 

So now we need to be able to have the following capabilities within our solution viz.

  • After submitting the form, if the application behind makes a call to an API layer, the test automation script must be able to do the same, because that is how we can validate the actual vs. expected data [This is assuming that the API layer is our final source of data truth]
  • After submitting the form, if the data flows through a # of layers behind. For simplicity let’s say it makes an api call, the api works through some message queues and eventually there is a data access layer that persists this data into a relational database. In this case, the test Automation script can either have programmatic access to each of the layers we talked about OR if we have enough confidence on the database quality, then we would just test the layers upward from the database layer all the way to the front-end
  • Depending on how complex the application gets, the n-layers behind the front end can have their own conversions of data as it passes the layers, that means we might have to know how it is done, if we intend to do end-end test Automation

So as we see the test Data (expected vs. actual) can get complex as we delve into the details. I am not trying to scare you away, but emphasizing that getting hold of test data is very important and if we can get this puzzle in control to a certain extent (meaning if we can have ingestion scripts that populate data in all layers behind front-end and also have a mechanism to retrieve data — all programmatically, then we are already on our road to success with Automation).


There are multiple strategies on how we can deal with Test Data some of them being:

  • Boil the ocean and have access to all data-access layers [idea, but not practical]
  • Data/Service virtualization tools to shield away the inconsistencies/performance issues of some layers in the chain. For example the content api layer might be slow all the time. We cannot afford to fail the test Automation scripts all the time just because one layer was down and all other layers were up. Some tools being HP SV, IBM SV etc. This strategy is really good where the application development team can focus on building the application and have mock services or virtualized services that can serve data to the application until the real services become stable in the back end. We will have a separate section dedicated to virtualization on this website very soon.
  •  Have a Test Automation Strategy that does Test Automation at different layers in the application as per the below pyramid and handle test data at that layer efficiently.



Another high-level way of looking at it is as below.


Let’s get tactical:

So with that background and context, lets move on now and get into some technical stuff. To start with let’s talk about different data formats we encounter and how can we handle it in our Automation Solution. Continue Reading..