Preparing the Files
Mapping the Data: Determining the Relationships Between Locations and Schools
Mapping the Data: Accounting for the Differences Between the Data
Uniting the Data
The Actual Graphgist
Today we are looking at data modeling with ONgDB. 1 state, over 9 million children, and 42,981 rows of CSV immunization data. After many rough drafts, I was finally able to land on an efficient and aesthetically pleasing way to map out the immunization data of children in California (found and downloaded online from the California Department of Education*).In this post our goal is to walk through the data modeling with ONgDB process to show how this CSV data can be connected meaningfully with ONgDB. What makes this data so interesting is its varying degrees of location, three distinct grade levels, and a dense record of immunization numbers and percentages-all spanning over two separate school years.
After successfully mapping the data, I could then easily explore it, answering questions such as: Where in California has the lowest amount of children vaccinated?, Are less parents vaccinating their children in 2015 compared to 2014?, and Which age group is more up to date on its vaccinations?. Furthermore, I was able to clearly visualize the data in small and large quantities using the open native graph database.
* The public school dataset can be found at http://www.cde.ca.gov/ds/si/ds/pubschls.asp . The private school dataset can be found at http://www.cde.ca.gov/ds/si/ps/ .
Before anything else, we have to modify the CSVs by:
- Changing the column names: capitalize only the first letter of each word and leave the rest of the word in lower case.
- Replacing # with Number, % with Percent, 1 with One, + with Plus, and taking out any dashes and parentheses. More simply, spell out the symbols and numbers found in the headers.
Before getting into the code, I took pen to paper by simply drawing out the nodes and their relationships to each other as first step for creating the data modeling with ONgDB. One of the complex parts of this process was determining the relationships between the schools/facilities and their different location labels. The data has three location labels: county, city, and district. Since we cannot be sure whether district borders overlap in multiple cities or vice versa (including county lines), we cannot simply connect county>city>district>school in a single string of relationships. Therefore we connect each school to each individual label. Since the location labels are related we also connect them to each other. We create another location identifier for the state the school is in: in this case, California. A unique constraint is put on the state’s name since all fifty states have unique names. The other three location labels are indexed in the case that there will be multiple, counties, cities, or districts with the same name within the whole country.
Although they are not accounted for in the column headers, the grade level and the grade year of each file can be created into nodes. The schools and facilities are related first to the grade level using an [:INCLUDES] relationship and the grade level nodes are then related to the grade year using the [:BELONGS_TO] relationship.The grade levels PreK, K, and 7th are indexed by name and the grade years 2014 and 2015 are uniquely constrained. We do not put a uniqueness constraint on the grade level since the grade level is also connected to the school’s or facility’s immunization record. If there was only one node for each grade level, too many immunization records would be attached to the grade level node.
In order to pull data from multiple CSVs, we have to load each file with sensitivity towards the differences between each one. A good place to start is to read the headers on the files, noting the similarities and differences from one to the next. In this case, the PreK institutions are called facilities while the Kindergarten and 7th grade institutions are called schools. Furthermore, the immunization record properties vary greatly even within the same grade levels; for example, the Kindergarten class of 2015 includes Healthcare Practitioner Counseled Pbe Number/Percent while the Kindergarten class of 2014 has no such thing. Furthermore, the PreK and Kindergarten files include specific records for each vaccine while the 7th grade records do not. From these observations, we can then accurately merge and create nodes unique to each file.
The validations for each file are essentially the same. These ensure that everything besides the immunization records are not null so that all of the names of the schools and facilities can be identified and their locations accounted for. We also make row.Reported not null.
Below are the Geequel queries from the data modeling process with ONgDB used to create the graph along with a select few of the queries and illustrations used to explore the data. These can quickly be setup and explored with GraphGrid Connected Data Platform.