What is a Data Scientist? – Brian Devitt LLC

I am old enough to remember the cool guys who worked on mainframe computers in the late seventies. They would hang out in the college computer center wearing jeans and t-shirts, impressing us with their mystical knowledge of early computer programming. I would patiently wait in line with my stack of punch cards to ask them why my simple Fahrenheit to Celsius temperature conversion program failed to compile. Like a magician, one of them would look at my green bar, printout, and tell me I missed a comma on the seventh line of code.

When preparing to enter the civilian world after serving seven active duty years in the Navy, I decided to pursue an MS in Computer Science. Thank God I did. It was 1984 (a profound year in retrospect), and I fortunately caught the information technology wave. I could surf between many diversified industries: aerospace, Wall Street, banking, pharmaceuticals, insurance, and the power industry. Although the software applications differed significantly, they had one thing in common: the implicit importance of their data. Nothing is more critical to a digital process than the data that runs through it. However, it took a while for this fundamental concept to be entirely accepted. When applications run smoothly, producing expected results, no one cares much about data. Even when things went to hell, and the data integrity was suspect, the computer system support team would provide a ‘hot patch’ to fix the issue and not investigate the data issue any further.

After Y2K software applications became more sophisticated, managers in every industry started asking more fundamental questions like “How are we doing?” and “How do we get better?” or “Who are our best performers?”. The only way to answer these or related questions is to review the stored data. Of course, the application databases seemed like the repository to start the search, but the necessary critical historical data was only sometimes there. So, databases dedicated to storing generations of corporate data were built with new and elaborate organizational structures. They were called Data Warehouses and are invaluable assets to corporate decision-making. An early breed of Data Scientists did the heavy lifting of creating and maintaining these data warehouses.

Data Warehouses caught on quickly during the early two-thousands, and every primary industry sought one. The demand for Data Scientists soared. Most were sourced internally because of their familiarity with existing databases and business acumen. Like many other new initiatives, Data Science struggled at first with the formation of best practices like keeping the operational database separate from the data warehouse to limit contention for resources, using the most efficient schema to mine the data in your corporation’s Data Warehouse; is it better to ‘outsource’ the Data Warehouse to an external vendor or keep it in-house?

Almost twenty-five years later, the role of Data Scientist has become formalized. A reasonable definition of the roles of a Data Scientist would include:

Data Acquisition

No matter the source, data needs to be captured and recorded correctly. There is zero tolerance for inaccuracies here because the loss of data integrity, no matter how small, will cripple any analytical data analysis application. Data Scientists need to anticipate the full spectrum of data possibilities and ensure a data extraction mechanism supports it.

Maintaining data in any formal repository is a full-time operation. As data is added or updated, it needs to be cleansed of inaccuracies to maintain the paramount data integrity requirement. It may also be necessary to reorganize the data to support new corporate initiatives. Depending on the size of the data repository, a data staging area could be required to first capture the data, cleanse it as necessary, perform any necessary updates, and then store it persistently with the other data.

Data Organization

Data in any well-maintained data repository should not stagnate. It is a living thing that needs constant attention. Besides care and upkeep, it requires constant attention to be valid. Continuous investigation for new data relationships and resultant modeling ensures optimal data mining. So many exciting and sometimes profound discoveries come from simple ‘what if?’ questions about the data when it’s efficiently summarized or classified. This is especially true when working with data collected from machine learning.

Metadata repositories were created better to understand the data organization within a corporate data repository. These coresident data structures contain information about the data (metadata) and act like a data blueprint elaborating the details of data structures containing the data. Its information is critical in the development of efficient data searches and queries.

Data Analysis

My first encounter with a large data repository was like a fisherman walking up to a large lake. Water, water everywhere, but I was nervous about where to start fishing and what bait to use. Only after trial and error was I able to effectively ‘fish’ and pull out meaningful data. This included first learning the database organization to develop an in-depth understanding of the data integrity rules. Eventually, I could ‘catch’ some big fish and contribute to the business line I supported. Today, the data fishing expeditions are much more supervised. Every legitimate data repository details its data organization for use by business intelligence (BI) applications and individual data analysts.

Data Promotion

Communication of data findings is essential to any organization. Once crucial information is mined from corporate data, it should be promoted and propagated so everyone can take advantage of it. With all the sophisticated data aggregation and display tools available today, elaborate digital illustrations are easy to produce and understand. They have come a long way from simple pie and bar charts. This generation of users is also better at interpreting data results in multidimensional display formats.

Nothing is ever set in stone, the least of which is anything defined in information technology. However, Data Science as a discipline has achieved a great deal of respect and appreciation. Everyone from entry to executive-level positions understands the importance of data and how to use it best to be successful. Next to Cybersecurity Forces, there is no more critical group within any organization than the Data Scientists.