I have been writing examples of LRMI metadata for schema.org. Of course I want these to be valid, so I have been hitting various online validators quite frequently. This was getting tedious. Fortunately, the Yandex structured data validator has an API, so I could write a python script to automate the testing.
Here it is
#!/usr/bin/python import httplib, urllib, json, sys from html2text import html2text from sys import argv noerror = False def errhunt(key, responses): # a key and a dictionary, print "Checking %s object" % key # we're going on an err hunt if (responses[key]): for s in responses[key]: for object_key in s.keys(): if (object_key == "@error"): print "Errors in ", key for error in s['@error']: print "tError code: ", error['error_code'][0] print "tError message: ", html2text(error['message'][0]).replace('n',' ') noerror = False elif (s[object_key] != ''): errhunt(object_key, s) else: print "No errors in %s object" % key else: print "No %s objects" % key try: script, file_name = argv except: print "tError: Missing argument, name of file to check.ntUsage: yandexvalidator.py filename" sys.exit(0) try: file = open( file_name, 'r' ) except: print "tError: Could not open file ", file_name, " to read" sys.exit(0) content = file.read() try: validator_url = "validator-api.semweb.yandex.ru" key = "12345-1234-1234-1234-123456789abc" params = urllib.urlencode({'apikey': key, 'lang': 'en', 'pretty': 'true', 'only_errors': 'true' }) validator_path = "/v1.0/document_parser?"+params headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "*/*"} validator_connection = httplib.HTTPSConnection( validator_url ) except: print "tError: something went wrong connecting to the Yandex validator." try: validator_connection.request("POST", validator_path, content, headers) response = validator_connection.getresponse() if (response.status == 204): noerror= True response_data = response.read() # to clear for next connection else: response_data = json.load(response) validator_connection.close() except: print "tError: something went wrong getting data from the Yandex validator by API." print "tcontent:n", content print "tresponse: ", response.read() print "tstatus: ", response.status print "tmessage: ", response.msg print "treason: ", response.reason print "n" raise sys.exit(0) if noerror : print "No errors found." else: for k in response_data.keys(): errhunt(k, response_data)
Usage:
$ ./yandexvalidator.py test.html No errors found. $ ./yandexvalidator.py test2.html Checking json-ld object No json-ld objects Checking rdfa object No rdfa objects Checking id object No id objects Checking microformat object No microformat objects Checking microdata object Checking http://schema.org/audience object Checking http://schema.org/educationalAlignment object Checking http://schema.org/video object Errors in http://schema.org/video Error code: missing_empty Error message: WARNING: Не выполнено обязательное условие для передачи данных в Яндекс.Видео: **isFamilyFriendly** field missing or empty Error code: missing_empty Error message: WARNING: Не выполнено обязательное условие для передачи данных в Яндекс.Видео: **thumbnail** field missing or empty $
Points to note:
- I’m no software engineer. I’ve tested this against valid and invalid files. You’re welcome to use this, but it might not work for you. (You’ll need your own API key). If you see something needs fixing, drop me a line.
- Line 51: has to be an HTTPS connection.
- Line 58: we ask for errors only (at line 46) so no news is good news.
- The function errhunt does most of the work, recursively.
The response from the API is a json object (and json objects are converted into python dictionary objects by line 62), the keys of which are the “id” you sent and each of the structured data formats that are checked. For each of these there is an array/list of objects, and these objects are either simple key-value pairs or the value may be an array/list of objects. If there is an error in any of the objects, the value for the key “@error” gives the details, as a list of Error_code and Error_message key-value pairs. errhunt iterates and recurses through these lists of objects with embedded lists of objects.